NotesMath for LLMs

Probabilistic Models

Math for Specific Models / Probabilistic Models

Notes

Probabilistic models describe data, hidden structure, parameters, and predictions with distributions. They make uncertainty a first-class object rather than an afterthought.

Overview

The core operations are:

p(x)=zp(x,z),p(θD)=p(Dθ)p(θ)p(D).p(x)=\sum_zp(x,z),\qquad p(\theta\mid D)=\frac{p(D\mid\theta)p(\theta)}{p(D)}.

Marginalization handles hidden variables. Bayes rule updates beliefs. Likelihood trains parameters. These ideas appear in naive Bayes, mixture models, HMMs, variational inference, VAEs, uncertainty estimation, and probabilistic interpretations of neural networks.

Prerequisites

  • Probability rules, expectation, and conditional probability
  • Log likelihood and cross-entropy
  • Linear and neural model sections
  • Basic optimization

Companion Notebooks

NotebookPurpose
theory.ipynbDemonstrates likelihoods, Bayes updates, naive Bayes, Gaussian mixtures, EM, HMM forward recursion, Monte Carlo, ELBO intuition, and calibration.
exercises.ipynbTen practice problems for probability normalization, MLE/MAP, Bayes, mixture responsibilities, HMMs, and diagnostics.

Learning Objectives

After this section, you should be able to:

  • Distinguish joint, marginal, conditional, likelihood, posterior, and predictive distributions.
  • Compute MLE and MAP estimates in simple models.
  • Apply Bayes rule with conjugate priors.
  • Explain naive Bayes and the generative/discriminative distinction.
  • Compute mixture responsibilities and one EM update.
  • Explain graphical-model factorization and conditional independence.
  • Run HMM forward and Viterbi-style recursions at a high level.
  • Explain approximate inference, ELBO, and posterior predictive checks.

Table of Contents

  1. Probabilistic Modeling View
  2. Likelihood and Estimation
  3. Bayes Rule
  4. Naive Bayes and Discriminative Contrast
  5. Latent Variable Models
  6. Expectation Maximization
  7. Graphical Models
  8. Hidden Markov Models
  9. Approximate Inference
  10. Diagnostics

Object Map

observed data:       x, y
latent variables:    z
parameters:          theta
prior:               p(theta)
likelihood:          p(D | theta)
posterior:           p(theta | D)
prediction:          p(y_new | x_new, D)

1. Probabilistic Modeling View

This part studies probabilistic modeling view as uncertainty-aware modeling. Keep track of observed variables, hidden variables, parameters, and decisions.

SubtopicQuestionFormula
Random variablesrepresent uncertain quantitiesX,Y,ZX,Y,Z
Joint distributionmodel all variables togetherp(x,y,z)p(x,y,z)
Conditional distributionpredict one variable given anotherp(yx)p(y\mid x)
Marginalizationsum or integrate out hidden variablesp(x)=zp(x,z)p(x)=\sum_zp(x,z)
Decision ruleturn probabilities into actionsa=argminaE[L(a,Y)x]a^\star=\arg\min_a E[L(a,Y)\mid x]

1.1 Random variables

Main idea. Represent uncertain quantities.

Core relation:

X,Y,ZX,Y,Z

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

1.2 Joint distribution

Main idea. Model all variables together.

Core relation:

p(x,y,z)p(x,y,z)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

1.3 Conditional distribution

Main idea. Predict one variable given another.

Core relation:

p(yx)p(y\mid x)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

1.4 Marginalization

Main idea. Sum or integrate out hidden variables.

Core relation:

p(x)=zp(x,z)p(x)=\sum_zp(x,z)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

1.5 Decision rule

Main idea. Turn probabilities into actions.

Core relation:

a=argminaE[L(a,Y)x]a^\star=\arg\min_a E[L(a,Y)\mid x]

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

2. Likelihood and Estimation

This part studies likelihood and estimation as uncertainty-aware modeling. Keep track of observed variables, hidden variables, parameters, and decisions.

SubtopicQuestionFormula
Likelihoodview data probability as a function of parametersL(θ)=p(Dθ)L(\theta)=p(D\mid\theta)
Log likelihoodproducts become sums(θ)=ilogp(xiθ)\ell(\theta)=\sum_i\log p(x_i\mid\theta)
MLEchoose parameters maximizing data likelihoodθ^MLE=argmaxθ(θ)\hat\theta_\mathrm{MLE}=\arg\max_\theta\ell(\theta)
MAPinclude a prior over parametersθ^MAP=argmaxθ[logp(Dθ)+logp(θ)]\hat\theta_\mathrm{MAP}=\arg\max_\theta[\log p(D\mid\theta)+\log p(\theta)]
Predictive distributionintegrate parameter uncertainty when Bayesianp(xD)=p(xθ)p(θD)dθp(x_\star\mid D)=\int p(x_\star\mid\theta)p(\theta\mid D)d\theta

2.1 Likelihood

Main idea. View data probability as a function of parameters.

Core relation:

L(θ)=p(Dθ)L(\theta)=p(D\mid\theta)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

2.2 Log likelihood

Main idea. Products become sums.

Core relation:

(θ)=ilogp(xiθ)\ell(\theta)=\sum_i\log p(x_i\mid\theta)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

2.3 MLE

Main idea. Choose parameters maximizing data likelihood.

Core relation:

θ^MLE=argmaxθ(θ)\hat\theta_\mathrm{MLE}=\arg\max_\theta\ell(\theta)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

2.4 MAP

Main idea. Include a prior over parameters.

Core relation:

θ^MAP=argmaxθ[logp(Dθ)+logp(θ)]\hat\theta_\mathrm{MAP}=\arg\max_\theta[\log p(D\mid\theta)+\log p(\theta)]

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

2.5 Predictive distribution

Main idea. Integrate parameter uncertainty when bayesian.

Core relation:

p(xD)=p(xθ)p(θD)dθp(x_\star\mid D)=\int p(x_\star\mid\theta)p(\theta\mid D)d\theta

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

3. Bayes Rule

This part studies bayes rule as uncertainty-aware modeling. Keep track of observed variables, hidden variables, parameters, and decisions.

SubtopicQuestionFormula
Posteriorupdate beliefs after datap(θD)=p(Dθ)p(θ)/p(D)p(\theta\mid D)=p(D\mid\theta)p(\theta)/p(D)
Evidencenormalizing probability of datap(D)=p(Dθ)p(θ)dθp(D)=\int p(D\mid\theta)p(\theta)d\theta
Conjugacysome priors yield closed-form posteriorsBeta+BernoulliBeta\mathrm{Beta}+\mathrm{Bernoulli}\rightarrow\mathrm{Beta}
Prior strengthprior counts can regularize low-data estimatesα,β\alpha,\beta
Posterior predictivepredictions average over posterior uncertaintyp(yx,D)p(y\mid x,D)

3.1 Posterior

Main idea. Update beliefs after data.

Core relation:

p(θD)=p(Dθ)p(θ)/p(D)p(\theta\mid D)=p(D\mid\theta)p(\theta)/p(D)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. Bayes rule is the core update that turns observations into revised uncertainty.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

3.2 Evidence

Main idea. Normalizing probability of data.

Core relation:

p(D)=p(Dθ)p(θ)dθp(D)=\int p(D\mid\theta)p(\theta)d\theta

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

3.3 Conjugacy

Main idea. Some priors yield closed-form posteriors.

Core relation:

Beta+BernoulliBeta\mathrm{Beta}+\mathrm{Bernoulli}\rightarrow\mathrm{Beta}

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

3.4 Prior strength

Main idea. Prior counts can regularize low-data estimates.

Core relation:

α,β\alpha,\beta

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

3.5 Posterior predictive

Main idea. Predictions average over posterior uncertainty.

Core relation:

p(yx,D)p(y\mid x,D)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

4. Naive Bayes and Discriminative Contrast

This part studies naive bayes and discriminative contrast as uncertainty-aware modeling. Keep track of observed variables, hidden variables, parameters, and decisions.

SubtopicQuestionFormula
Generative classifiermodel class prior and feature likelihoodp(y,x)=p(y)p(xy)p(y,x)=p(y)p(x\mid y)
Naive assumptionfeatures are conditionally independent given classp(xy)=jp(xjy)p(x\mid y)=\prod_jp(x_j\mid y)
Classificationchoose largest posterior classargmaxyp(y)jp(xjy)\arg\max_y p(y)\prod_jp(x_j\mid y)
Discriminative modelmodel p(yx) directly
Calibrationprobabilistic outputs should match frequenciesP(Y=yp^y=c)cP(Y=y\mid \hat p_y=c)\approx c

4.1 Generative classifier

Main idea. Model class prior and feature likelihood.

Core relation:

p(y,x)=p(y)p(xy)p(y,x)=p(y)p(x\mid y)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

4.2 Naive assumption

Main idea. Features are conditionally independent given class.

Core relation:

p(xy)=jp(xjy)p(x\mid y)=\prod_jp(x_j\mid y)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

4.3 Classification

Main idea. Choose largest posterior class.

Core relation:

argmaxyp(y)jp(xjy)\arg\max_y p(y)\prod_jp(x_j\mid y)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

4.4 Discriminative model

Main idea. Model p(y|x) directly.

Core relation:

pθ(yx)p_\theta(y\mid x)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

4.5 Calibration

Main idea. Probabilistic outputs should match frequencies.

Core relation:

P(Y=yp^y=c)cP(Y=y\mid \hat p_y=c)\approx c

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

5. Latent Variable Models

This part studies latent variable models as uncertainty-aware modeling. Keep track of observed variables, hidden variables, parameters, and decisions.

SubtopicQuestionFormula
Latent variablehidden cause explains observed datap(x,z)=p(z)p(xz)p(x,z)=p(z)p(x\mid z)
Mixture modeldata comes from one of several componentsp(x)=kπkp(xz=k)p(x)=\sum_k\pi_kp(x\mid z=k)
Responsibilitiesposterior component probabilitiesγik=p(zi=kxi)\gamma_{ik}=p(z_i=k\mid x_i)
Identifiabilitydifferent latent labels can represent the same distributionzz labels can permute
Representation learninglatent variables are probabilistic hidden featureszxz\rightarrow x

5.1 Latent variable

Main idea. Hidden cause explains observed data.

Core relation:

p(x,z)=p(z)p(xz)p(x,z)=p(z)p(x\mid z)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

5.2 Mixture model

Main idea. Data comes from one of several components.

Core relation:

p(x)=kπkp(xz=k)p(x)=\sum_k\pi_kp(x\mid z=k)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

5.3 Responsibilities

Main idea. Posterior component probabilities.

Core relation:

γik=p(zi=kxi)\gamma_{ik}=p(z_i=k\mid x_i)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. Responsibilities are soft cluster assignments and the heart of mixture-model EM.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

5.4 Identifiability

Main idea. Different latent labels can represent the same distribution.

Core relation:

z$ labels can permute

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

5.5 Representation learning

Main idea. Latent variables are probabilistic hidden features.

Core relation:

zxz\rightarrow x

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

6. Expectation Maximization

This part studies expectation maximization as uncertainty-aware modeling. Keep track of observed variables, hidden variables, parameters, and decisions.

SubtopicQuestionFormula
E-stepcompute posterior over latent variablesqi(z)=p(zxi,θold)q_i(z)=p(z\mid x_i,\theta^{old})
M-stepmaximize expected complete-data log likelihoodθnew=argmaxθEq[logp(x,zθ)]\theta^{new}=\arg\max_\theta E_q[\log p(x,z\mid\theta)]
Lower boundEM improves an evidence lower boundlogp(x)Eq[logp(x,z)logq(z)]\log p(x)\ge E_q[\log p(x,z)-\log q(z)]
Gaussian mixture updatesmeans become responsibility-weighted averagesμk=iγikxi/iγik\mu_k=\sum_i\gamma_{ik}x_i/\sum_i\gamma_{ik}
Local optimaEM depends on initializationθ0\theta_0 matters

6.1 E-step

Main idea. Compute posterior over latent variables.

Core relation:

qi(z)=p(zxi,θold)q_i(z)=p(z\mid x_i,\theta^{old})

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

6.2 M-step

Main idea. Maximize expected complete-data log likelihood.

Core relation:

θnew=argmaxθEq[logp(x,zθ)]\theta^{new}=\arg\max_\theta E_q[\log p(x,z\mid\theta)]

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

6.3 Lower bound

Main idea. Em improves an evidence lower bound.

Core relation:

logp(x)Eq[logp(x,z)logq(z)]\log p(x)\ge E_q[\log p(x,z)-\log q(z)]

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

6.4 Gaussian mixture updates

Main idea. Means become responsibility-weighted averages.

Core relation:

μk=iγikxi/iγik\mu_k=\sum_i\gamma_{ik}x_i/\sum_i\gamma_{ik}

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

6.5 Local optima

Main idea. Em depends on initialization.

Core relation:

\theta_0$ matters

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

7. Graphical Models

This part studies graphical models as uncertainty-aware modeling. Keep track of observed variables, hidden variables, parameters, and decisions.

SubtopicQuestionFormula
Directed modelfactorize a joint distribution by parentsp(x1:n)=ip(xipai)p(x_{1:n})=\prod_ip(x_i\mid\mathrm{pa}_i)
Undirected modelfactorize by potentials over cliquesp(x)=Z1cψc(xc)p(x)=Z^{-1}\prod_c\psi_c(x_c)
Conditional independencegraph structure encodes independence assumptionsXYZX\perp Y\mid Z
Inferencecompute marginals or MAP assignmentsp(xievidence)p(x_i\mid evidence)
Message passingreuse local computations on graphsmij(xj)m_{i\rightarrow j}(x_j)

7.1 Directed model

Main idea. Factorize a joint distribution by parents.

Core relation:

p(x1:n)=ip(xipai)p(x_{1:n})=\prod_ip(x_i\mid\mathrm{pa}_i)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

7.2 Undirected model

Main idea. Factorize by potentials over cliques.

Core relation:

p(x)=Z1cψc(xc)p(x)=Z^{-1}\prod_c\psi_c(x_c)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

7.3 Conditional independence

Main idea. Graph structure encodes independence assumptions.

Core relation:

XYZX\perp Y\mid Z

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

7.4 Inference

Main idea. Compute marginals or map assignments.

Core relation:

p(xievidence)p(x_i\mid evidence)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

7.5 Message passing

Main idea. Reuse local computations on graphs.

Core relation:

mij(xj)m_{i\rightarrow j}(x_j)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

8. Hidden Markov Models

This part studies hidden markov models as uncertainty-aware modeling. Keep track of observed variables, hidden variables, parameters, and decisions.

SubtopicQuestionFormula
Markov statecurrent hidden state depends on previous hidden statep(ztzt1)p(z_t\mid z_{t-1})
Emissionobservation depends on current hidden statep(xtzt)p(x_t\mid z_t)
Forward recursioncompute filtered probabilitiesαt(j)=p(x1:t,zt=j)\alpha_t(j)=p(x_{1:t},z_t=j)
Viterbifind most likely hidden state pathmaxz1:Tp(z1:T,x1:T)\max_{z_{1:T}}p(z_{1:T},x_{1:T})
Sequence modeling bridgeHMMs are probabilistic predecessors of neural sequence modelsztz_t hidden state

8.1 Markov state

Main idea. Current hidden state depends on previous hidden state.

Core relation:

p(ztzt1)p(z_t\mid z_{t-1})

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

8.2 Emission

Main idea. Observation depends on current hidden state.

Core relation:

p(xtzt)p(x_t\mid z_t)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

8.3 Forward recursion

Main idea. Compute filtered probabilities.

Core relation:

αt(j)=p(x1:t,zt=j)\alpha_t(j)=p(x_{1:t},z_t=j)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is dynamic programming for probabilistic sequence inference.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

8.4 Viterbi

Main idea. Find most likely hidden state path.

Core relation:

maxz1:Tp(z1:T,x1:T)\max_{z_{1:T}}p(z_{1:T},x_{1:T})

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

8.5 Sequence modeling bridge

Main idea. Hmms are probabilistic predecessors of neural sequence models.

Core relation:

z_t$ hidden state

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

9. Approximate Inference

This part studies approximate inference as uncertainty-aware modeling. Keep track of observed variables, hidden variables, parameters, and decisions.

SubtopicQuestionFormula
Monte Carloestimate expectations with samplesE[f(X)]n1if(xi)E[f(X)]\approx n^{-1}\sum_if(x_i)
Variational inferenceapproximate posterior with a tractable familyqϕ(z)p(zx)q_\phi(z)\approx p(z\mid x)
ELBOoptimize a lower bound on log evidenceEq[logp(x,z)logq(z)]E_q[\log p(x,z)-\log q(z)]
Reparameterizationdifferentiate through random variables when possiblez=μ+σϵz=\mu+\sigma\epsilon
Amortized inferenceuse a neural network to predict variational parametersqϕ(zx)q_\phi(z\mid x)

9.1 Monte Carlo

Main idea. Estimate expectations with samples.

Core relation:

E[f(X)]n1if(xi)E[f(X)]\approx n^{-1}\sum_if(x_i)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

9.2 Variational inference

Main idea. Approximate posterior with a tractable family.

Core relation:

qϕ(z)p(zx)q_\phi(z)\approx p(z\mid x)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

9.3 ELBO

Main idea. Optimize a lower bound on log evidence.

Core relation:

Eq[logp(x,z)logq(z)]E_q[\log p(x,z)-\log q(z)]

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. The ELBO connects classical latent-variable models to VAEs and modern variational methods.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

9.4 Reparameterization

Main idea. Differentiate through random variables when possible.

Core relation:

z=μ+σϵz=\mu+\sigma\epsilon

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

9.5 Amortized inference

Main idea. Use a neural network to predict variational parameters.

Core relation:

qϕ(zx)q_\phi(z\mid x)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

10. Diagnostics

This part studies diagnostics as uncertainty-aware modeling. Keep track of observed variables, hidden variables, parameters, and decisions.

SubtopicQuestionFormula
Log-likelihoodtrack data fit under the probabilistic model(D)\ell(D)
Posterior predictive checkssimulate from the fitted model and compare to dataxrepp(xD)x^\mathrm{rep}\sim p(x\mid D)
Calibration curvescheck probability qualityECE\mathrm{ECE}
Sensitivity to priorsposterior can change under weak datap(θ)p(\theta)
Ablationscompare generative, discriminative, latent, and neural versionsΔL,ΔS\Delta L,\Delta S

10.1 Log-likelihood

Main idea. Track data fit under the probabilistic model.

Core relation:

(D)\ell(D)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

10.2 Posterior predictive checks

Main idea. Simulate from the fitted model and compare to data.

Core relation:

xrepp(xD)x^\mathrm{rep}\sim p(x\mid D)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. A probabilistic model should generate data that looks like the data it claims to explain.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

10.3 Calibration curves

Main idea. Check probability quality.

Core relation:

ECE\mathrm{ECE}

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

10.4 Sensitivity to priors

Main idea. Posterior can change under weak data.

Core relation:

p(θ)p(\theta)

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.

10.5 Ablations

Main idea. Compare generative, discriminative, latent, and neural versions.

Core relation:

ΔL,ΔS\Delta L,\Delta S

Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.

Worked micro-example. If a coin prior is Beta(2,2)\mathrm{Beta}(2,2) and we observe 7 heads and 3 tails, the posterior is Beta(9,5)\mathrm{Beta}(9,5). The posterior mean is 9/(9+5)9/(9+5), which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.

Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.

AI connection. This is a practical probabilistic-model control variable.

Common mistake. Do not confuse likelihood p(Dθ)p(D\mid\theta) with posterior p(θD)p(\theta\mid D). They are related by Bayes rule but answer different questions.


Practice Exercises

  1. Normalize a discrete probability vector.
  2. Compute Bernoulli log likelihood and MLE.
  3. Compute a Beta-Bernoulli posterior.
  4. Compute a MAP estimate with pseudo-counts.
  5. Classify with naive Bayes.
  6. Compute Gaussian mixture responsibilities.
  7. Perform one EM mean update.
  8. Run one HMM forward step.
  9. Estimate an expectation by Monte Carlo.
  10. Write a probabilistic-model debugging checklist.

Why This Matters for AI

Modern AI systems need uncertainty: calibrated classifiers, latent-variable generative models, retrieval confidence, Bayesian decision rules, and probabilistic sequence models. Neural networks often produce the parameters of probability distributions, so probabilistic modeling explains what those outputs mean.

Bridge to RNN and LSTM Math

Hidden Markov models are probabilistic sequence models with latent states. RNNs replace explicit latent-state probabilities with learned hidden vectors and differentiable recurrence.

References