Probabilistic models describe data, hidden structure, parameters, and predictions with distributions. They make uncertainty a first-class object rather than an afterthought.
Overview
The core operations are:
Marginalization handles hidden variables. Bayes rule updates beliefs. Likelihood trains parameters. These ideas appear in naive Bayes, mixture models, HMMs, variational inference, VAEs, uncertainty estimation, and probabilistic interpretations of neural networks.
Prerequisites
- Probability rules, expectation, and conditional probability
- Log likelihood and cross-entropy
- Linear and neural model sections
- Basic optimization
Companion Notebooks
| Notebook | Purpose |
|---|---|
| theory.ipynb | Demonstrates likelihoods, Bayes updates, naive Bayes, Gaussian mixtures, EM, HMM forward recursion, Monte Carlo, ELBO intuition, and calibration. |
| exercises.ipynb | Ten practice problems for probability normalization, MLE/MAP, Bayes, mixture responsibilities, HMMs, and diagnostics. |
Learning Objectives
After this section, you should be able to:
- Distinguish joint, marginal, conditional, likelihood, posterior, and predictive distributions.
- Compute MLE and MAP estimates in simple models.
- Apply Bayes rule with conjugate priors.
- Explain naive Bayes and the generative/discriminative distinction.
- Compute mixture responsibilities and one EM update.
- Explain graphical-model factorization and conditional independence.
- Run HMM forward and Viterbi-style recursions at a high level.
- Explain approximate inference, ELBO, and posterior predictive checks.
Table of Contents
- Probabilistic Modeling View
- 1.1 Random variables
- 1.2 Joint distribution
- 1.3 Conditional distribution
- 1.4 Marginalization
- 1.5 Decision rule
- Likelihood and Estimation
- 2.1 Likelihood
- 2.2 Log likelihood
- 2.3 MLE
- 2.4 MAP
- 2.5 Predictive distribution
- Bayes Rule
- 3.1 Posterior
- 3.2 Evidence
- 3.3 Conjugacy
- 3.4 Prior strength
- 3.5 Posterior predictive
- Naive Bayes and Discriminative Contrast
- 4.1 Generative classifier
- 4.2 Naive assumption
- 4.3 Classification
- 4.4 Discriminative model
- 4.5 Calibration
- Latent Variable Models
- 5.1 Latent variable
- 5.2 Mixture model
- 5.3 Responsibilities
- 5.4 Identifiability
- 5.5 Representation learning
- Expectation Maximization
- 6.1 E-step
- 6.2 M-step
- 6.3 Lower bound
- 6.4 Gaussian mixture updates
- 6.5 Local optima
- Graphical Models
- 7.1 Directed model
- 7.2 Undirected model
- 7.3 Conditional independence
- 7.4 Inference
- 7.5 Message passing
- Hidden Markov Models
- 8.1 Markov state
- 8.2 Emission
- 8.3 Forward recursion
- 8.4 Viterbi
- 8.5 Sequence modeling bridge
- Approximate Inference
- 9.1 Monte Carlo
- 9.2 Variational inference
- 9.3 ELBO
- 9.4 Reparameterization
- 9.5 Amortized inference
- Diagnostics
- 10.1 Log-likelihood
- 10.2 Posterior predictive checks
- 10.3 Calibration curves
- 10.4 Sensitivity to priors
- 10.5 Ablations
Object Map
observed data: x, y
latent variables: z
parameters: theta
prior: p(theta)
likelihood: p(D | theta)
posterior: p(theta | D)
prediction: p(y_new | x_new, D)
1. Probabilistic Modeling View
This part studies probabilistic modeling view as uncertainty-aware modeling. Keep track of observed variables, hidden variables, parameters, and decisions.
| Subtopic | Question | Formula |
|---|---|---|
| Random variables | represent uncertain quantities | |
| Joint distribution | model all variables together | |
| Conditional distribution | predict one variable given another | |
| Marginalization | sum or integrate out hidden variables | |
| Decision rule | turn probabilities into actions |
1.1 Random variables
Main idea. Represent uncertain quantities.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
1.2 Joint distribution
Main idea. Model all variables together.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
1.3 Conditional distribution
Main idea. Predict one variable given another.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
1.4 Marginalization
Main idea. Sum or integrate out hidden variables.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
1.5 Decision rule
Main idea. Turn probabilities into actions.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
2. Likelihood and Estimation
This part studies likelihood and estimation as uncertainty-aware modeling. Keep track of observed variables, hidden variables, parameters, and decisions.
| Subtopic | Question | Formula |
|---|---|---|
| Likelihood | view data probability as a function of parameters | |
| Log likelihood | products become sums | |
| MLE | choose parameters maximizing data likelihood | |
| MAP | include a prior over parameters | |
| Predictive distribution | integrate parameter uncertainty when Bayesian |
2.1 Likelihood
Main idea. View data probability as a function of parameters.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
2.2 Log likelihood
Main idea. Products become sums.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
2.3 MLE
Main idea. Choose parameters maximizing data likelihood.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
2.4 MAP
Main idea. Include a prior over parameters.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
2.5 Predictive distribution
Main idea. Integrate parameter uncertainty when bayesian.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
3. Bayes Rule
This part studies bayes rule as uncertainty-aware modeling. Keep track of observed variables, hidden variables, parameters, and decisions.
| Subtopic | Question | Formula |
|---|---|---|
| Posterior | update beliefs after data | |
| Evidence | normalizing probability of data | |
| Conjugacy | some priors yield closed-form posteriors | |
| Prior strength | prior counts can regularize low-data estimates | |
| Posterior predictive | predictions average over posterior uncertainty |
3.1 Posterior
Main idea. Update beliefs after data.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. Bayes rule is the core update that turns observations into revised uncertainty.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
3.2 Evidence
Main idea. Normalizing probability of data.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
3.3 Conjugacy
Main idea. Some priors yield closed-form posteriors.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
3.4 Prior strength
Main idea. Prior counts can regularize low-data estimates.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
3.5 Posterior predictive
Main idea. Predictions average over posterior uncertainty.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
4. Naive Bayes and Discriminative Contrast
This part studies naive bayes and discriminative contrast as uncertainty-aware modeling. Keep track of observed variables, hidden variables, parameters, and decisions.
| Subtopic | Question | Formula |
|---|---|---|
| Generative classifier | model class prior and feature likelihood | |
| Naive assumption | features are conditionally independent given class | |
| Classification | choose largest posterior class | |
| Discriminative model | model p(y | x) directly |
| Calibration | probabilistic outputs should match frequencies |
4.1 Generative classifier
Main idea. Model class prior and feature likelihood.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
4.2 Naive assumption
Main idea. Features are conditionally independent given class.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
4.3 Classification
Main idea. Choose largest posterior class.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
4.4 Discriminative model
Main idea. Model p(y|x) directly.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
4.5 Calibration
Main idea. Probabilistic outputs should match frequencies.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
5. Latent Variable Models
This part studies latent variable models as uncertainty-aware modeling. Keep track of observed variables, hidden variables, parameters, and decisions.
| Subtopic | Question | Formula |
|---|---|---|
| Latent variable | hidden cause explains observed data | |
| Mixture model | data comes from one of several components | |
| Responsibilities | posterior component probabilities | |
| Identifiability | different latent labels can represent the same distribution | labels can permute |
| Representation learning | latent variables are probabilistic hidden features |
5.1 Latent variable
Main idea. Hidden cause explains observed data.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
5.2 Mixture model
Main idea. Data comes from one of several components.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
5.3 Responsibilities
Main idea. Posterior component probabilities.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. Responsibilities are soft cluster assignments and the heart of mixture-model EM.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
5.4 Identifiability
Main idea. Different latent labels can represent the same distribution.
Core relation:
z$ labels can permuteProbabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
5.5 Representation learning
Main idea. Latent variables are probabilistic hidden features.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
6. Expectation Maximization
This part studies expectation maximization as uncertainty-aware modeling. Keep track of observed variables, hidden variables, parameters, and decisions.
| Subtopic | Question | Formula |
|---|---|---|
| E-step | compute posterior over latent variables | |
| M-step | maximize expected complete-data log likelihood | |
| Lower bound | EM improves an evidence lower bound | |
| Gaussian mixture updates | means become responsibility-weighted averages | |
| Local optima | EM depends on initialization | matters |
6.1 E-step
Main idea. Compute posterior over latent variables.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
6.2 M-step
Main idea. Maximize expected complete-data log likelihood.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
6.3 Lower bound
Main idea. Em improves an evidence lower bound.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
6.4 Gaussian mixture updates
Main idea. Means become responsibility-weighted averages.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
6.5 Local optima
Main idea. Em depends on initialization.
Core relation:
\theta_0$ mattersProbabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
7. Graphical Models
This part studies graphical models as uncertainty-aware modeling. Keep track of observed variables, hidden variables, parameters, and decisions.
| Subtopic | Question | Formula |
|---|---|---|
| Directed model | factorize a joint distribution by parents | |
| Undirected model | factorize by potentials over cliques | |
| Conditional independence | graph structure encodes independence assumptions | |
| Inference | compute marginals or MAP assignments | |
| Message passing | reuse local computations on graphs |
7.1 Directed model
Main idea. Factorize a joint distribution by parents.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
7.2 Undirected model
Main idea. Factorize by potentials over cliques.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
7.3 Conditional independence
Main idea. Graph structure encodes independence assumptions.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
7.4 Inference
Main idea. Compute marginals or map assignments.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
7.5 Message passing
Main idea. Reuse local computations on graphs.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
8. Hidden Markov Models
This part studies hidden markov models as uncertainty-aware modeling. Keep track of observed variables, hidden variables, parameters, and decisions.
| Subtopic | Question | Formula |
|---|---|---|
| Markov state | current hidden state depends on previous hidden state | |
| Emission | observation depends on current hidden state | |
| Forward recursion | compute filtered probabilities | |
| Viterbi | find most likely hidden state path | |
| Sequence modeling bridge | HMMs are probabilistic predecessors of neural sequence models | hidden state |
8.1 Markov state
Main idea. Current hidden state depends on previous hidden state.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
8.2 Emission
Main idea. Observation depends on current hidden state.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
8.3 Forward recursion
Main idea. Compute filtered probabilities.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is dynamic programming for probabilistic sequence inference.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
8.4 Viterbi
Main idea. Find most likely hidden state path.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
8.5 Sequence modeling bridge
Main idea. Hmms are probabilistic predecessors of neural sequence models.
Core relation:
z_t$ hidden stateProbabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
9. Approximate Inference
This part studies approximate inference as uncertainty-aware modeling. Keep track of observed variables, hidden variables, parameters, and decisions.
| Subtopic | Question | Formula |
|---|---|---|
| Monte Carlo | estimate expectations with samples | |
| Variational inference | approximate posterior with a tractable family | |
| ELBO | optimize a lower bound on log evidence | |
| Reparameterization | differentiate through random variables when possible | |
| Amortized inference | use a neural network to predict variational parameters |
9.1 Monte Carlo
Main idea. Estimate expectations with samples.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
9.2 Variational inference
Main idea. Approximate posterior with a tractable family.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
9.3 ELBO
Main idea. Optimize a lower bound on log evidence.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. The ELBO connects classical latent-variable models to VAEs and modern variational methods.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
9.4 Reparameterization
Main idea. Differentiate through random variables when possible.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
9.5 Amortized inference
Main idea. Use a neural network to predict variational parameters.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
10. Diagnostics
This part studies diagnostics as uncertainty-aware modeling. Keep track of observed variables, hidden variables, parameters, and decisions.
| Subtopic | Question | Formula |
|---|---|---|
| Log-likelihood | track data fit under the probabilistic model | |
| Posterior predictive checks | simulate from the fitted model and compare to data | |
| Calibration curves | check probability quality | |
| Sensitivity to priors | posterior can change under weak data | |
| Ablations | compare generative, discriminative, latent, and neural versions |
10.1 Log-likelihood
Main idea. Track data fit under the probabilistic model.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
10.2 Posterior predictive checks
Main idea. Simulate from the fitted model and compare to data.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. A probabilistic model should generate data that looks like the data it claims to explain.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
10.3 Calibration curves
Main idea. Check probability quality.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
10.4 Sensitivity to priors
Main idea. Posterior can change under weak data.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
10.5 Ablations
Main idea. Compare generative, discriminative, latent, and neural versions.
Core relation:
Probabilistic models make uncertainty explicit. Instead of producing only a point prediction, they specify distributions over observations, classes, hidden states, or parameters. This lets us compute likelihoods, posteriors, predictive uncertainty, and principled decisions.
Worked micro-example. If a coin prior is and we observe 7 heads and 3 tails, the posterior is . The posterior mean is , which is less extreme than the raw frequency 0.7 because the prior contributes pseudo-counts.
Implementation check. Confirm that probabilities normalize, log probabilities are finite, latent responsibilities sum to one, and held-out likelihood improves for the right reason.
AI connection. This is a practical probabilistic-model control variable.
Common mistake. Do not confuse likelihood with posterior . They are related by Bayes rule but answer different questions.
Practice Exercises
- Normalize a discrete probability vector.
- Compute Bernoulli log likelihood and MLE.
- Compute a Beta-Bernoulli posterior.
- Compute a MAP estimate with pseudo-counts.
- Classify with naive Bayes.
- Compute Gaussian mixture responsibilities.
- Perform one EM mean update.
- Run one HMM forward step.
- Estimate an expectation by Monte Carlo.
- Write a probabilistic-model debugging checklist.
Why This Matters for AI
Modern AI systems need uncertainty: calibrated classifiers, latent-variable generative models, retrieval confidence, Bayesian decision rules, and probabilistic sequence models. Neural networks often produce the parameters of probability distributions, so probabilistic modeling explains what those outputs mean.
Bridge to RNN and LSTM Math
Hidden Markov models are probabilistic sequence models with latent states. RNNs replace explicit latent-state probabilities with learned hidden vectors and differentiable recurrence.
References
- Kevin Murphy, "Probabilistic Machine Learning: An Introduction", 2022: https://probml.github.io/pml-book/book1.html
- Christopher Bishop, "Pattern Recognition and Machine Learning", 2006.
- A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm", 1977: https://academic.oup.com/jrsssb/article/39/1/1/7027539
- Lawrence Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition", 1989.