Linear models are the first serious model family in machine learning: simple enough to solve and inspect, strong enough to be useful, and foundational for understanding optimization, regularization, classification, and modern neural network heads.
Overview
The basic prediction is:
For a design matrix , all predictions are:
From this one form we get least squares, ridge regression, lasso, logistic regression, softmax regression, linear probes, and the final LM head of a transformer.
Prerequisites
- Vectors, matrices, dot products, and norms
- Gradients and convex optimization basics
- Probability and cross-entropy for classification
- Train/validation evaluation vocabulary
Companion Notebooks
| Notebook | Purpose |
|---|---|
| theory.ipynb | Demonstrates least squares, gradient descent, ridge, lasso-style shrinkage intuition, logistic regression, calibration, SVD conditioning, and linear probes. |
| exercises.ipynb | Ten practice problems for closed-form solves, gradients, regularization, classification probabilities, and diagnostics. |
Learning Objectives
After this section, you should be able to:
- Write linear regression and classification models in vectorized form.
- Derive the least-squares normal equations.
- Explain projection, pseudoinverse, rank, and conditioning.
- Implement gradient descent and check its gradient.
- Explain ridge, lasso, elastic net, and the bias-variance tradeoff.
- Compute logistic and softmax probabilities.
- Interpret linear probes and LM heads in modern AI.
- Build a diagnostic checklist for linear models.
Table of Contents
- Linear Prediction
- 1.1 Feature vector
- 1.2 Affine score
- 1.3 Design matrix
- 1.4 Vectorized prediction
- 1.5 Linear decision boundary
- Least Squares Regression
- 2.1 Residuals
- 2.2 Squared loss
- 2.3 Normal equations
- 2.4 Projection view
- 2.5 Pseudoinverse
- Gradient Descent
- 3.1 Gradient
- 3.2 Update
- 3.3 Learning rate
- 3.4 Stochastic gradients
- 3.5 Feature scaling
- Regularization
- 4.1 Ridge
- 4.2 Ridge solution
- 4.3 Lasso
- 4.4 Elastic net
- 4.5 Bias variance
- Linear Classification
- 5.1 Logistic regression
- 5.2 Binary cross-entropy
- 5.3 Softmax regression
- 5.4 Margin
- 5.5 Linear separability
- Optimization Geometry
- 6.1 Convexity
- 6.2 Condition number
- 6.3 SVD view
- 6.4 Collinearity
- 6.5 Whitening
- Evaluation
- 7.1 Train validation split
- 7.2 MSE and MAE
- 7.3 Accuracy and log loss
- 7.4 Calibration
- 7.5 Cross-validation
- Linear Models in AI
- 8.1 Baseline strength
- 8.2 Linear probes
- 8.3 LM head
- 8.4 Logit lens
- 8.5 Interpretability
- Implementation Details
- 9.1 Intercept handling
- 9.2 Numerical solve
- 9.3 Standardization leakage
- 9.4 Rank checks
- 9.5 Residual diagnostics
- Diagnostics
- 10.1 Shape checks
- 10.2 Gradient check
- 10.3 Regularization path
- 10.4 Influence
- 10.5 Baseline comparison
Shape Map
features: X shape (n, d)
targets: y shape (n,)
weights: w shape (d,)
predictions: y_hat shape (n,)
multi-class W: W shape (K, d)
logits: Z shape (n, K)
1. Linear Prediction
This part studies linear prediction as the simplest useful supervised-learning model family. The important habit is to connect algebra, geometry, optimization, and diagnostics.
| Subtopic | Question | Formula |
|---|---|---|
| Feature vector | represent each example as coordinates | |
| Affine score | combine features by weighted sum plus bias | |
| Design matrix | stack examples row-wise | |
| Vectorized prediction | predict all examples at once | |
| Linear decision boundary | classification threshold creates a hyperplane |
1.1 Feature vector
Main idea. Represent each example as coordinates.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
1.2 Affine score
Main idea. Combine features by weighted sum plus bias.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
1.3 Design matrix
Main idea. Stack examples row-wise.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
1.4 Vectorized prediction
Main idea. Predict all examples at once.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
1.5 Linear decision boundary
Main idea. Classification threshold creates a hyperplane.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
2. Least Squares Regression
This part studies least squares regression as the simplest useful supervised-learning model family. The important habit is to connect algebra, geometry, optimization, and diagnostics.
| Subtopic | Question | Formula |
|---|---|---|
| Residuals | prediction errors form a vector | |
| Squared loss | penalize large residuals quadratically | |
| Normal equations | set the gradient to zero | |
| Projection view | least squares projects y onto the column space of X | |
| Pseudoinverse | handle rank-deficient or rectangular systems |
2.1 Residuals
Main idea. Prediction errors form a vector.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
2.2 Squared loss
Main idea. Penalize large residuals quadratically.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
2.3 Normal equations
Main idea. Set the gradient to zero.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is the closed-form anchor for least-squares learning.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
2.4 Projection view
Main idea. Least squares projects y onto the column space of x.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
2.5 Pseudoinverse
Main idea. Handle rank-deficient or rectangular systems.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
3. Gradient Descent
This part studies gradient descent as the simplest useful supervised-learning model family. The important habit is to connect algebra, geometry, optimization, and diagnostics.
| Subtopic | Question | Formula |
|---|---|---|
| Gradient | loss gradient points uphill in parameter space | |
| Update | move opposite the gradient | |
| Learning rate | step size must respect curvature | 0<\eta<2/\lambda_\max(X^\top X) |
| Stochastic gradients | estimate gradient from mini-batches | |
| Feature scaling | rescale features to improve conditioning |
3.1 Gradient
Main idea. Loss gradient points uphill in parameter space.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
3.2 Update
Main idea. Move opposite the gradient.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
3.3 Learning rate
Main idea. Step size must respect curvature.
Core relation:
0<\eta<2/\lambda_\max(X^\top X)Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
3.4 Stochastic gradients
Main idea. Estimate gradient from mini-batches.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
3.5 Feature scaling
Main idea. Rescale features to improve conditioning.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
4. Regularization
This part studies regularization as the simplest useful supervised-learning model family. The important habit is to connect algebra, geometry, optimization, and diagnostics.
| Subtopic | Question | Formula |
|---|---|---|
| Ridge | penalize squared parameter norm | |
| Ridge solution | shift the Gram matrix spectrum | |
| Lasso | penalize absolute values to encourage sparsity | |
| Elastic net | combine L1 sparsity and L2 shrinkage | |
| Bias variance | regularization trades bias for lower variance |
4.1 Ridge
Main idea. Penalize squared parameter norm.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
4.2 Ridge solution
Main idea. Shift the gram matrix spectrum.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. Ridge is one of the cleanest examples of regularization improving numerical stability.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
4.3 Lasso
Main idea. Penalize absolute values to encourage sparsity.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
4.4 Elastic net
Main idea. Combine l1 sparsity and l2 shrinkage.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
4.5 Bias variance
Main idea. Regularization trades bias for lower variance.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
5. Linear Classification
This part studies linear classification as the simplest useful supervised-learning model family. The important habit is to connect algebra, geometry, optimization, and diagnostics.
| Subtopic | Question | Formula |
|---|---|---|
| Logistic regression | map affine score to probability | |
| Binary cross-entropy | negative Bernoulli log likelihood | |
| Softmax regression | multi-class linear logits | |
| Margin | signed distance proxy for confidence | |
| Linear separability | a hyperplane can perfectly split classes only in some feature spaces |
5.1 Logistic regression
Main idea. Map affine score to probability.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
5.2 Binary cross-entropy
Main idea. Negative bernoulli log likelihood.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
5.3 Softmax regression
Main idea. Multi-class linear logits.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
5.4 Margin
Main idea. Signed distance proxy for confidence.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
5.5 Linear separability
Main idea. A hyperplane can perfectly split classes only in some feature spaces.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
6. Optimization Geometry
This part studies optimization geometry as the simplest useful supervised-learning model family. The important habit is to connect algebra, geometry, optimization, and diagnostics.
| Subtopic | Question | Formula |
|---|---|---|
| Convexity | linear regression and logistic regression losses are convex in w | |
| Condition number | ill-conditioned features slow optimization | \kappa=\lambda_\max/\lambda_\min |
| SVD view | singular values reveal stable and unstable directions | |
| Collinearity | correlated features make coefficients unstable | nearly singular |
| Whitening | decorrelate features when appropriate |
6.1 Convexity
Main idea. Linear regression and logistic regression losses are convex in w.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
6.2 Condition number
Main idea. Ill-conditioned features slow optimization.
Core relation:
\kappa=\lambda_\max/\lambda_\minLinear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
6.3 SVD view
Main idea. Singular values reveal stable and unstable directions.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
6.4 Collinearity
Main idea. Correlated features make coefficients unstable.
Core relation:
X^\top X$ nearly singularLinear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
6.5 Whitening
Main idea. Decorrelate features when appropriate.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
7. Evaluation
This part studies evaluation as the simplest useful supervised-learning model family. The important habit is to connect algebra, geometry, optimization, and diagnostics.
| Subtopic | Question | Formula |
|---|---|---|
| Train validation split | measure generalization on held-out data | |
| MSE and MAE | regression errors emphasize different residual behavior | |
| Accuracy and log loss | classification quality needs both decisions and probabilities | |
| Calibration | predicted probabilities should match empirical frequencies | |
| Cross-validation | average validation over multiple folds | folds |
7.1 Train validation split
Main idea. Measure generalization on held-out data.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
7.2 MSE and MAE
Main idea. Regression errors emphasize different residual behavior.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
7.3 Accuracy and log loss
Main idea. Classification quality needs both decisions and probabilities.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
7.4 Calibration
Main idea. Predicted probabilities should match empirical frequencies.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
7.5 Cross-validation
Main idea. Average validation over multiple folds.
Core relation:
K$ foldsLinear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
8. Linear Models in AI
This part studies linear models in ai as the simplest useful supervised-learning model family. The important habit is to connect algebra, geometry, optimization, and diagnostics.
| Subtopic | Question | Formula |
|---|---|---|
| Baseline strength | linear models expose whether features already solve the task | |
| Linear probes | frozen representations can be tested by a linear head | |
| LM head | language models end with a linear projection to vocabulary logits | |
| Logit lens | intermediate states can be projected with the LM head | |
| Interpretability | linear weights are easier to inspect than deep nonlinear interactions | feature effect |
8.1 Baseline strength
Main idea. Linear models expose whether features already solve the task.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
8.2 Linear probes
Main idea. Frozen representations can be tested by a linear head.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. Linear probes are a practical way to test what a deep representation already contains.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
8.3 LM head
Main idea. Language models end with a linear projection to vocabulary logits.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. Even the final step of a giant language model is a linear model over hidden features.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
8.4 Logit lens
Main idea. Intermediate states can be projected with the lm head.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
8.5 Interpretability
Main idea. Linear weights are easier to inspect than deep nonlinear interactions.
Core relation:
w_j$ feature effectLinear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
9. Implementation Details
This part studies implementation details as the simplest useful supervised-learning model family. The important habit is to connect algebra, geometry, optimization, and diagnostics.
| Subtopic | Question | Formula |
|---|---|---|
| Intercept handling | bias can be a separate scalar or a column of ones | |
| Numerical solve | prefer stable solvers over explicit matrix inverse | |
| Standardization leakage | fit scaling parameters on train data only | |
| Rank checks | inspect rank before trusting coefficients | |
| Residual diagnostics | plot residuals to find nonlinearity or outliers |
9.1 Intercept handling
Main idea. Bias can be a separate scalar or a column of ones.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
9.2 Numerical solve
Main idea. Prefer stable solvers over explicit matrix inverse.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
9.3 Standardization leakage
Main idea. Fit scaling parameters on train data only.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
9.4 Rank checks
Main idea. Inspect rank before trusting coefficients.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
9.5 Residual diagnostics
Main idea. Plot residuals to find nonlinearity or outliers.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
10. Diagnostics
This part studies diagnostics as the simplest useful supervised-learning model family. The important habit is to connect algebra, geometry, optimization, and diagnostics.
| Subtopic | Question | Formula |
|---|---|---|
| Shape checks | confirm matrix dimensions before fitting | |
| Gradient check | compare analytic and finite-difference gradients | |
| Regularization path | plot coefficients as lambda changes | |
| Influence | outliers can dominate squared loss | |
| Baseline comparison | compare linear, regularized, and nonlinear models |
10.1 Shape checks
Main idea. Confirm matrix dimensions before fitting.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
10.2 Gradient check
Main idea. Compare analytic and finite-difference gradients.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. Linear models are perfect for learning how to verify optimization code.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
10.3 Regularization path
Main idea. Plot coefficients as lambda changes.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
10.4 Influence
Main idea. Outliers can dominate squared loss.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
10.5 Baseline comparison
Main idea. Compare linear, regularized, and nonlinear models.
Core relation:
Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.
Worked micro-example. If , , and , then . The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.
Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.
AI connection. This is a practical linear-model control variable.
Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.
Practice Exercises
- Compute an affine prediction.
- Solve a small least-squares problem.
- Derive and compute the squared-loss gradient.
- Run a few gradient-descent steps.
- Compute a ridge solution.
- Compare coefficient shrinkage under different lambdas.
- Compute logistic probability and binary cross-entropy.
- Compute softmax probabilities.
- Check feature standardization and leakage.
- Write a linear-model debugging checklist.
Why This Matters for AI
Linear models appear inside modern AI more often than they first seem. A classifier head is linear. A language model head is linear. A linear probe tests representation quality. Ridge and lasso teach regularization. Least squares teaches projection geometry. Logistic regression teaches probabilistic classification and cross-entropy.
Bridge to Neural Networks
Neural networks can be viewed as learned feature maps followed by linear heads. The next section studies what changes when the feature map itself becomes trainable and nonlinear.
References
- Trevor Hastie, Robert Tibshirani, and Jerome Friedman, "The Elements of Statistical Learning", 2nd ed.: https://web.stanford.edu/~hastie/ElemStatLearn/
- Christopher Bishop, "Pattern Recognition and Machine Learning", 2006.
- Arthur Hoerl and Robert Kennard, "Ridge Regression: Biased Estimation for Nonorthogonal Problems", Technometrics, 1970: https://www.tandfonline.com/doi/abs/10.1080/00401706.1970.10488634
- Robert Tibshirani, "Regression Shrinkage and Selection via the Lasso", 1996: https://academic.oup.com/jrsssb/article/58/1/267/7027929