NotesMath for LLMs

Linear Models

Math for Specific Models / Linear Models

Notes

Linear models are the first serious model family in machine learning: simple enough to solve and inspect, strong enough to be useful, and foundational for understanding optimization, regularization, classification, and modern neural network heads.

Overview

The basic prediction is:

y^=wx+b.\hat y = w^\top x+b.

For a design matrix XX, all predictions are:

y^=Xw+b1.\hat y=Xw+b\mathbf{1}.

From this one form we get least squares, ridge regression, lasso, logistic regression, softmax regression, linear probes, and the final LM head of a transformer.

Prerequisites

  • Vectors, matrices, dot products, and norms
  • Gradients and convex optimization basics
  • Probability and cross-entropy for classification
  • Train/validation evaluation vocabulary

Companion Notebooks

NotebookPurpose
theory.ipynbDemonstrates least squares, gradient descent, ridge, lasso-style shrinkage intuition, logistic regression, calibration, SVD conditioning, and linear probes.
exercises.ipynbTen practice problems for closed-form solves, gradients, regularization, classification probabilities, and diagnostics.

Learning Objectives

After this section, you should be able to:

  • Write linear regression and classification models in vectorized form.
  • Derive the least-squares normal equations.
  • Explain projection, pseudoinverse, rank, and conditioning.
  • Implement gradient descent and check its gradient.
  • Explain ridge, lasso, elastic net, and the bias-variance tradeoff.
  • Compute logistic and softmax probabilities.
  • Interpret linear probes and LM heads in modern AI.
  • Build a diagnostic checklist for linear models.

Table of Contents

  1. Linear Prediction
  2. Least Squares Regression
  3. Gradient Descent
  4. Regularization
  5. Linear Classification
  6. Optimization Geometry
  7. Evaluation
  8. Linear Models in AI
  9. Implementation Details
  10. Diagnostics

Shape Map

features:       X       shape (n, d)
targets:        y       shape (n,)
weights:        w       shape (d,)
predictions:    y_hat   shape (n,)
multi-class W:  W       shape (K, d)
logits:         Z       shape (n, K)

1. Linear Prediction

This part studies linear prediction as the simplest useful supervised-learning model family. The important habit is to connect algebra, geometry, optimization, and diagnostics.

SubtopicQuestionFormula
Feature vectorrepresent each example as coordinatesxRdx\in\mathbb{R}^d
Affine scorecombine features by weighted sum plus biasy^=wx+b\hat y=w^\top x+b
Design matrixstack examples row-wiseXRn×dX\in\mathbb{R}^{n\times d}
Vectorized predictionpredict all examples at oncey^=Xw+b1\hat y=Xw+b\mathbf{1}
Linear decision boundaryclassification threshold creates a hyperplanewx+b=0w^\top x+b=0

1.1 Feature vector

Main idea. Represent each example as coordinates.

Core relation:

xRdx\in\mathbb{R}^d

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

1.2 Affine score

Main idea. Combine features by weighted sum plus bias.

Core relation:

y^=wx+b\hat y=w^\top x+b

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

1.3 Design matrix

Main idea. Stack examples row-wise.

Core relation:

XRn×dX\in\mathbb{R}^{n\times d}

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

1.4 Vectorized prediction

Main idea. Predict all examples at once.

Core relation:

y^=Xw+b1\hat y=Xw+b\mathbf{1}

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

1.5 Linear decision boundary

Main idea. Classification threshold creates a hyperplane.

Core relation:

wx+b=0w^\top x+b=0

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

2. Least Squares Regression

This part studies least squares regression as the simplest useful supervised-learning model family. The important habit is to connect algebra, geometry, optimization, and diagnostics.

SubtopicQuestionFormula
Residualsprediction errors form a vectorr=yXwr=y-Xw
Squared losspenalize large residuals quadraticallyL(w)=12Xwy22L(w)=\frac12\Vert Xw-y\Vert_2^2
Normal equationsset the gradient to zeroXXw=XyX^\top Xw=X^\top y
Projection viewleast squares projects y onto the column space of Xy^=PXy\hat y=P_Xy
Pseudoinversehandle rank-deficient or rectangular systemsw=X+yw=X^+y

2.1 Residuals

Main idea. Prediction errors form a vector.

Core relation:

r=yXwr=y-Xw

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

2.2 Squared loss

Main idea. Penalize large residuals quadratically.

Core relation:

L(w)=12Xwy22L(w)=\frac12\Vert Xw-y\Vert_2^2

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

2.3 Normal equations

Main idea. Set the gradient to zero.

Core relation:

XXw=XyX^\top Xw=X^\top y

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is the closed-form anchor for least-squares learning.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

2.4 Projection view

Main idea. Least squares projects y onto the column space of x.

Core relation:

y^=PXy\hat y=P_Xy

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

2.5 Pseudoinverse

Main idea. Handle rank-deficient or rectangular systems.

Core relation:

w=X+yw=X^+y

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

3. Gradient Descent

This part studies gradient descent as the simplest useful supervised-learning model family. The important habit is to connect algebra, geometry, optimization, and diagnostics.

SubtopicQuestionFormula
Gradientloss gradient points uphill in parameter spacewL=X(Xwy)\nabla_w L=X^\top(Xw-y)
Updatemove opposite the gradientwt+1=wtηwLw_{t+1}=w_t-\eta\nabla_wL
Learning ratestep size must respect curvature0<\eta<2/\lambda_\max(X^\top X)
Stochastic gradientsestimate gradient from mini-batchesgB=XB(XBwyB)g_B=X_B^\top(X_Bw-y_B)
Feature scalingrescale features to improve conditioningxj(xjμj)/σjx_j\leftarrow(x_j-\mu_j)/\sigma_j

3.1 Gradient

Main idea. Loss gradient points uphill in parameter space.

Core relation:

wL=X(Xwy)\nabla_w L=X^\top(Xw-y)

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

3.2 Update

Main idea. Move opposite the gradient.

Core relation:

wt+1=wtηwLw_{t+1}=w_t-\eta\nabla_wL

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

3.3 Learning rate

Main idea. Step size must respect curvature.

Core relation:

0<\eta<2/\lambda_\max(X^\top X)

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

3.4 Stochastic gradients

Main idea. Estimate gradient from mini-batches.

Core relation:

gB=XB(XBwyB)g_B=X_B^\top(X_Bw-y_B)

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

3.5 Feature scaling

Main idea. Rescale features to improve conditioning.

Core relation:

xj(xjμj)/σjx_j\leftarrow(x_j-\mu_j)/\sigma_j

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

4. Regularization

This part studies regularization as the simplest useful supervised-learning model family. The important habit is to connect algebra, geometry, optimization, and diagnostics.

SubtopicQuestionFormula
Ridgepenalize squared parameter normL=12Xwy2+λ2w2L=\frac12\Vert Xw-y\Vert^2+\frac{\lambda}{2}\Vert w\Vert^2
Ridge solutionshift the Gram matrix spectrumw=(XX+λI)1Xyw=(X^\top X+\lambda I)^{-1}X^\top y
Lassopenalize absolute values to encourage sparsityL=12Xwy2+λw1L=\frac12\Vert Xw-y\Vert^2+\lambda\Vert w\Vert_1
Elastic netcombine L1 sparsity and L2 shrinkageλ1w1+λ2w22\lambda_1\Vert w\Vert_1+\lambda_2\Vert w\Vert_2^2
Bias varianceregularization trades bias for lower varianceE[(f^f)2]=bias2+var+σ2E[(\hat f-f)^2]=\mathrm{bias}^2+\mathrm{var}+\sigma^2

4.1 Ridge

Main idea. Penalize squared parameter norm.

Core relation:

L=12Xwy2+λ2w2L=\frac12\Vert Xw-y\Vert^2+\frac{\lambda}{2}\Vert w\Vert^2

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

4.2 Ridge solution

Main idea. Shift the gram matrix spectrum.

Core relation:

w=(XX+λI)1Xyw=(X^\top X+\lambda I)^{-1}X^\top y

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. Ridge is one of the cleanest examples of regularization improving numerical stability.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

4.3 Lasso

Main idea. Penalize absolute values to encourage sparsity.

Core relation:

L=12Xwy2+λw1L=\frac12\Vert Xw-y\Vert^2+\lambda\Vert w\Vert_1

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

4.4 Elastic net

Main idea. Combine l1 sparsity and l2 shrinkage.

Core relation:

λ1w1+λ2w22\lambda_1\Vert w\Vert_1+\lambda_2\Vert w\Vert_2^2

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

4.5 Bias variance

Main idea. Regularization trades bias for lower variance.

Core relation:

E[(f^f)2]=bias2+var+σ2E[(\hat f-f)^2]=\mathrm{bias}^2+\mathrm{var}+\sigma^2

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

5. Linear Classification

This part studies linear classification as the simplest useful supervised-learning model family. The important habit is to connect algebra, geometry, optimization, and diagnostics.

SubtopicQuestionFormula
Logistic regressionmap affine score to probabilityp(y=1x)=σ(wx+b)p(y=1\mid x)=\sigma(w^\top x+b)
Binary cross-entropynegative Bernoulli log likelihood[ylogp+(1y)log(1p)]-\left[y\log p+(1-y)\log(1-p)\right]
Softmax regressionmulti-class linear logitspk=exp(wkx)/jexp(wjx)p_k=\exp(w_k^\top x)/\sum_j\exp(w_j^\top x)
Marginsigned distance proxy for confidencey(wx+b)y(w^\top x+b)
Linear separabilitya hyperplane can perfectly split classes only in some feature spacesyi(wxi+b)>0y_i(w^\top x_i+b)>0

5.1 Logistic regression

Main idea. Map affine score to probability.

Core relation:

p(y=1x)=σ(wx+b)p(y=1\mid x)=\sigma(w^\top x+b)

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

5.2 Binary cross-entropy

Main idea. Negative bernoulli log likelihood.

Core relation:

[ylogp+(1y)log(1p)]-\left[y\log p+(1-y)\log(1-p)\right]

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

5.3 Softmax regression

Main idea. Multi-class linear logits.

Core relation:

pk=exp(wkx)/jexp(wjx)p_k=\exp(w_k^\top x)/\sum_j\exp(w_j^\top x)

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

5.4 Margin

Main idea. Signed distance proxy for confidence.

Core relation:

y(wx+b)y(w^\top x+b)

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

5.5 Linear separability

Main idea. A hyperplane can perfectly split classes only in some feature spaces.

Core relation:

yi(wxi+b)>0y_i(w^\top x_i+b)>0

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

6. Optimization Geometry

This part studies optimization geometry as the simplest useful supervised-learning model family. The important habit is to connect algebra, geometry, optimization, and diagnostics.

SubtopicQuestionFormula
Convexitylinear regression and logistic regression losses are convex in w2L0\nabla^2L\succeq0
Condition numberill-conditioned features slow optimization\kappa=\lambda_\max/\lambda_\min
SVD viewsingular values reveal stable and unstable directionsX=UΣVX=U\Sigma V^\top
Collinearitycorrelated features make coefficients unstableXXX^\top X nearly singular
Whiteningdecorrelate features when appropriateCov(X)I\mathrm{Cov}(X)\approx I

6.1 Convexity

Main idea. Linear regression and logistic regression losses are convex in w.

Core relation:

2L0\nabla^2L\succeq0

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

6.2 Condition number

Main idea. Ill-conditioned features slow optimization.

Core relation:

\kappa=\lambda_\max/\lambda_\min

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

6.3 SVD view

Main idea. Singular values reveal stable and unstable directions.

Core relation:

X=UΣVX=U\Sigma V^\top

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

6.4 Collinearity

Main idea. Correlated features make coefficients unstable.

Core relation:

X^\top X$ nearly singular

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

6.5 Whitening

Main idea. Decorrelate features when appropriate.

Core relation:

Cov(X)I\mathrm{Cov}(X)\approx I

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

7. Evaluation

This part studies evaluation as the simplest useful supervised-learning model family. The important habit is to connect algebra, geometry, optimization, and diagnostics.

SubtopicQuestionFormula
Train validation splitmeasure generalization on held-out dataLvalL_\mathrm{val}
MSE and MAEregression errors emphasize different residual behaviorMSE=n1ri2\mathrm{MSE}=n^{-1}\sum r_i^2
Accuracy and log lossclassification quality needs both decisions and probabilitieslogpy-\log p_y
Calibrationpredicted probabilities should match empirical frequenciesP(y=1p^=c)cP(y=1\mid \hat p=c)\approx c
Cross-validationaverage validation over multiple foldsKK folds

7.1 Train validation split

Main idea. Measure generalization on held-out data.

Core relation:

LvalL_\mathrm{val}

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

7.2 MSE and MAE

Main idea. Regression errors emphasize different residual behavior.

Core relation:

MSE=n1ri2\mathrm{MSE}=n^{-1}\sum r_i^2

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

7.3 Accuracy and log loss

Main idea. Classification quality needs both decisions and probabilities.

Core relation:

logpy-\log p_y

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

7.4 Calibration

Main idea. Predicted probabilities should match empirical frequencies.

Core relation:

P(y=1p^=c)cP(y=1\mid \hat p=c)\approx c

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

7.5 Cross-validation

Main idea. Average validation over multiple folds.

Core relation:

K$ folds

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

8. Linear Models in AI

This part studies linear models in ai as the simplest useful supervised-learning model family. The important habit is to connect algebra, geometry, optimization, and diagnostics.

SubtopicQuestionFormula
Baseline strengthlinear models expose whether features already solve the tasky^=Xw\hat y=Xw
Linear probesfrozen representations can be tested by a linear headh=fθ0(x), y^=Whh=f_{\theta_0}(x),\ \hat y=Wh
LM headlanguage models end with a linear projection to vocabulary logitsz=hWEz=hW_E^\top
Logit lensintermediate states can be projected with the LM headz=hWEz_\ell=h_\ell W_E^\top
Interpretabilitylinear weights are easier to inspect than deep nonlinear interactionswjw_j feature effect

8.1 Baseline strength

Main idea. Linear models expose whether features already solve the task.

Core relation:

y^=Xw\hat y=Xw

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

8.2 Linear probes

Main idea. Frozen representations can be tested by a linear head.

Core relation:

h=fθ0(x), y^=Whh=f_{\theta_0}(x),\ \hat y=Wh

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. Linear probes are a practical way to test what a deep representation already contains.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

8.3 LM head

Main idea. Language models end with a linear projection to vocabulary logits.

Core relation:

z=hWEz=hW_E^\top

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. Even the final step of a giant language model is a linear model over hidden features.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

8.4 Logit lens

Main idea. Intermediate states can be projected with the lm head.

Core relation:

z=hWEz_\ell=h_\ell W_E^\top

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

8.5 Interpretability

Main idea. Linear weights are easier to inspect than deep nonlinear interactions.

Core relation:

w_j$ feature effect

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

9. Implementation Details

This part studies implementation details as the simplest useful supervised-learning model family. The important habit is to connect algebra, geometry, optimization, and diagnostics.

SubtopicQuestionFormula
Intercept handlingbias can be a separate scalar or a column of onesX~=[X,1]\tilde X=[X,\mathbf{1}]
Numerical solveprefer stable solvers over explicit matrix inversesolve(XX,Xy)\mathrm{solve}(X^\top X,X^\top y)
Standardization leakagefit scaling parameters on train data onlyμtrain,σtrain\mu_\mathrm{train},\sigma_\mathrm{train}
Rank checksinspect rank before trusting coefficientsrank(X)\mathrm{rank}(X)
Residual diagnosticsplot residuals to find nonlinearity or outliersri=yiy^ir_i=y_i-\hat y_i

9.1 Intercept handling

Main idea. Bias can be a separate scalar or a column of ones.

Core relation:

X~=[X,1]\tilde X=[X,\mathbf{1}]

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

9.2 Numerical solve

Main idea. Prefer stable solvers over explicit matrix inverse.

Core relation:

solve(XX,Xy)\mathrm{solve}(X^\top X,X^\top y)

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

9.3 Standardization leakage

Main idea. Fit scaling parameters on train data only.

Core relation:

μtrain,σtrain\mu_\mathrm{train},\sigma_\mathrm{train}

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

9.4 Rank checks

Main idea. Inspect rank before trusting coefficients.

Core relation:

rank(X)\mathrm{rank}(X)

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

9.5 Residual diagnostics

Main idea. Plot residuals to find nonlinearity or outliers.

Core relation:

ri=yiy^ir_i=y_i-\hat y_i

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

10. Diagnostics

This part studies diagnostics as the simplest useful supervised-learning model family. The important habit is to connect algebra, geometry, optimization, and diagnostics.

SubtopicQuestionFormula
Shape checksconfirm matrix dimensions before fittingX:(n,d), w:(d,)X:(n,d),\ w:(d,)
Gradient checkcompare analytic and finite-difference gradientsL\nabla L
Regularization pathplot coefficients as lambda changesw(λ)w(\lambda)
Influenceoutliers can dominate squared lossri2r_i^2
Baseline comparisoncompare linear, regularized, and nonlinear modelsΔL\Delta L

10.1 Shape checks

Main idea. Confirm matrix dimensions before fitting.

Core relation:

X:(n,d), w:(d,)X:(n,d),\ w:(d,)

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

10.2 Gradient check

Main idea. Compare analytic and finite-difference gradients.

Core relation:

L\nabla L

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. Linear models are perfect for learning how to verify optimization code.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

10.3 Regularization path

Main idea. Plot coefficients as lambda changes.

Core relation:

w(λ)w(\lambda)

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

10.4 Influence

Main idea. Outliers can dominate squared loss.

Core relation:

ri2r_i^2

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.

10.5 Baseline comparison

Main idea. Compare linear, regularized, and nonlinear models.

Core relation:

ΔL\Delta L

Linear models are not weak because they are simple. They are useful because their assumptions are explicit. A linear model says that the target is explained by additive feature contributions, after whatever feature map has been chosen.

Worked micro-example. If x=[2,3]x=[2,3], w=[0.5,1]w=[0.5,-1], and b=4b=4, then y^=0.5213+4=2\hat y=0.5\cdot2-1\cdot3+4=2. The whole model is one dot product plus a bias, but the feature design determines how expressive that dot product can be.

Implementation check. Always inspect shapes, feature scaling, rank, residuals, and validation loss. A closed-form solution can still generalize poorly if the features or split are wrong.

AI connection. This is a practical linear-model control variable.

Common mistake. Do not interpret a coefficient without considering feature scaling and correlation. A large coefficient may reflect units or collinearity, not causal importance.


Practice Exercises

  1. Compute an affine prediction.
  2. Solve a small least-squares problem.
  3. Derive and compute the squared-loss gradient.
  4. Run a few gradient-descent steps.
  5. Compute a ridge solution.
  6. Compare coefficient shrinkage under different lambdas.
  7. Compute logistic probability and binary cross-entropy.
  8. Compute softmax probabilities.
  9. Check feature standardization and leakage.
  10. Write a linear-model debugging checklist.

Why This Matters for AI

Linear models appear inside modern AI more often than they first seem. A classifier head is linear. A language model head is linear. A linear probe tests representation quality. Ridge and lasso teach regularization. Least squares teaches projection geometry. Logistic regression teaches probabilistic classification and cross-entropy.

Bridge to Neural Networks

Neural networks can be viewed as learned feature maps followed by linear heads. The next section studies what changes when the feature map itself becomes trainable and nonlinear.

References