Notes - Math for LLMs Tutorial

Notes

<- Back to Chapter 7: Statistics | Next Chapter: Optimization ->

"All models are wrong, but some are useful."

George E. P. Box

Overview

Regression analysis is the statistics chapter's canonical home for modeling how a response changes with predictors. The central question is conditional: given features $\mathbf{x}$ , what should we predict about $Y$ , how uncertain is that prediction, and what structure in the data supports that conclusion? That question is older than machine learning, but it is also the blueprint for supervised learning. Ordinary least squares is the simplest version of fitting a model to data. Ridge and Lasso are among the oldest regularization methods. Logistic regression is still one of the most reliable probabilistic classifiers in production. Generalized linear models show that many apparently different predictive tasks share one algebraic skeleton: a linear predictor followed by a link function and a response distribution.

This section does not try to absorb every topic that happens to touch linear models. It will not re-teach generic bias-variance language from Estimation Theory, the full hypothesis-testing toolkit from Hypothesis Testing, posterior derivations from Bayesian Inference, or the full optimization algorithms that belong in Chapter 8: Optimization. It also avoids swallowing the broader model zoo already developed in Math for Specific Models: Linear Models. Instead, it focuses on what belongs here: the regression problem setup, the design-matrix viewpoint, ordinary least squares, projection geometry, Gauss-Markov, regression-specific inference and diagnostics, Ridge and Lasso as statistical estimators, logistic and Poisson regression, the generalized linear model framework, and the practical habits that make regression a strong baseline rather than a fragile ritual.

For AI, regression matters far beyond introductory coursework. A linear probe on embeddings is regression or classification on frozen features. A calibration head is a regression-style mapping from scores to probabilities. Weight decay is Ridge in another language. Sparse adaptation and feature selection inherit Lasso geometry. Click models, rate models, latency models, reward models, and tabular baselines all live inside the regression family. Even when we later fit far richer nonlinear models, regression remains the standard against which those models must justify their extra complexity.

Prerequisites

Covariance, correlation, variance decomposition, and descriptive summaries - Descriptive Statistics
Bias, variance, MLE, Fisher information, and confidence intervals - Estimation Theory
Hypothesis tests, p-values, t-tests, and ANOVA-style logic - Hypothesis Testing
MAP estimation, Gaussian and Laplace priors, and posterior predictive thinking - Bayesian Inference
Time-dependent errors and autocorrelation in sequential settings - Time Series
Matrix algebra, projections, positive definite matrices, and pseudoinverses - Chapter 3, Advanced Linear Algebra
Convexity, gradients, and numerical optimization basics - Chapter 8, Optimization

Companion Notebooks

Notebook	Description
theory.ipynb	Interactive derivations and simulations for OLS geometry, diagnostics, regularization, logistic and Poisson regression, GLMs, and ML-facing regression workflows
exercises.ipynb	10 graded exercises covering covariance-to-slope derivations, matrix OLS, intervals, multicollinearity, Ridge/Lasso, logistic and Poisson models, influence, and linear-probe style applications

Learning Objectives

After completing this section, you will:

Define regression as conditional modeling of a response given predictors and distinguish that goal from raw association summaries
Build and interpret the design matrix, including intercepts, categorical coding, interaction terms, and feature engineering choices
Derive the ordinary least-squares estimator in both scalar and matrix form and explain it as orthogonal projection
State the Gauss-Markov theorem precisely and explain why OLS is BLUE under the classical assumptions
Compute regression-specific standard errors, confidence intervals, and prediction intervals without confusing them
Diagnose lack of fit through residual analysis, leverage, influence, heteroscedasticity checks, and multicollinearity measures
Explain how Ridge and Lasso change the geometry of estimation and why they stabilize or sparsify solutions
Derive logistic regression as a Bernoulli GLM and interpret odds, log-odds, and decision boundaries
Explain Poisson regression for count/rate data and unify linear, logistic, and Poisson regression under the GLM framework
Recognize when basis expansions, robust losses, cross-validation, or regularization are needed beyond plain OLS
Connect regression mathematics to linear probes, calibration, tabular baselines, weight decay, sparse adaptation, and low-rank updates in modern ML systems

1. Intuition
2. Formal Definitions
3. Core Theory I: OLS and Inference
4. Core Theory II: Model Assessment and Diagnostics
5. Core Theory III: Regularized and Generalized Regression
6. Advanced Topics
7. Applications in Machine Learning
8. Common Mistakes
9. Exercises
10. Why This Matters for AI (2026 Perspective)
11. Conceptual Bridge

1. Intuition

1.1 From Correlation to Conditional Modeling

The first regression lesson is that association is not yet a model. In Descriptive Statistics, we used covariance and correlation to summarize how variables move together. Those summaries are useful, but they do not answer the regression question. Regression asks what we should predict about a response once predictors are observed. That makes regression directional. We are not merely measuring whether height and weight move together; we are asking how to predict one from the other, how uncertainty changes after conditioning, and how much of the variation in the response remains unexplained.

This distinction matters because conditional questions are the ones that operational systems ask. If an ad platform sees a feature vector for a user-request pair, it must predict click probability. If an LLM service sees prompt features and infrastructure state, it must predict latency, token throughput, or failure risk. If a medical triage model sees demographics and lab values, it must predict a target outcome under uncertainty. These are regression questions even when the eventual target is binary, count-valued, or transformed through a link function.

Regression also introduces a second distinction: the difference between the response surface we wish to model and the sample we happened to observe. The data do not contain a visible line waiting to be discovered. They contain noisy observations from which we infer a conditional pattern. That is why regression mixes geometry, probability, and decision-making. We need a function class, an estimator, an uncertainty model, and a diagnostic toolkit.

REGRESSION AS CONDITIONAL MODELING
======================================================================

  descriptive statistics
      -> covariance(X, Y)
      -> correlation(X, Y)
      -> association summary

  regression
      -> model E[Y | X]
      -> estimate parameters from data
      -> predict new outcomes
      -> quantify residual uncertainty

======================================================================

Three examples make the shift concrete.

Example 1: house prices. Correlation tells us price and square footage move together. Regression tells us how expected price changes after conditioning on square footage, location, and age, and it gives a predictive interval for a new house.

Example 2: model latency. Correlation might reveal that token length correlates with latency. Regression lets us incorporate token length, batch size, cache status, and GPU type together, then separate systematic effects from residual noise.

Example 3: reward-model preference data. Even when the ultimate model is logistic rather than Gaussian, the logic is still regression: predict a response from features and interpret coefficients in the conditional model.

Non-examples help too.

A scatter plot with a visually rising cloud is not yet a regression analysis.
A high correlation does not imply a stable predictive slope under confounding or extrapolation.
A fitted line is not automatically causal. Causal interpretation belongs later, especially in Chapter 22: Causal Inference.

For AI: much of supervised learning is a regression problem with different observation models. The output layer may produce a real number, a Bernoulli probability, or a count rate, but the central object is still a conditional prediction rule driven by features.

1.2 Why Regression Is the Blueprint for Supervised Learning

Regression is the blueprint for supervised learning because it packages all the key design choices in a small, interpretable setting. We choose a feature representation. We choose a model class. We define a loss or likelihood. We fit parameters from data. We evaluate predictive performance. We regularize to control overfitting. We inspect uncertainty and residual structure. That is already most of machine learning in miniature.

The linear predictor

\eta(\mathbf{x}) = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p

is especially important because it appears everywhere. Linear regression uses it directly for the conditional mean. Logistic regression passes it through a sigmoid to obtain a probability. Poisson regression exponentiates it to obtain a nonnegative rate. Neural networks stack many affine maps and nonlinearities, but the final readout layer is usually still a regression-style map from learned features to outputs.

This is why regression remains a benchmark even in the age of large models. If a deep model cannot beat a well-specified regression baseline, the extra complexity may be wasted. And if a complex model does beat the baseline, regression still provides the interpretable reference point for what the richer model gained: nonlinearity, interactions, hierarchy, sequence dependence, or richer uncertainty handling.

There is also a computational reason regression is foundational. OLS has a closed-form solution under full rank. Ridge has a closed-form solution too. These exact solutions let us study shrinkage, degrees of freedom, overparameterization, and conditioning without algorithmic distractions. When we later move to iterative optimization, we already know the statistical target those algorithms are trying to approximate.

For AI systems, regression-style models appear in at least five recurring roles:

baseline predictive model for structured features
interpretable audit model on top of black-box features
probe or classifier on frozen representations
calibration layer for risk scores and probabilities
regularized linear adapter or head during fine-tuning

Each role inherits the same core mathematics: linear predictors, loss functions, regularization, and diagnostic reasoning.

1.3 When Linear Structure Works and When It Breaks

Linear structure works surprisingly well when the feature representation already captures the relevant nonlinearity. If embeddings from a strong pretrained model make semantic classes nearly linearly separable, then a logistic regression head can be excellent. If engineered features already encode curvature, then linear regression in the expanded feature space can compete with more complex models. If a local neighborhood is narrow enough, linear approximation may be the right first-order model even when the global system is nonlinear.

But linear structure breaks in recognizable ways. Additivity fails when interactions dominate. Global linearity fails when effects change sign across regions. Constant-variance assumptions fail when residual spread grows with the mean. Gaussian noise fails for binary, count, or heavy-tailed responses. Independent observations fail in panel and time-series settings. Feature stability fails when predictors are collinear, sparse, or high-dimensional.

The right response is not to abandon regression at the first sign of strain. Instead, we expand the toolkit. Basis expansions and splines preserve the regression framework while adding nonlinear flexibility. Ridge and Lasso control instability and high-dimensionality. GLMs adapt the response distribution and link function. Weighted or robust losses address heteroscedasticity and outliers. Residual diagnostics reveal when the model class is misaligned with the data.

WHEN LINEARITY HELPS
======================================================================

  raw features --poor--> linear model
  learned / engineered features --better--> linear model
  linear model + regularization --> stable baseline
  linear model + basis expansion --> nonlinear response through linear fit

  if failures remain:
      interactions
      heavy tails
      heteroscedasticity
      temporal dependence
      causally unstable features

======================================================================

Three examples:

Example 1: tabular risk prediction. A linear model with carefully scaled features and interactions can be strong, cheap, and interpretable.

Example 2: embedding classification. A simple logistic head on a frozen encoder can be competitive because the encoder already learned nonlinear structure.

Example 3: counts with exposure. Plain OLS is awkward when the response is a count. Poisson regression preserves the linear-predictor logic while respecting count behavior.

Non-examples:

Using a linear model on raw pixels and concluding the task is impossible. The issue may be the feature representation rather than regression itself.
Reading coefficient signs literally under severe multicollinearity.
Trusting in-sample fit even when residual plots show curvature or changing variance.

1.4 Historical Timeline

Year	Contributor	Contribution
1805	Adrien-Marie Legendre	Publishes least squares for fitting astronomical data
1809	Carl Friedrich Gauss	Connects least squares to Gaussian errors and likelihood reasoning
1886	Francis Galton	Coins "regression" through regression toward the mean
1901	Karl Pearson	Develops correlation and related geometric summaries feeding regression analysis
1922	R. A. Fisher	Sharpens likelihood and linear-model inference foundations
1937-1947	Neyman, Pearson, Wald, and others	Build the modern hypothesis-testing and estimation context surrounding regression inference
1951	Gauss-Markov theorem formalized in modern textbooks	Clarifies BLUE status of OLS under classical assumptions
1970	Hoerl and Kennard	Popularize Ridge regression for collinearity and ill conditioning
1972	Nelder and Wedderburn	Generalized linear models unify Gaussian, Bernoulli, and Poisson regression
1977	Cook	Introduces Cook's distance for influence diagnostics
1996	Tibshirani	Introduces Lasso as an L1-regularized estimator
2001	Ng and Jordan	Compare logistic regression and Naive Bayes in a modern ML framing
2000s-2010s	Statistical learning literature	Makes regularized regression a core high-dimensional tool
2019-2026	Interpretability and representation analysis	Linear probes become standard for studying deep model representations
2022-2026	Parameter-efficient fine-tuning literature	Low-rank and sparse adaptation methods reconnect modern AI to regularized linear estimation

The timeline is worth remembering because regression is not an obsolete prelude to machine learning. It is one of the oldest places where prediction, regularization, probabilistic modeling, and diagnostics were all developed together.

2. Formal Definitions

2.1 Response, Predictors, and Conditional Expectation

We begin with data

\{(y_i,\mathbf{x}_i)\}_{i=1}^n,

where $y_i$ is the response for observation $i$ and $\mathbf{x}_i \in \mathbb{R}^p$ is a predictor vector. Regression asks us to model some aspect of the conditional distribution of $Y$ given $\mathbf{X}$ . In the simplest linear-regression setting, the target is the conditional mean:

\mathbb{E}[Y \mid \mathbf{X} = \mathbf{x}] \approx \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p.

That sentence already hides several important choices. First, the response may be continuous, binary, count-valued, or otherwise constrained. Second, the predictors may be numerical, categorical, or engineered from raw inputs. Third, the approximation may be exact in the assumed model or only an intentionally useful simplification. Fourth, the word "conditional" means the interpretation of a coefficient is always with the other included predictors held fixed.

The conditional-mean perspective is especially important because it separates regression from broader supervised-learning language. Some tasks care about the entire conditional distribution. Some care only about the median or a quantile. Some care about classification risk rather than mean squared error. But linear regression begins with the conditional mean because it provides the cleanest geometric and probabilistic theory.

We will use the standard linear model notation

Y_i = \beta_0 + \sum_{j=1}^p \beta_j X_{ij} + \varepsilon_i,

or in vector form,

Y_i = \mathbf{x}_i^\top \boldsymbol{\beta} + \varepsilon_i,

where $\mathbf{x}_i$ is understood to include a leading 1 if an intercept is present. The noise term $\varepsilon_i$ captures the part of the response not explained by the included predictors.

Examples:

house price from size, location, and age
token-generation latency from context length, model size, and batch state
click probability from user, query, and ad features
crash count from usage volume and deployment characteristics

Non-examples:

modeling only the joint cloud of $(X,Y)$ without designating a predictive target
fitting a line to time-dependent errors while ignoring serial correlation from Time Series
interpreting coefficients causally without a causal design

For AI: even when modern systems learn huge feature representations end-to-end, their final tasks are still conditional prediction problems. Regression notation gives us a language for the last-mile map from features to outputs.

2.2 Design Matrix, Linear Predictor, and Feature Coding

Stacking the predictors gives the design matrix

X = \begin{bmatrix} \mathbf{x}_1^\top \\ \mathbf{x}_2^\top \\ \vdots \\ \mathbf{x}_n^\top \end{bmatrix} \in \mathbb{R}^{n \times p},

and stacking the responses gives $\mathbf{y} \in \mathbb{R}^n$ . Then the linear model is

\mathbf{y} = X \boldsymbol{\beta} + \boldsymbol{\varepsilon}.

The vector $X\boldsymbol{\beta}$ is the linear predictor evaluated on all observations.

The design matrix is not a passive container. It is the object whose column space determines what the model can express. Each column is one feature pattern across the sample. If a column is duplicated or nearly duplicated, coefficients become unstable. If an important interaction is missing, the column space may be too small to represent the signal. If the intercept is omitted when it is needed, the fitted hyperplane is forced through the origin.

DESIGN MATRIX VIEW
======================================================================

          columns = features
        +----------------------+
 row 1  | 1  x11 x12 ... x1p   | -> y1
 row 2  | 1  x21 x22 ... x2p   | -> y2
  ...   | ...                  | -> ...
 row n  | 1  xn1 xn2 ... xnp   | -> yn
        +----------------------+

  X beta = fitted mean vector in R^n
  column space of X = all fitted-value vectors the model can represent

======================================================================

Feature coding deserves care. Numerical predictors may need centering or scaling, especially before regularization. Categorical predictors require indicator coding or a contrast scheme. Interactions are new columns formed from products such as $x_1 x_2$ . Polynomial regression is still linear regression in the coefficients because the nonlinearity is in the features, not in $\boldsymbol{\beta}$ .

Three examples:

Example 1: intercept handling. Adding a column of ones lets $\beta_0$ absorb the mean level of the response.

Example 2: binary category coding. For a predictor GPU = {A100, H100}, one indicator column can represent the shift relative to a reference category.

Example 3: interaction term. If prompt length affects latency differently under batch mode, then a feature like length x batch_size can encode that slope change.

Non-examples:

adding all dummy variables plus an intercept, which creates perfect multicollinearity
regularizing unscaled features and then comparing coefficient magnitudes as if the penalty acted equally
assuming "linear model" means predictors themselves must be raw and linear

2.3 Fitted Values, Residuals, and Sum-of-Squares Quantities

Given an estimator $\hat{\boldsymbol{\beta}}$ , the fitted values are

\hat{\mathbf{y}} = X \hat{\boldsymbol{\beta}},

and the residuals are

\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}.

Residuals are not the same as the true errors $\boldsymbol{\varepsilon}$ . The true errors depend on the unknown data-generating process. Residuals are the sample leftovers after fitting the model. They are observable proxies we can inspect for structure.

Several sum-of-squares quantities organize regression analysis:

\operatorname{RSS} = \sum_{i=1}^n e_i^2, \qquad \operatorname{TSS} = \sum_{i=1}^n (y_i - \bar{y})^2, \qquad \operatorname{ESS} = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2.

With an intercept in the model,

\operatorname{TSS} = \operatorname{ESS} + \operatorname{RSS}.

This decomposition is a geometric statement: the centered response vector splits into an explained projection and an orthogonal residual component.

Residuals carry many meanings at once. Large residual magnitude suggests poor local fit. Residual patterns across fitted values can reveal nonlinearity or heteroscedasticity. Residual clustering by groups can reveal omitted structure. Residual autocorrelation can signal that the independence assumption fails and Time Series methods are needed.

Fitted values also have a subtle interpretation. They are not just predicted outputs; they are the orthogonal projection of $\mathbf{y}$ onto the model space when OLS is used. That geometric fact will drive most of Section 3.

Examples:

A small RSS with badly patterned residuals can still mean the model is misspecified.
Two models can have similar $R^2$ but very different residual diagnostics.
A low RSS in an overparameterized fit may signal interpolation rather than genuine generalization.

2.4 Classical Linear Model Assumptions

The classical linear model is often summarized through the following assumptions.

Linearity in parameters: $\mathbb{E}[Y \mid X] = X\boldsymbol{\beta}$ .
Exogeneity / zero conditional mean: $\mathbb{E}[\boldsymbol{\varepsilon}\mid X] = \mathbf{0}$ .
Full column rank: the columns of $X$ are linearly independent.
Homoscedasticity: $\operatorname{Var}(\boldsymbol{\varepsilon}\mid X)=\sigma^2 I$ .
No autocorrelation: errors are uncorrelated across observations.
Normality is sometimes added for exact small-sample inference, though not for unbiasedness of OLS itself.

Each assumption answers a different question. Linearity specifies the systematic part of the model. Exogeneity underwrites unbiasedness. Full rank ensures the normal equations identify a unique coefficient vector. Homoscedasticity and no autocorrelation simplify the covariance structure. Normality justifies exact $t$ and $F$ distributions in finite samples.

It is a mistake to compress these into a single vague statement such as "the data should be linear." The assumptions do different work. Some matter for estimation consistency, some for variance formulas, some only for exact finite-sample testing. A model can remain useful after one assumption fails, but then we need the right repair: robust standard errors, feature expansion, weighted least squares, or a different response model.

CLASSICAL ASSUMPTIONS AT A GLANCE
======================================================================

  assumption                  what it supports
  --------------------------------------------------------------------
  E[e | X] = 0                unbiased coefficients
  rank(X) = p                 unique OLS solution
  Var[e | X] = sigma^2 I      simple variance formulas, efficiency
  no serial dependence        valid iid-style inference
  normal errors               exact small-sample t and F inference

======================================================================

Examples:

If variance grows with the mean, OLS coefficients can still be unbiased, but the classical standard errors are wrong.
If two columns are nearly duplicates, predictions may remain good while coefficient interpretation becomes unstable.
If errors are correlated over time, Time Series or robust covariance methods become necessary.

2.5 Examples, Non-Examples, and Edge Cases

It helps to see what does and does not count as a straightforward regression setting.

Standard examples

Simple linear regression: one predictor, one response, slope plus intercept.
Multiple linear regression: several predictors combined additively.
Logistic regression: same linear predictor but linked to Bernoulli outcomes through the logit link.
Poisson regression: same linear predictor linked to counts through the log link.
Polynomial regression: still linear in coefficients after basis expansion.

Non-examples

A binary classifier trained with hinge loss is not a regression model, even though it may use a linear score.
A time-series model with lagged dependence and autocorrelated residuals is not safely handled by iid OLS unless the dependence is modeled appropriately.
A causal treatment-effect estimate is not justified merely because a regression coefficient exists.

Edge cases

Perfect multicollinearity: $X^\top X$ is singular, so OLS is not uniquely identified.
$p > n$ high-dimensional regime: infinitely many interpolating coefficient vectors may exist.
Separated logistic data: logistic MLE may diverge without regularization.
Outlier-driven fit: a few influential points can dominate the estimator.
Missing data: complete-case regression may distort the target if missingness is structured.

These edge cases are not rare exceptions in modern ML. They are normal. Feature spaces are often large, predictors correlated, loss surfaces regularized, and deployment distributions unstable. That is why diagnostics and regularization belong inside regression rather than as afterthoughts.

3. Core Theory I: OLS and Inference

3.1 Simple Linear Regression and Its Link to Correlation

In simple linear regression, we model

Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i.

OLS chooses $(\hat{\beta}_0,\hat{\beta}_1)$ to minimize

\sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2.

Differentiating with respect to $\beta_0$ and $\beta_1$ and setting derivatives to zero yields

\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2} = \frac{s_{xy}}{s_x^2},

and

\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}.

This is the first strong connection back to Descriptive Statistics. The slope is sample covariance divided by sample variance. Equivalently, since $r = s_{xy}/(s_x s_y)$ ,

\hat{\beta}_1 = r \frac{s_y}{s_x}.

So correlation is not the regression slope, but the slope is built directly from it after rescaling by marginal variability.

The formula reveals useful intuition. If $X$ varies very little, even a moderate covariance can produce a large slope because we must tilt the line sharply to explain small changes in $X$ . If $X$ and $Y$ are uncorrelated in the sample, the slope is near zero. If we scale $X$ by a constant, the slope rescales inversely. These are simple facts, but they foreshadow why feature scaling matters once penalties are introduced.

There is also a geometric fact hiding here: the fitted line passes through $(\bar{x},\bar{y})$ . That follows from the intercept formula and is one of the earliest glimpses of projection geometry.

Examples:

standardized variables make the simple-regression slope equal to the correlation
centering predictors simplifies interpretation of the intercept
when $r$ is high but the sample is tiny, standard error may still be large

Non-examples:

interpreting a large slope as stronger association without considering units
comparing unscaled slopes across predictors measured in different units

3.2 Matrix OLS and the Normal Equations

Multiple regression is most naturally written in matrix form. OLS minimizes

L(\boldsymbol{\beta}) = \lVert \mathbf{y} - X\boldsymbol{\beta} \rVert_2^2 = (\mathbf{y} - X\boldsymbol{\beta})^\top (\mathbf{y} - X\boldsymbol{\beta}).

Expanding and differentiating gives

\nabla_{\boldsymbol{\beta}} L(\boldsymbol{\beta}) = -2X^\top \mathbf{y} + 2X^\top X \boldsymbol{\beta}.

Setting the gradient to zero yields the normal equations

X^\top X \hat{\boldsymbol{\beta}} = X^\top \mathbf{y}.

If $X$ has full column rank, $X^\top X$ is invertible and

\hat{\boldsymbol{\beta}} = (X^\top X)^{-1}X^\top \mathbf{y}.

The formula is compact, but its meaning is rich. $X^\top X$ is the Gram matrix of feature inner products. $X^\top \mathbf{y}$ records how strongly each feature aligns with the response. Solving the system balances those alignments against the redundancy among predictors.

The inverse formula should be read conceptually, not as a numerical recipe. In practice, stable software often uses QR decomposition, SVD, or iterative methods rather than explicitly forming $(X^\top X)^{-1}$ . Detailed numerical linear algebra belongs more fully to Chapter 3 and Chapter 8. What belongs here is the statistical target defined by the normal equations.

When $X$ is not full rank, the normal equations may admit many solutions. Then the pseudoinverse

\hat{\boldsymbol{\beta}} = X^+ \mathbf{y}

provides the minimum-norm solution. That is important in modern overparameterized regression, but the canonical home for pseudoinverse machinery is earlier linear algebra. We preview it here because regression runs into rank deficiency constantly.

For AI: normal-equation structure reappears whenever a linear head is fit on frozen features, whether those features came from a hand-designed pipeline, an encoder, or a transformer hidden state.

3.3 Projection Geometry, the Hat Matrix, and Orthogonality

OLS is best understood geometrically. The set of all fitted-value vectors has the form

\mathcal{C}(X)=\{X\boldsymbol{\beta}:\boldsymbol{\beta}\in\mathbb{R}^p\},

the column space of $X$ . OLS chooses the point in that subspace closest to $\mathbf{y}$ :

\hat{\mathbf{y}} = \arg\min_{\mathbf{v}\in \mathcal{C}(X)} \lVert \mathbf{y} - \mathbf{v} \rVert_2^2.

Therefore $\hat{\mathbf{y}}$ is the orthogonal projection of $\mathbf{y}$ onto $\mathcal{C}(X)$ . The residual vector

\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}

is orthogonal to the entire model space, which means

X^\top \mathbf{e} = \mathbf{0}.

This orthogonality is exactly the normal equation in disguise. It tells us that the residual has zero sample covariance with every included predictor. That fact is often memorized algebraically, but the geometry is more memorable: once we project, there is no remaining component of $\mathbf{y}$ in any modeled direction.

The projection matrix is the hat matrix

H = X(X^\top X)^{-1}X^\top,

so named because it maps $\mathbf{y}$ to $\hat{\mathbf{y}}$ :

\hat{\mathbf{y}} = H\mathbf{y}.

It satisfies

H^\top = H, \qquad H^2 = H,

so it is symmetric and idempotent, the algebraic signature of orthogonal projection.

OLS GEOMETRY
======================================================================

                 y
                /|
               / |
              /  | residual e = y - y_hat
             /   |
            /    |
           *-----* y_hat in column space of X

  y_hat = projection of y onto C(X)
  residual is perpendicular to every modeled direction

======================================================================

Diagonal entries $h_{ii}$ of the hat matrix measure leverage: how unusual observation $i$ is in predictor space. This matters later for diagnostics because high-leverage points can pull the fitted surface strongly, especially when they also have large residuals.

Examples:

in simple regression, points far from $\bar{x}$ have higher leverage
if $\mathbf{y}$ already lies in the column space, then RSS is zero and the projection is exact
the decomposition $\mathbf{y}=\hat{\mathbf{y}}+\mathbf{e}$ is orthogonal, not just additive

3.4 Gauss-Markov Theorem

The Gauss-Markov theorem is one of the central results of classical regression. Under the linear model

\mathbf{y} = X\boldsymbol{\beta} + \boldsymbol{\varepsilon}, \qquad \mathbb{E}[\boldsymbol{\varepsilon}\mid X]=\mathbf{0}, \qquad \operatorname{Var}(\boldsymbol{\varepsilon}\mid X)=\sigma^2 I,

with $X$ full column rank, the OLS estimator is the Best Linear Unbiased Estimator of $\boldsymbol{\beta}$ . That means among all estimators of the form $A\mathbf{y}$ that are unbiased, OLS has the smallest covariance matrix in the positive-semidefinite ordering.

The theorem is easy to misread if we compress the acronym too quickly.

Linear refers to estimators linear in $\mathbf{y}$ , not to the model alone.
Unbiased means $\mathbb{E}[\hat{\boldsymbol{\beta}}\mid X]=\boldsymbol{\beta}$ .
Best means minimum variance among linear unbiased estimators under the stated covariance assumptions.

The theorem does not say OLS is optimal among all possible estimators. Once we allow bias, Ridge can have lower mean squared error. Once we relax the linear class, other estimators may outperform OLS for prediction. Once homoscedasticity fails, weighted least squares can be more efficient. So Gauss-Markov is powerful, but tightly scoped.

A proof sketch is worth seeing. Let $\tilde{\boldsymbol{\beta}}=A\mathbf{y}$ be any linear unbiased estimator. Unbiasedness implies $AX=I$ . Write

\tilde{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}} + C\mathbf{y},

where $CX=0$ . Then

\operatorname{Var}(\tilde{\boldsymbol{\beta}}\mid X) = \operatorname{Var}(\hat{\boldsymbol{\beta}}\mid X) + \sigma^2 CC^\top,

because the cross term vanishes. Since $\sigma^2 CC^\top$ is positive semidefinite, OLS has no larger variance than $\tilde{\boldsymbol{\beta}}$ in any direction.

This result matters for interpretation. If the classical assumptions are plausible and your priority is unbiased linear estimation, then OLS is not a naive baseline. It is the efficient choice within that class. But if the assumptions are doubtful or prediction under collinearity matters more than unbiasedness, regularization may be preferable.

3.5 Sampling Distribution, Standard Errors, Confidence Intervals, and Prediction Intervals

Conditioned on $X$ , the OLS estimator has mean

\mathbb{E}[\hat{\boldsymbol{\beta}}\mid X] = \boldsymbol{\beta}

under exogeneity, and covariance

\operatorname{Var}(\hat{\boldsymbol{\beta}}\mid X)=\sigma^2 (X^\top X)^{-1}

under homoscedasticity and no autocorrelation. If we additionally assume Gaussian errors,

\hat{\boldsymbol{\beta}}\mid X \sim \mathcal{N}\!\left(\boldsymbol{\beta}, \sigma^2 (X^\top X)^{-1}\right).

Because $\sigma^2$ is unknown, we estimate it with

\hat{\sigma}^2 = \frac{\operatorname{RSS}}{n-p},

where $p$ counts fitted coefficients including the intercept when present. Then the estimated standard error of coefficient $j$ is

\operatorname{SE}(\hat{\beta}_j) = \hat{\sigma}\sqrt{[(X^\top X)^{-1}]_{jj}}.

This gives coefficient confidence intervals of the form

\hat{\beta}_j \pm t_{n-p,1-\alpha/2}\operatorname{SE}(\hat{\beta}_j).

These intervals belong to regression specifically because the variance depends on the design matrix. Not all predictors are estimated equally precisely. Precision increases when residual variance is low and the relevant feature column contains strong, independent variation.

Prediction intervals are wider than confidence intervals for the mean response. For a new feature vector $\mathbf{x}_*$ , the estimated mean response is $\mathbf{x}_*^\top \hat{\boldsymbol{\beta}}$ . Its variance is

\hat{\sigma}^2 \mathbf{x}_*^\top (X^\top X)^{-1}\mathbf{x}_*.

But a new observation also contains fresh noise, so the predictive variance becomes

\hat{\sigma}^2\left(1+\mathbf{x}_*^\top (X^\top X)^{-1}\mathbf{x}_*\right).

That extra 1 is the irreducible observation noise.

Examples:

a coefficient may be statistically uncertain even when overall prediction error is decent
prediction far from the center of the design cloud is more uncertain because $\mathbf{x}_*^\top (X^\top X)^{-1}\mathbf{x}_*$ grows
a narrow confidence interval for the mean can coexist with a wide prediction interval for an individual observation

Non-examples:

treating confidence intervals as posterior credible intervals which belong in Bayesian Inference
using iid-style standard errors when errors are heteroscedastic or serially dependent

4. Core Theory II: Model Assessment and Diagnostics

4.1 R^2, Adjusted R^2, Residual Standard Error, and Goodness of Fit

Once a model is fit, we need summaries of how much variation it explains and how much typical error remains. The most famous measure is

R^2 = 1 - \frac{\operatorname{RSS}}{\operatorname{TSS}},

which, with an intercept present, measures the fraction of sample variation around $\bar{y}$ explained by the fitted model. Equivalently, $R^2 = \operatorname{ESS}/\operatorname{TSS}$ .

$R^2$ is useful, but it is easy to overread. It is a descriptive fit summary for the observed sample, not a guarantee of causal validity, not a proof of predictive generalization, and not a universal score across different response transformations or datasets. In some applications, a modest $R^2$ can still be operationally valuable if decisions depend on ranking or average effects. In others, a very high $R^2$ may just reflect leakage or interpolation.

Adjusted $R^2$ tries to compensate for the fact that ordinary $R^2$ never decreases when new predictors are added:

\bar{R}^2 = 1 - \frac{\operatorname{RSS}/(n-p)}{\operatorname{TSS}/(n-1)}.

It penalizes unnecessary complexity through degrees of freedom, though it is still a blunt instrument rather than a full model-selection criterion.

The residual standard error

\hat{\sigma} = \sqrt{\frac{\operatorname{RSS}}{n-p}}

is often more interpretable than $R^2$ because it lives in the units of the response. If the target is latency in milliseconds or revenue in dollars, then $\hat{\sigma}$ tells us the typical residual scale directly.

Examples:

A model with lower $R^2$ but better calibrated prediction intervals may be preferable in deployment.
Comparing $R^2$ across transformed targets can be misleading.
In high-dimensional settings, in-sample $R^2$ can be nearly 1 while out-of-sample performance is poor.

4.2 Residual Diagnostics for Linearity, Variance, and Normality

Regression is not finished when coefficients are printed. The residuals are where the model argues back. A residual plot against fitted values or key predictors often reveals more than a coefficient table: curvature suggests missing nonlinear structure, fan-shaped spread suggests heteroscedasticity, clusters suggest omitted groups or interactions, and heavy tails or asymmetry suggest a poor noise model.

The first principle is simple: residuals should look like structureless leftovers if the model is aligned with the data. That does not mean every residual plot must look perfectly random, but strong systematic patterns are warnings that the fitted mean structure is incomplete.

Normality deserves a careful interpretation. Exact small-sample $t$ and $F$ inference assumes Gaussian errors, and QQ-plots are a common diagnostic. But for many predictive tasks, modest non-normality matters far less than model misspecification, heteroscedasticity, or influential points. A perfectly straight QQ-plot does not rescue a badly curved residual-vs-fitted plot.

Three recurring diagnostic questions are worth asking:

Is the mean structure adequate? Look for curvature or interactions hiding in the residuals.
Is the variance roughly constant? Look for widening or narrowing residual spread.
Are there unusual observations or groups? Look for residual clusters, outliers, or leverage-driven distortions.

For AI: diagnostics matter especially for operational regressions, where a low average error can hide systematic failures on rare prompts, particular users, or specific infrastructure regimes.

4.3 Leverage, Influence, and Cook's Distance

Not all observations affect the fit equally. An observation with unusual predictor values has high leverage. In linear regression, the leverage of point $i$ is the diagonal entry $h_{ii}$ of the hat matrix. Large leverage means the point sits far from the center of the design cloud and can pull the fitted hyperplane.

But leverage alone does not guarantee harm. A high-leverage point that lies close to the fitted surface may actually stabilize slope estimation. Influence arises when leverage combines with a large residual. One standard summary is Cook's distance, which measures how much the fitted coefficients or fitted values change when one point is omitted.

The intuition is geometric. Points at the edge of the predictor cloud can act as anchors. If one such point is unusual in both $X$ and $Y$ , the regression line may rotate to accommodate it. That can create misleading coefficient estimates even if global fit metrics still look acceptable.

Examples:

A benchmark result from a rare hardware configuration may have high leverage in a latency regression.
A mislabeled example in a small tabular dataset can dominate a coefficient if it lies far from the rest of the design cloud.
In probing tasks, a few representation outliers can drive a probe's apparent separability.

Non-examples:

deleting every high-leverage point automatically
treating large residuals near the center of the design cloud as if they were high leverage

The right response to influence is thoughtful diagnosis: verify data quality, check whether a missing interaction or group effect explains the point, and report robustness when conclusions hinge on a few observations.

4.4 Multicollinearity, Condition Number, and VIF

Multicollinearity occurs when predictors are highly correlated or nearly linearly dependent. The model may still predict well, but individual coefficients become unstable because several different coefficient combinations explain almost the same fitted surface. In the normal equations, this appears through an ill-conditioned matrix $X^\top X$ .

The consequences are practical:

standard errors inflate
coefficient signs can flip under small perturbations
different random splits can produce different stories about feature importance
interpretation becomes brittle even when prediction stays acceptable

One global diagnostic is the condition number of the scaled design matrix, which measures how sensitive the solution is to perturbations. A feature-specific diagnostic is the variance inflation factor

\operatorname{VIF}_j = \frac{1}{1-R_j^2},

where $R_j^2$ is obtained by regressing predictor $j$ on the others. Large VIF means predictor $j$ is largely explained by the rest, so its coefficient estimate is imprecise.

Collinearity is not always a bug. In many domains, predictors are naturally related. Embedding features can be highly redundant. Operational covariates such as traffic, queue depth, and latency can move together. The question is whether interpretation of individual coefficients is central. If it is, collinearity is a major problem. If prediction is the goal, regularization may be enough.

4.5 Heteroscedasticity, Weighted Least Squares, and Robust Standard Errors

Classical OLS variance formulas assume

\operatorname{Var}(\varepsilon_i \mid X)=\sigma^2

for all $i$ . When variance changes with the predictors or the fitted mean, the errors are heteroscedastic. OLS coefficients remain unbiased under exogeneity, but the usual standard errors are wrong and OLS is no longer efficient among linear unbiased estimators.

One remedy is weighted least squares: if observation $i$ has variance proportional to $\sigma_i^2$ , we minimize

\sum_{i=1}^n \frac{(y_i-\mathbf{x}_i^\top \boldsymbol{\beta})^2}{\sigma_i^2}.

This gives more weight to precise observations and less to noisy ones. In practice, the variance structure is often estimated rather than known.

Another remedy is to keep OLS coefficients but replace the covariance formula with heteroscedasticity-robust standard errors. These "sandwich" estimators preserve the fitted coefficients while correcting uncertainty quantification under milder assumptions. They are often the right default when the mean model is trusted but constant variance is not.

Examples:

modeling spend, latency, or count-like quantities often shows variance increasing with the mean
sample weighting is natural when some observations aggregate many events while others aggregate few
robust standard errors are useful when prediction coefficients remain meaningful but classical inference is too optimistic

Non-examples:

using weighted least squares without a plausible weighting rationale
assuming robust standard errors fix mean-model misspecification

5. Core Theory III: Regularized and Generalized Regression

5.1 Ridge Regression as Stabilized OLS and Shrinkage

Ridge regression solves

\hat{\boldsymbol{\beta}}_{\text{ridge}} = \arg\min_{\boldsymbol{\beta}} \left\{ \lVert \mathbf{y} - X\boldsymbol{\beta}\rVert_2^2 + \lambda \lVert \boldsymbol{\beta}\rVert_2^2 \right\},

typically leaving the intercept unpenalized in practical implementations. The solution is

\hat{\boldsymbol{\beta}}_{\text{ridge}} = (X^\top X + \lambda I)^{-1}X^\top \mathbf{y}.

The extra $\lambda I$ stabilizes inversion when $X^\top X$ is ill-conditioned or singular. Geometrically, Ridge shrinks the coefficient vector toward zero. Statistically, it trades a small amount of bias for potentially large variance reduction. That tradeoff often improves prediction, especially under multicollinearity or limited sample size.

Ridge is best understood as shrinkage. If OLS estimates are noisy because many nearly equivalent directions fit the data, then pulling coefficients toward zero can improve out-of-sample stability. The penalty does not usually create exact zeros. Instead, it distributes weight more smoothly across correlated predictors.

There is a direct link back to Bayesian Inference: a Gaussian prior on coefficients makes the MAP estimator a Ridge problem. But here the canonical focus is statistical estimation, not the full posterior.

RIDGE GEOMETRY
======================================================================

  minimize RSS(beta) subject to ||beta||_2^2 <= c

      elliptical RSS contours
               /\
              /  \
             / () \   first touch with
            /  ()  \  round L2 ball
           /   ()   \

  effect: smooth shrinkage, rarely exact zeros

======================================================================

For AI: weight decay in deep learning is the large-scale descendant of Ridge. The context is different, but the bias-variance and shrinkage logic is the same.

5.2 Lasso Regression and Sparse Feature Selection

Lasso solves

\hat{\boldsymbol{\beta}}_{\text{lasso}} = \arg\min_{\boldsymbol{\beta}} \left\{ \lVert \mathbf{y} - X\boldsymbol{\beta}\rVert_2^2 + \lambda \lVert \boldsymbol{\beta}\rVert_1 \right\}.

Unlike Ridge, the L1 penalty has corners. Those corners are what make exact zeros common. Lasso therefore performs estimation and variable selection simultaneously.

The geometry explains the sparsity. In low dimensions, RSS contours are smooth ellipses, while the L1 constraint region is a diamond. The first tangency often occurs at a corner, which corresponds to one or more coefficients being exactly zero. This is why Lasso is especially attractive when many predictors are plausibly irrelevant or when interpretability requires a compact model.

Lasso is not magic. With strongly correlated predictors, it may pick one variable somewhat arbitrarily and discard another equally informative one. Its selection can be unstable across samples. Still, it often yields a useful sparse approximation, especially in high-dimensional settings where plain OLS is impossible.

The Bayesian preview is again straightforward: an independent Laplace prior yields an MAP objective with an L1 penalty. But full Bayesian sparsity modeling belongs back in Bayesian Inference.

LASSO GEOMETRY
======================================================================

  minimize RSS(beta) subject to ||beta||_1 <= c

      elliptical RSS contours
               /\
              /  \
             / /\ \
            /_/  \_\

  effect: corners encourage exact zero coefficients

======================================================================

5.3 Logistic Regression as a Bernoulli GLM

When the response is binary, plain linear regression is poorly aligned with the problem because predictions can fall outside $[0,1]$ and variance is not constant. Logistic regression fixes this by modeling

Y_i \mid \mathbf{x}_i \sim \operatorname{Bernoulli}(\pi_i), \qquad \log\frac{\pi_i}{1-\pi_i} = \mathbf{x}_i^\top \boldsymbol{\beta}.

Equivalently,

\pi_i = \sigma(\mathbf{x}_i^\top \boldsymbol{\beta}) = \frac{1}{1+e^{-\mathbf{x}_i^\top \boldsymbol{\beta}}}.

The coefficient interpretation changes from mean shifts to log-odds shifts. A one-unit increase in predictor $x_j$ changes the log-odds by $\beta_j$ , holding the other predictors fixed. Exponentiating gives an odds ratio: $e^{\beta_j}$ .

Logistic regression is fit by maximum likelihood, not by closed-form normal equations. The log-likelihood is

\ell(\boldsymbol{\beta}) = \sum_{i=1}^n \left[ y_i \log \pi_i + (1-y_i)\log(1-\pi_i) \right].

Maximizing this is equivalent to minimizing cross-entropy loss, which is why logistic regression is a direct ancestor of modern classification training.

Logistic regression deserves to be called the canonical binary baseline in ML. It is fast, probabilistic, well calibrated relative to many margin-based methods, and often strong on high-quality features.

5.4 Poisson Regression for Counts and Rates

For nonnegative counts, Poisson regression models

Y_i \mid \mathbf{x}_i \sim \operatorname{Poisson}(\mu_i), \qquad \log \mu_i = \mathbf{x}_i^\top \boldsymbol{\beta}.

The log link keeps the mean positive while preserving a linear predictor on the transformed scale. This makes Poisson regression natural for counts such as incidents, clicks, requests, or failures.

The model has a built-in mean-variance relation:

\mathbb{E}[Y_i\mid \mathbf{x}_i] = \operatorname{Var}(Y_i\mid \mathbf{x}_i)=\mu_i.

That is sometimes appropriate and sometimes too restrictive. If variance is much larger than the mean, overdispersion is present and a richer model may be needed. Still, Poisson regression is the canonical place to begin when counts or event rates are involved.

An especially useful extension is an offset. If exposure differs across observations, we write

\log \mu_i = \log(\text{exposure}_i) + \mathbf{x}_i^\top \boldsymbol{\beta},

which models rates rather than raw counts. This is common in production monitoring, advertising, epidemiology, and event logging.

Examples:

incidents per thousand requests
user clicks per impression volume
failures per machine-hour

Non-examples:

using OLS on small counts with many zeros and then interpreting negative fitted values
assuming Poisson is always valid when counts are observed

5.5 Generalized Linear Models

Generalized linear models unify several important regressions under one template:

a response distribution from the exponential family
a linear predictor $\eta_i = \mathbf{x}_i^\top \boldsymbol{\beta}$
a link function $g$ such that

g(\mu_i) = \eta_i, \qquad \mu_i = \mathbb{E}[Y_i\mid \mathbf{x}_i]

Linear regression, logistic regression, and Poisson regression are the flagship examples. The shared structure matters because it lets us reuse intuition: feature coding, regularization, deviance, and coefficient interpretation all revolve around the same linear predictor even when the response model changes.

GLMs clarify what is preserved and what changes. The predictor remains linear in the coefficients. The loss now comes from a likelihood suited to the response distribution. Variance is no longer constant in general. Interpretation is tied to the link scale: means for Gaussian identity, log-odds for Bernoulli logit, log-rates for Poisson log.

This perspective is also the bridge from classical regression to many ML losses. Cross-entropy is Bernoulli or categorical GLM likelihood. Count losses are Poisson-like. Regularization layers on top of the same framework.

6. Advanced Topics

6.1 Basis Expansions, Polynomial Regression, and Spline Intuition

A linear model can represent nonlinear relationships once the features are expanded. If we replace $x$ by basis functions

\phi_1(x),\ldots,\phi_K(x),

then the model

\mathbb{E}[Y\mid X=x] = \beta_0 + \sum_{k=1}^K \beta_k \phi_k(x)

is still linear in the coefficients. Polynomial regression uses powers such as $x^2$ and $x^3$ . Spline methods use localized basis functions that produce smoother, more flexible fits without the instability of high-degree global polynomials.

This is an important conceptual lesson: "linear model" refers to linearity in parameters, not necessarily linearity in the raw input variable. Feature engineering can therefore recover a great deal of nonlinear structure while preserving the familiar regression machinery.

The danger is overfitting and numerical instability. As the basis grows, the design matrix can become ill conditioned. Regularization and validation then become essential.

6.2 High-Dimensional Regression and the p > n Regime

When $p > n$ , plain OLS no longer has a unique solution because the column rank of $X$ cannot exceed $n$ . There are infinitely many coefficient vectors that interpolate the training data whenever interpolation is possible. This regime is not exotic in modern ML. It is normal.

The statistical question changes. Instead of asking for the unique unbiased linear estimator, we ask which inductive bias selects among many compatible fits. Minimum-norm interpolation, Ridge shrinkage, Lasso sparsity, and elastic-net tradeoffs all become ways to impose structure.

High-dimensional regression also changes how we interpret classic fit metrics. Training error can be near zero while generalization depends almost entirely on regularization and feature structure. This is where the chapter's regression material begins to touch the later themes of optimization and modern overparameterized learning.

6.3 Robust Regression and Outlier-Resistant Losses

OLS squares residuals, so large residuals receive disproportionately large weight. That is desirable when Gaussian noise is appropriate, but it can be fragile under heavy tails or contamination. Robust regression replaces or supplements the squared loss with functions less dominated by extreme points, such as absolute loss, Huber loss, or Tukey-style redescending losses.

The point is not to become indifferent to unusual observations. Sometimes outliers are the most important data in the set. The point is to distinguish signal from estimator fragility. A robust fit asks: would the substantive conclusion survive if one or two extreme observations were slightly perturbed?

Robust methods are especially relevant in production logging and telemetry, where instrumentation glitches, bursts, or upstream failures can create residual behavior far from ideal Gaussian assumptions.

6.4 Model Selection, Cross-Validation, and Information Criteria

Choosing the predictor set, basis complexity, or regularization strength is itself a statistical problem. Two broad strategies dominate.

Cross-validation: estimate out-of-sample predictive performance by repeated train-validation splits.
Information criteria: penalize fit by complexity through quantities such as AIC or BIC.

Cross-validation is the more ML-native view because it targets prediction directly. Information criteria are lighter-weight analytic summaries tied to likelihood and effective complexity. Neither is universally superior. For small datasets, interpretability, or likelihood-based comparison, information criteria can be useful. For flexible models and prediction-driven workflows, cross-validation is often safer.

The key regression lesson is that in-sample fit is not model selection. A larger model always fits at least as well in sample. Selection needs an explicit complexity penalty, validation split, or prior structural preference.

6.5 Omitted-Variable Bias, Extrapolation, and Causal Caution

Regression coefficients are conditional associations under the included model. If an important omitted variable affects both a predictor and the response, the included coefficient can absorb that missing structure. This is omitted-variable bias. The formula for the bias in simple settings is worth knowing, but the main lesson here is conceptual: regression alone does not create causal identification.

Extrapolation is a different but equally common danger. A linear model fit on a certain design region can behave wildly or implausibly outside that region. The algebra happily outputs a prediction; the statistics do not guarantee it is meaningful.

Because a full treatment belongs in Chapter 22: Causal Inference, we keep the causal preview brief here. Regression is a powerful conditional-prediction tool. Sometimes, with the right design assumptions, it also supports causal interpretation. But the latter requires extra structure beyond the regression formula itself.

7. Applications in Machine Learning

7.1 Linear Regression Baselines for Tabular and Structured Prediction

For structured data, linear and generalized linear models are often the first serious baseline, not a toy baseline. They are cheap to fit, easy to debug, and robust under limited data. When feature engineering is strong, they can be surprisingly competitive.

This matters because benchmark culture sometimes skips directly to larger black-box models. That is risky. A regression baseline can expose leakage, show whether the task is mostly linear, and identify which groups or ranges are poorly captured before heavy modeling begins.

7.2 Logistic Regression Baselines, Calibration, and Probability Outputs

Logistic regression remains one of the strongest default classifiers when well-structured features are available. Its outputs are probabilities, its coefficients are interpretable, and its convex objective avoids many training pathologies of richer models.

It is also a natural calibration tool. Platt scaling and related post-processing methods are logistic-style regressions on scores. This connects classical GLM ideas directly to modern confidence calibration for classifiers and rankers.

7.3 Linear Probes on Embeddings and Hidden States

A linear probe asks whether a representation contains information that a simple linear readout can extract. If a concept is linearly decodable from a hidden state, that suggests the representation organizes that concept accessibly, though not necessarily causally or robustly.

Probe results should be interpreted carefully. High probe accuracy may reflect representation quality, dataset artifacts, or probe capacity interacting with class imbalance. Still, the reason probes are everywhere in interpretability is that regression and classification on frozen features are simple, cheap, and mathematically transparent.

7.4 Ridge, Lasso, Weight Decay, and Sparse or Low-Rank Adaptation

Regularized regression concepts reappear constantly in deep learning. Weight decay is Ridge-like shrinkage. Sparse penalties echo Lasso-style feature selection. Low-rank adaptation methods can be viewed as imposing structured low-dimensional coefficient updates rather than full unrestricted parameter movement.

This does not mean a transformer fine-tuning run is literally a classical regression problem. The surrounding optimization and representation learning are much richer. But the regularization logic is shared: restrict flexibility, stabilize estimation, and encode a prior preference for simpler or smaller updates.

7.5 GLMs in Production Systems

GLMs remain useful in production because many operational targets are naturally binary, counted, or rate-based. Examples include:

click-through prediction
conversion probability
count of failures per unit time
support-ticket arrivals
incidents per request volume
moderation-trigger probability

In such systems, GLMs often win on a combination of interpretability, training speed, monitoring ease, and calibration. They also make excellent fallback models when richer systems become unstable or hard to debug.

8. Common Mistakes

#	Mistake	Why It's Wrong	Fix
1	Treating regression as "just correlation with a fitted line"	Regression is conditional modeling with an estimator, uncertainty model, and diagnostics, not merely an association summary	Start from $E[Y \mid X]$ and define the response, predictors, and assumptions explicitly
2	Interpreting coefficients causally by default	Conditional association is not causal identification	Use causal language only with supporting design assumptions and methods
3	Comparing raw coefficient magnitudes across differently scaled features	Units and scaling affect coefficients directly	Standardize when needed and compare on a common scale or through partial effects
4	Reading a high $R^2$ as proof of a good model	High fit can coexist with leakage, extrapolation failure, or patterned residuals	Check residuals, validation performance, and domain plausibility
5	Ignoring residual plots because p-values look fine	Misspecified mean structure can survive coefficient significance tests	Inspect residual-vs-fitted plots, QQ-plots, and groupwise residual behavior
6	Trusting OLS coefficients under strong multicollinearity	Near dependence inflates variance and destabilizes interpretation	Check condition numbers and VIFs; use Ridge or rethink the feature set
7	Assuming homoscedasticity because OLS ran without error	OLS fitting does not validate the constant-variance assumption	Use residual diagnostics, robust SEs, or weighted least squares when needed
8	Using linear regression for binary probabilities without checking range and variance issues	OLS can predict outside $[0,1]$ and is poorly matched to Bernoulli noise	Use logistic regression or another appropriate binary model
9	Treating Lasso feature selection as stable truth	Correlated features can make selection unstable across samples	Examine selection stability and consider Elastic Net or grouped structure
10	Forgetting to scale predictors before regularization	L1 and L2 penalties act on coefficient magnitude, so scale changes the penalty's practical effect	Center/scale predictors before penalized fitting unless there is a deliberate reason not to
11	Using random train/test shuffles when observations are time dependent	Temporal leakage makes generalization estimates too optimistic	Use chronological splits and time-series methods where dependence matters
12	Extrapolating far beyond the observed predictor range because the equation still returns a number	The fitted hyperplane is only supported by the observed design region	Flag extrapolation explicitly and gather data in the relevant regime

9. Exercises

These exercises are mirrored in exercises.ipynb, where each problem is followed by a runnable scaffold cell and a full reference solution.

Exercise 1 * - From Covariance to Slope Let $(x_i,y_i)$ be paired observations. (a) Derive the simple-regression slope from the least-squares objective. (b) Rewrite the slope as sample covariance divided by sample variance. (c) Express it in terms of the sample correlation coefficient and sample standard deviations. (d) Explain why scaling $x$ changes the slope but not the fitted values after reparameterization.
Exercise 2 * - Matrix OLS and Residual Orthogonality Given a small design matrix $X$ with intercept and response vector $\mathbf{y}$ , (a) compute $\hat{\boldsymbol{\beta}} = (X^\top X)^{-1}X^\top \mathbf{y}$ , (b) compute the fitted values and residuals, (c) verify numerically that $X^\top \mathbf{e}=\mathbf{0}$ , and (d) interpret the orthogonality condition geometrically.
Exercise 3 * - Regression Intervals For a fitted simple regression with known residual variance estimate, (a) compute a confidence interval for the slope, (b) compute a confidence interval for the mean response at a new point, (c) compute a prediction interval for a new observation, and (d) explain why the prediction interval is wider.
Exercise 4 ** - Diagnosing Multicollinearity Using a design matrix with highly correlated predictors, (a) compute the condition number, (b) compute VIF values, (c) compare OLS coefficients before and after a small perturbation, and (d) explain why prediction can remain stable while interpretation becomes unstable.
Exercise 5 ** - Ridge vs Lasso Under Correlated Predictors Fit OLS, Ridge, and Lasso to a small dataset with redundant features. (a) Compare coefficient norms, (b) identify which coefficients Lasso sets to zero, (c) compare prediction error on a held-out point, and (d) explain the different geometric behavior of the L1 and L2 penalties.
Exercise 6 ** - Logistic Regression and Odds Ratios For a binary-outcome dataset, (a) compute predicted probabilities from a logistic model, (b) convert a coefficient to an odds ratio, (c) identify the decision boundary under threshold $0.5$ , and (d) explain why a linear-probability model is less appropriate here.
Exercise 7 ** - Poisson Regression for Event Rates Using count data and exposure values, (a) build a log-rate model with an offset, (b) compute predicted rates, (c) interpret a coefficient multiplicatively, and (d) explain what overdispersion would look like.
Exercise 8 *** - Leverage, Influence, and Cook's Distance In a small regression problem with one unusual predictor point, (a) compute leverage scores, (b) approximate or compute Cook's distance, (c) compare the fit with and without the influential point, and (d) explain whether the point looks like valuable structure or likely contamination.
Exercise 9 *** - Linear Probes and Regularized Readouts Treat a matrix of frozen embeddings as predictors and labels as the response. (a) Fit a regularized linear or logistic readout, (b) compare weak and strong regularization, (c) explain how this mirrors a probe on hidden states, and (d) state one reason probe accuracy should not be overinterpreted as causal understanding.

10. Why This Matters for AI (2026 Perspective)

Concept	AI Impact
Design matrix thinking	Encourages explicit feature auditing, leakage detection, and representation inspection before scaling to heavier models
OLS projection geometry	Clarifies what a linear head can and cannot express on frozen embeddings or engineered features
Gauss-Markov	Explains why OLS is still a principled baseline under classical assumptions rather than a naive first try
Regression-specific intervals	Supports uncertainty estimates for scores, forecasts, and operational targets
Residual diagnostics	Reveals systematic failure modes in slices, regimes, and user groups that average loss can hide
Multicollinearity analysis	Matters for correlated features, embeddings, and attribution stories in high-dimensional ML systems
Ridge shrinkage	Reappears as weight decay and other stability-inducing penalties in modern training
Lasso sparsity	Connects to sparse feature selection, pruning-style thinking, and structured compression
Logistic regression	Remains the canonical interpretable probabilistic classifier and calibration primitive
Poisson / rate modeling	Useful for counts, incidents, clicks, requests, and monitoring metrics in production
GLM framework	Shows that many common ML losses are structured likelihood choices around a linear predictor
Basis expansion	Demonstrates how feature engineering can add nonlinearity without abandoning the regression framework
Robust regression	Helps production systems stay useful when logs contain bursts, glitches, or contaminated points
Cross-validation and information criteria	Keep model-selection discipline grounded in predictive performance rather than in-sample fit
Linear probes	Make regression central to representation analysis and interpretability research

Regression matters for AI because it sits at the boundary between interpretable statistics and scalable machine learning. It is simple enough to analyze exactly, rich enough to support real deployment tasks, and foundational enough that its ideas keep reappearing under different names in larger systems.

11. Conceptual Bridge

Regression analysis closes Chapter 7 by turning the chapter's statistical ideas into a general supervised-learning language. From Descriptive Statistics, it inherits covariance, correlation, and variance summaries. From Estimation Theory, it inherits estimator language, likelihood reasoning, and interval logic. From Hypothesis Testing, it inherits coefficient testing, model-comparison logic, and the discipline of uncertainty-aware claims. From Bayesian Inference, it inherits the prior-based interpretation of regularization and the distinction between point estimates and full posterior uncertainty. From Time Series, it inherits caution about dependence and the danger of pretending sequential residual structure is iid noise.

Forward, regression hands off directly to Chapter 8: Optimization. The normal equations show that some statistical problems have closed forms, but modern ML depends on gradient-based procedures precisely because most interesting models do not. Ridge and Lasso already preview constrained and penalized objectives. Logistic regression already previews convex likelihood optimization. Cross-validation and hyperparameter tuning already preview the practical question of how optimization targets are selected in the first place.

The bridge also extends outward. Math for Specific Models: Linear Models will revisit many of these ideas inside a broader landscape of classifiers, Bayesian linear models, double descent, and low-rank adaptation. Chapter 22: Causal Inference will revisit regression from the perspective of identification rather than only prediction. And many later ML chapters will quietly reuse the same ideas: linear predictors, regularization, probabilistic readouts, and diagnostics on residual errors or representation quality.

ASCII CURRICULUM POSITION
======================================================================

  Descriptive Statistics
      -> covariance, correlation, scaling
             |
             v
  Estimation Theory
      -> estimators, likelihood, intervals
             |
             v
  Hypothesis Testing
      -> coefficient tests, model comparison
             |
             v
  Bayesian Inference
      -> priors, MAP, uncertainty-aware regularization
             |
             v
  Time Series
      -> dependence cautions for residual structure
             |
             v
  Regression Analysis
      -> OLS, diagnostics, regularization, GLMs, supervised-learning blueprint
             |
             +---------------------> Optimization
             +---------------------> Linear Models
             +---------------------> Causal Inference

======================================================================

One way to summarize the whole chapter arc is this: statistics began by describing data, then learned how to estimate hidden quantities, then learned how to test claims, then learned how to update beliefs, then learned how to respect temporal dependence, and finally learned how to model responses from predictors. Regression is where those threads become the standard language of supervised prediction.

References

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer.
Weisberg, Sanford. Applied Linear Regression. Wiley.
Montgomery, Douglas C., Elizabeth A. Peck, and G. Geoffrey Vining. Introduction to Linear Regression Analysis. Wiley.
Gelman, Andrew, Jennifer Hill, and Aki Vehtari. Regression and Other Stories. Cambridge University Press.
Kutner, Michael H., Christopher Nachtsheim, John Neter, and William Li. Applied Linear Statistical Models. McGraw-Hill.
Berkeley Stat 153 course notes on multiple linear regression, Spring 2026 edition.
Gregorka B. K. linear-model course materials for STAT 714, University of South Carolina, 2023.
MIT lecture notes on logistic regression from 15.075, 2003 edition.
Geyer, Charles. University of Minnesota notes on generalized linear models.
Tibshirani, Robert. "Regression Shrinkage and Selection via the Lasso." Journal of the Royal Statistical Society, Series B (1996).
Hoerl, Arthur E., and Robert W. Kennard. "Ridge Regression: Biased Estimation for Nonorthogonal Problems." Technometrics (1970).
Nelder, John A., and Robert W. M. Wedderburn. "Generalized Linear Models." Journal of the Royal Statistical Society, Series A (1972).
Cook, R. Dennis. "Detection of Influential Observation in Linear Regression." Technometrics (1977).

Regression Analysis

Overview

Prerequisites

Companion Notebooks

Learning Objectives

Table of Contents

1. Intuition

1.1 From Correlation to Conditional Modeling

1.2 Why Regression Is the Blueprint for Supervised Learning

1.3 When Linear Structure Works and When It Breaks

1.4 Historical Timeline

2. Formal Definitions

2.1 Response, Predictors, and Conditional Expectation

2.2 Design Matrix, Linear Predictor, and Feature Coding

2.3 Fitted Values, Residuals, and Sum-of-Squares Quantities

2.4 Classical Linear Model Assumptions

2.5 Examples, Non-Examples, and Edge Cases

3. Core Theory I: OLS and Inference

3.1 Simple Linear Regression and Its Link to Correlation

3.2 Matrix OLS and the Normal Equations

3.3 Projection Geometry, the Hat Matrix, and Orthogonality

3.4 Gauss-Markov Theorem

3.5 Sampling Distribution, Standard Errors, Confidence Intervals, and Prediction Intervals

4. Core Theory II: Model Assessment and Diagnostics

4.1 R^2, Adjusted R^2, Residual Standard Error, and Goodness of Fit

4.2 Residual Diagnostics for Linearity, Variance, and Normality

4.3 Leverage, Influence, and Cook's Distance

4.4 Multicollinearity, Condition Number, and VIF

4.5 Heteroscedasticity, Weighted Least Squares, and Robust Standard Errors

5. Core Theory III: Regularized and Generalized Regression

5.1 Ridge Regression as Stabilized OLS and Shrinkage

5.2 Lasso Regression and Sparse Feature Selection

5.3 Logistic Regression as a Bernoulli GLM

5.4 Poisson Regression for Counts and Rates

5.5 Generalized Linear Models

6. Advanced Topics

6.1 Basis Expansions, Polynomial Regression, and Spline Intuition

6.2 High-Dimensional Regression and the p > n Regime

6.3 Robust Regression and Outlier-Resistant Losses

6.4 Model Selection, Cross-Validation, and Information Criteria

6.5 Omitted-Variable Bias, Extrapolation, and Causal Caution

7. Applications in Machine Learning

7.1 Linear Regression Baselines for Tabular and Structured Prediction

7.2 Logistic Regression Baselines, Calibration, and Probability Outputs

7.3 Linear Probes on Embeddings and Hidden States

7.4 Ridge, Lasso, Weight Decay, and Sparse or Low-Rank Adaptation

7.5 GLMs in Production Systems

8. Common Mistakes

9. Exercises

10. Why This Matters for AI (2026 Perspective)

11. Conceptual Bridge

References