Part 1

29 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Vectors and Spaces: Part 1: Intuition to 8. Affine Subspaces and Convexity

1. Intuition

1.1 What Are Vectors and Spaces?

A vector is first encountered as an arrow with magnitude and direction, but that picture is only the beginning. In modern linear algebra, a vector is any element of a set that supports two operations:

addition of vectors
multiplication of a vector by a scalar

If those operations satisfy the right axioms, the set is called a vector space. The power of the abstraction is that the "things" inside the space do not need to be arrows in physical space. They may be coordinate tuples, polynomials, matrices, functions, sequences, or even gradient fields.

Concrete examples help fix the idea:

$(3, -1, 2) \in \mathbb{R}^3$ is a vector
a polynomial such as $1 + 2x - 5x^2$ is a vector in a polynomial space
a matrix in $\mathbb{R}^{m \times n}$ is a vector in a matrix space
a continuous function $f:[0,1] \to \mathbb{R}$ is a vector in a function space

The unifying idea is linear structure. If you can sensibly add two objects of the same kind and scale the result by numbers, then linear algebra is likely available.

1.2 Why Vectors and Spaces Are Central to AI

Nearly every important object in deep learning is a vector or lives in a vector space:

token embeddings are vectors in $\mathbb{R}^d$
hidden states are vectors in a residual stream
query, key, and value representations are vectors in learned subspaces
parameter sets are points in a very large Euclidean space
gradients are vectors in the same parameter space
logits are vectors in vocabulary space
probability outputs live in the probability simplex, which sits inside a vector space but is not itself one

The Transformer made this especially explicit. Attention compares a query vector $q$ with key vectors $k_j$ using dot products, then produces an output as a weighted sum of value vectors $v_j$ :

\mathrm{Attention}(q, K, V) = \sum_{j=1}^{n} \alpha_j v_j, \qquad \alpha_j = \frac{\exp(q^\top k_j / \sqrt{d_k})}{\sum_{\ell=1}^{n}\exp(q^\top k_\ell / \sqrt{d_k})}.

That is pure vector-space computation: inner products, scaling, exponentiation of scores, and linear combination. The semantic geometry of embeddings became operational in NLP through word2vec (Mikolov et al., 2013), and the geometry of projected subspaces became central to model architecture through self-attention (Vaswani et al., 2017). In large open models such as Llama 3, parameter space itself has hundreds of billions of coordinates at the frontier scale (Dubey et al., 2024), so geometric reasoning is not optional; it is the only way to think clearly about what the model can represent.

1.3 Geometric, Algebraic, and Abstract Views

There are three complementary ways to think about vectors.

View	Core idea	Strength	Typical AI use
Geometric	A vector is an arrow with length and direction	Builds intuition for angle, distance, projection	Similarity, attention, clustering
Algebraic	A vector is an ordered tuple of numbers	Easy to compute with componentwise rules	Arrays, tensors, embeddings, gradients
Abstract	A vector is any element of a vector space	Generalizes linear algebra beyond coordinates	Function spaces, kernels, NTK, RKHS

The geometric view tells you why cosine similarity matters. The algebraic view tells you how to implement it. The abstract view tells you why the same mathematics reappears for functions, distributions, and operators.

One idea, three views

Geometric:             Algebraic:              Abstract:

    y                     [v1, v2, ... , vn]      v in V
    ^                                              where V has
    |   / v                                         vector addition
    |  /                                            and scalar multiplication
    | /
----+--------------> x

1.4 Where Vectors and Spaces Appear in AI

The flow through a language model can be read as a sequence of moves between vector spaces:

Token IDs
  -> embedding vectors in R^d
  -> sequence matrix X in R^(n x d)
  -> projected query/key/value spaces
  -> attention outputs as weighted sums
  -> feed-forward maps R^d -> R^d_ff -> R^d
  -> logits in R^|V|
  -> probabilities in the simplex Delta^(|V|-1)

Even when the last object is not a vector space, it usually sits inside one. The simplex of probability vectors is an affine slice of $\mathbb{R}^{|V|}$ with nonnegativity constraints. That pattern is common in AI: model objects often live in structured subsets of ambient vector spaces.

1.5 The Hierarchy of Structure

A useful way to organize the subject is by how much extra structure is available.

set
  -> abelian group (addition)
  -> vector space (addition + scalar multiplication)
     -> normed space (size)
        -> Banach space (complete normed space)
     -> inner product space (angles and lengths)
        -> Hilbert space (complete inner product space)

This branching picture is more accurate than a single chain. A normed space need not come from an inner product. An inner product space automatically induces a norm, and a Hilbert space is therefore also a Banach space, but not every Banach space is Hilbert. Most day-to-day ML uses finite-dimensional Euclidean spaces, where these distinctions collapse because all norms are equivalent and completeness is automatic. In theoretical ML and functional analysis, they matter a great deal.

1.6 Historical Timeline

The language of vectors and spaces was assembled gradually.

Period	Figure or development	Importance
Ancient mechanics	Directed quantities in geometry and physics	Proto-vector intuition
17th century	Galileo and Newton	Composition of velocity and force
1843-1844	Hamilton and Grassmann	Algebraic extension beyond ordinary 3D geometry
Late 19th century	Peano and axiomatic linear algebra	Abstract vector space formulation
Early 20th century	Hilbert	Infinite-dimensional inner product spaces
1920s-1930s	Banach and functional analysis	Normed complete spaces and operator theory
20th century computing	Numerical linear algebra	Matrix computation becomes practical at scale
2013	word2vec	Semantic geometry becomes an engineering object
2017	Transformer	Dot-product geometry becomes core architecture
2020s	Foundation models	Representation geometry becomes a first-class research topic

The modern lesson is simple: vectors started as geometry, became algebra, and now serve as the organizing language of computation.

2. Vectors in R^n

2.1 Definition and Notation

A vector in $\mathbb{R}^n$ is an ordered $n$ -tuple of real numbers:

\mathbf{v} = (v_1, v_2, \ldots, v_n) \in \mathbb{R}^n.

In linear algebra, the default convention is usually the column vector:

\mathbf{v} = \begin{pmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{pmatrix}.

Key notation:

$v_i$ is the $i$ -th component or coordinate
$\mathbf{0} = (0,\ldots,0)$ is the zero vector
$\mathbf{e}_i$ is the $i$ -th standard basis vector, with a 1 in position $i$ and 0 elsewhere
$\mathbf{v}^\top$ denotes the transpose, turning a column into a row

Every vector in $\mathbb{R}^n$ can be written as

\mathbf{v} = v_1 \mathbf{e}_1 + v_2 \mathbf{e}_2 + \cdots + v_n \mathbf{e}_n.

That formula is already telling you something profound: coordinates are not the vector itself; they are the coefficients of the vector relative to a chosen basis.

2.2 Vector Addition

Vectors in $\mathbb{R}^n$ add componentwise:

(\mathbf{u} + \mathbf{v})_i = u_i + v_i.

So if

\mathbf{u} = \begin{pmatrix} 1 \\ -2 \\ 5 \end{pmatrix}, \qquad \mathbf{v} = \begin{pmatrix} 3 \\ 4 \\ -1 \end{pmatrix},

then

\mathbf{u} + \mathbf{v} = \begin{pmatrix} 4 \\ 2 \\ 4 \end{pmatrix}.

Vector addition satisfies the familiar algebraic laws:

commutativity: $\mathbf{u} + \mathbf{v} = \mathbf{v} + \mathbf{u}$
associativity: $(\mathbf{u} + \mathbf{v}) + \mathbf{w} = \mathbf{u} + (\mathbf{v} + \mathbf{w})$
identity: $\mathbf{v} + \mathbf{0} = \mathbf{v}$
inverse: $\mathbf{v} + (-\mathbf{v}) = \mathbf{0}$

Geometrically, addition follows the parallelogram law. Place the tail of one vector at the head of another; the resulting diagonal is the sum. That picture becomes important later when we interpret residual connections as additive updates to a representation vector.

Head-to-tail view of vector addition

O ----u----> A ----v----> B
O ----------------------> B
            u + v

2.3 Scalar Multiplication

If $\alpha \in \mathbb{R}$ and $\mathbf{v} \in \mathbb{R}^n$ , then scalar multiplication is also componentwise:

(\alpha \mathbf{v})_i = \alpha v_i.

For example,

2 \begin{pmatrix} 1 \\ -3 \\ 4 \end{pmatrix} = \begin{pmatrix} 2 \\ -6 \\ 8 \end{pmatrix}, \qquad - \begin{pmatrix} 1 \\ -3 \\ 4 \end{pmatrix} = \begin{pmatrix} -1 \\ 3 \\ -4 \end{pmatrix}.

The main properties are:

$\alpha(\mathbf{u} + \mathbf{v}) = \alpha \mathbf{u} + \alpha \mathbf{v}$
$(\alpha + \beta)\mathbf{v} = \alpha \mathbf{v} + \beta \mathbf{v}$
$\alpha(\beta \mathbf{v}) = (\alpha \beta)\mathbf{v}$
$1 \mathbf{v} = \mathbf{v}$

Geometrically, positive scaling stretches or shrinks the vector, while negative scaling also reverses direction.

Scalar multiplication changes length and possibly direction

alpha > 1           0 < alpha < 1         alpha < 0

O----->             O-->                  O<-----
stretch             shrink                flip + scale

2.4 Linear Combinations

A linear combination of vectors $\mathbf{v}_1, \ldots, \mathbf{v}_k$ is any expression of the form

\alpha_1 \mathbf{v}_1 + \alpha_2 \mathbf{v}_2 + \cdots + \alpha_k \mathbf{v}_k = \sum_{i=1}^{k} \alpha_i \mathbf{v}_i

with scalars $\alpha_1, \ldots, \alpha_k$ .

This idea is central because most linear algebra questions reduce to asking which linear combinations are possible. If you know the allowable combinations, you know what a system can generate.

In AI, linear combinations are everywhere:

a dense layer computes weighted sums of input coordinates before adding bias
attention outputs are weighted sums of value vectors
residual streams combine contributions from multiple components additively
concept arithmetic in embedding space uses vector addition and subtraction as semantic approximations

The famous analogy

\text{king} - \text{man} + \text{woman} \approx \text{queen}

is a statement that some semantic relations behave approximately linearly in embedding space (Mikolov et al., 2013). It is not an exact algebraic law, but it is a powerful geometric clue.

2.5 Linear Independence

Vectors $\mathbf{v}_1, \ldots, \mathbf{v}_k$ are linearly independent if the only way to make the zero vector from them is the trivial way:

\alpha_1 \mathbf{v}_1 + \cdots + \alpha_k \mathbf{v}_k = \mathbf{0} \quad \Longrightarrow \quad \alpha_1 = \cdots = \alpha_k = 0.

If a nontrivial combination gives zero, the vectors are linearly dependent. Dependence means redundancy: at least one vector can be expressed in terms of the others.

Examples:

$(1,0)$ and $(0,1)$ are independent in $\mathbb{R}^2$
$(1,2)$ and $(2,4)$ are dependent because $(2,4) = 2(1,2)$
any set containing $\mathbf{0}$ is automatically dependent

Testing independence computationally is a rank question. Put the vectors as columns of a matrix $A$ . Then:

if $\mathrm{rank}(A) = k$ , the columns are independent
if $\mathrm{rank}(A) < k$ , they are dependent

In model analysis, approximate dependence matters as much as exact dependence. If different heads or features occupy nearly the same direction, they carry overlapping information and may be prunable.

Independent vs dependent in R^2

Independent                     Dependent

      ^ y                            ^ y
      |   w                          |  2v
      |  /                           | /
      | /                            |/
------+------> x             -------+------> x
     /                               /
    v                               v

two distinct directions         same line, one is redundant

2.6 Span

The span of a set of vectors is the set of all linear combinations of those vectors:

\mathrm{span}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\} = \left\{ \sum_{i=1}^{k} \alpha_i \mathbf{v}_i \; \middle| \; \alpha_i \in \mathbb{R} \right\}.

The span tells you every direction reachable from the given generating set.

Typical cases:

$\mathrm{span}\{\mathbf{v}\}$ is a line through the origin
$\mathrm{span}\{\mathbf{v}, \mathbf{w}\}$ is a plane through the origin if $\mathbf{v}$ and $\mathbf{w}$ are independent
$\mathrm{span}\{\mathbf{e}_1, \ldots, \mathbf{e}_n\} = \mathbb{R}^n$

Two facts are worth memorizing:

span is always a subspace
adding a vector outside the current span increases the dimension by one

This is why span is the right language for representational capacity. A low-rank matrix can only output vectors inside a low-dimensional span. A collection of value vectors can only produce attention outputs inside their span.

What span looks like geometrically

span{v}              span{v, w}               span{e1, ..., en}

   line                 plane                    whole ambient space

    /                    __________
   /                    /        /|
--+--->                /________/ |
                      |        |  |
                      |        | /
                      |________|/

3. Abstract Vector Spaces

Reading note: This section introduces the vector space axioms and their most common concrete instances ( $\mathbb{R}^n$ , function spaces, polynomial spaces). The focus here is on geometric intuition and computational examples. For the full axiomatic treatment - subspace criteria, span, linear independence, quotient spaces, and the abstract theory of bases - see the dedicated section.

-> Full axiomatic treatment: Vector Spaces and Subspaces 2-6

3.1 The Vector Space Axioms

Let $F$ be a field, usually $\mathbb{R}$ or $\mathbb{C}$ . A vector space over $F$ is a set $V$ equipped with:

vector addition: $+: V \times V \to V$
scalar multiplication: $\cdot: F \times V \to V$

such that, for all $\mathbf{u}, \mathbf{v}, \mathbf{w} \in V$ and $\alpha, \beta \in F$ , the following laws hold.

Axiom	Statement
Additive closure	$\mathbf{u} + \mathbf{v} \in V$
Commutativity	$\mathbf{u} + \mathbf{v} = \mathbf{v} + \mathbf{u}$
Associativity	$(\mathbf{u} + \mathbf{v}) + \mathbf{w} = \mathbf{u} + (\mathbf{v} + \mathbf{w})$
Zero vector	There exists $\mathbf{0} \in V$ with $\mathbf{v} + \mathbf{0} = \mathbf{v}$
Additive inverse	For each $\mathbf{v}$ there exists $-\mathbf{v}$ with $\mathbf{v} + (-\mathbf{v}) = \mathbf{0}$
Scalar closure	$\alpha \mathbf{v} \in V$
Distributivity over vector addition	$\alpha(\mathbf{u} + \mathbf{v}) = \alpha \mathbf{u} + \alpha \mathbf{v}$
Distributivity over scalar addition	$(\alpha + \beta)\mathbf{v} = \alpha \mathbf{v} + \beta \mathbf{v}$
Scalar associativity	$\alpha(\beta \mathbf{v}) = (\alpha \beta)\mathbf{v}$
Unit scalar	$1 \mathbf{v} = \mathbf{v}$

Some texts count eight axioms by folding closure into the operation definitions. Listing ten is often pedagogically clearer.

3.2 Consequences of the Axioms

The axioms are minimal, but they already imply a lot.

The zero vector is unique.
If $\mathbf{0}$ and $\mathbf{0}'$ both behave like additive identities, then

\mathbf{0} = \mathbf{0} + \mathbf{0}' = \mathbf{0}'.

Additive inverses are unique.
If $\mathbf{w}$ and $\mathbf{w}'$ both satisfy $\mathbf{v} + \mathbf{w} = \mathbf{0}$ and $\mathbf{v} + \mathbf{w}' = \mathbf{0}$ , then

\mathbf{w} = \mathbf{w} + \mathbf{0} = \mathbf{w} + (\mathbf{v} + \mathbf{w}') = (\mathbf{w} + \mathbf{v}) + \mathbf{w}' = \mathbf{0} + \mathbf{w}' = \mathbf{w}'.

Zero scalar kills every vector.

0 \mathbf{v} = (0 + 0)\mathbf{v} = 0 \mathbf{v} + 0 \mathbf{v}.

Add the additive inverse of $0\mathbf{v}$ to both sides to get $0\mathbf{v} = \mathbf{0}$ .

Every scalar kills the zero vector.

\alpha \mathbf{0} = \alpha(\mathbf{0} + \mathbf{0}) = \alpha \mathbf{0} + \alpha \mathbf{0},

so again $\alpha \mathbf{0} = \mathbf{0}$ .

Multiplying by $-1$ gives the additive inverse.

\mathbf{v} + (-1)\mathbf{v} = (1 + -1)\mathbf{v} = 0\mathbf{v} = \mathbf{0},

so $(-1)\mathbf{v} = -\mathbf{v}$ .

These are simple but useful. Many later proofs quietly depend on them.

3.3 Examples of Vector Spaces

The same axioms govern many different mathematical objects.

1. Euclidean space $\mathbb{R}^n$
This is the standard example. Addition and scalar multiplication are componentwise.

2. Polynomial space $P_n$
The set of polynomials of degree at most $n$ :

P_n = \{a_0 + a_1 x + \cdots + a_n x^n : a_i \in \mathbb{R}\}.

Addition is polynomial addition, and scalar multiplication rescales coefficients.

3. Matrix space $\mathbb{R}^{m \times n}$
All $m \times n$ real matrices form a vector space with componentwise addition and scaling.

4. Continuous function space $C([a,b])$
All continuous real-valued functions on $[a,b]$ form a vector space under pointwise operations:

(f+g)(x) = f(x)+g(x), \qquad (\alpha f)(x) = \alpha f(x).

5. Sequence spaces
Infinite sequences can also form vector spaces. Important examples include $\ell^2$ , the square-summable sequences, and $\ell^1$ , the absolutely summable sequences.

6. $L^2([0,1])$
Square-integrable functions form an infinite-dimensional vector space that becomes a Hilbert space once we equip it with the usual inner product.

This last example matters in machine learning theory because kernels, Fourier methods, and approximation theory are often most naturally phrased in function spaces rather than coordinate spaces.

3.4 Non-Examples

Students usually learn vector spaces faster by seeing what fails.

Set	Why it is not a vector space
$\{x \in \mathbb{R}^n : x_i > 0 \text{ for all } i\}$	Not closed under additive inverse
$\{x \in \mathbb{R}^n : \\|x\\| = 1\}$	Not closed under addition or scaling
Probability simplex $\Delta^{n-1}$	Not closed under arbitrary addition or scaling
$\mathbb{Z}^n$ over $\mathbb{R}$	Not closed under real scalar multiplication
A line not through the origin	Does not contain the zero vector

The probability simplex is especially important in AI. It lives inside a vector space, but it is not itself a vector space because probabilities must remain nonnegative and sum to one.

3.5 Subspaces

A subset $W \subseteq V$ is a subspace if it is itself a vector space under the same operations. The practical test is short:

$\mathbf{0} \in W$
if $\mathbf{u}, \mathbf{v} \in W$ , then $\mathbf{u} + \mathbf{v} \in W$
if $\mathbf{v} \in W$ and $\alpha \in F$ , then $\alpha \mathbf{v} \in W$

Equivalently, a nonempty set is a subspace if it is closed under linear combinations.

Examples inside $\mathbb{R}^3$ :

$\{\mathbf{0}\}$ is the trivial subspace
any line through the origin is a one-dimensional subspace
any plane through the origin is a two-dimensional subspace
$\mathbb{R}^3$ itself is a subspace

Non-examples:

a translated line or plane not passing through the origin
the unit sphere
the set of vectors with first coordinate equal to 1

Subspaces are how linear algebra talks about structure. The column space of a matrix, the null space of a map, the subspace spanned by a collection of attention heads, and the tangent space of a model family all follow the same logic: identify the directions that are allowed, closed, and linearly stable.

Subspace vs non-subspace in R^2

Subspace (through origin)      Not a subspace (shifted)

      /                             /
     /                             /
----+----> x                  -----/----> x
   /                             /
  /

contains 0                     misses 0

4. Basis and Dimension

4.1 Basis Definition

A basis for a vector space $V$ is a set of vectors

B = \{\mathbf{b}_1, \mathbf{b}_2, \ldots, \mathbf{b}_n\}

that satisfies two conditions:

the vectors are linearly independent
the vectors span the whole space

These two requirements are exact opposites that must hold simultaneously. Independence prevents redundancy; spanning prevents missing directions.

If $B$ is a basis, then every vector $\mathbf{v} \in V$ has a unique expansion

\mathbf{v} = \alpha_1 \mathbf{b}_1 + \alpha_2 \mathbf{b}_2 + \cdots + \alpha_n \mathbf{b}_n.

The coefficients $\alpha_1, \ldots, \alpha_n$ are the coordinates of $\mathbf{v}$ relative to $B$ .

Uniqueness matters. If the same vector had two different coordinate descriptions in the same basis, then subtracting those descriptions would produce a nontrivial linear combination of basis vectors equal to zero, contradicting independence.

In R^2:

one vector          two non-collinear vectors       three vectors
not enough          just enough = basis             spanning but redundant

   /                    ^ y                           ^ y
  /                     |  /                          |  /|
 /                      | /                           | / |
-----------------> x    +--------> x                 +--------> x
                       /                             /   /

4.2 Standard Basis of R^n

The standard basis of $\mathbb{R}^n$ is

\mathbf{e}_1 = (1,0,\ldots,0), \; \mathbf{e}_2 = (0,1,\ldots,0), \; \ldots, \; \mathbf{e}_n = (0,0,\ldots,1).

In the standard basis, the coordinate vector is just the familiar list of components:

\mathbf{v} = \begin{pmatrix} v_1 \\ \vdots \\ v_n \end{pmatrix} = v_1 \mathbf{e}_1 + \cdots + v_n \mathbf{e}_n.

But the standard basis is not sacred. Many problems become simpler after a change of basis:

PCA chooses a basis aligned with data variance
Fourier analysis chooses sinusoidal basis functions
diagonalization chooses an eigenbasis when available
orthonormal bases simplify coordinates, projections, and energy calculations

The vector does not change when the basis changes. Only its coordinate description changes.

4.3 Dimension

For finite-dimensional vector spaces, every basis has the same number of elements. That common number is the dimension of the space:

\dim(V) = \text{number of vectors in any basis of } V.

Examples:

$\dim(\mathbb{R}^n) = n$
$\dim(\mathbb{R}^{m \times n}) = mn$
$\dim(P_n) = n+1$ because $\{1, x, x^2, \ldots, x^n\}$ is a basis
$\dim(C([a,b])) = \infty$

Dimension is a count of independent directions, not a count of how many vectors happen to be mentioned in a description. You can describe a plane in $\mathbb{R}^3$ using ten spanning vectors if you want; the dimension is still 2.

In AI practice, dimension is often both a modeling choice and a computational budget. Embedding dimension, head dimension, hidden width, and bottleneck rank are all dimension decisions.

4.4 Dimension and Subspaces

If $W$ is a subspace of a finite-dimensional vector space $V$ , then

\dim(W) \leq \dim(V).

Moreover:

if $\dim(W) = \dim(V)$ , then $W = V$
if $\dim(W) = 0$ , then $W = \{\mathbf{0}\}$
in $\mathbb{R}^2$ , the only subspaces are $\{\mathbf{0}\}$ , lines through the origin, and $\mathbb{R}^2$
in $\mathbb{R}^3$ , the only subspaces are $\{\mathbf{0}\}$ , lines through the origin, planes through the origin, and $\mathbb{R}^3$

The codimension of $W$ in $V$ is

\mathrm{codim}(W) = \dim(V) - \dim(W).

Codimension tells you how many independent directions are missing.

This is the right language for low-rank modeling. If a matrix maps a 4096-dimensional residual stream into a 64-dimensional attention head space, then its image has dimension at most 64, so most of the ambient directions are not expressible in that head.

4.5 Coordinates and Change of Basis

Let $B = \{\mathbf{b}_1, \ldots, \mathbf{b}_n\}$ be a basis for $V$ . The coordinate vector of $\mathbf{v}$ relative to $B$ is

[\mathbf{v}]_B = \begin{pmatrix} \alpha_1 \\ \vdots \\ \alpha_n \end{pmatrix} \quad \text{where} \quad \mathbf{v} = \alpha_1 \mathbf{b}_1 + \cdots + \alpha_n \mathbf{b}_n.

If $B'$ is another basis, then coordinates transform by an invertible matrix:

[\mathbf{v}]_{B'} = P [\mathbf{v}]_B.

The change-of-basis matrix is invertible because coordinates in each basis are unique. If it were not invertible, some nonzero coordinate vector would collapse to zero, contradicting uniqueness.

Suppose a linear map $T$ is represented by matrix $A$ in one basis and by matrix $A'$ in another basis. Then the relationship is

A' = P^{-1} A P.

This is similarity transformation. It does not change the underlying map; it changes the coordinates in which the map is described.

That idea appears all over ML:

whitening and PCA rotate data into more convenient coordinates
orthogonal parameterizations change basis while preserving norms
query, key, and value projections choose learned coordinate systems inside the embedding dimension

Same vector, different coordinates

standard basis:              rotated basis:

      y                           b2
      ^                          /
      |    v                    /
      |   /                    /   v
      |  /                    /
------+------> x        -----/----------> b1

The geometric arrow is the same.
Only the coordinate description changes.

4.6 Rank-Nullity Theorem

Let $T: V \to W$ be a linear map with finite-dimensional domain. Then

\dim(\ker T) + \dim(\mathrm{im}\,T) = \dim(V).

The two terms are named:

nullity: $\dim(\ker T)$
rank: $\dim(\mathrm{im}\,T)$

For a matrix $A \in \mathbb{R}^{m \times n}$ , this becomes

\mathrm{rank}(A) + \mathrm{nullity}(A) = n.

Interpretation:

rank counts how many independent directions survive the map
nullity counts how many independent directions are lost completely

If $A$ has rank $r < n$ , then every output lies in an $r$ -dimensional subspace of $\mathbb{R}^m$ , and an $(n-r)$ -dimensional family of input directions gets sent to zero.

This theorem explains a great deal of ML engineering:

low-rank adaptation works because useful updates often live in small image spaces (Hu et al., 2022)
bottleneck layers deliberately compress information into lower-dimensional subspaces
overparameterized models have large null spaces in parameter space, which helps explain why many parameter settings realize similar functions

5. Norms and Metric Spaces

5.1 Norms on Vector Spaces

A norm on a vector space $V$ is a function

\|\cdot\|: V \to \mathbb{R}

satisfying, for all $\mathbf{u}, \mathbf{v} \in V$ and all scalars $\alpha$ :

positive definiteness: $\|\mathbf{v}\| \geq 0$ , with equality iff $\mathbf{v} = \mathbf{0}$
homogeneity: $\|\alpha \mathbf{v}\| = |\alpha| \, \|\mathbf{v}\|$
triangle inequality: $\|\mathbf{u} + \mathbf{v}\| \leq \|\mathbf{u}\| + \|\mathbf{v}\|$

A norm measures size. Once a norm is available, we can talk about boundedness, convergence, approximation error, and regularization.

5.2 The p-Norms

For $\mathbf{v} \in \mathbb{R}^n$ and $1 \leq p < \infty$ , the $p$ -norm is

\|\mathbf{v}\|_p = \left( \sum_{i=1}^{n} |v_i|^p \right)^{1/p}.

The limiting case is

\|\mathbf{v}\|_\infty = \max_i |v_i|.

The most important examples are:

L1 norm

\|\mathbf{v}\|_1 = \sum_{i=1}^{n} |v_i|.

This encourages sparsity in optimization because its unit ball has corners aligned with coordinate axes.

L2 norm

\|\mathbf{v}\|_2 = \sqrt{\sum_{i=1}^{n} v_i^2}.

This is the Euclidean length and the most common norm in deep learning.

L-infinity norm

\|\mathbf{v}\|_\infty = \max_i |v_i|.

This measures worst-case component size and is common in adversarial robustness.

The geometry of the unit ball depends on $p$ :

$p=1$ : diamond or cross-polytope
$p=2$ : sphere
$p=\infty$ : cube

Those shapes are not cosmetic. They explain different optimization behavior.

Unit balls in 2D

L1 ball              L2 ball              L_inf ball

   /\                  ____               +------+
  /  \               /      \             |      |
  \  /               \______/             |      |
   \/                                       +------+

corners              smooth                flat faces

5.3 Norm Equivalence

In finite-dimensional spaces, all norms induce the same notion of convergence. More precisely, if $\|\cdot\|_a$ and $\|\cdot\|_b$ are norms on $\mathbb{R}^n$ , then there exist constants $c_1, c_2 > 0$ such that

c_1 \|\mathbf{v}\|_a \leq \|\mathbf{v}\|_b \leq c_2 \|\mathbf{v}\|_a \qquad \text{for all } \mathbf{v} \in \mathbb{R}^n.

Useful special inequalities:

\|\mathbf{v}\|_\infty \leq \|\mathbf{v}\|_2 \leq \|\mathbf{v}\|_1,

\|\mathbf{v}\|_1 \leq \sqrt{n}\,\|\mathbf{v}\|_2, \qquad \|\mathbf{v}\|_2 \leq \sqrt{n}\,\|\mathbf{v}\|_\infty, \qquad \|\mathbf{v}\|_1 \leq n \|\mathbf{v}\|_\infty.

This means that in finite dimensions, saying "a sequence converges" does not depend on whether you measure error with L1, L2, or L-infinity. In infinite dimensions, this is false. Different norms can induce genuinely different topologies.

5.4 Matrix Norms

Matrices also admit norms. Important ones include:

Frobenius norm

\|A\|_F = \sqrt{\sum_{i,j} A_{ij}^2} = \sqrt{\mathrm{tr}(A^\top A)}.

This is the Euclidean norm of the matrix viewed as one long vector.

Spectral norm

\|A\|_2 = \sigma_{\max}(A).

This is the maximum stretch factor of the linear map:

\|A\|_2 = \sup_{\|\mathbf{x}\|_2 = 1} \|A\mathbf{x}\|_2.

Nuclear norm

\|A\|_* = \sum_i \sigma_i(A).

This is the sum of singular values. It is a convex surrogate for rank and appears in matrix completion and low-rank learning (Candes and Recht, 2009).

Induced 1-norm and infinity-norm

\|A\|_1 = \max_j \sum_i |A_{ij}|, \qquad \|A\|_\infty = \max_i \sum_j |A_{ij}|.

In ML:

Frobenius norm corresponds to standard weight decay on matrix entries
spectral norm controls the Lipschitz constant of a linear layer
nuclear norm promotes low-rank structure

5.5 Metric Spaces

A metric space is a set $X$ equipped with a distance function

d: X \times X \to \mathbb{R}

such that for all $x,y,z \in X$ :

$d(x,y) \geq 0$ , with equality iff $x=y$
$d(x,y) = d(y,x)$
$d(x,z) \leq d(x,y) + d(y,z)$

Every norm induces a metric by

d(\mathbf{u}, \mathbf{v}) = \|\mathbf{u} - \mathbf{v}\|.

Not every metric comes from a norm. Edit distance on strings is a metric, but there is no vector subtraction of strings that generates it as a norm.

AI uses many distances:

Euclidean distance for retrieval and clustering
cosine distance for directional similarity
Hamming distance for binary codes
edit distance for token sequences
Wasserstein distance for distributions
KL divergence as a useful non-metric divergence

The important point is conceptual: vector-space geometry gives one family of distances, but machine learning uses broader metric ideas too.

5.6 Convergence in Normed Spaces

A sequence $(\mathbf{v}_n)$ converges to $\mathbf{v}$ in norm if

\|\mathbf{v}_n - \mathbf{v}\| \to 0 \qquad \text{as } n \to \infty.

A sequence is Cauchy if its terms eventually become arbitrarily close to each other:

\forall \varepsilon > 0, \exists N \text{ such that } m,n \geq N \implies \|\mathbf{v}_n - \mathbf{v}_m\| < \varepsilon.

A normed space is complete if every Cauchy sequence converges to a limit that still belongs to the space. A complete normed space is called a Banach space.

Important examples:

$\mathbb{R}^n$ with any norm is complete
matrix spaces with standard norms are complete
$C([a,b])$ with the sup norm is complete
$L^p$ spaces are complete for $1 \leq p \leq \infty$

Completeness matters because many learning algorithms are iterative. Gradient methods, fixed-point solvers, and optimization routines generate sequences. Completeness is the condition that prevents the limit from "falling out of the space."

6. Inner Product Spaces

Reading note: This section develops inner products concretely in $\mathbb{R}^n$ and establishes the geometric tools used throughout this chapter (dot product, angle, Cauchy-Schwarz, orthogonality). The abstract inner product space theory - Gram-Schmidt, orthonormal bases, Hilbert spaces, and orthogonal complements - is developed in full later.

-> Full treatment: Vector Spaces and Subspaces 9

6.1 Inner Products

An inner product on a real vector space $V$ is a function

\langle \cdot, \cdot \rangle : V \times V \to \mathbb{R}

satisfying, for all $\mathbf{u}, \mathbf{v}, \mathbf{w} \in V$ and scalars $\alpha, \beta$ :

symmetry: $\langle \mathbf{u}, \mathbf{v} \rangle = \langle \mathbf{v}, \mathbf{u} \rangle$
linearity: $\langle \alpha \mathbf{u} + \beta \mathbf{v}, \mathbf{w} \rangle = \alpha \langle \mathbf{u}, \mathbf{w} \rangle + \beta \langle \mathbf{v}, \mathbf{w} \rangle$
positive definiteness: $\langle \mathbf{v}, \mathbf{v} \rangle \geq 0$ , with equality iff $\mathbf{v} = \mathbf{0}$

An inner product upgrades a vector space with geometry. Once it is available, we can measure lengths, define angles, talk about orthogonality, and project onto subspaces.

For complex vector spaces, the definition is slightly modified: $\langle \mathbf{u}, \mathbf{v} \rangle = \overline{\langle \mathbf{v}, \mathbf{u} \rangle}$ and one argument is conjugate-linear. The core geometric story remains the same.

6.2 Standard Inner Product on R^n

The standard inner product on $\mathbb{R}^n$ is the dot product:

\langle \mathbf{u}, \mathbf{v} \rangle = \mathbf{u}^\top \mathbf{v} = \sum_{i=1}^{n} u_i v_i.

This inner product induces the Euclidean norm:

\|\mathbf{v}\|_2 = \sqrt{\langle \mathbf{v}, \mathbf{v} \rangle}.

So in Euclidean space, angle and length are not independent notions. They are both derived from the same primitive object.

6.3 Geometric Interpretation

The angle $\theta$ between nonzero vectors $\mathbf{u}$ and $\mathbf{v}$ is defined by

\cos \theta = \frac{\langle \mathbf{u}, \mathbf{v} \rangle} {\|\mathbf{u}\| \, \|\mathbf{v}\|}.

Interpretation:

$\theta = 0$ : same direction, maximal alignment
$\theta = \pi/2$ : orthogonal, zero alignment
$\theta = \pi$ : opposite directions

Cosine similarity is exactly this normalized inner product. It is widely used for embeddings because it measures direction rather than raw magnitude.

For attention, the score $q^\top k / \sqrt{d_k}$ combines both angular alignment and magnitude. If vectors are normalized, the score is proportional to cosine similarity. If they are not, norm effects matter too.

Angle and alignment

same direction          orthogonal            opposite direction

-----> ----->           ----->                <----- ----->
                        |
                        |
                        v

cos(theta) = 1          cos(theta) = 0        cos(theta) = -1

6.4 Cauchy-Schwarz Inequality

The fundamental inequality of inner-product geometry is

|\langle \mathbf{u}, \mathbf{v} \rangle| \leq \|\mathbf{u}\| \, \|\mathbf{v}\|.

Equality holds exactly when $\mathbf{u}$ and $\mathbf{v}$ are linearly dependent.

A classic proof uses positivity. For any real $t$ ,

0 \leq \|\mathbf{u} + t \mathbf{v}\|^2 = \langle \mathbf{u} + t\mathbf{v}, \mathbf{u} + t\mathbf{v} \rangle = \|\mathbf{u}\|^2 + 2t \langle \mathbf{u}, \mathbf{v} \rangle + t^2 \|\mathbf{v}\|^2.

This quadratic in $t$ must have nonpositive discriminant, so

4\langle \mathbf{u}, \mathbf{v} \rangle^2 - 4\|\mathbf{u}\|^2 \|\mathbf{v}\|^2 \leq 0,

which implies the result.

Cauchy-Schwarz guarantees that cosine similarity always lies in $[-1,1]$ . It also gives quick upper bounds on correlations, dot products, and approximation error.

6.5 Orthogonality

Vectors are orthogonal if their inner product is zero:

\mathbf{u} \perp \mathbf{v} \quad \Longleftrightarrow \quad \langle \mathbf{u}, \mathbf{v} \rangle = 0.

An orthogonal set is a collection of pairwise orthogonal nonzero vectors. Such a set is automatically linearly independent.

Proof sketch: if

\alpha_1 \mathbf{v}_1 + \cdots + \alpha_k \mathbf{v}_k = \mathbf{0},

take the inner product with $\mathbf{v}_j$ . All cross-terms vanish, leaving

\alpha_j \|\mathbf{v}_j\|^2 = 0,

so $\alpha_j = 0$ for every $j$ .

An orthonormal set is orthogonal and normalized:

\langle \mathbf{u}_i, \mathbf{u}_j \rangle = \delta_{ij}.

In an orthonormal basis, coordinates are exceptionally simple:

\mathbf{v} = \sum_{i=1}^{n} \langle \mathbf{v}, \mathbf{u}_i \rangle \mathbf{u}_i.

You get coefficients by taking inner products. No matrix inversion is needed.

Orthogonality is central in ML because it reduces interference:

orthogonal initializations help preserve signal norms through depth (Saxe et al., 2014)
orthogonal features are easier to disentangle
orthogonal subspaces make decomposition interpretable

6.6 Gram-Schmidt Orthogonalization

Given linearly independent vectors $\mathbf{v}_1, \ldots, \mathbf{v}_n$ , the Gram-Schmidt procedure constructs an orthonormal basis $\mathbf{u}_1, \ldots, \mathbf{u}_n$ spanning the same subspace.

First set

\mathbf{u}_1 = \frac{\mathbf{v}_1}{\|\mathbf{v}_1\|}.

Then recursively subtract the components already accounted for:

\widetilde{\mathbf{u}}_k = \mathbf{v}_k - \sum_{j=1}^{k-1} \langle \mathbf{v}_k, \mathbf{u}_j \rangle \mathbf{u}_j,

and normalize:

\mathbf{u}_k = \frac{\widetilde{\mathbf{u}}_k}{\|\widetilde{\mathbf{u}}_k\|}.

Each step removes the part of $\mathbf{v}_k$ already explained by the previous orthonormal vectors. What remains is orthogonal to the earlier directions.

Conceptually, Gram-Schmidt says:

start with a spanning description
strip away redundancy direction by direction
normalize what survives

Algorithmically, Gram-Schmidt underlies QR decomposition. Numerically, modified Gram-Schmidt and Householder methods are often preferred for stability.

Gram-Schmidt idea for v2

u1 -------------------->

v2 ------------------------>

split v2 into:

proj_u1(v2) -------------->
remainder                  ^
                           |
                           |

Keep the remainder, then normalize it.

6.7 Orthogonal Complement

If $W$ is a subspace of an inner-product space $V$ , its orthogonal complement is

W^\perp = \{\mathbf{v} \in V : \langle \mathbf{v}, \mathbf{w} \rangle = 0 \text{ for all } \mathbf{w} \in W\}.

Important facts:

$W^\perp$ is always a subspace
in finite dimensions, $\dim(W) + \dim(W^\perp) = \dim(V)$
every vector decomposes uniquely as

\mathbf{v} = \mathbf{w} + \mathbf{w}^\perp \qquad \text{with } \mathbf{w} \in W,\; \mathbf{w}^\perp \in W^\perp

This decomposition is one of the most useful ideas in linear algebra. It is how least squares, residuals, and projection error are understood.

For matrices, the row space is orthogonal to the null space, and the column space is orthogonal to the left null space. Those are special cases of the same principle.

Orthogonal decomposition

               v
              /|
             / |
            /  |  v_perp
-----------*---+-----------------> W
          proj_W(v)

v = proj_W(v) + v_perp

6.8 Hilbert Spaces

A Hilbert space is a complete inner-product space. It has both geometry and completeness.

Finite-dimensional inner-product spaces are automatically Hilbert spaces. The interesting cases are infinite-dimensional:

$\ell^2$ : square-summable sequences
$L^2([a,b])$ : square-integrable functions
reproducing kernel Hilbert spaces (RKHS), central to kernel methods

In a Hilbert space, orthonormal bases and projection theory still work, though bases may now be infinite. Parseval-type identities and Fourier expansions become available:

\|f\|^2 = \sum_{i=1}^{\infty} |\langle f, e_i \rangle|^2

for suitable orthonormal bases $\{e_i\}$ .

Hilbert spaces matter for AI because many learning-theoretic objects are not finite vectors at all; they are functions. Kernel methods, Gaussian processes, and parts of neural network theory live more naturally in Hilbert spaces than in $\mathbb{R}^n$ .

7. Orthogonal Projections

7.1 Projection onto a Subspace

Given a subspace $W$ of an inner-product space and a vector $\mathbf{v}$ , the orthogonal projection of $\mathbf{v}$ onto $W$ is the vector in $W$ closest to $\mathbf{v}$ :

\mathrm{Proj}_W(\mathbf{v}) = \arg\min_{\mathbf{w} \in W} \|\mathbf{v} - \mathbf{w}\|.

The key characterization is:

$\mathrm{Proj}_W(\mathbf{v}) \in W$
$\mathbf{v} - \mathrm{Proj}_W(\mathbf{v}) \in W^\perp$

For a one-dimensional subspace spanned by a nonzero vector $\mathbf{u}$ ,

\mathrm{Proj}_{\mathbf{u}}(\mathbf{v}) = \frac{\langle \mathbf{v}, \mathbf{u} \rangle} {\langle \mathbf{u}, \mathbf{u} \rangle} \mathbf{u}.

If $\mathbf{u}$ is already unit length, this simplifies to

\mathrm{Proj}_{\mathbf{u}}(\mathbf{v}) = \langle \mathbf{v}, \mathbf{u} \rangle \mathbf{u}.

Projection is the formal version of "keep the part of the vector that points in the subspace, discard the perpendicular remainder."

Projection onto a line or subspace

          v
         /|
        / |
       /  |  residual = v - Proj_W(v)
------*---+---------------------- W
     Proj_W(v)

7.2 Projection Matrix Properties

If $P$ is the matrix of an orthogonal projection, then:

$P^2 = P$ (idempotence)
$P^\top = P$ (symmetry)

The first identity says projecting twice is the same as projecting once. The second says the projection is orthogonal rather than oblique.

For projection onto the line spanned by a nonzero vector $\mathbf{u}$ ,

P = \frac{\mathbf{u}\mathbf{u}^\top}{\mathbf{u}^\top \mathbf{u}}.

Its eigenvalues are 0 and 1 only:

eigenvalue 1 corresponds to directions already in the target subspace
eigenvalue 0 corresponds to directions annihilated by projection

The complementary projector is $I-P$ , which projects onto the orthogonal complement.

7.3 Projection onto Column Space

Let $A \in \mathbb{R}^{m \times n}$ have full column rank. The orthogonal projector onto the column space of $A$ is

P_A = A(A^\top A)^{-1}A^\top.

Why this formula works:

every projected vector has the form $A\mathbf{x}$ , so it lies in $\mathrm{col}(A)$
the residual must be orthogonal to every column of $A$
orthogonality of the residual gives the normal equations

If the columns of $A$ are orthonormal, then $A^\top A = I$ and the formula simplifies to

P_A = AA^\top.

This expression appears in least squares, regression, PCA, and low-dimensional approximation.

In Transformer language, learned projections $W_Q$ , $W_K$ , and $W_V$ are not orthogonal projectors in the strict algebraic sense, but they do map activations into lower-dimensional subspaces where later computations occur. Orthogonal projection is therefore the clean mathematical ideal behind many approximate representation moves.

Projecting b onto col(A)

          b
         /|
        / |
       /  | residual
      /   |
-----*----+----------> col(A)
   A x_hat

Best-fit output lies inside col(A).
The leftover error is orthogonal to col(A).

7.4 Gram Matrix

Given vectors $\mathbf{v}_1, \ldots, \mathbf{v}_n$ in $\mathbb{R}^d$ , the Gram matrix is

G_{ij} = \langle \mathbf{v}_i, \mathbf{v}_j \rangle.

If $V$ is the matrix whose rows are the vectors, then

G = VV^\top.

Gram matrices have two key properties:

they are symmetric
they are positive semidefinite

Indeed, for any coefficient vector $\mathbf{x}$ ,

\mathbf{x}^\top G \mathbf{x} = \mathbf{x}^\top VV^\top \mathbf{x} = \|V^\top \mathbf{x}\|^2 \geq 0.

The Gram matrix is positive definite exactly when the vectors are linearly independent.

In AI:

attention score matrices are built from dot products, so they are Gram-like objects before softmax scaling and masking
kernel matrices are Gram matrices in feature space
covariance and similarity matrices are Gram constructions with centering or normalization added

7.5 Best Approximation and Least Squares

Orthogonal projection solves the best approximation problem:

\min_{\mathbf{w} \in W} \|\mathbf{v} - \mathbf{w}\|_2.

If $\mathbf{p} = \mathrm{Proj}_W(\mathbf{v})$ , then the residual $\mathbf{r} = \mathbf{v} - \mathbf{p}$ is orthogonal to $W$ , and Pythagoras gives

\|\mathbf{v}\|^2 = \|\mathbf{p}\|^2 + \|\mathbf{r}\|^2.

Least squares is exactly this geometry in matrix form. Given an overdetermined system $A\mathbf{x} \approx \mathbf{b}$ , the least-squares solution minimizes

\|A\mathbf{x} - \mathbf{b}\|_2^2.

The residual must be orthogonal to the column space of $A$ :

A^\top (A\mathbf{x} - \mathbf{b}) = 0,

A^\top A \mathbf{x} = A^\top \mathbf{b}.

When $A$ has full column rank,

\mathbf{x}^* = (A^\top A)^{-1} A^\top \mathbf{b}.

Geometrically, $A\mathbf{x}^*$ is the projection of $\mathbf{b}$ onto the column space of $A$ .

This is why projection is not just a geometric curiosity. It is the linear algebra under classical regression, representation compression, denoising, and many approximation arguments used throughout machine learning.

8. Affine Subspaces and Convexity

8.1 Affine Subspaces

A linear subspace must pass through the origin. Many important geometric sets do not. An affine subspace is a translated subspace:

A = \mathbf{v}_0 + W = \{\mathbf{v}_0 + \mathbf{w} : \mathbf{w} \in W\}.

Here $\mathbf{v}_0$ is a base point and $W$ is a linear subspace called the direction space.

Examples:

a line not through the origin in $\mathbb{R}^2$
a plane $ax + by + cz = d$ in $\mathbb{R}^3$ with $d \neq 0$
the probability simplex $\Delta^{n-1}$ before nonnegativity is imposed, because $\sum_i p_i = 1$ is an affine constraint

Affine spaces are not vector spaces unless they happen to pass through the origin. You can subtract two points in an affine space to get a direction vector, but you cannot usually add two points and stay in the set.

Linear subspace vs affine subspace

subspace W                  affine set v0 + W

      /                           /
     /                           /
----+----> x                ----/----> x
   /                           /
  /

through 0                    shifted away from 0

8.2 Convex Sets

A set $C$ is convex if, whenever $\mathbf{u}, \mathbf{v} \in C$ and $\lambda \in [0,1]$ ,

\lambda \mathbf{u} + (1-\lambda)\mathbf{v} \in C.

This means the entire line segment joining any two points of the set stays inside the set.

Important convex sets:

every subspace
every affine subspace
every norm ball $\{\|\mathbf{v}\| \leq r\}$ for a true norm
every half-space $\{\mathbf{x} : \mathbf{w}^\top \mathbf{x} \leq b\}$
the probability simplex

The convex hull of a set $S$ , written $\mathrm{conv}(S)$ , is the set of all convex combinations of points in $S$ . It is the smallest convex set containing $S$ .

This matters directly for attention. The output of a single attention head at one position is

\sum_j \alpha_j \mathbf{v}_j \quad \text{with } \alpha_j \geq 0,\; \sum_j \alpha_j = 1,

so it lies in the convex hull of the value vectors at that position.

Convex vs non-convex

Convex: segment stays inside

*-----------*

Non-convex: segment leaves the set

*---- gap ----*

8.3 Hyperplanes and Half-Spaces

A hyperplane in $\mathbb{R}^n$ is a set of the form

\{\mathbf{x} \in \mathbb{R}^n : \mathbf{w}^\top \mathbf{x} = b\}

for some nonzero normal vector $\mathbf{w}$ .

Geometrically, a hyperplane is an $(n-1)$ -dimensional affine subspace. It cuts space into two half-spaces:

\{\mathbf{x} : \mathbf{w}^\top \mathbf{x} \leq b\}, \qquad \{\mathbf{x} : \mathbf{w}^\top \mathbf{x} \geq b\}.

Hyperplanes are the basic decision surfaces of linear models. A neuron computes

\mathbf{w}^\top \mathbf{x} + b.

Thresholding that expression defines a half-space. A ReLU network can therefore be viewed as composing piecewise-linear maps defined by many hyperplane arrangements.

Convex analysis adds a deeper theorem: disjoint convex sets can often be separated by hyperplanes. This is the geometric basis for max-margin classification, linear probes, and many duality arguments.

Hyperplane and half-spaces

half-space        hyperplane         half-space

xxxxx             |                  .....
xxxxx             |  w^T x = b       .....
xxxxx             |                  .....
                  ^
                  normal direction w

8.4 Convex Functions and Sublevel Sets

A function $f: V \to \mathbb{R}$ is convex if for all $\mathbf{u}, \mathbf{v}$ and $\lambda \in [0,1]$ ,

f(\lambda \mathbf{u} + (1-\lambda)\mathbf{v}) \leq \lambda f(\mathbf{u}) + (1-\lambda) f(\mathbf{v}).

Convex functions are important because their sublevel sets

\{\mathbf{v} : f(\mathbf{v}) \leq c\}

are convex. This makes optimization far more tractable.

Examples:

$f(\mathbf{x}) = \|\mathbf{x}\|_2^2$ is convex
$f(\mathbf{x}) = \|\mathbf{x}\|_1$ is convex
logistic loss and cross-entropy are convex in suitable linear-model settings
deep network training objectives are generally not globally convex

The contrast matters. Classical optimization often uses convex structure globally. Deep learning rarely gets global convexity, but convex sets and convex penalties still appear everywhere locally and architecturally.

Vectors and Spaces: Part 1 - Intuition To 8 Affine Subspaces And Convexity

Vectors and Spaces: Part 1: Intuition to 8. Affine Subspaces and Convexity

1. Intuition

1.1 What Are Vectors and Spaces?

1.2 Why Vectors and Spaces Are Central to AI

1.3 Geometric, Algebraic, and Abstract Views

1.4 Where Vectors and Spaces Appear in AI

1.5 The Hierarchy of Structure

1.6 Historical Timeline

2. Vectors in R^n

2.1 Definition and Notation

2.2 Vector Addition

2.3 Scalar Multiplication

2.4 Linear Combinations

2.5 Linear Independence

2.6 Span

3. Abstract Vector Spaces

3.1 The Vector Space Axioms

3.2 Consequences of the Axioms

3.3 Examples of Vector Spaces

3.4 Non-Examples

3.5 Subspaces

4. Basis and Dimension

4.1 Basis Definition

4.2 Standard Basis of R^n

4.3 Dimension

4.4 Dimension and Subspaces

4.5 Coordinates and Change of Basis

4.6 Rank-Nullity Theorem

5. Norms and Metric Spaces

5.1 Norms on Vector Spaces

5.2 The p-Norms

5.3 Norm Equivalence

5.4 Matrix Norms

5.5 Metric Spaces

5.6 Convergence in Normed Spaces

6. Inner Product Spaces

6.1 Inner Products

6.2 Standard Inner Product on R^n

6.3 Geometric Interpretation

6.4 Cauchy-Schwarz Inequality

6.5 Orthogonality

6.6 Gram-Schmidt Orthogonalization

6.7 Orthogonal Complement

6.8 Hilbert Spaces

7. Orthogonal Projections

7.1 Projection onto a Subspace

7.2 Projection Matrix Properties

7.3 Projection onto Column Space

7.4 Gram Matrix

7.5 Best Approximation and Least Squares

8. Affine Subspaces and Convexity

8.1 Affine Subspaces

8.2 Convex Sets

8.3 Hyperplanes and Half-Spaces

8.4 Convex Functions and Sublevel Sets

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?