Part 2

23 min read18 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Vectors and Spaces: Part 9: High-Dimensional Geometry to References

9. High-Dimensional Geometry

9.1 The Curse of Dimensionality

Low-dimensional intuition is unreliable in high dimensions. This broad phenomenon is often called the curse of dimensionality.

One form is volume concentration. For a $d$ -dimensional ball of radius $r$ , the fraction of volume inside the inner ball of radius $r-\varepsilon$ is

\left(1 - \frac{\varepsilon}{r}\right)^d.

As $d$ grows, this shrinks rapidly. Most of the volume moves near the boundary.

Another form is distance concentration. Random points in high-dimensional spaces tend to be at similar distances from each other, so nearest-neighbor structure becomes less contrastive unless data has meaningful low-dimensional structure.

For embeddings, this means random vectors in high dimension are usually nearly orthogonal. Meaningful similarity is therefore a signal against a strong null background of near-right angles.

Volume concentration intuition

low dimension:            high dimension:

[ core | shell ]          [tiny core | shell shell shell shell]

As dimension grows, most volume moves into the shell.

9.2 Concentration of Measure

Concentration of measure says that in high-dimensional spaces, many reasonably smooth functions vary much less than low-dimensional intuition suggests. Informally: a Lipschitz function on a high-dimensional sphere is almost constant on most of the sphere.

One representative result is that if $\mathbf{x}$ is random on the unit sphere $S^{d-1}$ and $\mathbf{u}$ is a fixed unit vector, then the inner product $\langle \mathbf{x}, \mathbf{u} \rangle$ is sharply concentrated near 0 as $d$ grows.

Consequences:

random directions are almost orthogonal
norms and averages become more stable than naive geometric pictures suggest
random projections preserve more structure than one might expect

These ideas are central to modern high-dimensional probability (Vershynin, 2018) and are one reason random features and approximate nearest neighbor methods work surprisingly well.

9.3 Johnson-Lindenstrauss Lemma

The Johnson-Lindenstrauss lemma formalizes dimension reduction with bounded distortion (Johnson and Lindenstrauss, 1984; Dasgupta and Gupta, 2003).

Given $n$ points in a high-dimensional Euclidean space and an accuracy parameter $\varepsilon \in (0,1)$ , there exists a linear map

f: \mathbb{R}^d \to \mathbb{R}^k

with

k = O\!\left(\frac{\log n}{\varepsilon^2}\right)

such that all pairwise distances are preserved up to multiplicative error $(1 \pm \varepsilon)$ :

(1-\varepsilon)\|\mathbf{v}_i - \mathbf{v}_j\|_2^2 \leq \|f(\mathbf{v}_i) - f(\mathbf{v}_j)\|_2^2 \leq (1+\varepsilon)\|\mathbf{v}_i - \mathbf{v}_j\|_2^2.

Remarkably, the target dimension depends logarithmically on the number of points, not on the original ambient dimension.

Practical message for ML:

random projections can compress data while preserving geometry
retrieval systems can work in reduced spaces
head dimension can be much smaller than model dimension without destroying all pairwise structure

9.4 Random Vectors in High Dimensions

If $\mathbf{x}$ and $\mathbf{y}$ are independent random unit vectors in $\mathbb{R}^d$ , then for large $d$ ,

\mathbb{E}[\langle \mathbf{x}, \mathbf{y} \rangle] = 0, \qquad \mathrm{Var}(\langle \mathbf{x}, \mathbf{y} \rangle) \approx \frac{1}{d}.

So the inner product typically has size about $1/\sqrt{d}$ , which goes to zero with dimension.

A related fact is

\|\mathbf{x} - \mathbf{y}\|_2^2 = \|\mathbf{x}\|_2^2 + \|\mathbf{y}\|_2^2 - 2\langle \mathbf{x}, \mathbf{y} \rangle \approx 2.

Thus random unit vectors are not only almost orthogonal; they are also all about the same Euclidean distance apart.

This helps explain why high-dimensional spaces can pack many nearly independent directions. It also explains why cosine similarity is a more informative baseline than raw Euclidean distance in many embedding applications.

9.5 Geometry of Embedding Spaces

Learned embeddings are not random vectors, but high-dimensional geometry strongly shapes how they behave.

Important empirical observations:

simple semantic relations can appear as approximately linear directions (Mikolov et al., 2013)
contextual embedding spaces are often anisotropic rather than uniformly spread on a sphere (Ethayarajh, 2019)
nominal dimension may be large while intrinsic dimension is much smaller
singular values of embedding or feature matrices often decay quickly, indicating effective low-rank structure

Anisotropy matters because cosine similarity becomes biased if almost all vectors occupy a narrow cone. Various centering, whitening, and contrastive training methods attempt to improve isotropy for retrieval and representation quality.

The lesson is subtle: high-dimensional space provides enormous capacity, but learned models do not use that capacity uniformly. They carve out structured manifolds, cones, and low-rank directions inside the ambient space.

Embedding geometry: isotropic vs anisotropic

roughly isotropic              anisotropic / narrow cone

     .   .                           . .
   .   o   .                          . .
     .   .                            . .
   .       .                           . .

spread in many directions         packed into a small angular region

9.6 Softmax Temperature and Simplex Geometry

Softmax maps a logit vector $\mathbf{z} \in \mathbb{R}^n$ to a probability vector:

p_i = \frac{\exp(z_i / \tau)}{\sum_j \exp(z_j / \tau)},

where $\tau > 0$ is temperature.

Geometrically:

as $\tau \to 0$ , the distribution concentrates near a vertex of the simplex
as $\tau \to \infty$ , the distribution approaches the center of the simplex

So temperature controls how close the output lies to an extreme point versus the interior of the simplex.

This is not just a sampling trick. It is a geometric control knob for how sharply a model commits to one direction in probability space.

Temperature moves probability inside the simplex

high tau                         low tau

   more uniform                   more peaked

   [0.33 0.33 0.34]  ------->     [0.97 0.02 0.01]
        near center                    near a vertex

10. Linear Maps Between Spaces

10.1 Definition and Properties

A map $T: V \to W$ is linear if for all vectors $\mathbf{u}, \mathbf{v}$ and scalars $\alpha, \beta$ ,

T(\alpha \mathbf{u} + \beta \mathbf{v}) = \alpha T(\mathbf{u}) + \beta T(\mathbf{v}).

Equivalently, it preserves addition and scalar multiplication separately.

Immediate consequences:

$T(\mathbf{0}) = \mathbf{0}$
$T(-\mathbf{v}) = -T(\mathbf{v})$
$T$ is determined completely by its action on a basis

For finite-dimensional spaces, every linear map $\mathbb{R}^n \to \mathbb{R}^m$ is represented by a unique matrix $A \in \mathbb{R}^{m \times n}$ such that

T(\mathbf{x}) = A\mathbf{x}.

Composition of linear maps corresponds to matrix multiplication.

10.2 Kernel and Image

The kernel of a linear map is

\ker(T) = \{\mathbf{v} \in V : T(\mathbf{v}) = \mathbf{0}\}.

The image is

\mathrm{im}(T) = \{T(\mathbf{v}) : \mathbf{v} \in V\}.

Both are subspaces.

Interpretation:

the kernel contains directions completely ignored by the map
the image contains all outputs the map can possibly produce

Injectivity and surjectivity are controlled by these spaces:

$T$ is injective iff $\ker(T) = \{\mathbf{0}\}$
$T$ is surjective iff $\mathrm{im}(T) = W$

This is a simple but powerful way to reason about architecture. If a projection matrix has a large kernel, then many input directions are guaranteed to be invisible downstream.

10.3 The Four Fundamental Subspaces

For a matrix $A \in \mathbb{R}^{m \times n}$ , the four fundamental subspaces are:

column space $\mathrm{col}(A) \subseteq \mathbb{R}^m$
null space $\mathrm{null}(A) \subseteq \mathbb{R}^n$
row space $\mathrm{row}(A) = \mathrm{col}(A^\top) \subseteq \mathbb{R}^n$
left null space $\mathrm{null}(A^\top) \subseteq \mathbb{R}^m$

They satisfy the orthogonal decompositions

\mathbb{R}^n = \mathrm{row}(A) \oplus \mathrm{null}(A),

\mathbb{R}^m = \mathrm{col}(A) \oplus \mathrm{null}(A^\top).

Dimensional relations:

$\dim(\mathrm{col}(A)) = \dim(\mathrm{row}(A)) = \mathrm{rank}(A)$
$\dim(\mathrm{null}(A)) = n - \mathrm{rank}(A)$
$\dim(\mathrm{null}(A^\top)) = m - \mathrm{rank}(A)$

These decompositions provide a precise vocabulary for information flow:

the row space describes which input combinations matter
the null space describes which input directions are lost
the column space describes what outputs are expressible
the left null space describes output directions orthogonal to everything the map can produce

Four fundamental subspaces

Domain R^n                          Codomain R^m

[ row(A) | null(A) ] --A--> [ col(A) | 0 ]
                               ^
                               |
                     left-null directions live here,
                     orthogonal to every reachable output

10.4 Isomorphisms

A linear map $T: V \to W$ is an isomorphism if it is bijective. Then $V$ and $W$ are isomorphic vector spaces.

For finite-dimensional real spaces, the classification is simple:

V \cong W \quad \Longleftrightarrow \quad \dim(V) = \dim(W).

So all $n$ -dimensional real vector spaces are structurally equivalent to $\mathbb{R}^n$ , even if their elements look different.

Important caution: isomorphic does not mean equal. The space of quadratic polynomials and $\mathbb{R}^3$ are not the same set, but they are isomorphic because both have dimension 3.

This is why linear algebra can move freely between concrete coordinates and abstract objects. Once a basis is chosen, every finite-dimensional space looks like ordinary Euclidean coordinate space.

10.5 Dual Spaces

The dual space of $V$ is the space of linear functionals on $V$ :

V^* = \{f: V \to \mathbb{R} \mid f \text{ is linear}\}.

Elements of the dual space eat vectors and return scalars.

If $V$ is finite-dimensional, then

\dim(V^*) = \dim(V).

With an inner product, every linear functional can be represented as an inner product against a fixed vector:

f(\mathbf{x}) = \langle \mathbf{x}, \mathbf{v} \rangle.

This is the finite-dimensional form of the Riesz representation idea.

In coordinates, row vectors behave like dual vectors:

\mathbf{a}^\top \mathbf{x}

is a linear functional of $\mathbf{x}$ .

This perspective is useful in attention. A query can be viewed as defining a linear functional on key space: it "asks" how much each key aligns with a certain direction. The score $q^\top k$ is then the action of a dual vector on a primal vector.

11. Special Vector Spaces in AI

11.1 Embedding Space Geometry

Let $E \in \mathbb{R}^{|V| \times d}$ be a token embedding matrix. Each row $E_t$ is the embedding of token $t$ .

This space is where discrete symbols become continuous geometry. The hope is that semantically or syntactically related tokens occupy nearby or meaningfully aligned regions. Word2vec made this idea famous by showing that certain analogical relations behave approximately linearly (Mikolov et al., 2013).

Several geometric questions matter in practice:

local similarity: are related tokens close under cosine or Euclidean distance?
directional structure: do directions correspond to interpretable relations?
isotropy: are embeddings spread broadly or collapsed into a narrow cone?
intrinsic dimension: how many effective degrees of freedom are really used?

Modern contextual embeddings complicate the picture because a token does not have a single representation; it acquires one through context-dependent computation. Even so, the geometric language remains the same. Probing methods ask whether a property is linearly readable from a representation, which is a question about whether a suitable separating hyperplane exists in the embedding space.

11.2 Representation Space Through Layers

In a Transformer, each layer maps a sequence representation

X^{(\ell)} \in \mathbb{R}^{n \times d}

to a new one

X^{(\ell+1)} = X^{(\ell)} + f^{(\ell)}(X^{(\ell)}).

The residual connection means layers do not overwrite the representation from scratch. They add a vector-valued update inside a shared ambient space.

This suggests a productive geometric picture:

the residual stream is a shared communication space
attention heads read from and write to particular directions or subspaces
MLP blocks add structured nonlinear updates in that same space
representation learning is partly the art of organizing information so it can be linearly extracted or manipulated later

Mechanistic interpretability often phrases model behavior in exactly these terms: directions, features, subspaces, superposition, and read/write operations into a shared residual stream.

Residual stream picture

X^(l) --------> [ layer computes update f^(l)(X^(l)) ] ----+
  |                                                        |
  +-------------------------- add --------------------------+
                                                           |
                                                           v
                                                        X^(l+1)

11.3 Attention Subspaces

Attention projections are linear maps

W_Q, W_K \in \mathbb{R}^{d \times d_k}, \qquad W_V \in \mathbb{R}^{d \times d_v}.

For token representations $\mathbf{x}, \mathbf{y} \in \mathbb{R}^d$ ,

q = W_Q^\top \mathbf{x}, \qquad k = W_K^\top \mathbf{y}, \qquad v = W_V^\top \mathbf{y}.

The raw attention score is

q^\top k = \mathbf{x}^\top W_Q W_K^\top \mathbf{y}.

So each head defines a bilinear form on the original embedding space. Because $W_Q W_K^\top$ has rank at most $d_k$ , each head uses a low-rank interaction relative to the full $d \times d$ ambient space.

This is one of the cleanest examples of linear-algebraic structure in a large model:

attention does not compare full vectors directly
it compares them after moving them into learned low-dimensional subspaces
different heads can be understood as different learned geometric lenses

Attention subspace flow

x ----W_Q^T----> q
                  \
                   \   score = q^T k
                    \
y ----W_K^T----> k
|
----W_V^T----> v

output = weighted sum of v vectors

11.4 The Probability Simplex

The probability simplex in $\mathbb{R}^n$ is

\Delta^{n-1} = \left\{ \mathbf{p} \in \mathbb{R}^n : p_i \geq 0,\; \sum_{i=1}^{n} p_i = 1 \right\}.

It is not a vector space, but it is a highly structured convex set:

its vertices are the one-hot vectors
its center is the uniform distribution
it lies in an affine hyperplane defined by $\sum_i p_i = 1$

Softmax maps logits from $\mathbb{R}^n$ into the interior of this simplex. Sampling, calibration, entropy, KL divergence, and decoding strategies all live on this geometric object.

A deeper geometric view uses the Fisher information metric, which turns the simplex into a curved statistical manifold. That perspective becomes important in information geometry and natural-gradient methods.

Probability simplex in 3 classes

          e2
         /\
        /  \
       / p  \
      /      \
     /________\
   e1          e3

vertices = one-hot distributions
center = uniform distribution

11.5 Parameter Space

If a model has $p$ trainable parameters, then its parameter space is

\Theta = \mathbb{R}^p.

For contemporary language models, $p$ can be enormous. A 7B model has roughly $7 \times 10^9$ coordinates, and frontier open models reach hundreds of billions (Dubey et al., 2024).

Key objects in parameter space:

parameter vector $\theta$
loss function $L(\theta)$
gradient $\nabla L(\theta)$
Hessian $\nabla^2 L(\theta)$

This space is flat as a Euclidean manifold, but the loss defined on it can be highly nonconvex. Optimization therefore studies geometry not of the ambient space alone, but of the scalar field drawn on top of it.

Mode connectivity, low-loss valleys, and flat-versus-sharp minima are all geometric statements about subsets of parameter space.

11.6 Function Spaces and the Neural Tangent Kernel

A neural network with parameters $\theta$ defines a function

f_\theta : \mathcal{X} \to \mathcal{Y}.

As $\theta$ varies, the model traces a family of functions inside a function space. Locally, infinitesimal parameter changes move the model through the tangent directions

\frac{\partial f_\theta}{\partial \theta_j}.

The neural tangent kernel (NTK) is

K(x, x') = \sum_{j=1}^{p} \frac{\partial f_\theta(x)}{\partial \theta_j} \frac{\partial f_\theta(x')}{\partial \theta_j} = \left\langle \nabla_\theta f_\theta(x), \nabla_\theta f_\theta(x') \right\rangle.

So the NTK is an inner product in parameter-gradient space. Jacot et al. (2018) showed that in the infinite-width limit, gradient descent can be analyzed using this kernel viewpoint, making neural network training resemble kernel regression.

This is a good example of why abstract vector spaces matter. The relevant geometry is no longer primarily in input space or parameter space, but in a function space whose coordinates are themselves derivatives.

12. Direct Sums and Quotient Spaces

12.1 Direct Sum

If $W_1$ and $W_2$ are subspaces of $V$ , then

V = W_1 \oplus W_2

means:

every vector in $V$ can be written as $\mathbf{w}_1 + \mathbf{w}_2$
the decomposition is unique

Equivalently:

$W_1 + W_2 = V$
$W_1 \cap W_2 = \{\mathbf{0}\}$

Dimensions add:

\dim(W_1 \oplus W_2) = \dim(W_1) + \dim(W_2).

The most important example in inner-product spaces is

V = W \oplus W^\perp.

Direct sums say that a space can be assembled from independent coordinate subsystems.

Direct sum intuition

      w2
      ^
      |
O---->*
  w1   \
        \
         v = w1 + w2

Two independent pieces add to one vector.

12.2 Decomposing R^n

For a matrix $A$ , one fundamental decomposition is

\mathbb{R}^n = \mathrm{row}(A) \oplus \mathrm{null}(A).

Likewise,

\mathbb{R}^m = \mathrm{col}(A) \oplus \mathrm{null}(A^\top).

For symmetric matrices, eigenspaces corresponding to distinct eigenvalues are orthogonal, allowing spectral decompositions of the form

\mathbb{R}^n = E_{\lambda_1} \oplus \cdots \oplus E_{\lambda_k}.

In AI, decompositions of hidden spaces into feature subspaces, head subspaces, or signal-versus-noise components are often informal versions of direct-sum thinking.

12.3 Quotient Spaces

Given a subspace $W \subseteq V$ , the quotient space $V/W$ identifies vectors that differ by an element of $W$ :

\mathbf{v}_1 \sim \mathbf{v}_2 \quad \Longleftrightarrow \quad \mathbf{v}_1 - \mathbf{v}_2 \in W.

The elements of the quotient are cosets

\mathbf{v} + W = \{\mathbf{v} + \mathbf{w} : \mathbf{w} \in W\}.

Dimension behaves cleanly:

\dim(V/W) = \dim(V) - \dim(W).

Conceptually, quotienting collapses all directions in $W$ to zero. If a model is invariant to a set of directions, the meaningful representation can often be thought of as living in a quotient space.

This is the right abstraction for "ignore all variation of type X." It appears in invariant learning, symmetry reduction, and representation factorization.

Quotient-space intuition

Before quotient by W:          After collapsing W-directions:

*----*----*                    [o]   [o]   [o]
*----*----*        ---->          each class = one coset
*----*----*

Points that differ only along W become the same object.

12.4 Tensor Product

The tensor product $V \otimes W$ is the vector space generated by formal bilinear pairs $v \otimes w$ subject to bilinearity rules. For finite-dimensional spaces,

\dim(V \otimes W) = \dim(V)\dim(W).

If $\{\mathbf{e}_i\}$ is a basis for $V$ and $\{\mathbf{f}_j\}$ is a basis for $W$ , then $\{\mathbf{e}_i \otimes \mathbf{f}_j\}$ is a basis for $V \otimes W$ .

In coordinates, matrices can be viewed as tensor-product objects:

\mathbb{R}^{m \times n} \cong \mathbb{R}^m \otimes \mathbb{R}^n.

A rank-1 matrix is a simple tensor:

\mathbf{u}\mathbf{v}^\top \leftrightarrow \mathbf{u} \otimes \mathbf{v}.

This is relevant for model compression and LoRA. A low-rank update

\Delta W = BA

is a sum of a small number of rank-1 outer products, so it lives in a low-dimensional tensorially structured subset of matrix space.

Outer product / simple tensor

column u           row v^T            matrix u v^T

[u1]               [v1 v2 v3]         [u1v1 u1v2 u1v3]
[u2]       x                          [u2v1 u2v2 u2v3]

One column times one row creates a rank-1 matrix.

13. Norms, Regularization, and Geometry

13.1 L2 Regularization as a Geometric Constraint

L2 regularization adds a penalty

\lambda \|\theta\|_2^2

to the loss. Geometrically, this prefers solutions near the origin and can be viewed through the constrained problem

\min_\theta L(\theta) \quad \text{subject to} \quad \|\theta\|_2 \leq r.

The L2 ball is smooth and rotationally symmetric. As a result, L2 regularization shrinks parameters continuously rather than forcing exact zeros.

This is the modern descendant of ridge regression (Hoerl and Kennard, 1970). In neural networks it appears as weight decay.

L2 constraint set in 2D

      ____
    /      \
   |   .    |
    \______/

smooth boundary -> shrink smoothly

13.2 L1 Regularization and Sparsity

L1 regularization uses

\lambda \|\theta\|_1.

Its geometry is different. The L1 ball has corners along coordinate axes, so when an objective first touches the constraint boundary, it is disproportionately likely to do so at a sparse point.

That geometric fact explains why L1 promotes sparsity (Tibshirani, 1996). The corresponding proximal operator is soft thresholding:

\mathrm{prox}_{\lambda \|\cdot\|_1}(x_i) = \mathrm{sign}(x_i)\max(0, |x_i| - \lambda).

Small coordinates are driven exactly to zero.

This matters for:

feature selection
sparse coding
pruning
sparse MoE routing and sparse attention variants

L1 constraint set in 2D

        /\
       /  \
       \  /
        \/

corners on coordinate axes -> sparse solutions are favored

13.3 Spectral Regularization

For a matrix-valued parameter $W$ , several geometry-aware penalties are common.

Frobenius penalty

\|W\|_F^2

which is entrywise L2 regularization.

Spectral norm control

\|W\|_2 = \sigma_{\max}(W)

which bounds the largest one-step amplification of the layer.

Nuclear norm

\|W\|_*

which promotes low rank.

Spectral normalization rescales a weight matrix by an estimate of its top singular value so that the layer has bounded operator norm (Miyato et al., 2018). This is a geometric way of controlling Lipschitz constants and stabilizing training.

What spectral control means geometrically

unit ball              after W                after spectral normalization

   ()                    (------)                  (----)
                         long axis = sigma_max     longest stretch capped

Spectral norm controls the biggest stretching direction.

13.4 The Geometry of Gradient Descent

The gradient

\nabla L(\theta)

is a vector in parameter space pointing in the direction of steepest increase of the loss under the Euclidean metric. Gradient descent updates by

\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t).

Several geometric facts follow:

the gradient is orthogonal to level sets of $L$
ill-conditioned curvature creates narrow valleys and zig-zagging updates
preconditioning changes the metric of parameter space to make descent more isotropic

If the Hessian has eigenvalues $\lambda_{\max}$ and $\lambda_{\min}$ with large ratio

\kappa = \frac{\lambda_{\max}}{\lambda_{\min}},

then the local landscape is elongated and naive gradient descent is inefficient.

Adaptive methods and natural gradient can be understood as changes of geometry. Natural gradient replaces the Euclidean metric with the Fisher information metric, aligning steepest descent with the geometry of the model's predictive distribution (Amari, 1998).

This is the right mental model: optimization is not just arithmetic on parameters; it is motion through a geometric space whose metric determines what "steepest" even means.

Gradient descent in a narrow valley

\                /
 \   x ->      /
  \           /
  /   <- x   \
 /           \
/_____________\

Poor conditioning makes updates zig-zag.

14. Common Mistakes

Mistake	Why it is wrong	Better statement
"A set closed under addition is automatically a subspace."	It must also contain the zero vector and be closed under scalar multiplication.	Check all subspace conditions, not just one.
"Independent vectors are orthogonal."	Orthogonality implies independence for nonzero vectors, but independence does not imply orthogonality.	Use Gram-Schmidt when you want orthogonality.
"The zero vector is not in a proper subspace."	Every subspace must contain the zero vector.	If $\mathbf{0} \notin W$ , then $W$ is not a subspace.
"Cosine similarity tells you Euclidean closeness."	High cosine can coexist with large Euclidean distance when norms differ a lot.	Cosine measures angle; Euclidean distance also depends on magnitude.
"All vectors in high dimension are similar."	Random high-dimensional vectors are typically nearly orthogonal, not highly aligned.	Meaningful similarity is rare relative to the random baseline.
"A spanning set has as many vectors as the dimension."	Spanning sets can be redundant. Dimension counts basis vectors, not all listed vectors.	Reduce to a linearly independent spanning set to find dimension.
"The simplex is a vector space because it contains vectors."	It is not closed under arbitrary scaling or addition.	The simplex is a convex subset of a vector space.
"Projection leaves every vector unchanged."	It leaves only vectors already in the target subspace unchanged.	Projection keeps the in-subspace component and removes the orthogonal remainder.
"Two spaces of the same dimension are equal."	They are isomorphic, not literally the same set.	Same dimension means same linear structure up to coordinate relabeling.
"Null $(A)$ and Null $(A^\top)$ are the same thing."	They live in different ambient spaces unless $A$ is square and special.	Right null space and left null space are distinct objects.

15. Exercises

These are designed to force both symbolic fluency and geometric interpretation.

Vector space verification. Determine whether each set is a vector space under the stated operations. If it is, justify all axioms; if not, identify a failing axiom.
- $(a)$ $\{(x,y) \in \mathbb{R}^2 : x \geq 0\}$ with standard operations
- $(b)$ $\{f:\mathbb{R}\to\mathbb{R} : f(0)=0\}$ with pointwise operations
- $(c)$ $\{A \in \mathbb{R}^{2 \times 2} : \det(A)=1\}$ with matrix addition
- $(d)$ $\{\mathbf{p} \in \mathbb{R}^3 : p_1+p_2+p_3=1\}$ with componentwise addition
Subspaces and dimension.
- $(a)$ Show that $\{(x,y,z)\in\mathbb{R}^3 : x+2y-z=0\}$ is a subspace, find a basis, and compute its dimension.
- $(b)$ Find the dimension of the intersection of the planes $x+y+z=0$ and $x-y+2z=0$ .
- $(c)$ Let $W=\mathrm{span}\{(1,1,0),(1,0,1),(0,1,-1)\}$ . Find a basis for $W$ and determine whether $W=\mathbb{R}^3$ .
Inner products and orthogonality. Let $\mathbf{u}=(1,2,-1)$ and $\mathbf{v}=(3,0,1)$ .
- $(a)$ Compute $\langle \mathbf{u}, \mathbf{v} \rangle$ .
- $(b)$ Compute $\|\mathbf{u}\|_2$ and $\|\mathbf{v}\|_2$ .
- $(c)$ Find the angle between the vectors.
- $(d)$ Compute the projection of $\mathbf{u}$ onto $\mathbf{v}$ .
- $(e)$ Verify the Pythagorean decomposition.
- $(f)$ Apply Gram-Schmidt to produce an orthonormal basis for $\mathrm{span}\{\mathbf{u}, \mathbf{v}\}$ .
Rank-nullity. For

A = \begin{pmatrix} 1 & 2 & 3 \\ 2 & 4 & 6 \\ 1 & 1 & 2 \end{pmatrix},

find:

$(a)$ a basis for $\mathrm{null}(A)$ $(b)$ a basis for $\mathrm{col}(A)$ $(c)$ a basis for $\mathrm{row}(A)$ $(d)$ the rank and nullity, verifying rank-nullity $(e)$ a nonzero vector in $\mathrm{null}(A^\top)$

Projection matrices.
- $(a)$ Compute the orthogonal projector onto the line spanned by $(1,2,2)^\top$ .
- $(b)$ Verify $P^2=P$ and $P^\top=P$ .
- $(c)$ Compute $I-P$ and identify its target subspace.
- $(d)$ For

A = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{pmatrix},

compute $P_A = A(A^\top A)^{-1}A^\top$ and interpret the result geometrically.

High-dimensional geometry.
- $(a)$ Suppose $\mathbf{x}, \mathbf{y}$ are independent random unit vectors in $\mathbb{R}^{100}$ . What are the expected value and approximate variance of $\langle \mathbf{x}, \mathbf{y} \rangle$ ?
- $(b)$ For 500 points and $\varepsilon=0.1$ , estimate a Johnson-Lindenstrauss target dimension $k$ up to constants.
- $(c)$ Show that all distinct standard basis vectors in $\mathbb{R}^d$ are at Euclidean distance $\sqrt{2}$ from one another.
Attention geometry. Let $W_Q \in \mathbb{R}^{512 \times 64}$ .
- $(a)$ What is the maximum possible rank of $W_Q$ ?
- $(b)$ What is the dimension of its null space if the rank is maximal?
- $(c)$ What kinds of input directions are lost under this projection?
- $(d)$ Why is $W_Q W_K^\top$ low rank?
Norms and distances.
- $(a)$ For $\mathbf{v}=(3,-4,0,1)^\top$ , compute $\|\mathbf{v}\|_1$ , $\|\mathbf{v}\|_2$ , and $\|\mathbf{v}\|_\infty$ .
- $(b)$ Verify $\|\mathbf{v}\|_\infty \leq \|\mathbf{v}\|_2 \leq \|\mathbf{v}\|_1$ .
- $(c)$ For

A = \begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix},

compute $\|A\|_F$ and $\|A\|_2$ .

Quotient-space thinking. Suppose a representation is unchanged when you add any vector from a subspace $W$ . Explain why the meaningful information lives in $V/W$ rather than in $V$ itself.
Regularization geometry. Sketch or describe the L1 and L2 unit balls in $\mathbb{R}^2$ , then explain geometrically why L1 tends to produce sparse solutions and L2 typically does not.

16. Why This Matters for AI

Aspect	Why vectors and spaces matter
Embeddings	Every token, position, and feature is represented as a vector in a learned space.
Attention	Attention is built from inner products, low-rank projections, and weighted sums.
Representation learning	Features live in subspaces, not just in scalar coordinates.
Compression	Low-rank methods, adapters, and LoRA rely on dimension, span, and rank.
Optimization	Gradients are vectors in parameter space, and regularization is geometric constraint design.
Generalization theory	RKHS, NTK, margin bounds, and norm-based complexity all use vector-space language.
Numerical stability	Orthogonality, conditioning, and norm control determine whether training remains stable.
Interpretability	Linear probes, concept vectors, activation steering, and superposition are all geometric analyses.
Retrieval and search	Similarity search depends on metrics, angle, and concentration in high dimensions.
Probabilistic modeling	Logits live in vector spaces and probabilities live in simplices embedded in them.

The short version is that vectors and spaces are not one topic among many. They are the common substrate of model architecture, learning dynamics, representation quality, and theoretical analysis.

17. Conceptual Bridge

Scalars are the atoms of computation. Vectors are organized collections of scalars. Vector spaces tell us which collections can be added and scaled consistently. Norms and inner products add geometry. Linear maps move us between spaces. Convexity constrains what combinations are allowed. High-dimensional geometry tells us why intuition breaks and why large models can still work.

That is the bridge to the next chapters:

sets and functions
  -> vectors and spaces
  -> matrices and linear maps
  -> orthogonality, eigenvalues, and SVD
  -> optimization, probability, and information
  -> neural networks and transformers

If this chapter gives you one durable mental model, let it be this: deep learning is geometry performed by linear maps inside structured high-dimensional spaces, with nonlinearities deciding which geometric regions remain active.

References

Core Linear Algebra and Geometry

Sheldon Axler, Linear Algebra Done Right (official site): https://linear.axler.net/
Roman Vershynin, High-Dimensional Probability (author page): https://www.math.uci.edu/~rvershyn/papers/HDP-book/HDP-book.html
William B. Johnson and Joram Lindenstrauss (1984), "Extensions of Lipschitz mappings into a Hilbert space": https://web.stanford.edu/class/cs114/readings/JL-Johnson.pdf
Sanjoy Dasgupta and Anupam Gupta (2003), "An Elementary Proof of the Johnson-Lindenstrauss Lemma": https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=bcd7068ee41305cb4dc5f133379cc22801cf744d
Nachman Aronszajn (1950), "Theory of Reproducing Kernels": https://www.ams.org/journals/tran/1950-068-03/S0002-9947-1950-0051437-7/

AI and Representation Geometry

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean (2013), "Efficient Estimation of Word Representations in Vector Space": https://arxiv.org/abs/1301.3781
Ashish Vaswani et al. (2017), "Attention Is All You Need": https://arxiv.org/abs/1706.03762
Kawin Ethayarajh (2019), "How Contextual Are Contextualized Word Representations?": https://aclanthology.org/D19-1006/
Abhimanyu Dubey et al. (2024), "The Llama 3 Herd of Models": https://arxiv.org/abs/2407.21783

Low Rank, Kernels, and Optimization Geometry

Edward J. Hu et al. (2022), "LoRA: Low-Rank Adaptation of Large Language Models": https://arxiv.org/abs/2106.09685
Arthur Jacot, Franck Gabriel, and Clement Hongler (2018), "Neural Tangent Kernel: Convergence and Generalization in Neural Networks": https://proceedings.mlr.press/v80/jacot18a.html
Emmanuel J. Candes and Benjamin Recht (2009), "Exact Matrix Completion via Convex Optimization": https://arxiv.org/abs/0805.4471
Takeru Miyato et al. (2018), "Spectral Normalization for Generative Adversarial Networks": https://openreview.net/forum?id=B1QRgziT-
Andrew M. Saxe, James L. McClelland, and Surya Ganguli (2014), "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks": https://arxiv.org/abs/1312.6120

Regularization and Information Geometry

Robert Tibshirani (1996), "Regression Shrinkage and Selection via the Lasso": https://www.jstor.org/stable/2346178
Arthur E. Hoerl and Robert W. Kennard (1970), "Ridge Regression: Biased Estimation for Nonorthogonal Problems." Original Technometrics paper.
Shun-ichi Amari (1998), "Natural Gradient Works Efficiently in Learning": https://papers.nips.cc/paper_files/paper/1998/hash/4b22aa9464f5a138cb5c51cab4093b7b-Abstract.html

Back to Home | Next: Matrix Operations

Vectors and Spaces: Part 2 - High Dimensional Geometry To References