The Language and Philosophy of Linear Algebra

Published: 5th January 2025

"Mathematics reveals its secrets only to those who approach it with pure love, for its own beauty." — Archimedes

Introduction

Linear algebra is the branch of mathematics concerned with vectors, matrices, and linear transformations. In simple terms, it studies linear combinations of quantities and the structures that arise from them. Despite its abstract nature, linear algebra has become a fundamental tool in numerous fields, ranging from physics and engineering to economics and computer science. In particular, linear algebra forms the backbone of modern artificial intelligence (AI) and machine learning (ML). Many core AI algorithms are either explicitly based on linear algebra or can be understood in terms of linear structures.

At its core, linear algebra provides a language to describe data and transformations. In machine learning, data points (e.g., images, feature vectors, word embeddings) are usually represented as vectors in a high-dimensional vector space. Models often consist of parameters that can be organised into matrices and vectors (for example, the weight matrix in a neural network layer or the coefficients in a linear regression model). Operations like computing predictions, updating parameters, or propagating information through a network are essentially linear algebra operations (matrix-vector multiplications, dot products, etc.). Even many nonlinear models rely on linear algebra behind the scenes: for instance, each layer in a deep neural network is a linear transformation followed by a nonlinear activation.

Linear algebra is not only practically useful but also theoretically illuminating. It provides tools to analyse and understand complex systems by breaking them into simpler linear components. In fact, for a nonlinear system, one often performs a linearisation: the derivative (Jacobian) of a multivariate function at a point is a linear map that best approximates the function near that point. For example, principal component analysis (PCA) uses linear algebra (eigenvectors of a covariance matrix) to identify directions of maximum variance in data, which helps in dimensionality reduction and data compression. In reinforcement learning, one can analyse the dynamics of state transitions using matrices, and techniques like the successor representation rely on matrix operations (indeed, the successor representation can be written as $M = I + \gamma T + \gamma^2 T^2 + \cdots$, involving powers of the state transition matrix). Understanding these linear components is often the first step to mastering more complex nonlinear behaviours.

Historical Context

Linear algebra is often treated as a prerequisite, a technical necessity for machine learning. But it is far more. It is the silent foundation upon which the architectures of artificial cognition are built. It defines the axes of thought, the spaces of meaning, the transformations of insight. When we build AI systems, we are not just writing code—we are sculpting geometry, carving intelligence from chaos with every matrix we multiply.

To study linear algebra in this context is not just to learn math. It is to engage with a new kind of metaphysics—one in which thought is structure, and structure is carved in dimensions we cannot see, but which increasingly shape the world we live in.

The development of linear algebra spans centuries and involves many great mathematicians. Solving systems of linear equations has been a problem of interest since ancient times. Methods equivalent to what we now call Gaussian elimination were described in ancient China as early as 200 BC in The Nine Chapters on the Mathematical Art, which included solutions of $3 \times 3$ linear systems. In the West, the introduction of coordinate geometry by René Descartes (1637) linked algebra to geometry, allowing geometric problems (like finding the intersection of lines and planes) to be solved via linear equations. Gottfried Leibniz (around 1693) first considered the use of determinants for solving linear systems, and Gabriel Cramer formulated his determinant-based rule (Cramer's rule) in 1750. Carl Friedrich Gauss advanced the systematic solution of linear systems in the early 19th century, popularising the elimination method that now bears his name.

The 19th century saw linear algebra emerge as a distinct mathematical field. In 1844, Hermann Grassmann published his "Theory of Extension," introducing the concept of abstract vector spaces and operations on them. Around the same time, James Joseph Sylvester and Arthur Cayley began developing matrix theory. Sylvester coined the term "matrix" (Latin for "womb") in 1848, reflecting the idea of a matrix as an object that generates a set of linear combinations (much like a womb generates life). Cayley introduced matrix multiplication and the notion of the matrix inverse by 1856, treating matrices as mathematical objects in their own right. He also realised that composing linear transformations corresponds to multiplying their matrices, and he helped establish fundamental results such as the Cayley-Hamilton Theorem (that every square matrix satisfies its own characteristic polynomial).

By the late 19th and early 20th centuries, the concept of a vector space was formalised (Giuseppe Peano gave the first modern definition of vector spaces in 1888). A theory of linear transformations of finite-dimensional vector spaces emerged around 1900, and linear algebra took its modern abstract form in the first half of the 20th century as ideas from earlier centuries were generalised in the framework of abstract algebra. In the mid-20th century, the advent of computers greatly increased interest in linear algebra for practical computations. Efficient algorithms for Gaussian elimination, eigenvalue problems, and matrix decompositions were developed, enabling the solution of large linear systems and eigenproblems. Linear algebra became an essential tool for modelling and simulations in science and engineering. Today, it underpins many algorithms in data science and AI, where the ability to handle high-dimensional vectors and large matrices is critical.

Linear Algebra in AI and Machine Learning

It is often said that linear algebra is the language in which machine learning is written. From a simple linear regression to a cutting-edge deep neural network, the underlying computations rely on linear algebra. For example:

In data preprocessing, we often centre and scale data using vector operations. Principal Component Analysis (PCA), a widely used technique for reducing data dimensionality, is essentially an application of eigenvalues and eigenvectors of the data's covariance matrix.
In linear regression, the goal is to find a weight vector $w$ such that $Xw \approx y$, where $X$ is the matrix of input features (each row is an example, each column a feature) and $y$ is the vector of target values. This can be solved by linear algebra methods (the normal equations, involving $X^T X$ and $X^T y$) or by iterative optimisation methods that move through weight space via linear algebra steps (gradient descent uses vector subtraction along the gradient direction).
Logistic regression and other classifiers also rely on linear algebra for computing the model output (a weighted sum of features passed through a nonlinear activation function) and for optimisation via gradient descent. The weight update in gradient descent, $w := w - \alpha \nabla L(w)$, is a vector subtraction, where $\nabla L(w)$ is the gradient vector of the loss.
Neural networks are perhaps the most linear-algebra-intensive application: each layer's forward pass is essentially a multiplication of a weight matrix by an input vector (plus a bias vector), and backpropagation involves multiplications by the transpose of those weight matrices to propagate gradients. Weight updates are linear combinations of gradients. Large-scale implementations use highly optimised matrix-matrix operations. For example, training a neural network on a GPU involves multiplying huge matrices of activations and weights. Even convolutional layers (common in image processing) can be viewed as linear operations on vectorised patches of the image.
In reinforcement learning (RL), linear algebra appears in algorithms that use linear function approximation for value functions or policies. For instance, if we approximate a state's value $V(s)$ as a linear combination of features $V(s) \approx w^T \phi(s)$, then finding the best weights $w$ can reduce to a least-squares problem similar to linear regression. Policy evaluation for a fixed policy in an MDP involves solving $(I - \gamma P)v = r$ for the value function $v$, where $P$ is the state transition matrix and $r$ is the reward vector—a system of linear equations. Moreover, advanced concepts like the successor representation or eigenoption discovery use eigen-decomposition of matrices derived from the state transition graph to find useful representations or options for an agent.

This writing aims to bridge the gap between the theory of linear algebra and its practice in modern AI. I will cover the core concepts of linear algebra with both mathematical rigour and intuitive explanations, and then illustrate how each concept connects to machine learning or reinforcement learning applications. Key topics include:

Vectors and vector spaces – understanding vectors, subspaces, bases, and dimensions.
Linear combinations and span – how vectors can be combined, the idea of spanning sets, linear independence, basis, and dimension.
Linear transformations and matrices – viewing matrices as functions between vector spaces, matrix operations (addition, multiplication, transpose, inverse), and important properties like rank and nullity (including the Rank-Nullity Theorem).
Determinants – how to compute them, their properties, and their geometric interpretation as volume-scaling factors.
Eigenvalues and eigenvectors – how to find them (solving the characteristic polynomial), their meaning (directions of invariant stretching/compression), and their use in diagonalisation and understanding matrix behaviour.
Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) – tools for factorising matrices, understanding the action of non-square matrices, and reducing the dimensionality of data.
Orthogonality and orthonormal bases – the power of orthogonal vectors, orthonormal bases, the Gram-Schmidt process for orthonormalisation, and why orthonormal bases simplify computations.
Projections and least squares – how to project onto subspaces, solve least-squares problems for over-determined systems, and the connection to linear regression (approximating solutions when an exact solution doesn't exist).
Applications in AI – tying everything together with examples from machine learning (gradient descent optimisation, linear and logistic regression, neural network layers, dimensionality reduction) and reinforcement learning (state-value prediction, policy gradients, eigen-based representations like successor features).

Throughout this guide, I maintain a focus on pedagogy. Each concept is accompanied by examples (both mathematical and code examples using Python/PyTorch) to demonstrate the ideas in practice. I also provide practice problems with solutions to reinforce understanding. By the end, you should have a strong grasp of linear algebra and a clear sense of how these concepts underpin and empower techniques in machine learning and reinforcement learning.

Let us begin our journey into linear algebra, starting with the most basic objects of study: vectors and vector spaces.

1. Vectors and Vector Spaces

What Is a Vector?

The concept of a vector is one that has many definitions, depending on the context and level of abstraction. Let us start with the most concrete definition and then gradually generalise.

In the most elementary sense, a vector is a quantity with both magnitude and direction. This is the definition you may have encountered in physics, where vectors represent physical quantities like velocity, force, or acceleration. For instance, a velocity vector of 50 km/h north-east specifies both how fast an object is moving (the magnitude: 50 km/h) and in which direction (north-east).

In mathematics, we often represent vectors geometrically as arrows in space. The length of the arrow corresponds to the vector's magnitude, and the direction in which the arrow points corresponds to the vector's direction. This geometric perspective provides valuable intuition for many vector operations.

More formally, for the purposes of computation, we represent vectors as ordered lists of numbers. In an n-dimensional space, a vector is an ordered list of n numbers, often written as:

$$\mathbf{v} = \begin{pmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{pmatrix}$$

Here, $v_i$ represents the component of the vector along the i-th coordinate axis. For instance, in three-dimensional space, a vector $\mathbf{v} = \begin{pmatrix} 3 \\ 4 \\ 5 \end{pmatrix}$ has components 3, 4, and 5 along the x, y, and z axes, respectively.

However, the power of linear algebra comes from abstraction. In the most general sense, vectors are elements of a vector space (which we will define shortly). They can be functions, matrices, polynomials, or any objects that satisfy the axioms of a vector space. This level of abstraction allows us to apply the machinery of linear algebra to a wide range of problems, including many in artificial intelligence and machine learning.

Vector Operations

Two fundamental operations on vectors are vector addition and scalar multiplication.

Vector Addition

Given two vectors $\mathbf{u}$ and $\mathbf{v}$ in an n-dimensional space, their sum $\mathbf{u} + \mathbf{v}$ is obtained by adding corresponding components:

$$\mathbf{u} + \mathbf{v} = \begin{pmatrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{pmatrix} + \begin{pmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{pmatrix} = \begin{pmatrix} u_1 + v_1 \\ u_2 + v_2 \\ \vdots \\ u_n + v_n \end{pmatrix}$$

Geometrically, vector addition follows the parallelogram rule: to add vectors $\mathbf{u}$ and $\mathbf{v}$, place the tail of $\mathbf{v}$ at the head of $\mathbf{u}$. The sum $\mathbf{u} + \mathbf{v}$ is the vector from the tail of $\mathbf{u}$ to the head of $\mathbf{v}$. If we draw a parallelogram with $\mathbf{u}$ and $\mathbf{v}$ as adjacent sides, then $\mathbf{u} + \mathbf{v}$ is the diagonal of this parallelogram.

Scalar Multiplication

Given a scalar (i.e., a number) $\alpha$ and a vector $\mathbf{v}$, their product $\alpha \mathbf{v}$ is obtained by multiplying each component of $\mathbf{v}$ by $\alpha$:

$$\alpha \mathbf{v} = \alpha \begin{pmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{pmatrix} = \begin{pmatrix} \alpha v_1 \\ \alpha v_2 \\ \vdots \\ \alpha v_n \end{pmatrix}$$

Geometrically, scalar multiplication stretches or compresses the vector, and may reverse its direction. Specifically:

If $\alpha > 1$, then $\alpha \mathbf{v}$ is longer than $\mathbf{v}$.
If $0 < \alpha < 1$, then $\alpha \mathbf{v}$ is shorter than $\mathbf{v}$.
If $\alpha < 0$, then $\alpha \mathbf{v}$ points in the opposite direction to $\mathbf{v}$.
If $\alpha = 0$, then $\alpha \mathbf{v}$ is the zero vector, denoted as $\mathbf{0}$.

Vector Subtraction

Vector subtraction is defined in terms of addition and scalar multiplication: $\mathbf{u} - \mathbf{v} = \mathbf{u} + (-1) \mathbf{v}$.

The Dot Product

The dot product (or inner product) of two vectors $\mathbf{u}$ and $\mathbf{v}$ is a scalar value obtained by multiplying corresponding components and summing the results:

$$\mathbf{u} \cdot \mathbf{v} = u_1 v_1 + u_2 v_2 + \cdots + u_n v_n = \sum_{i=1}^{n} u_i v_i$$

Geometrically, the dot product is related to the angle $\theta$ between the vectors:

$$\mathbf{u} \cdot \mathbf{v} = \|\mathbf{u}\| \|\mathbf{v}\| \cos \theta$$

where $\|\mathbf{u}\|$ and $\|\mathbf{v}\|$ are the magnitudes (or norms) of $\mathbf{u}$ and $\mathbf{v}$, respectively.

The dot product has several important properties and applications:

Two non-zero vectors are orthogonal (perpendicular) if and only if their dot product is zero.
The dot product can be used to compute the length (or norm) of a vector: $\|\mathbf{v}\| = \sqrt{\mathbf{v} \cdot \mathbf{v}}$.
The dot product can be used to compute the projection of one vector onto another.

The Cross Product (in 3D)

In three-dimensional space, the cross product of two vectors $\mathbf{u}$ and $\mathbf{v}$ is a vector $\mathbf{u} \times \mathbf{v}$ that is perpendicular to both $\mathbf{u}$ and $\mathbf{v}$. Its magnitude is given by $\|\mathbf{u} \times \mathbf{v}\| = \|\mathbf{u}\| \|\mathbf{v}\| \sin \theta$, where $\theta$ is the angle between $\mathbf{u}$ and $\mathbf{v}$.

The cross product is computed as:

$$\mathbf{u} \times \mathbf{v} = \begin{pmatrix} u_2 v_3 - u_3 v_2 \\ u_3 v_1 - u_1 v_3 \\ u_1 v_2 - u_2 v_1 \end{pmatrix}$$

The cross product has applications in physics and geometry, such as computing torque, finding a vector perpendicular to a plane, and determining the area of a parallelogram.

Vector Spaces

A vector space (or linear space) is a collection of objects called vectors, which may be added together and multiplied by scalars (i.e., numbers). More formally, a vector space over a field $F$ (e.g., the real numbers $\mathbb{R}$ or the complex numbers $\mathbb{C}$) is a set $V$ together with two operations: vector addition (which is a function $V \times V \to V$) and scalar multiplication (which is a function $F \times V \to V$), satisfying the following axioms:

Closure under vector addition: For all $\mathbf{u}, \mathbf{v} \in V$, the sum $\mathbf{u} + \mathbf{v}$ is in $V$.
Commutativity of vector addition: For all $\mathbf{u}, \mathbf{v} \in V$, $\mathbf{u} + \mathbf{v} = \mathbf{v} + \mathbf{u}$.
Associativity of vector addition: For all $\mathbf{u}, \mathbf{v}, \mathbf{w} \in V$, $(\mathbf{u} + \mathbf{v}) + \mathbf{w} = \mathbf{u} + (\mathbf{v} + \mathbf{w})$.
Identity element of vector addition: There exists an element $\mathbf{0} \in V$, called the zero vector, such that $\mathbf{v} + \mathbf{0} = \mathbf{v}$ for all $\mathbf{v} \in V$.
Inverse elements of vector addition: For each $\mathbf{v} \in V$, there exists an element $-\mathbf{v} \in V$, called the additive inverse of $\mathbf{v}$, such that $\mathbf{v} + (-\mathbf{v}) = \mathbf{0}$.
Closure under scalar multiplication: For all $\alpha \in F$ and $\mathbf{v} \in V$, the product $\alpha \mathbf{v}$ is in $V$.
Distributivity of scalar multiplication with respect to vector addition: For all $\alpha \in F$ and $\mathbf{u}, \mathbf{v} \in V$, $\alpha(\mathbf{u} + \mathbf{v}) = \alpha\mathbf{u} + \alpha\mathbf{v}$.
Distributivity of scalar multiplication with respect to field addition: For all $\alpha, \beta \in F$ and $\mathbf{v} \in V$, $(\alpha + \beta)\mathbf{v} = \alpha\mathbf{v} + \beta\mathbf{v}$.
Compatibility of scalar multiplication with field multiplication: For all $\alpha, \beta \in F$ and $\mathbf{v} \in V$, $\alpha(\beta\mathbf{v}) = (\alpha\beta)\mathbf{v}$.
Identity element of scalar multiplication: For all $\mathbf{v} \in V$, $1\mathbf{v} = \mathbf{v}$, where 1 denotes the multiplicative identity in $F$.

These axioms ensure that vector spaces behave in a consistent and predictable manner, which is essential for developing a coherent theory of linear algebra.

Examples of Vector Spaces

Here are some common examples of vector spaces:

Euclidean Spaces

The most familiar vector spaces are the Euclidean spaces $\mathbb{R}^n$, which consist of n-tuples of real numbers. For instance, $\mathbb{R}^2$ is the plane, and $\mathbb{R}^3$ is the three-dimensional space. Addition and scalar multiplication are performed component-wise, as described earlier.

Function Spaces

The set of all functions from a set $X$ to a field $F$ forms a vector space, where addition and scalar multiplication are defined pointwise:

$(f + g)(x) = f(x) + g(x)$ for all $x \in X$
$(\alpha f)(x) = \alpha f(x)$ for all $x \in X$ and $\alpha \in F$

Examples of function spaces include the set of all continuous functions on an interval, the set of all differentiable functions, and the set of all square-integrable functions (which form Hilbert spaces, important in functional analysis and quantum mechanics).

Polynomial Spaces

The set of all polynomials of degree at most n forms a vector space, where addition and scalar multiplication are defined in the usual way for polynomials. For instance, the set of all quadratic polynomials $ax^2 + bx + c$ forms a vector space, with the standard operations of polynomial addition and scalar multiplication.

Matrix Spaces

The set of all $m \times n$ matrices with entries in a field $F$ forms a vector space, where addition and scalar multiplication are defined entry-wise.

Subspaces

A subspace of a vector space $V$ is a subset $W$ of $V$ that is itself a vector space under the operations inherited from $V$. Specifically, $W$ is a subspace of $V$ if and only if:

The zero vector of $V$ is in $W$.
$W$ is closed under vector addition: for all $\mathbf{u}, \mathbf{v} \in W$, the sum $\mathbf{u} + \mathbf{v}$ is in $W$.
$W$ is closed under scalar multiplication: for all $\alpha \in F$ and $\mathbf{v} \in W$, the product $\alpha \mathbf{v}$ is in $W$.

Examples of subspaces include:

The set of all vectors in $\mathbb{R}^3$ that lie in a plane through the origin is a subspace of $\mathbb{R}^3$.
The set of all vectors in $\mathbb{R}^3$ that lie on a line through the origin is a subspace of $\mathbb{R}^3$.
The set of all polynomials of degree at most 2 is a subspace of the vector space of all polynomials.

Linear Combinations and Span

A linear combination of vectors $\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k$ from a vector space $V$ is an expression of the form:

$$c_1 \mathbf{v}_1 + c_2 \mathbf{v}_2 + \cdots + c_k \mathbf{v}_k$$

where $c_1, c_2, \ldots, c_k$ are scalars.

The span of vectors $\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k$ is the set of all possible linear combinations of these vectors. It is denoted as $\text{span}(\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k)$.

Importantly, the span of a set of vectors is always a subspace of the vector space. This is because the span includes the zero vector (obtained when all coefficients are zero) and is closed under both vector addition and scalar multiplication.

Linear Independence

A set of vectors $\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k$ is linearly independent if the only solution to the equation:

$$c_1 \mathbf{v}_1 + c_2 \mathbf{v}_2 + \cdots + c_k \mathbf{v}_k = \mathbf{0}$$

is $c_1 = c_2 = \cdots = c_k = 0$. In other words, no vector in the set can be expressed as a linear combination of the others. If the set is not linearly independent, it is linearly dependent, and at least one vector in the set can be expressed as a linear combination of the others.

Linear independence is a crucial concept in linear algebra, as it helps determine whether a set of vectors forms a basis for a vector space.

Basis and Dimension

A basis for a vector space $V$ is a set of vectors that:

Spans $V$ (i.e., every vector in $V$ can be expressed as a linear combination of these vectors).
Is linearly independent (i.e., no vector in the set can be expressed as a linear combination of the others).

A standard basis for $\mathbb{R}^n$ is the set of unit vectors $\mathbf{e}_1, \mathbf{e}_2, \ldots, \mathbf{e}_n$, where $\mathbf{e}_i$ has a 1 in the $i$-th position and 0s elsewhere.

The dimension of a vector space $V$ is the number of vectors in a basis for $V$. For instance, $\mathbb{R}^n$ has dimension $n$. The dimension of a vector space is well-defined, in the sense that all bases for the same vector space have the same number of vectors.

Implications for AI and Machine Learning

The concepts of vectors and vector spaces are fundamental to many aspects of AI and machine learning:

Data Representation

In machine learning, data points are typically represented as vectors. For instance, an image might be represented as a vector of pixel values, a document as a vector of word frequencies, and a user as a vector of features (age, income, etc.). The vector space is the space of all possible such vectors.

Feature Spaces

Feature spaces in machine learning are vector spaces where each feature corresponds to a dimension. For instance, in a two-feature space (e.g., with features "height" and "weight"), each data point is a vector in $\mathbb{R}^2$. Linear models, such as linear regression and logistic regression, assume that the target is a linear function (or a function of a linear function) of the features.

Dimensionality Reduction

Techniques like Principal Component Analysis (PCA) aim to find lower-dimensional subspaces that capture most of the variance in the data. These subspaces are spanned by a few vectors (the principal components), which form a basis for the subspace.

Neural Networks

In neural networks, each layer transforms vectors from one vector space to another. For instance, a linear layer performs a linear transformation, transforming an input vector of one dimension to an output vector of (potentially) another dimension. Activation functions then apply non-linear transformations to these vectors.

Word Embeddings

In natural language processing, words are often represented as vectors in a high-dimensional space (word embeddings), where similar words have similar vectors. These embeddings can be learned by models like Word2Vec or GloVe, and they capture semantic relationships between words. For example, in a well-trained embedding, vector operations can reveal semantic relationships: $\text{vector("king")} - \text{vector("man")} + \text{vector("woman")} \approx \text{vector("queen")}$.

Summary

In this chapter, we have explored the foundational concepts of vectors and vector spaces. We have seen how vectors can be represented, the operations that can be performed on them, and the structures that they form. We have also discussed the implications of these concepts for AI and machine learning.

In the next chapter, we will delve into linear transformations and matrices, which allow us to systematically manipulate vectors and spaces.

Exercises

Show that the set of all 2 × 2 matrices with entries in $\mathbb{R}$ forms a vector space over $\mathbb{R}$. What is the dimension of this vector space?
Prove that the set of all polynomials of degree exactly 2 is not a subspace of the vector space of all polynomials. What about the set of all polynomials of degree at most 2?
Given vectors $\mathbf{v}_1 = (1, 2, 3)$, $\mathbf{v}_2 = (2, 3, 4)$, and $\mathbf{v}_3 = (3, 5, 7)$ in $\mathbb{R}^3$, determine whether they are linearly independent. If not, express one of the vectors as a linear combination of the others.
Consider the vector space $\mathbb{R}^4$. Find a basis for the subspace consisting of all vectors $(x, y, z, w)$ such that $x + y + z + w = 0$. What is the dimension of this subspace?
In the context of machine learning, explain why representing data points as vectors in a feature space is useful. Give an example of how linear operations on these vectors might be used in a simple machine learning algorithm.

2. Matrices and Linear Transformations

Linear Transformations and Matrix Representation

A function between two vector spaces that preserves the linear structure is called a linear transformation. More formally, let $T: V \to W$ be a mapping from vector space $V$ to vector space $W$. We say $T$ is linear if for any vectors $\mathbf{u},\mathbf{v} \in V$ and any scalar $a \in \mathbb{R}$:

$$T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v}), \qquad T(a\,\mathbf{v}) = a\,T(\mathbf{v}).$$

These two properties imply that $T$ preserves linear combinations in general:

$$T(a \mathbf{u} + b \mathbf{v}) = a\,T(\mathbf{u}) + b\,T(\mathbf{v})$$

for any scalars $a, b$ and vectors $\mathbf{u}, \mathbf{v}$. Intuitively, a linear map $T$ is one for which scaling or adding inputs results in the same scaling or adding of outputs.

Examples of linear transformations include:

Projections: $T(x,y,z) = (x,y,0)$ is a linear map from $\mathbb{R}^3$ to $\mathbb{R}^3$ that projects a 3D vector onto the $xy$-plane by discarding the $z$ component.
Rotations and reflections about the origin are linear transformations on $\mathbb{R}^2$ or $\mathbb{R}^3$ (they preserve vector addition and scaling).
Differentiation is a linear transformation from the space of polynomials (or other suitable function spaces) to itself: $D(a f + b g) = a D(f) + b D(g)$. Integration (with zero constant of integration) is also linear.
In machine learning, the function that a single layer of a neural network computes (before the nonlinearity) is linear: $T(\mathbf{x}) = W\mathbf{x} + \mathbf{b}$ is a linear map from input vector space to output vector space if we consider the bias $\mathbf{b}$ to be zero or treat it separately. Strictly speaking $T(\mathbf{x})=W\mathbf{x}+\mathbf{b}$ is an affine map, not linear, because of the constant $\mathbf{b}$ (since $T(0) = \mathbf{b} \neq 0$ in general). If $\mathbf{b}=0$, it is linear.

When working with finite-dimensional vector spaces, linear transformations are conveniently represented by matrices. If $V$ and $W$ are vector spaces of dimensions $n$ and $m$, respectively, then any linear map $T: V \to W$ can be associated with an $m \times n$ matrix $A$ such that $T(\mathbf{x}) = A \mathbf{x}$ for all $\mathbf{x} \in V$. How do we determine this matrix? It depends on the choice of bases for $V$ and $W$.

Suppose $\{\mathbf{e}_1, \ldots, \mathbf{e}_n\}$ is a basis of $V$ and $\{\mathbf{f}_1, \ldots, \mathbf{f}_m\}$ is a basis of $W$. For each basis vector $\mathbf{e}_j$ of $V$, its image $T(\mathbf{e}_j)$ is some vector in $W$, which we can express in the $W$-basis as:

$$T(\mathbf{e}_j) = a_{1j} \mathbf{f}_1 + a_{2j} \mathbf{f}_2 + \cdots + a_{mj} \mathbf{f}_m.$$

The coefficients $a_{ij}$ are the entries of the matrix $A$ that represents $T$ (with respect to those bases). In other words, the $j$-th column of $A$ is $[a_{1j}, a_{2j}, \ldots, a_{mj}]^T$, which are the coordinates of $T(\mathbf{e}_j)$ in $W$.

Once we know $A$ on basis vectors, for an arbitrary vector $\mathbf{x} = x_1 \mathbf{e}_1 + \cdots + x_n \mathbf{e}_n$, linearity gives:

$$T(\mathbf{x}) = x_1 T(\mathbf{e}_1) + \cdots + x_n T(\mathbf{e}_n) = x_1 (a_{*1}) + \cdots + x_n (a_{*n}),$$

where $a_{*j}$ denotes the $j$-th column of $A$. This is exactly the matrix-vector product $A \mathbf{x}$.

In the common case where $V = \mathbb{R}^n$ and $W = \mathbb{R}^m$ and we use their standard bases, the matrix of a linear transformation is simply obtained by seeing how the transformation acts on standard unit vectors. For example, a linear $T: \mathbb{R}^3 \to \mathbb{R}^2$ can be written as:

$$T(x,y,z) = (a_{11}x + a_{12}y + a_{13}z,\; a_{21}x + a_{22}y + a_{23}z),$$

which corresponds to the $2 \times 3$ matrix

$$A = \begin{pmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \end{pmatrix},$$

so that $T(\mathbf{x}) = A \mathbf{x}$.

The main advantage of representing linear maps with matrices is that it turns problems about linear transformations into problems about matrices, which can be tackled with algebraic operations. Composition of linear maps corresponds to multiplication of their matrices; the inverse of a linear map (if it exists) corresponds to the matrix inverse, and so on.

Matrix Operations

An $m \times n$ matrix $A$ is a rectangular array of numbers with $m$ rows and $n$ columns:

$$A = \begin{pmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{pmatrix}$$

We denote the entry in the $i$-th row and $j$-th column by $a_{ij}$. The set of all $m\times n$ real matrices is denoted $\mathbb{R}^{m\times n}$. As mentioned, $\mathbb{R}^{m\times n}$ is a vector space of dimension $mn$ (each matrix can be thought of as a vector of length $mn$ by listing its entries).

Basic operations on matrices include:

Matrix addition: You can add two matrices of the same size by adding their corresponding entries. If $A, B \in \mathbb{R}^{m\times n}$, then $C = A+B$ is the $m\times n$ matrix with $c_{ij} = a_{ij} + b_{ij}$. Matrix addition is commutative and associative, and has an additive identity (the zero matrix, with all entries 0).
Scalar multiplication: Like with vectors, you can multiply a matrix by a scalar $c \in \mathbb{R}$ to get $cA$, where $(cA)_{ij} = c\cdot a_{ij}$.
Matrix-vector multiplication: If $A$ is $m\times n$ and $\mathbf{x} \in \mathbb{R}^n$ (think of $\mathbf{x}$ as an $n\times 1$ matrix), then $A\mathbf{x}$ is an $m$-dimensional vector given by:
$$A\mathbf{x} = \begin{pmatrix} a_{11}x_1 + a_{12}x_2 + \cdots + a_{1n}x_n \\ a_{21}x_1 + a_{22}x_2 + \cdots + a_{2n}x_n \\ \vdots \\ a_{m1}x_1 + a_{m2}x_2 + \cdots + a_{mn}x_n \end{pmatrix}$$
This is exactly the linear combination of the columns of $A$ weighted by the components of $\mathbf{x}$ (or equivalently, dot products of the rows of $A$ with $\mathbf{x}$). This operation corresponds to applying the linear transformation represented by $A$ to the vector $\mathbf{x}$.
Matrix-matrix multiplication: This generalizes the above. If $A$ is an $m\times n$ matrix and $B$ is an $n \times p$ matrix, then the product $C = A B$ is defined to be the $m \times p$ matrix such that:
$$(AB)_{ij} = \sum_{k=1}^{n} a_{ik}\, b_{kj},$$
for $1 \le i \le m$ and $1 \le j \le p$. In words, to get the entry in the $i$-th row and $j$-th column of $C$, take the $i$-th row of $A$ and the $j$-th column of $B$, multiply corresponding elements and sum them (i.e., take their dot product). For the multiplication $A B$ to be defined, the number of columns of $A$ must equal the number of rows of $B$ (here both equal $n$). The resulting matrix $AB$ has the same number of rows as $A$ and the same number of columns as $B$.

The definition of matrix multiplication might seem arbitrary, but it is designed exactly so that the matrix of a composed linear transformation is the product of the matrices. If $T: U \to V$ and $S: V \to W$ are linear maps with matrices $A$ and $B$ respectively (so $A$ is representing $T$ and $B$ representing $S$), then the composition $S \circ T: U \to W$ has matrix $B A$. Notice the order: if we first apply $T$ (matrix $A$) then $S$ (matrix $B$), the combined effect is $B A$. This matches with the rule $(B A)\mathbf{x} = B(A \mathbf{x})$, i.e., you first do $A \mathbf{x}$ then $B$ times that result.

Matrix multiplication is associative: $(AB)C = A(BC)$ when dimensions are compatible. However, it is not commutative in general: $AB$ is usually not equal to $BA$. In fact, $BA$ might not even be defined if $A$ and $B$ are different sizes (e.g., $A$ is $m\times n$ and $B$ is $n \times p$, then $BA$ would be $n \times n$ times $m \times n$, which is invalid unless $m=n$ and even then the values can differ). This non-commutativity is a major difference from multiplication of numbers.

Another important matrix operation is the transpose. The transpose of an $m \times n$ matrix $A$, denoted $A^T$, is the $n \times m$ matrix whose rows are the columns of $A$. In symbols, $(A^T)_{ij} = a_{ji}$. For example:

$$A = \begin{pmatrix} 1 & 4 & 7 \\ 2 & 5 & 8 \end{pmatrix}, \qquad A^T = \begin{pmatrix} 1 & 2 \\ 4 & 5 \\ 7 & 8 \end{pmatrix}.$$

Taking transpose twice returns to the original: $(A^T)^T = A$. Transpose interacts with matrix multiplication by reversing order: $(AB)^T = B^T A^T$. This is analogous to the rule $(XY)^{-1} = Y^{-1} X^{-1}$ for inverses.

A matrix that is equal to its transpose ($A^T = A$) is called symmetric. Symmetric matrices (assuming real entries) have many nice properties: their eigenvalues are real and they can be diagonalized by an orthonormal basis of eigenvectors (we'll see this later in the eigenvalues section). In data science, covariance matrices are symmetric, and symmetry is related to the notion that a linear map is self-adjoint (angles and lengths are preserved in a certain sense if the matrix is orthogonal, etc.).

Example (Matrix Multiplication in Code)

Matrix multiplication using PyTorch:

import torch

A = torch.tensor([[1., 2.], [3., 4.]])
B = torch.tensor([[5., 6.], [7., 8.]])

# Matrix multiplication (using @ operator or torch.matmul)
C = A @ B
print(C)

The result of multiplying $ A = \begin{pmatrix}1 & 2\\3 & 4\end{pmatrix} $ and $ B = \begin{pmatrix}5 & 6\\7 & 8\end{pmatrix} $ is:

$$C = AB = \begin{pmatrix}1\cdot 5 + 2\cdot 7 & 1\cdot 6 + 2\cdot 8\\[6pt] 3\cdot 5 + 4\cdot 7 & 3\cdot 6 + 4\cdot 8\end{pmatrix} = \begin{pmatrix}19 & 22\\43 & 50\end{pmatrix}.$$

The code would output:

tensor([[19., 22.], [43., 50.]])

We can also verify properties like non-commutativity by computing $ B @ A $, which yields $\begin{pmatrix}23 & 34\\31 & 46\end{pmatrix}$, not equal to $AB$.

Matrices can be multiplied with vectors as well:

x = torch.tensor([1.0, 2.0])
y = A @ x
print(y)  # tensor([5., 11.])

Here $y = A x = \begin{pmatrix}1 & 2\\3 & 4\end{pmatrix}\begin{pmatrix}1\\2\end{pmatrix} = \begin{pmatrix}5\\11\end{pmatrix}$. This matches the printed result.

Invertible Matrices and Systems of Equations

A square matrix $A \in \mathbb{R}^{n\times n}$ is called invertible (or nonsingular) if there exists another $n\times n$ matrix $B$ such that

$$AB = BA = I_n,$$

where $I_n$ is the $n\times n$ identity matrix (1's on the diagonal and 0's elsewhere). The matrix $B$ is called the inverse of $A$ and denoted $A^{-1}$. If no such $B$ exists, $A$ is called singular (non-invertible).

Invertible matrices correspond to linear transformations that are bijections (one-to-one and onto) and hence have inverses as functions.

For a $2\times 2$ matrix $A = \begin{pmatrix} a & b \\ c & d \end{pmatrix}$, a formula for the inverse (if it exists) is:

$$A^{-1} = \frac{1}{ad - bc} \begin{pmatrix} d & -b \\ -c & a \end{pmatrix},$$

provided that $ad - bc \neq 0$. The quantity $ad - bc$ in this case is the determinant of $A$ (we'll discuss determinants in the next chapter). If $ad - bc = 0$, then $A$ is not invertible (its two rows or two columns are linearly dependent).

For larger matrices, finding the inverse explicitly is more complex, but conceptually a matrix is invertible if and only if its columns (or rows) form a basis of $\mathbb{R}^n$. That means $A$ has rank $n$ (full rank) and determinant $\neq 0$. The inverse transformation $A^{-1}$ exists and is also linear.

One of the most common uses of matrix inverses (or the related techniques) is solving linear systems of equations. Consider $A\mathbf{x} = \mathbf{b}$ where $A$ is an $m \times n$ matrix, $\mathbf{x}\in \mathbb{R}^n$ is unknown and $\mathbf{b}\in \mathbb{R}^m$ is known. This system is essentially $m$ equations in $n$ unknowns:

$$\begin{align*} a_{11}x_1 + a_{12}x_2 + \cdots + a_{1n}x_n &= b_1, \\ a_{21}x_1 + a_{22}x_2 + \cdots + a_{2n}x_n &= b_2, \\ \vdots \qquad\qquad & \vdots \\ a_{m1}x_1 + a_{m2}x_2 + \cdots + a_{mn}x_n &= b_m. \end{align*}$$

Solving this system is equivalent to finding $\mathbf{x} = A^{-1}\mathbf{b}$ if $A$ is invertible (note $m=n$ in that case).

When $A$ is not a square or not full rank, we cannot find a two-sided inverse, and the system might have either no solutions or infinitely many solutions. We will discuss how to handle those cases in the least squares section. For now, if $A$ is invertible, the unique solution to $A\mathbf{x}=\mathbf{b}$ is $\mathbf{x} = A^{-1}\mathbf{b}$.

In practice, one almost never computes the inverse explicitly to solve a system (doing so is usually less efficient and less numerically stable than other methods). Instead, one uses Gaussian elimination or matrix factorization techniques directly on $(A|\mathbf{b})$, the augmented matrix, to solve for $\mathbf{x}$. But conceptually, the inverse provides a formula.

Example (Solving a Linear System)

Consider the system:

$$\begin{cases} 2x + y = 4, \\ x + y = 3, \end{cases}$$

which in matrix form is $A \mathbf{x} = \mathbf{b}$ with $A = \begin{pmatrix}2 & 1\\1 & 1\end{pmatrix}$, $\mathbf{x} = \begin{pmatrix}x\\y\end{pmatrix}$, $\mathbf{b}=\begin{pmatrix}4\\3\end{pmatrix}$.

The coefficient matrix $A$ has determinant $2\cdot1 - 1\cdot1 = 1$, so $A$ is invertible. The inverse (using the formula) is:

$$A^{-1} = \begin{pmatrix} 1 & -1 \\ -1 & 2 \end{pmatrix},$$

since $1/(2\cdot1 - 1\cdot1) = 1$.

Now $\mathbf{x} = A^{-1}\mathbf{b} = \begin{pmatrix}1 & -1\\-1 & 2\end{pmatrix}\begin{pmatrix}4\\3\end{pmatrix} = \begin{pmatrix} 1\cdot4 + -1\cdot3 \\ -1\cdot4 + 2\cdot3\end{pmatrix} = \begin{pmatrix}1\\2\end{pmatrix}$.

So the solution is $x=1, y=2$. We can quickly check: $2(1)+2=4$, $1+2=3$, which is correct.

Using PyTorch (or NumPy) to solve this:

import torch

A = torch.tensor([[2.0, 1.0], [1.0, 1.0]])
b = torch.tensor([4.0, 3.0])

x = torch.linalg.solve(A, b)
print(x)  # tensor([1., 2.])

This outputs $[1., 2.]$ as expected. Under the hood, the solver likely performed elimination or an $LU$ factorization rather than explicitly computing $A^{-1}$.

Gaussian Elimination

The standard algorithm to solve arbitrary linear systems $A\mathbf{x} = \mathbf{b}$ (or to find $A^{-1}$, or to determine the rank of $A$) is Gaussian elimination. In essence, one performs a sequence of elementary row operations on the augmented matrix $[A|\mathbf{b}]$ to reduce $A$ to an upper triangular (or row-echelon) form, and then back-substitutes to find the solution.

The elementary row operations are:

Swap two rows.
Multiply a row by a nonzero scalar.
Add a scalar multiple of one row to another row.

These operations are all justifiable as they correspond to premultiplying by elementary matrices (which are invertible), thus they don't change the solution set of the system.

After elimination, one might find:

A unique solution (if one eventually reduces $A$ to the identity matrix, indicating $A$ was invertible).
No solution (if a row reduces to $[0, 0, \ldots, 0 | c]$ with $c \neq 0$, representing an impossible equation like $0=5$).
Infinitely many solutions (if there are free variables, corresponding to some columns without pivots in the row-echelon form; we'll revisit this in the Rank-Nullity section).

Gaussian elimination is essentially a systematic way of applying the linear combination row operations to eliminate variables step by step. It's also how one can compute $A^{-1}$: by augmenting $A$ with the identity and reducing $[A | I]$ to $[I | A^{-1}]$ if $A$ is invertible.

Rank, Nullity, and the Rank-Nullity Theorem

Any linear transformation $T: V \to W$ has two fundamental associated subspaces:

The image or range of $T$, $\mathrm{Im}(T) = \{ T(\mathbf{v}) : \mathbf{v} \in V \} \subseteq W$. This is the set of all vectors in $W$ that are outputs of the transformation. If $T$ is represented by a matrix $A$, the image corresponds to the column space of $A$ (the span of the columns).
The kernel of $T$ (also called the null space), $\ker(T) = \{\mathbf{v} \in V : T(\mathbf{v}) = \mathbf{0}\}$. This is the set of vectors that map to the zero vector in $W$. For matrix $A$, the kernel corresponds to the null space of $A$, i.e., all solutions to $A\mathbf{x} = \mathbf{0}$.

The rank of $T$ is defined as $\dim(\mathrm{Im}(T))$, the dimension of the image (number of linearly independent outputs, equivalently the number of pivot columns in $A$). The nullity of $T$ is $\dim(\ker(T))$, the dimension of the kernel (number of free variables in solving $A\mathbf{x}=0$).

In matrix terms, the rank is the number of linearly independent columns (or rows) of $A$, and the nullity is the number of independent solutions to $A\mathbf{x}=\mathbf{0}$.

Rank-Nullity Theorem: For any linear transformation $T: V \to W$ (with $V$ finite-dimensional),

$$\dim(V) = \mathrm{rank}(T) + \mathrm{nullity}(T).$$

Equivalently, for an $m \times n$ matrix $A$,

$$n = \text{rank}(A) + \text{nullity}(A),$$

i.e., number of columns = rank + nullity.

This theorem makes intuitive sense: if you solve $A\mathbf{x}=\mathbf{0}$ (which is a homogeneous system of $n$ unknowns and rank$=r$ independent equations), you'll get $n-r$ free parameters in the general solution, so nullity $= n-r$.

The rank-nullity theorem is fundamental because it tells us that constraints (rank) plus degrees of freedom in the solution (nullity) add up to the total dimension.

As consequences:

If $T$ is injective (one-to-one), then $\ker(T)=\{\mathbf{0}\}$ so nullity $=0$, hence rank $= \dim(V)$. An injective linear map from $n$-dimensional space to an $m$-dimensional space can have rank at most $m$; injectivity forces rank = $n$ (so $n \le m$ in that case).
If $T$ is surjective (onto), then $\mathrm{Im}(T) = W$ so rank $=\dim(W)$. For a map $V \to W$ to be onto, you need at least as many dimensions in $V$ as in $W$ (so $n \ge m$ with rank $=m$).
For a square matrix $A$ ($n \times n$), $A$ is invertible $\iff$ rank$(A) = n$ $\iff$ nullity$(A)=0$. Invertibility, one-to-one, and onto all coincide in that case.

The rank of a matrix can be found via row reduction (the number of nonzero rows in the reduced row echelon form is the rank, which equals the number of pivot columns). The nullity can be found by solving $A\mathbf{x}=0$ and counting parameters.

Example

Consider

$$A = \begin{pmatrix} 1 & 2 & 3 \\ 2 & 4 & 6 \\ 0 & 5 & 5 \end{pmatrix}.$$

We see that the second row is just 2 times the first row for the first two columns, but we'll formally find the rank and nullity. Row reduce $A$:

$$\begin{pmatrix} 1 & 2 & 3 \\ 2 & 4 & 6 \\ 0 & 5 & 5 \end{pmatrix} \xrightarrow{R_2 \to R_2 - 2R_1} \begin{pmatrix} 1 & 2 & 3 \\ 0 & 0 & 0 \\ 0 & 5 & 5 \end{pmatrix}$$

I'll do it systematically:

$R_2 \to R_2 - 2R_1$: gives $\begin{pmatrix}1 & 2 & 3\\ 0 & 0 & 0\\ 0 & 5 & 5\end{pmatrix}$.
$R_3$ is already 0 in first column, use it to eliminate below: well, there's no $R_4$. Instead, swap $R_2$ and $R_3$ to get a nonzero second row: $R_2 \leftrightarrow R_3$: $\begin{pmatrix}1 & 2 & 3\\ 0 & 5 & 5\\ 0 & 0 & 0\end{pmatrix}.$
Now eliminate: $R_2$ is fine (leading entry in column 2). No need to eliminate further since $R_3$ is all zeros. We get a row-echelon form.

We have pivot positions in column 1 (row1) and column 2 (row2). So rank $=2$. Nullity $= n - \text{rank} = 3 - 2 = 1$.

Indeed one can find a one-parameter family of solutions to $A\mathbf{x}=0$. Writing the equations:

$$\begin{cases} x_1 + 2x_2 + 3x_3 = 0, \\ 5x_2 + 5x_3 = 0, \\ 0 = 0, \end{cases}$$

from the reduced system (or the second original equation was redundant). The second equation gives $5x_2 + 5x_3=0 \implies x_2 = -x_3$. Plug into the first: $x_1 + 2(-x_3) + 3x_3 = x_1 + x_3 = 0$, so $x_1 = -x_3$. Let $t = x_3$ (free parameter). Then solutions are $(-t, -t, t)$ or $t(-1, -1, 1)$. A basis for the null space is $\{(-1,-1,1)\}$. The dimension of the null space is 1, as we found.

For this matrix $A$, as a transformation $\mathbb{R}^3 \to \mathbb{R}^3$, the image is 2-dimensional (a plane in $\mathbb{R}^3$ spanned by the first and third columns, for instance) and the kernel is the 1-dimensional line spanned by $(-1,-1,1)$. Rank-nullity confirms $3 = 2 + 1$.

In ML terms, if $A$ were a data matrix with 3 features (columns) and 3 samples (rows), rank 2 indicates one feature is redundant (here the second feature is a linear combo of the first and something else). Nullity 1 indicates there's one linear relationship among features. Recognizing this can avoid issues like singular matrices in computations and can guide feature reduction.

Practice Problem 2.1

Let

$$B = \begin{pmatrix} 1 & 1 & 1 & 1 \\ 2 & 0 & 2 & 0 \\ 3 & 1 & 3 & 1 \end{pmatrix}.$$

Determine the rank of $B$ and find a basis for the null space of $B$.

Solution

First, note that the columns of $B$:

$$\mathbf{c}_1 = \begin{pmatrix}1\\2\\3\end{pmatrix},\; \mathbf{c}_2 = \begin{pmatrix}1\\0\\1\end{pmatrix},\; \mathbf{c}_3 = \begin{pmatrix}1\\2\\3\end{pmatrix},\; \mathbf{c}_4 = \begin{pmatrix}1\\0\\1\end{pmatrix}.$$

We see immediately that $\mathbf{c}_3 = \mathbf{c}_1$ (third column equals first) and $\mathbf{c}_4 = \mathbf{c}_2$ (fourth equals second). So at most 2 independent columns. Are $\mathbf{c}_1$ and $\mathbf{c}_2$ independent? They're $\begin{pmatrix}1\\2\\3\end{pmatrix}$ and $\begin{pmatrix}1\\0\\1\end{pmatrix}$, which are not multiples of each other (since e.g. first minus second yields $(0,2,2)$ which is not zero), so yes, they're independent. Thus rank$(B)=2$.

Alternatively, perform row reduction:

$$\begin{pmatrix} 1 & 1 & 1 & 1\\ 2 & 0 & 2 & 0\\ 3 & 1 & 3 & 1 \end{pmatrix} \xrightarrow{R_2 \to R_2 - 2R_1} \begin{pmatrix} 1 & 1 & 1 & 1\\ 0 & -2 & 0 & -2\\ 3 & 1 & 3 & 1 \end{pmatrix} \xrightarrow{R_3 \to R_3 - 3R_1} \begin{pmatrix} 1 & 1 & 1 & 1\\ 0 & -2 & 0 & -2\\ 0 & -2 & 0 & -2 \end{pmatrix} \xrightarrow{R_3 \to R_3 - R_2} \begin{pmatrix} 1 & 1 & 1 & 1\\ 0 & -2 & 0 & -2\\ 0 & 0 & 0 & 0 \end{pmatrix}.$$

We have two pivots (col1 and col2 pivot positions), confirming rank $=2$.

To find the null space, solve $B \mathbf{x} = \mathbf{0}$ with $\mathbf{x} = (x_1,x_2,x_3,x_4)^T$. The linear equations (from rows or the reduced form) are:

$$\begin{align*} \text{Row1: } & x_1 + x_2 + x_3 + x_4 = 0, \\ \text{Row2: } & 2x_1 + 0x_2 + 2x_3 + 0x_4 = 0 \quad\implies 2x_1 + 2x_3 = 0, \\ \text{Row3: } & 3x_1 + x_2 + 3x_3 + x_4 = 0, \text{ but this is not independent (likely sum of row1 and row2).} \end{align*}$$

From Row2: $2x_1 + 2x_3 = 0 \implies x_1 = -x_3$. Plug into Row1: $-x_3 + x_2 + x_3 + x_4 = 0 \implies x_2 + x_4 = 0 \implies x_2 = -x_4$.

Let $s = x_3$ and $t = x_4$ be free parameters. Then $x_1 = -s$, $x_2 = -t$, $x_3 = s$, $x_4 = t$. So a general solution is:

$$(x_1,x_2,x_3,x_4) = s(-1,0,1,0) + t(0,-1,0,1).$$

Thus a basis for $\ker(B)$ can be $\{(-1,0,1,0), (0,-1,0,1)\}$. We can verify those two vectors indeed satisfy $B\mathbf{x}=\mathbf{0}$ and are independent. The nullity is 2, consistent with $4 - \text{rank}(B)=4-2=2$. Each basis vector corresponds to one of the free parameters.

So rank$(B)=2$. One possible basis for $\mathcal{N}(B)$ is $\{(-1,0,1,0), (0,-1,0,1)\}$.

With the fundamentals of vector spaces, linear maps, and matrices covered, we can proceed to deeper properties of matrices, starting with the determinant, which provides a scalar summary of a square matrix's scaling effect and invertibility.

3. Determinants

The determinant is a scalar value that can be computed from a square matrix and encapsulates important information: whether the matrix is invertible, and the scaling factor of the linear transformation in terms of volume/orientation. For an $n \times n$ matrix $A = [a_{ij}]$, the determinant is denoted $\det(A)$ or $|A|$.

For small $n$, the determinant has explicit formulas:

For a $1\times1$ matrix $A=[a_{11}]$, $\det(A) = a_{11}$.
For $2\times2$: as mentioned, if $A = \begin{pmatrix}a & b\\ c & d\end{pmatrix}$, then $\det(A) = ad - bc$.
For $3\times3$: if $A = \begin{pmatrix} a & b & c\\ d & e & f\\ g & h & i \end{pmatrix}$, a formula is \[\det(A) = a \det\begin{pmatrix}e & f\\ h & i\end{pmatrix} - b \det\begin{pmatrix}d & f\\ g & i\end{pmatrix} + c \det\begin{pmatrix}d & e\\ g & h\end{pmatrix},\] which expands to $a(ei - fh) - b(di - fg) + c(dh - eg)$.

These formulas come from the general definition using permutations of $\{1,\ldots,n\}$ (the Leibniz formula):

$$\det(A) = \sum_{\sigma \in S_n} (\operatorname{sgn}\sigma)\; a_{1,\sigma(1)}\, a_{2,\sigma(2)} \cdots a_{n,\sigma(n)},$$

where the sum is over all permutations $\sigma$ of $\{1,2,\ldots,n\}$, and $\operatorname{sgn}\sigma$ is $+1$ for even permutations and $-1$ for odd permutations. For $n=3$ this yields 6 terms, which the above formula simplifies (the formula given is Laplace expansion along the first row).

While the explicit formula is complicated for large $n$ (there are $n!$ terms), conceptually the determinant has these key properties:

It is multilinear in the columns (or rows) of $A$.
It changes sign if two columns (or rows) are swapped.
$\det(I_n) = 1$ for the identity matrix.

These properties, in fact, uniquely characterize the determinant.

From these properties, one can deduce:

If any two columns (or rows) of $A$ are equal, $\det(A)=0$ (because swapping them would not change $A$ but would multiply $\det$ by -1).
If $A$ has a column (or row) of all zeros, $\det(A)=0$ (because it's linear in that column and clearly zero if you plug a zero vector).
$\det(AB) = \det(A)\det(B)$ for any $n\times n$ matrices $A, B$. Determinant of a product is the product of determinants.
$\det(A^T) = \det(A)$, i.e., the determinant is the same for a matrix and its transpose.
A matrix is invertible if and only if $\det(A) \neq 0$. (The determinant being nonzero signals the linear transformation is a volume-scaling by some nonzero factor, hence it must be bijective in finite dimensions.)

Geometric Meaning

The absolute value $|\det(A)|$ represents the scaling factor by which the linear transformation $T(\mathbf{x}) = A\mathbf{x}$ stretches or shrinks volumes (or areas in 2D, lengths in 1D). For example, if $A$ is $2\times2$, $|\det(A)|$ is the area of the parallelogram that the unit square is mapped to by $A$. If $A$ is $3\times3$, $|\det(A)|$ is the volume of the parallelepiped that the unit cube maps to.

A determinant of 1 or -1 means volume is preserved (though -1 indicates a flip in orientation). A determinant of 0 means volume collapses to zero — i.e., the image lies in a lower-dimensional subspace (the transformation is singular).

The sign of $\det(A)$ indicates orientation: If $\det(A) > 0$, the transformation preserves orientation (an oriented volume element keeps the same handedness), whereas $\det(A) < 0$ means it reverses orientation (like a reflection does). In $\mathbb{R}^2$, a $2\times2$ matrix with negative determinant flips the orientation (it turns a clockwise basis to counterclockwise or vice versa).

Example

Consider $A = \begin{pmatrix} 3 & 1\\ 2 & 4\end{pmatrix}$. Then $\det(A) = 3\cdot 4 - 1\cdot 2 = 12 - 2 = 10$. This is positive (orientation preserved) and not zero, so $A$ is invertible.

Geometrically, if you take the unit square with corners at $(0,0), (1,0), (1,1), (0,1)$, and apply $A$ to each point, you'll get a parallelogram whose area is 10 times the area of the original square (so area 10). Indeed, the column vectors $(3,2)^T$ and $(1,4)^T$ form the sides of that parallelogram, and area = $|3\cdot 4 - 1\cdot 2|=10$. We can sketch those two column vectors: one goes out to $(3,2)$, another to $(1,4)$, and the parallelogram spanned by them has area 10. If we swapped the two columns, we would get $\det = 1\cdot 2 - 3\cdot 4 = -10$, same area but opposite orientation.

One important use of determinants in theory is in formulas like Cramer's Rule for solving linear systems (which expresses each component of the solution as a ratio of two determinants), and in understanding eigenvalues (the characteristic polynomial $\det(A - \lambda I)=0$ gives eigenvalues). However, in practice, direct use of determinants for computation is often avoided for numeric stability reasons (especially for large matrices, one typically uses factorization methods).

A note on computation: The determinant of an $n\times n$ matrix can be computed by performing Gaussian elimination and multiplying the diagonal entries of the upper triangular form (taking care to account for any row swaps or scalings, as those change the determinant in known ways: swapping rows multiplies $\det$ by -1, scaling a row by $c$ multiplies $\det$ by $c$). This is $O(n^3)$ and is how most libraries do it.

Determinant in Machine Learning Context

In machine learning, you might encounter determinants in the context of probability distributions. For example, the probability density of a multivariate normal distribution involves $\frac{1}{\sqrt{(2\pi)^n \det(\Sigma)}} \exp(-\frac{1}{2}x^T \Sigma^{-1} x)$, where $\Sigma$ is a covariance matrix. The $\det(\Sigma)$ term appears as part of the normalization factor, indicating how the volume scaling by $\Sigma$ affects density.

Determinants also show up when changing variables in integrals (Jacobian determinant of the transformation), such as in backprop for certain flows or normalization flows in deep learning (where one needs the log-determinant of a Jacobian).

In linear regression, one might look at $\det(X^T X)$ as part of diagnostic tools; if $\det(X^T X)$ is near zero, the design matrix $X$ (with columns as features) is nearly singular (collinear features).

Computing Determinants

Determinant calculation in Python using: (1) The explicit formula for 2×2 and 3×3 matrices (2) A recursive method using cofactor expansion


            import numpy as np

            def det_2x2(A):
                """Compute determinant of a 2x2 matrix."""
                return A[0,0]*A[1,1] - A[0,1]*A[1,0]

            def det_3x3(A):
                """Compute determinant of a 3x3 matrix using cofactor expansion."""
                return (A[0,0] * (A[1,1]*A[2,2] - A[1,2]*A[2,1]) - 
                        A[0,1] * (A[1,0]*A[2,2] - A[1,2]*A[2,0]) + 
                        A[0,2] * (A[1,0]*A[2,1] - A[1,1]*A[2,0]))

            def recursive_det(A):
                """Compute determinant recursively using cofactor expansion."""
                n = A.shape[0]
                
                # Base case: 1×1 matrix
                if n == 1:
                    return A[0,0]
                
                # Base case: 2×2 matrix
                if n == 2:
                    return det_2x2(A)
                
                # Recursive case: expand along first row
                det = 0
                for j in range(n):
                    # Create submatrix by removing first row and column j
                    submatrix = np.delete(np.delete(A, 0, axis=0), j, axis=1)
                    # Add term with appropriate sign
                    det += ((-1) ** j) * A[0,j] * recursive_det(submatrix)
                
                return det

            # Example
            A = np.array([[3, 1], [2, 4]])
            print(f"det(A) = {det_2x2(A)}")  # Should be 10

            B = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
            print(f"det(B) = {det_3x3(B)}")  # Should be 0

            # Compare with NumPy's implementation
            print(f"NumPy det(A) = {np.linalg.det(A)}")
            print(f"NumPy det(B) = {np.linalg.det(B)}")

In practice, we rarely implement determinant calculations by hand. Instead, we use optimized libraries like NumPy that employ LU decomposition or other efficient algorithms. The recursive method shown above has $O(n!)$ complexity and is provided only for educational purposes.

Practice Problem 3.1

Compute the determinant of

$$M = \begin{pmatrix} 2 & -1 & 0 \\ 5 & 3 & 1 \\ 0 & 4 & -2 \end{pmatrix},$$

and interpret its meaning in terms of the linear transformation $T(\mathbf{x}) = M\mathbf{x}$.

Solution

I'll compute $\det(M)$ using expansion or elimination. Using the first row expansion (Laplace expansion):

$$\det(M) = 2 \cdot \det\begin{pmatrix}3 & 1\\ 4 & -2\end{pmatrix} - (-1) \cdot \det\begin{pmatrix}5 & 1\\ 0 & -2\end{pmatrix} + 0 \cdot \det(\cdots).$$

We can ignore the third term since it's $0$ times a minor. Now:

$\det\begin{pmatrix}3 & 1\\ 4 & -2\end{pmatrix} = 3(-2) - 1(4) = -6 - 4 = -10$.

$\det\begin{pmatrix}5 & 1\\ 0 & -2\end{pmatrix} = 5(-2) - 1(0) = -10$. So:

$$\det(M) = 2(-10) - (-1)(-10) = -20 - (+10) = -30.$$

Thus $\det(M) = -30$.

To double-check, we might do elimination:

$$\begin{pmatrix} 2 & -1 & 0\\ 5 & 3 & 1\\ 0 & 4 & -2 \end{pmatrix} \xrightarrow{R_2 \to R_2 - (5/2)R_1} \begin{pmatrix} 2 & -1 & 0\\ 0 & 3 + (5/2) & 1\\ 0 & 4 & -2 \end{pmatrix} = \begin{pmatrix} 2 & -1 & 0\\ 0 & 11/2 & 1\\ 0 & 4 & -2 \end{pmatrix} \xrightarrow{R_3 \to R_3 - \frac{4}{(11/2)} R_2} \begin{pmatrix} 2 & -1 & 0\\ 0 & 11/2 & 1\\ 0 & 0 & -2 - \frac{4}{(11/2)}\cdot 1 \end{pmatrix}.$$

Compute that last part: $\frac{4}{(11/2)} = \frac{8}{11}$. So:

$$-2 - \frac{8}{11} = -\frac{22 + 8}{11} = -\frac{30}{11}.$$

Oops, looks like fractional elimination got messy. Perhaps easier: do cofactor or another expansion approach.

Alternatively, do expansion by first column:

$$\det(M) = 2 \det\begin{pmatrix}3 & 1\\ 4 & -2\end{pmatrix} - 5 \det\begin{pmatrix}-1 & 0\\ 4 & -2\end{pmatrix} + 0\cdot(\cdot).$$

First minor we did is $-10$. Second minor: $\det\begin{pmatrix}-1 & 0\\ 4 & -2\end{pmatrix} = (-1)(-2) - (0)(4) = 2$. So: $\det(M) = 2(-10) - 5(2) = -20 - 10 = -30$. Matches above.

So $\det(M) = -30$. The absolute value $30$ is the volume scaling factor of $T:\mathbb{R}^3 \to \mathbb{R}^3$. That means if you take the unit cube (volume 1) and transform all its points by $M$, the resulting parallelepiped has volume $30$. The negative sign indicates an orientation reversal: the transformation $T$ flips the orientation of space (it is like a reflection or an odd number of flips composed with stretches). In practical terms, $T$ is invertible (since determinant is nonzero), and any small oriented volume element will be flipped and scaled by factor 30 under $T$.

The result $-30$ means $T$ is one-to-one and onto (invertible), increasing volumes by 30, but if you had a right-hand rule basis in the domain, its image in the codomain will follow a left-hand rule due to the sign flip. This orientation flip could be seen, for instance, if $M$ had an odd number of negative eigenvalues or something akin to a reflection.

Now with determinants understood, we move on to eigenvalues and eigenvectors, which are critical for understanding how a linear transformation can act by simply stretching or compressing along certain special directions.

4. Eigenvalues and Eigenvectors

A central concept in linear algebra is that of eigenvalues and eigenvectors. Given a square matrix $A \in \mathbb{R}^{n\times n}$ (or a linear transformation $T: V \to V$ on an $n$-dimensional vector space), an eigenvector is a nonzero vector $\mathbf{v}$ such that applying $A$ to $\mathbf{v}$ simply scales $\mathbf{v}$ by some scalar factor. That is:

$$A \mathbf{v} = \lambda \mathbf{v},$$

for some scalar $\lambda$. The scalar $\lambda$ is called the eigenvalue corresponding to eigenvector $\mathbf{v}$. In words, $\mathbf{v}$ is an eigenvector if it's in a "direction" that $A$ stretches or squashes (or flips) by the factor $\lambda$.

To find eigenvalues, we treat the equation $A\mathbf{v} = \lambda \mathbf{v}$ for unknowns $\mathbf{v}\neq 0$ and $\lambda$. Rearranged, this is:

$$(A - \lambda I)\mathbf{v} = \mathbf{0}.$$

For a nonzero solution $\mathbf{v}$ to exist, the matrix $A - \lambda I$ must be singular (non-invertible). That happens precisely when:

$$\det(A - \lambda I) = 0.$$

The left-hand side is an $n$-th degree polynomial in $\lambda$, called the characteristic polynomial of $A$. By the fundamental theorem of algebra, it has $n$ roots (counting multiplicities) in the complex numbers. However, if $A$ has all real entries, its real eigenvalues may be fewer (some roots could be complex). In this guide, we'll focus on cases with real eigenvalues for intuition, but note that complex eigenpairs are a thing too.

So the eigenvalues of $A$ are the solutions $\lambda$ to $\det(A - \lambda I) = 0$. For each eigenvalue, one can find eigenvectors by solving $(A - \lambda I)\mathbf{v} = \mathbf{0}$ (which is a null space computation).

Example 1 (2×2 matrix)

Let $A = \begin{pmatrix}4 & 2\\1 & 3\end{pmatrix}$. The characteristic polynomial is:

$$\det\begin{pmatrix}4-\lambda & 2\\1 & 3-\lambda\end{pmatrix} = (4-\lambda)(3-\lambda) - 2\cdot 1 = (4-\lambda)(3-\lambda) - 2.$$

Expanding: $(4-\lambda)(3-\lambda) = 12 - 4\lambda - 3\lambda + \lambda^2 = \lambda^2 - 7\lambda + 12$. Subtracting 2 gives $\lambda^2 - 7\lambda + 10$. Solve $\lambda^2 - 7\lambda + 10 = 0$. It factors as $(\lambda-5)(\lambda-2) = 0$. So eigenvalues are $\lambda_1 = 5$ and $\lambda_2 = 2$.

Now eigenvectors:

For $\lambda=5$: solve $(A - 5I)\mathbf{v} = 0$, i.e. $\begin{pmatrix}-1 & 2\\1 & -2\end{pmatrix}\begin{pmatrix}v_1\\v_2\end{pmatrix} = \mathbf{0}$. This gives equations: $-v_1 + 2v_2 = 0$ and $v_1 - 2v_2 = 0$ (which is the same equation twice). So $-v_1 + 2v_2 = 0 \implies v_1 = 2v_2$. Take $v_2 = 1$, then $v_1 = 2$. So an eigenvector for $\lambda=5$ is $\mathbf{v}_1 = \begin{pmatrix}2\\1\end{pmatrix}$ (any scalar multiple would also work, e.g. $(2,1)^T$ scaled).

For $\lambda=2$: solve $(A - 2I)\mathbf{v}=0$: $\begin{pmatrix}2 & 2\\1 & 1\end{pmatrix}\begin{pmatrix}v_1\\v_2\end{pmatrix} = \mathbf{0}$, which yields $2v_1 + 2v_2 = 0$ and $v_1 + v_2 = 0$ (the second is redundant given first). So $v_1 = -v_2$. Choose $v_2 = 1$, then $v_1 = -1$. An eigenvector is $\mathbf{v}_2 = (-1, 1)^T$.

Indeed, check: $A(2,1)^T = (4\cdot2+2\cdot1,\;1\cdot2+3\cdot1)^T = (10,5)^T = 5(2,1)^T$. And $A(-1,1)^T = (4\cdot(-1)+2\cdot1,\;1\cdot(-1)+3\cdot1)^T = (-2,2)^T = 2(-1,1)^T$. So it works.

Geometrically, eigenvectors $(2,1)$ and $(-1,1)$ are directions in $\mathbb{R}^2$ that $A$ simply stretches by factors 5 and 2 respectively. Along $(2,1)$, $A$ acts like $5I$; along $(-1,1)$, $A$ acts like $2I$.

Eigenvectors give insight into the "geometry" of $A$. If $A$ were thought of as transforming space, eigenvectors are those special directions that don't get "rotated" out of their span, just scaled. Not every matrix has a full set of real eigenvectors; for example, rotation matrices in 2D have complex eigenvalues or, say, an $180^\circ$ rotation has eigenvalues $-1, -1$ (which are real but degenerate orientation flip in every direction, technically every vector is an eigenvector for $\lambda=-1$ in that case).

If an $n\times n$ matrix $A$ has $n$ linearly independent eigenvectors (which is the case if it has $n$ distinct eigenvalues, or if it's symmetric, or other conditions), then one can diagonalize $A$. That means there exists an invertible matrix $P$ and a diagonal matrix $D$ such that $A = P D P^{-1}$, where the diagonal entries of $D$ are eigenvalues and columns of $P$ are corresponding eigenvectors. Diagonalization is very useful because $A^k = P D^k P^{-1}$ (so powers of $A$ are easy to compute), and it provides a kind of "modal decomposition" of the linear map into independent parts.

If $A$ is symmetric (i.e., $A = A^T$), a special and important case, not only are eigenvalues real, but one can choose eigenvectors to be orthonormal (perpendicular unit vectors). This is the spectral theorem for symmetric matrices. In that case, $A = Q \Lambda Q^T$ with $Q$ orthogonal (meaning $Q^{-1}=Q^T$). This is heavily used in PCA (covariance matrices are symmetric and positive semi-definite, their eigen decomposition yields orthogonal principal components).

Eigenvalues in Applications

In mechanical systems or any system of linear differential equations, eigenvalues of a transformation or matrix often relate to natural frequencies or modes of the system.

In Google's PageRank, the ranking vector is essentially an eigenvector of the (modified) web link matrix corresponding to eigenvalue 1.

In Markov chains, stationary distributions are eigenvectors of the transition matrix with eigenvalue 1.

In machine learning, sometimes one looks at the Hessian (matrix of second derivatives) of a loss function; its eigenvalues (which are real symmetric) tell you about the curvature in different directions (important for optimisation: large eigenvalues correspond to steep curvature along eigenvector directions).

In deep learning, the eigenvalues of weight matrices (singular values more directly, but eigen if symmetric) can indicate if gradients explode or vanish (if very large or small).

Principal Component Analysis: as we'll detail later, the principal components are eigenvectors of the covariance matrix, and eigenvalues indicate the variance explained by those components.

One can also discuss geometric multiplicity (dimension of eigenspace for an eigenvalue) and algebraic multiplicity (multiplicity of an eigenvalue as root of char poly). A matrix is diagonalizable if for each eigenvalue, geometric multiplicity equals algebraic multiplicity (and sum equals $n$). Some matrices are not diagonalizable (they are defective), but those can be put in Jordan normal form (beyond our scope here).

Practice Problem 4.1

Find the eigenvalues and eigenvectors of $C = \begin{pmatrix} 0 & 2 & 2 \\ 0 & 1 & -1 \\ 0 & 1 & -1 \end{pmatrix}$.

Solution

First find the eigenvalues by solving $\det(C - \lambda I) = 0$:

$$C - \lambda I = \begin{pmatrix} -\lambda & 2 & 2\\ 0 & 1-\lambda & -1\\ 0 & 1 & -1-\lambda \end{pmatrix}.$$

The determinant can be computed. Note the first column has only one nonzero entry (top). Expanding along the first column:

$$\det(C - \lambda I) = -\lambda \cdot \det\begin{pmatrix} 1-\lambda & -1\\ 1 & -1-\lambda \end{pmatrix} - 0 + 0.$$

So only the first term matters:

$$-\lambda [ (1-\lambda)(-1-\lambda) - (-1)(1) ] = -\lambda [ -(1-\lambda)(1+\lambda) + 1 ].$$

We used $(1-\lambda)(-1-\lambda) = -(1-\lambda)(1+\lambda)$. Simplify inside: $(1-\lambda)(1+\lambda) = 1 - \lambda^2$. So:

$$\det(C-\lambda I) = -\lambda [ - (1 - \lambda^2) + 1 ] = -\lambda [ -1 + \lambda^2 + 1 ] = -\lambda [ \lambda^2 ] = -\lambda^3.$$

Setting that equal to 0 gives $-\lambda^3 = 0 \implies \lambda^3 = 0$. So the only eigenvalue is $\lambda = 0$ (with algebraic multiplicity 3).

So 0 is a very special eigenvalue: it's the only one. Now find eigenvectors by solving $C \mathbf{v} = 0$ (the nullspace):

$$\begin{pmatrix}0 & 2 & 2\\0 & 1 & -1\\0 & 1 & -1\end{pmatrix} \begin{pmatrix}v_1\\v_2\\v_3\end{pmatrix} = \begin{pmatrix}0\\0\\0\end{pmatrix}.$$

This yields the system:

$$\begin{align*} 0\cdot v_1 + 2v_2 + 2v_3 &= 0, \quad \text{i.e. } 2v_2 + 2v_3 = 0,\\ 0\cdot v_1 + 1v_2 - 1v_3 &= 0, \quad \text{i.e. } v_2 - v_3 = 0,\\ 0\cdot v_1 + 1v_2 - 1v_3 &= 0 \text{ (same as above).} \end{align*}$$

The second equation gives $v_2 = v_3$. The first then gives $2v_2 + 2v_3 = 4v_3 = 0$, so $v_3 = 0$ and thus $v_2 = 0$. There is no restriction on $v_1$ from these equations (notice $v_1$ doesn't appear). So we have: $v_3 = 0, v_2 = 0, v_1$ free. Take $v_1 = 1$, then one eigenvector is $(1,0,0)^T$. In fact the eigenspace is all multiples of $(1,0,0)^T$. So the geometric multiplicity is 1. This matrix $C$ is not diagonalizable because we don't have 3 independent eigenvectors, we have only 1 independent eigenvector for the single eigenvalue 0.

We can still describe what $C$ does: clearly $C$ kills second and third components of any vector (since first column is zero). Any vector $(x,y,z)$ gets mapped to $(2y+2z,\;y-z,\;y-z)$. The eigenvector $(1,0,0)$ corresponds to $\lambda=0$ indeed: $C(1,0,0)^T = (0,0,0)^T$. Actually every vector in that $x$-axis is sent to zero (the entire $x$-axis is the eigenspace). Other vectors not on $x$-axis are not eigenvectors but $C$ doesn't stretch them, it kind of projects them (note the last two coordinates end up equal, etc.). Without diving into Jordan form, we can say: eigenvalue 0 with multiplicity 3, eigenvectors: all nonzero multiples of $(1,0,0)$.

So $\lambda_1 = \lambda_2 = \lambda_3 = 0$. Eigenvectors: any vector of form $(a,0,0)$ for $a \neq 0$. The dimension of eigenspace is 1.

This example shows a case where the matrix is nilpotent ($C^3=0$ as eigenvalues 0 suggests), which in RL or other contexts can happen (like certain Markov chain transitions minus I etc).

We will use eigenvalues in upcoming sections (PCA and RL e.g. successor representation analysis).

Eigenvectors thus give us "principal axes" of linear transformations. Next, we examine the Singular Value Decomposition, which is like an eigen-decomposition for arbitrary (possibly non-square or non-symmetric) matrices, and its application to principal component analysis.

5. Singular Value Decomposition and Principal Component Analysis

Every real matrix (even non-square) has a remarkable factorization called the Singular Value Decomposition (SVD). The SVD expresses an $m \times n$ matrix $A$ as $A = U \Sigma V^T$, where

$U$ is an $m\times m$ orthogonal matrix (meaning its columns are an orthonormal basis of $\mathbb{R}^m$),

$V$ is an $n\times n$ orthogonal matrix (columns an orthonormal basis of $\mathbb{R}^n$),

$\Sigma$ is an $m \times n$ diagonal matrix (strictly speaking, "diagonal" meaning $\Sigma_{ij}=0$ for $i\neq j$) with nonnegative entries on the diagonal, called singular values.

Typically, we order the singular values in descending order: $\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_r \ge 0$, where $r = \min(m,n)$ is the maximum number of singular values (if $A$ has rank $k < r$, then only $k$ of those will be > 0, the rest will be zeros). The columns of $U$ are called left-singular vectors and the columns of $V$ are right-singular vectors.

The effect of $A$ on a vector can be understood via SVD: $A\mathbf{x} = U \Sigma V^T \mathbf{x}$. If we let $\mathbf{y} = V^T \mathbf{x}$, that is a coordinate representation of $\mathbf{x}$ in the orthonormal basis given by columns of $V$. Then $\Sigma$ acts on $\mathbf{y}$ by scaling coordinates (some $\sigma_i$, and if $m>n$ there are $m-n$ zero rows that kill extra components), then $U$ rotates to the final coordinate system. In essence, $A$ takes the orthonormal basis given by $V$ in the domain, stretches it by factors $\sigma_i$, and then outputs in directions given by $U$ in the codomain.

If $A$ is square and symmetric positive-semidefinite (like a covariance matrix), the SVD coincides with the eigen-decomposition: $U = V$ and singular values = eigenvalues (nonnegative). If $A$ is symmetric but not positive, then SVD still yields nonneg singular values (which are absolute values of eigenvalues, if any negative eigen then $U$ and $V$ differ by a sign flip to make singular value positive).

For a concrete example, consider a simple matrix: $A = \begin{pmatrix} 3 & 0 \\ 0 & 1 \\ 0 & 0 \end{pmatrix}$ ($3\times2$). It's rank 2 (actually here rank is 2 only if bottom row isn't all zeros; in this case bottom row is $(0,0)$, so rank is 2? Wait, first row [3,0] and second [0,1] are independent, yes rank=2, but $m=3,n=2$ so $r=\min(3,2)=2$). We can guess SVD: columns of $V$ should align with principal directions in domain (which in this case might just be standard basis $e_1=(1,0), e_2=(0,1)$ because matrix $A$ is simple). $A e_1 = (3,0,0)^T$ and $A e_2 = (0,1,0)^T$. Those images are orthogonal (they are in directions of $e_1$ and $e_2$ in codomain maybe scaled). Actually $||Ae_1|| = 3, ||Ae_2||=1$. So singular values might be 3 and 1 (the lengths of images of unit vectors, but careful: if $e_1,e_2$ were already directions of max stretch, which here they are likely). So $\sigma_1=3, \sigma_2=1$. $U$ would have first column = $(1,0,0)^T$ (direction of $Ae_1$ normalized) and second column = $(0,1,0)^T$ (direction of $Ae_2$ normalized). $V$ would be identity (since we picked $e_1,e_2$ as those directions already). So guess: $U = \begin{pmatrix}1&0&0\\0&1&0\\0&0&1\end{pmatrix}$ but third column to complete orthonormal basis can be $(0,0,1)^T$, $V = I_{2}$, $\Sigma = \begin{pmatrix}3&0\\0&1\\0&0\end{pmatrix}$. Check: $U \Sigma V^T = U \Sigma = \Sigma$ (since $U$ is identity here except maybe the third basis which multiplies zero anyway). That equals original $A$. Good.

SVD existence is guaranteed and can be computed via eigendecomposition of $A^T A$ (whose eigenvalues are $\sigma_i^2$, same for $A A^T$ for left singular vectors).

The significance of SVD:

It gives the best low-rank approximations to a matrix (Eckart–Young theorem: if you take the top $k$ singular values and vectors, you get the matrix of rank $k$ that is closest to $A$ in Frobenius norm or 2-norm).

It is used in PCA: if $X$ is data matrix (say $n$ samples by $d$ features, demeaned), then $\frac{1}{\sqrt{n-1}} X = U \Sigma V^T$ (an SVD). The columns of $V$ are eigenvectors of covariance and are principal directions, $\Sigma^2$ diagonal entries related to variances.

SVD is used in recommender systems (e.g. latent factor analysis, where a user-item rating matrix is approximated by low-rank SVD capturing hidden factors).

It's used in solving least squares (the pseudoinverse of $A$ can be given by $A^+ = V \Sigma^+ U^T$, where $\Sigma^+$ just reciprocals nonzero singular values).

In modern ML, truncated SVD (PCA) can reduce dimensionality of data for efficiency or noise reduction.

Principal Component Analysis (PCA):

PCA is a technique to reduce the dimensionality of data while retaining as much variance as possible. Suppose we have a dataset of $n$ points in $\mathbb{R}^d$, represented as an $n \times d$ data matrix $X$ (each row is a data point, and we assume the data is zero-mean for PCA by subtracting the mean from each column first). We find the directions in $\mathbb{R}^d$ along which the data have the most variance. These directions are the eigenvectors of the covariance matrix $\frac{1}{n} X^T X$. Equivalently, they are given by the right singular vectors $V$ of $X$ (from SVD $X = U \Sigma V^T$). The singular values $\sigma_i$ of $X$ relate to the standard deviation of the data along those principal directions: indeed $\sigma_i^2 = $ (variance along $i$th principal component)$\times (n-1)$ if using sample covariance.

So computationally:

Form $X$ (centered data).

Compute SVD: $X = U \Sigma V^T$.

The principal components are the columns of $V$. The $i$th principal component direction is $V_{:,i}$ (the $i$th column of $V$) and the variance of data along that direction is $(\sigma_i^2/(n-1))$ if $\sigma_i$ came from SVD of $\frac{1}{\sqrt{n-1}}X$ (to tie to covariance exactly).

To reduce dimension to $k$, take the first $k$ columns of $V$ (the $k$ top singular vectors). The projected data (scores on PCs) is $X V_k$ (which equals $U_k \Sigma_k$ essentially the first $k$ columns of $U\Sigma$).

This projection captures maximum variance because any other projection to $k$ dims has less total variance than this one.

In summary, PCA chooses an orthonormal basis (principal components) aligned with data's covariance. It's used for:

Visualization of high-dimensional data (take first 2 or 3 PCs to plot).

Preprocessing to decorrelate features or reduce noise (since lower-variance components may largely be noise).

Compression (like representing data in fewer numbers).

It's essentially what tasks like eigenfaces (face recognition basis) or latent semantic analysis (document-topic analysis) do.

We should mention: the singular values (squared) are the eigenvalues of $X^T X$ (or $XX^T$). The sum of singular values squared equals the sum of squared entries of $X$ (total variance).

One can do PCA via eigendecomposition of covariance or via SVD of data matrix (which is often more stable for large data sets with fewer dimensions etc).

Example:

Consider a small dataset: Points in $\mathbb{R}^2$: $(2,0), (0,1), (-2,0), (0,-1)$. These lie roughly on a cross shape. If we do PCA: Mean is $(0,0)$ already. Covariance matrix = $\frac{1}{4} \begin{pmatrix}2^2+0^2+(-2)^2+0^2 & 20+01+(-2)0+0(-1)\ \cdots \end{pmatrix} = \frac{1}{4}\begin{pmatrix}8 & 0 \ 0 & 2 \end{pmatrix} = \begin{pmatrix}2 & 0\\0 & 0.5\end{pmatrix}$. Eigenvalues: 2 and 0.5, eigenvectors: $(1,0)^T$ and $(0,1)^T$. So first PC is x-axis, second PC is y-axis (makes sense given data extends more in x). If we reduce to 1D, we'd project onto x-axis, preserving that variance (which is 2). This trivial case shows PCA found main axis and minor axis.

In ML, if we had many features possibly redundant or correlated, PCA can sometimes improve performance or reduce complexity (though sometimes features with low variance can still be very important for prediction tasks, so one should be careful using PCA blindly for supervised learning).

We can use PyTorch or numpy to quickly find SVD/PCA:

import torch
X = torch.tensor([[2.0,0.0], [0.0,1.0], [-2.0,0.0], [0.0,-1.0]])
# Centering (already zero mean in this example)
U, S, Vh = torch.linalg.svd(X, full_matrices=False)
print(S**2 / (X.shape[0]-1))  # sample variances per component
print(Vh.T)

This would output singular values squared/(3) = [2, 0.5] presumably, and Vh.T = [[1,0],[0,1]].

An example where data is 3D but mostly lies near a 2D plane:

import numpy as np
# Generate random points near plane z = x + 2y
np.random.seed(0)
X = np.random.randn(100,2)
Z = X[:,0] + 2*X[:,1] + 0.1*np.random.randn(100)  # add small noise off the plane
data = np.column_stack((X, Z))

# Center the data
data -= data.mean(axis=0)

# Compute PCA via SVD
U, S, Vt = np.linalg.svd(data, full_matrices=False)
print("Singular values:", S)
print("Principal components (columns of V):")
print(Vt.T)

We expect 1st two singular values large, third much smaller (because plane mostly). We can try to run this in actual environment if allowed:

import numpy as np
np.random.seed(0)
X = np.random.randn(100,2)
Z = X[:,0] + 2*X[:,1] + 0.1*np.random.randn(100)
data = np.column_stack((X, Z))
data -= data.mean(axis=0)
U, S, Vt = np.linalg.svd(data, full_matrices=False)
print("Singular values:", S)
print("Variance explained:", (S**2)/(len(data)-1))
print("Principal components (V):", Vt.T)

The key results:

Two large singular values, one tiny (since data almost rank 2).

The last principal component corresponds to the vector roughly orthonormal to plane (should be like [1,2,-1/coeff] normalized, since plane normal is maybe (1,2,-1)? Actually plane eq: z - x - 2y = 0, normal = ( -(-1), -(-2), something?), we derive: plane: x + 2y - z = 0, so normal n = (1,2,-1). So we expect one principal component ~ direction (something proportional to (1,2,-1)). The other two will span plane.

So SVD/PCA strongly tie: SVD: $X = U \Sigma V^T$, Covariance: $\frac{1}{n-1}X^T X = V (\frac{\Sigma^T \Sigma}{n-1}) V^T$ meaning $V$ diagonalizes covariance with eigenvalues $\frac{\sigma_i^2}{n-1}$. So columns of $V$ (principal axes) align with covariance eigenvectors, and $\sigma_i^2/(n-1)$ are eigenvalues (variances).

Finally, we mention using PyTorch to do SVD:

import torch
A = torch.tensor([[3.0, 1.0],[1.0, 3.0]])
U, S, Vh = torch.linalg.svd(A)
print("U =\n", U)
print("S =", S)
print("V^T =\n", Vh)

This should yield something (for symmetric 2x2, U and V should be same apart from sign, S will be eigenvalues maybe sorted descending).

Practice Problem 5.1:

Consider the data points in $\mathbb{R}^3$: $(2,2,1), (3,3,2), (2,-2,-1), (3,-3,-2)$.

Compute the principal components of this dataset.
If you project the data onto the first principal component, what variance of the data is retained?
How many principal components are needed to capture all the variance in this dataset?

Solution:

First, I'll form our data matrix with each row being a data point:

\[X = \begin{pmatrix} 2 & 2 & 1\\ 3 & 3 & 2\\ 2 & -2 & -1\\ 3 & -3 & -2 \end{pmatrix}\]

1. To find the principal components, we start by centering the data. The mean of each column is $(2.5, 0, 0)$, so the centered data is:

\[X_{\text{centered}} = \begin{pmatrix} -0.5 & 2 & 1\\ 0.5 & 3 & 2\\ -0.5 & -2 & -1\\ 0.5 & -3 & -2 \end{pmatrix}\]

Next, we compute the sample covariance matrix (dividing by $n-1=3$):

\[\text{Cov}(X) = \frac{1}{3}X_{\text{centered}}^T X_{\text{centered}} = \begin{pmatrix} 0.25 & 0 & 0 \\ 0 & 6.5 & 4 \\ 0 & 4 & 2.5 \end{pmatrix}\]

Notice that the $x$ coordinate is uncorrelated with $y$ and $z$. The covariance matrix is block diagonal, with the $x$ component separated from the $y$-$z$ block. For the $y$-$z$ submatrix:

\[\begin{pmatrix} 6.5 & 4 \\ 4 & 2.5 \end{pmatrix}\]

We find eigenvalues by solving $\det(\text{Cov}_{yz} - \lambda I) = 0$:

\[(6.5-\lambda)(2.5-\lambda) - 4 \cdot 4 = 0\]

\[\lambda^2 - 9\lambda + 0.25 = 0\]

The eigenvalues are $\lambda_1 \approx 8.972$ and $\lambda_2 \approx 0.028$. Together with the $x$ component eigenvalue $0.25$, our eigenvalues in descending order are: $8.972$, $0.25$, and $0.028$.

For the first principal component, we need the eigenvector corresponding to $\lambda_1 \approx 8.972$ in the $y$-$z$ plane. Solving the equation:

\[(6.5 - 8.972)v_y + 4v_z = 0 \Rightarrow -2.472v_y + 4v_z = 0 \Rightarrow v_z \approx 0.618v_y\]

Normalizing, we get $v_y \approx 0.85$ and $v_z \approx 0.53$, so the first principal component is approximately $(0, 0.85, 0.53)$.

The second principal component corresponds to the next largest eigenvalue $0.25$, which is the $x$ component: $(1, 0, 0)$.

The third principal component is the other eigenvector in the $y$-$z$ plane, orthogonal to $(0, 0.85, 0.53)$, which is approximately $(0, 0.53, -0.85)$.

2. The variance retained by projecting onto the first principal component is $\lambda_1 \approx 8.972$ out of the total variance $\text{trace}(\text{Cov}) = 0.25 + 6.5 + 2.5 = 9.25$. As a percentage, that's approximately $8.972/9.25 \approx 97\%$ of the total variance.

3. To capture all the variance in the dataset, we need all 3 principal components. Although the third eigenvalue is small ($0.028$), it's not exactly zero, indicating that the data doesn't perfectly lie in a 2D plane.

Summary:

PC1: $(0, 0.85, 0.53)$ - captures about 97% of variance
PC2: $(1, 0, 0)$ - captures about 2.7% of variance
PC3: $(0, 0.53, -0.85)$ - captures about 0.3% of variance

Next, I'll discuss orthonormal bases and how to construct them via Gram-Schmidt, then projections and least squares.

6. Orthogonality and Orthonormal Bases

Two vectors $u$, $v$ in an inner product space (like $\mathbb{R}^n$ with the standard dot product) are orthogonal if $\langle u, v \rangle = 0$. In $\mathbb{R}^n$ with the standard inner product $\langle u, v \rangle = u^T v = \sum_i u_i v_i$, orthogonality means the vectors are perpendicular (at a $90^\circ$ angle).

A vector is normalized or a unit vector if it has length (norm) 1. An orthonormal set of vectors is a set of vectors that are mutually orthogonal and each of unit length. If $\{q_1, q_2, \ldots, q_k\}$ is orthonormal, then $q_i^T q_j = 0$ for $i\neq j$ and $q_i^T q_i = 1$ for each $i$.

Working with orthonormal bases simplifies many computations. Coordinates relative to an orthonormal basis are easy to compute: the coordinate of a vector $v$ in direction $q_i$ is just the dot product $\langle v, q_i \rangle$. Because:

\[v = \sum_i (v^T q_i)q_i\] (this is essentially projecting $v$ onto each orthonormal basis vector).

Orthonormal columns of a matrix lead to nice properties: if $Q$ has orthonormal columns, then $Q^T Q = I$ (so $Q$ is orthogonal if it's square; or in general $Q^T Q = I$ of appropriate size).

Orthogonal matrices preserve lengths and angles: $||Qx|| = ||x||$ and $(Qx)\cdot(Qy) = x\cdot y$. This is why rotations and reflections correspond to orthogonal matrices.

Gram-Schmidt Process

Given a set of linearly independent vectors $v_1, v_2, \ldots, v_n$ in an inner product space, we can produce an orthonormal set of vectors $q_1, q_2, \ldots, q_n$ that span the same subspace using the Gram-Schmidt algorithm:

\begin{align*} u_1 &= v_1, \qquad q_1 = \frac{u_1}{|u_1|}, \\ u_2 &= v_2 - \text{proj}_{q_1}(v_2) = v_2 - (v_2^T q_1) q_1, \qquad q_2 = \frac{u_2}{|u_2|}, \\ u_3 &= v_3 - (v_3^T q_1) q_1 - (v_3^T q_2) q_2, \qquad q_3 = \frac{u_3}{|u_3|}, \\ &\vdots \\ u_k &= v_k - \sum_{j=1}^{k-1} (v_k^T q_j) q_j, \qquad q_k = \frac{u_k}{|u_k|}. \end{align*}

At each step, $u_k$ is the component of $v_k$ orthogonal to all previously obtained $q_j$ ($j < k$). If the $v$'s are linearly independent, none of the $u_k$ will be zero, and this produces an orthonormal set $q_1,\ldots,q_n$.

Worked Example

Suppose we want to orthonormalize the set:
$v_1=(1, 1, 0)^T$, $v_2=(1, 0, 1)^T$, $v_3=(0, 1, 1)^T$ in $\mathbb{R}^3$.

Step 1: Start with $u_1 = (1,1,0)$; $|u_1| = \sqrt{2}$; $q_1 = \frac{1}{\sqrt{2}}(1,1,0)$

Step 2: $\text{proj}_{q_1}(v_2) = (v_2^T q_1)q_1 = \frac{1}{\sqrt{2}}q_1 = (0.5,0.5,0)$

Step 3: $u_2 = v_2 - \text{proj}_{q_1}(v_2) = (1,0,1) - (0.5,0.5,0) = (0.5, -0.5, 1)$

Step 4: $|u_2| \approx 1.225$ → $q_2 \approx (0.408, -0.408, 0.816)$

Step 5: $\text{proj}_{q_1}(v_3) = (v_3^T q_1)q_1 \approx (0.5,0.5,0)$

Step 6: $\text{proj}_{q_2}(v_3) = (v_3^T q_2)q_2 \approx (0.0833, -0.0833, 0.1667)$

Step 7: $u_3 = v_3 - \text{proj}_{q_1}(v_3) - \text{proj}_{q_2}(v_3)$
$= (0, 1, 1) - (0.5,0.5,0) - (0.0833, -0.0833, 0.1667)$
$\approx (-0.5833, 0.5833, 0.8333)$

Final result: We normalize $u_3$ to get $q_3$. The resulting set $\{q_1, q_2, q_3\}$ is orthonormal. You can verify this by checking that all dot products are 0 and all norms are 1.

Note: Gram-Schmidt is sensitive to numerical instability when vectors are nearly linearly dependent. In such cases, QR factorization is often preferred for stability.

Orthogonal Complements and Subspaces

Given a subspace $W$ of $\mathbb{R}^n$, its orthogonal complement is defined as: $W^\perp = \{ x \in \mathbb{R}^n : x^T w = 0, \forall w \in W\}$.

The dimensions satisfy: $\dim(W) + \dim(W^\perp) = n$.

For example, the solution space of $Ax = b$ (if consistent) can be described as a particular solution plus the nullspace of $A$. The row space of a matrix (span of its rows) is orthogonal to its nullspace. This leads to the fundamental subspaces:

Row space = $(\mathrm{Null}(A))^\perp$
Column space = $(\mathrm{Null}(A^T))^\perp$ (related to rank-nullity theorem)

Use in Machine Learning

Orthonormal vectors appear in PCA, where principal components are orthonormal.
Many algorithms prefer orthonormal bases for numerical stability (e.g., iterative methods).
Neural networks sometimes use orthogonal initialization to preserve variance across layers.
In optimization, conjugate gradient methods use the idea of A-orthogonal (conjugate) directions.

7. Applications in Machine Learning and Reinforcement Learning

I have introduced most of the essential linear algebra concepts. Now, we connect them explicitly to various applications in AI, machine learning (ML), and reinforcement learning (RL). Linear algebra is ubiquitous in these fields; understanding it can illuminate why algorithms work and how to implement them efficiently.

Gradient Descent and Optimization

Many machine learning algorithms boil down to optimizing a cost function $L(\theta)$ with respect to parameters $\theta$ (which could be a vector or matrix of weights). Gradient descent is a first-order iterative optimization method. Starting from an initial guess $\theta^{(0)}$, we update:

\[\theta^{(t+1)} = \theta^{(t)} - \alpha \nabla L(\theta^{(t)}),\]

where $\alpha$ is a learning rate (step size) and $\nabla L(\theta)$ is the gradient vector of partial derivatives.

Linear algebra comes in through the gradient: for example, in linear regression with cost $L(w) = \frac{1}{2}|Xw - y|^2$, we have $\nabla L(w) = X^T(Xw - y)$, which is derived using $X^T X w - X^T y = 0$ at optimum (the normal equations). Gradient descent in that case updates:

\[w \leftarrow w - \alpha X^T(Xw - y).\]

This is a sequence of vector operations (matrix-vector multiplies). If $X$ has orthonormal columns, this process will converge in one step if $\alpha=1$ (since then it's equivalent to normal equations solution). In general, analyzing convergence uses eigenvalues of $X^T X$ (the Hessian of the quadratic cost). For instance, the largest eigenvalue of $X^T X$ (which is $\sigma_{\max}^2$, the square of the largest singular value) determines the maximum safe learning rate for convergence.

For more complex models, gradient descent is still linear algebra under the hood. Consider a simple neural network with weight matrix $W$ and output $h = f(Wx)$ (with some nonlinear activation $f$). The gradient of a loss with respect to $W$ involves terms like $(f'(Wx)\odot \text{error}) x^T$ (which is an outer product of vectors). Frameworks like PyTorch handle these linear algebra operations automatically.

Example (Gradient Descent in Code)

I'll use gradient descent to fit the line from the previous example (without directly solving normal equations):

import torch

# Data
X = torch.tensor([[1.0, 0.0], [1.0, 1.0], [1.0, 2.0], [1.0, 3.0]])
y = torch.tensor([1.0, 2.1, 3.9, 6.2])

# Initialize parameters [b, a] randomly
theta = torch.randn(2, requires_grad=True)

# Gradient descent loop
learning_rate = 0.1
for t in range(1000):
    # Compute predictions and loss
    pred = X.mv(theta)  # X.mv multiplies matrix X by vector theta
    loss = ((pred - y)**2).mean()  # MSE loss
    loss.backward()  # compute gradients w.rt theta
    
    with torch.no_grad():     # update parameters without tracking gradient
        theta -= learning_rate * theta.grad
    theta.grad.zero_()

# After GD, theta should be close to [0.69, 1.74]
print(theta)

After running, theta will converge to approximately $[0.69, 1.74]^T$ as expected. Each iteration performed matrix-vector products $X \cdot \theta$ and $X^T \cdot (\text{pred}-y)$. This is gradient descent in action, utilizing linear algebra every step.

In more complex deep networks, gradient descent (and its variants like SGD, Adam) still relies heavily on linear algebra: the computation of gradients is via the backpropagation algorithm, which essentially multiplies the error by weight matrices backward through the network (the transpose of forward weight multiplications).

Linear and Logistic Regression

We already saw linear regression as a least squares problem. It's an example of a linear model because the prediction is a linear combination of the input features. Linear models are easy to analyze because the loss function (usually mean squared error) is convex quadratic, and the solution can be found by linear algebra directly.

Logistic regression is another popular linear model, used for binary classification. It predicts the probability of class 1 via:

\[p = \sigma(w^T x + b),\]

where $\sigma(z) = \frac{1}{1+e^{-z}}$ is the sigmoid function. Although the model $w^T x + b$ is linear in $x$, the sigmoid makes the overall model nonlinear in parameters. We cannot solve logistic regression with normal equations; we typically use gradient ascent on the log-likelihood or gradient descent on the negative log-likelihood. The gradient for logistic regression for a dataset $\{x_i, y_i\}$ (with $y_i \in \{0,1\}$) is:

\[\nabla_w L(w) = \sum_{i}( \sigma(w^T x_i + b) - y_i ) x_i,\]

which is an $n$-dimensional vector. Writing in matrix terms, if $X$ is the data matrix and $y$ the vector of labels, then $\sigma(Xw)$ denotes applying sigmoid to each component of the vector $Xw$. The gradient can be written as $X^T(\sigma(Xw) - y)$ (which is analogous to the linear regression gradient $X^T(Xw - y)$, except with $\sigma(Xw)$ in place of $Xw$). This expression shows again how linear algebra underlies even this "nonlinear" problem.

Using an appropriate library, one can fit a logistic regression similarly by iterative solvers. But understanding $X^T(\sigma(Xw)-y)=0$ as the optimality condition is insightful: it means the weighted sum of features for which prediction error is zero in each direction in the feature space at optimum – a balance of positive and negative errors. This is analogous to normal equations but with the nonlinearity making it harder to solve directly.

Neural Networks and Deep Learning

Modern AI is largely driven by deep neural networks. At their core, neural networks are composed of layers, many of which are linear transformations followed by elementwise nonlinearities:

\[h^{(\ell)} = f^{(\ell)}(W^{(\ell)} h^{(\ell-1)} + b^{(\ell)}),\]

where $W^{(\ell)}$ is a weight matrix at layer $\ell$ and $f^{(\ell)}$ is a nonlinear activation (like ReLU, sigmoid, etc.). The input $h^{(0)} = x$ is the feature vector, and the final output $h^{(L)}$ is used for prediction. All the $W^{(\ell)}$ are parameters.

The linear algebra connection:

The forward pass is a sequence of matrix-vector multiplications (or matrix-matrix if doing a batch of inputs at once) and additions.
The backward pass (backpropagation) computes gradients by propagating errors backward through transposed weight matrices (another series of matrix multiplications).
The update of weights is an addition of a scaled gradient (like $W^{(\ell)} \leftarrow W^{(\ell)} - \alpha \nabla_{W^{(\ell)}} L$).
The structure of the network can be analyzed in linear algebra terms: e.g., the first layer spans some subspace of $\mathbb{R}^{n}$, then next layer spans subspace in another space, etc. If activations were linear too, the whole network would collapse to a single linear transformation $W_{\text{eff}} = W^{(L)} \cdots W^{(1)}$, which is why we need nonlinearities to get a truly deep model. But even with nonlinearities, linear algebra provides the scaffolding.

Training a neural network is essentially solving a huge optimization problem in weight space. Second-order optimization methods would involve the Hessian (matrix of second derivatives) which is huge, but practitioners often rely on first-order methods (SGD and its variants).

Sometimes special linear algebra tricks are used: for example, in training RNNs or very deep networks, one might use orthogonal or unitary weight initializations to preserve gradients (because an orthogonal matrix has eigenvalues of magnitude 1, neither exploding nor vanishing too quickly). Convolutional layers use Toeplitz-structured matrices. Batch normalization multiplies by diagonal matrices (variances) and subtracts means. All of these can be seen through a linear algebra lens.

Dimensionality Reduction

High-dimensional data often has structure that allows it to be summarized in fewer dimensions. We already discussed PCA, which finds an orthonormal basis capturing maximum variance. PCA is widely used in preprocessing: for visualization, noise reduction, or as a pre-step for other algorithms to avoid multicollinearity.

In deep learning, autoencoders are neural networks that learn to compress and reconstruct data; the simplest linear autoencoder essentially computes the same subspace as PCA (the space spanned by top singular vectors of the data matrix).

Other techniques like factor analysis, singular value decomposition on word-document matrices (LSA in NLP), or low-rank matrix factorization for recommender systems are direct applications of linear algebra (finding low-rank approximations via SVD or alternating least squares).

Reinforcement Learning and Linear Algebra

Reinforcement learning involves an agent interacting with an environment modeled often as a Markov Decision Process (MDP). Linear algebra appears in multiple ways:

If the state space is small and we have a fixed policy, the value function $v$ (an $|S|$-dimensional vector) satisfies the Bellman equation $v = R + \gamma P v$ where $P$ is the state transition matrix under that policy (each row of $P$ is a distribution of next state given current state) and $R$ is the reward vector. This is a linear system: \[(I - \gamma P) v = R,\] which can be solved by inverting $(I - \gamma P)$ (if $|S|$ is not too large). Value iteration solves this iteratively (which is conceptually similar to solving linear equations by fixed-point iteration).
Policy evaluation with linear function approximation: Instead of representing $v$ explicitly for large state space, one might approximate $v(s) \approx \phi(s)^T w$ for some feature vector $\phi(s)$. Then finding $w$ such that $v \approx R + \gamma P v$ leads to a least squares problem known as LSTD (least-squares temporal difference learning). This again results in normal equations $\Phi^T (R + \gamma P \Phi w - \Phi w) = 0$, solving an $n \times n$ linear system (where $n$ is number of features).
The successor representation (SR) is a concept by Dayan (1993) which expresses long-term state occupancy. For a policy $\pi$, the SR matrix is $M = (I - \gamma P)^{-1}$ (assuming it converges) which we can expand as $I + \gamma P + \gamma^2 P^2 + \cdots$. The eigen-decomposition of $P$ helps analyze $M$: if $P = Q \Lambda Q^{-1}$ (diagonalizable), then $M = Q(I - \gamma \Lambda)^{-1} Q^{-1}$, so $M$ and $P$ share eigenvectors. This means the eigenvectors of $P$ (which correspond to coherent dynamics patterns in state space) form the basis for the successor features. In reinforcement learning research, people have used the eigenvectors of $P$ or related matrices (like the graph Laplacian of the state graph) as features or options: Proto-value functions are eigenfunctions of the Laplacian on the state graph, and eigenoptions are temporally extended actions derived from those eigenvectors. These ideas rely on computing eigenvalues/eigenvectors—often approximated for large problems.
Policy gradients: In policy gradient methods, we adjust policy parameters $\theta$ in the direction of $\nabla J(\theta)$ (the gradient of expected return). If the policy is parameterized linearly in some feature space (e.g., softmax policy where logits are $w^T \phi(s,a)$), then the gradient involves expectations of feature vectors. For example, REINFORCE uses an update $\Delta w \propto (R - b) \phi(s,a)$ for some baseline $b$. These are linear algebraic combinations of feature vectors weighted by returns.

In summary, RL algorithms often end up solving linear systems or eigenproblems (in planning) or doing gradient descent (in learning), all of which require a solid understanding of linear algebra.

Further Example: Eigenoptions in Gridworld

Imagine an agent in a grid world. We can construct a state graph where each state is connected to neighbors (under some policy or random walk). The structure of this graph can be captured in the adjacency or Laplacian matrix $L = D - P$ (where $P$ might be a symmetric transition matrix and $D$ degree matrix). The eigenvectors of $L$ give a harmonics basis on the grid. For instance, the second smallest eigenvector of $L$ might vary smoothly from left to right (positive on one side, negative on the other), suggesting an option that moves the agent left-right. The next eigenvector might vary top to bottom, suggesting an up-down exploratory option. This is a way linear algebra (eigen decomposition) provides insight into high-level behavior.

Why Linear Algebra for AI? – A Philosophical Note

Linear algebra is not just a tool; it is a language for expressing relationships in a world where patterns emerge from apparent chaos. To ask why linear algebra is central to artificial intelligence is to probe the deeper question: Why is linearity so foundational to our understanding of reality, and why does it persist even in nonlinear domains?

The Philosophy of Superposition: Linear Algebra and the Art of Reduction

At its heart, linear algebra is about superposition—the principle that complex structures or behaviors can be expressed as compositions of simpler, independent components. This notion is not merely a mathematical convenience; it is a philosophical posture toward the world. To embrace superposition is to adopt a reductionist epistemology—the belief that the complex can be understood by breaking it down into parts, that the sum is nothing more than the combination of its constituents. But this belief is more than just a way of modeling systems; it is how the human mind copes with the unfathomable depth of reality.

Superposition as an Ontological Stance

In physics, engineering, and machine learning alike, superposition underpins our models. A signal is treated as the sum of sine waves. An image is represented as a grid of pixel values. A vector in a space is written as a linear combination of basis vectors. Implicit in each of these representations is a belief: that truth, or at least workable approximations of it, can be teased apart and understood in layers. The world does not hand us neat equations or segmented truths; rather, we impose that structure upon it, and superposition becomes the scaffolding through which we analyze, predict, and build.

In this way, linear algebra becomes more than a tool—it becomes a worldview. We project our need for structure onto phenomena that are often chaotic, nonlinear, and irreducible in practice. We linearize not because the world is linear, but because linearity is the format we understand best. It is the lens through which we temporarily tame complexity, even if we must later acknowledge its inadequacy.

Linearization and the Human Strategy of Approximation

This brings us to the epistemological strategy that underlies much of science and artificial intelligence: when faced with complexity, we begin by simplifying. Taylor expansions reduce nonlinear functions to polynomials near a point. Jacobians represent nonlinear dynamics through linear transformations. Even in deep learning—arguably the most nonlinear domain of modern computation—we optimize networks using gradients, which are inherently local, linear approximations of change. This pattern is not arbitrary; it reflects something deep about human cognition.

We are not omniscient beings; we cannot simulate every possibility in a high-dimensional space. Instead, we construct local models and iterate. We linearize, because that's where logic can breathe. We simplify not out of laziness, but necessity. Even evolution follows this principle: rather than solving the entire environment at once, organisms develop heuristics—fast, linear-ish approximations—that work well enough in local contexts. Emotion, instinct, and perception are all, in their own way, simplifications of complex inputs.

AI and the Reinscription of Human Heuristics

Artificial intelligence inherits this tradition wholesale. Linear algebra isn't just part of AI—it is foundational. Word embeddings are vectors. Neural networks are layers of matrix multiplications. Backpropagation—the engine of learning—is a choreography of linear operations chained together. Even the most complex architectures, such as transformers, boil down to linear attention matrices, linear projections, and dot products. These are not arbitrary design choices; they reflect our inherited belief that intelligence itself can be constructed through composition—that understanding can emerge from the accumulation of simple, modular parts.

But this reliance on superposition raises deeper philosophical questions. If intelligence is built on the sum of parts, does that mean it is nothing more than the sum? Are consciousness, creativity, or moral reasoning just high-order interactions in a multidimensional vector space? Or is there a point at which the superposition principle collapses—where irreducible wholes emerge?

Reductionism works well when systems are tractable. But when we confront phenomena like consciousness, ethics, or social complexity, the assumptions of linearity begin to fray. AI may learn to model behavior linearly, but does it understand nonlinearity? Does it feel the difference between a curve and its tangent? Or are we simply forcing it to do what we do: ignore the totality, and hope that local behavior is enough?

The Limits of Superposition and the Quest Beyond Linearity

The danger in embracing superposition too religiously is the illusion of completeness. It is tempting to believe that because we can approximate the world linearly, we have captured its essence. But approximation is not identity. The map is not the territory. This is perhaps the most crucial philosophical insight linear algebra forces us to confront: that utility is not truth.

Nevertheless, we persist. And rightly so. Superposition, linearization, and decomposition are not lies—they are partial truths. They are scaffolds for insight. They are the crutches of comprehension. And until we have something better, they are our most powerful allies in navigating a world that remains, at its core, wildly nonlinear and stubbornly complex.

High-dimensional geometry and the nature of abstraction

Artificial intelligence doesn't live in the world of chairs and trees. It doesn't see the sky, touch the ground, or feel the weight of objects. It doesn't even really see at all. What it does operate in is a world of abstract spaces—mathematical spaces so high in dimension that human intuition fails almost immediately. Images become 10,000-dimensional vectors. Words become coordinates in sprawling semantic spaces. Thoughts, categories, and even emotions are mapped to latent embeddings in vectorized models. AI, in other words, exists in a geometry of abstraction—and linear algebra is the language that defines, structures, and manipulates that geometry.

This is not a minor point; it is a foundational philosophical shift. Intelligence—at least as embodied by machines—is not grounded in physical form, but in the ability to represent and manipulate structure in high-dimensional space. And this space is not arbitrary. It has its own rules, topologies, and transformations, all governed by the fundamental operations of linear algebra: vector addition, dot products, matrix multiplication, eigenvectors, and decompositions. In this landscape, geometry becomes cognition.

Dimensional Blindness: Where Intuition Fails

Human intuition is built for three dimensions. We evolved to navigate a world of cliffs, tools, predators, and social signals. Our spatial reasoning is tuned to the macroscopic world. Once we move beyond that—into 4, 40, or 40,000 dimensions—our evolved cognitive strategies collapse. Concepts like "distance," "direction," or "similarity" become mathematically coherent but intuitively alien. In high-dimensional space, most points are far from one another. Everything is nearly orthogonal. There are no simple visual metaphors. Yet AI thrives here.

This is where linear algebra steps in—not just as a technical crutch, but as an epistemological prosthetic. It defines what it means for two vectors to be "close," even when we cannot visualize them. It tells us how to rotate a 10,000-dimensional object, or project it into a lower space while preserving its essence. When we use singular value decomposition (SVD), principal component analysis (PCA), or eigenvector centrality, we are not just simplifying data—we are finding structure in places our minds cannot reach unaided. Linear algebra becomes a form of prosthetic intuition.

Vectors as Meaning: When Numbers Become Thought

A word embedding is not just a list of floating-point numbers. It is a spatial instantiation of meaning. The vector for "king" lies near the vector for "queen," and the vector offset from "man" to "king" mirrors the offset from "woman" to "queen." This is not coincidence; it is geometry capturing conceptual relationships. The same occurs in vision models, where similar faces or shapes cluster in latent space, or in generative models, where traversing a direction in vector space morphs a sketch into a photorealistic image.

This is philosophically profound: it means meaning itself can be embedded geometrically. AI is not just computing on data—it is sculpting meaning into multidimensional forms. Each matrix transformation, each dot product, is an act of structural imposition: a carving of intention into space. In this sense, linear algebra is not merely a tool for processing representations; it is a method for creating them.

And just as sculpture reveals form by subtracting from stone, so too does AI reveal patterns by decomposing matrices, pruning networks, or compressing manifolds. Dimensionality reduction becomes aesthetic as well as functional. It is a search for the essence of structure.

From Structure to Intelligence: Sculpting with a Chisel of Code

When we think of intelligence—human or machine—we often think in terms of outputs: solving problems, recognizing faces, answering questions. But these outputs are the shadows cast by much deeper operations. At its core, intelligence is the ability to model complexity, generalize patterns, and make predictions in unseen situations. And that ability, in AI, depends on structure. Not random structure, but geometry sculpted by data. Patterns captured in weight matrices. Rules embedded in loss landscapes. Constraints and affordances expressed in the algebraic interplay of tensors.

Here, linear algebra becomes the chisel. It shapes the data's structure, not arbitrarily, but meaningfully. It aligns with the gradients of the loss function. It reorients latent space to place semantically related concepts nearby. It stretches, compresses, rotates, and projects—actions that, though numeric, encode understanding. Through optimization and learning, AI constructs a cognitive geometry—one in which categories, analogies, and abstract forms are literally spatial relationships.

The Metaphysics of Manifolds

Philosophically, this redefines our understanding of thought. Intelligence, as realized in AI, does not rely on embodiment or consciousness. It resides in the capacity to encode abstraction geometrically. We might even say that AI models are manifold sculptors—entities whose entire intelligence lies in how well they learn to deform data space to align with the structure of the world.

This leads to a striking conclusion: intelligence is geometry. Or at least, geometry is the substrate through which artificial intelligence expresses itself. This doesn't negate the richness of biological intelligence, which is grounded in emotion, context, and embodiment. But it reveals something deeper—that all cognition, human or artificial, may ultimately rest on the ability to map complexity into comprehensible form. Whether through symbols or vectors, brains or matrices, intelligence is the art of structural compression.

The unreasonable effectiveness of linearity

If reality is nonlinear, chaotic, and filled with feedback loops and emergent complexity, why does linearity work so well? This is not just a technical observation; it is a philosophical puzzle. In a world that defies simplification, where systems rarely respond proportionally to inputs and where dependencies are entangled beyond easy analysis, linearity somehow remains our most powerful tool. From physics to economics, and now to artificial intelligence, linear structures form the skeleton of our most successful models. Why?

The answer is not that the world is linear. It clearly isn't. The answer is that linearity offers an intelligible approximation of reality—a scaffold for thought. It is not the whole house, but the framework around which we can begin to build. Linearity is not a mirror of nature; it is a human construct—a formalism that slices through complexity, carving the world into shapes we can compute with, reason about, and generalize from. It offers structure without demanding full fidelity to the chaos of the real. That alone makes it indispensable.

Linearity as Epistemic Strategy

We rarely understand complex systems by grasping them in their entirety. Instead, we approximate. We simplify. We reduce. This is not laziness—it is survival. Our minds cannot handle full combinatorial explosion, so we lean on frameworks that scale, that tame complexity into bite-sized generalities. Linearity is one such framework. It assumes proportionality and independence where there may be none, but it works well enough, often enough, that it has become our dominant language of reasoning.

This is epistemically significant. To study or build anything is to abstract it into a form we can manipulate. In mathematics, that form is often linear. In AI, this abstraction becomes literal: backpropagation, the algorithm that powers deep learning, is just a choreography of linear algebra operations—matrix multiplications, dot products, transpositions. Nothing about these operations is magical, yet what emerges from them is astonishing. Through them, we gain systems that can translate languages, generate art, detect cancer, and outperform humans at games of skill. Linearity becomes the gateway to complexity.

The Language of Linear Abstraction

Just as natural language helps us describe the world through nouns and verbs—even though the world contains no such divisions—linear algebra helps us describe structure through vectors and matrices. These tools don't represent reality directly; they render it tractable. They allow us to translate the fluid messiness of data into discrete transformations, structured relationships, and measurable distances. Linearity is the syntax of data interpretation.

Consider an AI model trained to recognize human faces. At its core, it doesn't "see" the image—it processes it as a high-dimensional vector, transforms it using learned weight matrices, and propagates the results through layers of linear and nonlinear operations. But the scaffolding is fundamentally linear. The model's ability to recognize a face, differentiate a smile from a frown, or infer identity from shadows and angles—all of it rests on how well the underlying linear transformations shape the data into intelligible, separable clusters.

And that's the philosophical moment: we are not just engineering systems; we are carving structure out of the inchoate. We are turning raw, unstructured information into geometric order—into shapes we can project, rotate, interpolate, and ultimately understand.

Linearity and Meaning

The implications go further. When a model generates poetry, or paints an image, or completes your sentence before you finish typing, it's easy to forget that behind these acts lie nothing more than layered compositions of linear transformations. There is no poetry module. No aesthetic sense. No intent. Just gradients computed over layers of linear operations, optimized to minimize error. Yet somehow, emergent behavior arises.

This invites a deeper philosophical interpretation: perhaps intelligence, too, is a matter of structured transformation. Not intelligence in the conscious, subjective sense, but in the functional sense—the ability to map inputs to meaningful outputs. If intelligence is structure, and structure is captured by linear operations, then linear algebra becomes more than just a mathematical tool. It becomes the scaffold of cognition—a substrate on which artificial thought is constructed.

The Bottom Line: Linear Algebra and the Architecture of Understanding

Linear algebra isn't just a tool; it's a philosophy disguised as a framework. It doesn't claim the world is linear—only that linearity is the first step toward sense-making. It's how we begin to hold reality without being drowned by it. High-dimensional geometry, vector spaces, eigenvalues—these aren't just abstract constructs. They're survival mechanisms. They're how we reduce the chaos into something graspable; something we can compress, manipulate, and ultimately act on.

When we map the world into vectors, when we rotate spaces with matrices, what we're really doing is building a geometry of meaning. AI doesn't navigate the world through objects or experiences; it floats in latent spaces—fields of numbers where distance and direction are understanding. And it works. Somehow. That's the existential tension: the world is messy, nonlinear, even absurd—but linear structures still manage to carve meaning from it. Not perfectly, but meaningfully enough.

So maybe linear algebra doesn't reflect reality as it is; maybe it reflects reality as it must be filtered—if you want to learn from it, predict it, survive it. Maybe it's not that the world is intelligible, but that intelligence arises from the ability to make the world look intelligible. In the end, linear algebra doesn't show us everything; it shows us just enough to keep going. And sometimes, that's all understanding ever was.