Singular value decomposition
Singular value decomposition (SVD) is a fundamental factorization in linear algebra that expresses any real or complex matrix A \in \mathbb{R}^{m \times n} (or \mathbb{C}^{m \times n}) as A = U \Sigma V^T, where U \in \mathbb{R}^{m \times m} and V \in \mathbb{R}^{n \times n} are orthogonal matrices (or unitary for complex cases), and \Sigma \in \mathbb{R}^{m \times n} is a rectangular diagonal matrix with non-negative real singular values \sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_p \geq 0 on its main diagonal, where p = \min(m, n).[1] The singular values are the square roots of the eigenvalues of A^T A (or A^* A for complex matrices), and the columns of V are the corresponding eigenvectors, while the columns of U are derived from A times the columns of V normalized by the singular values.[1] This decomposition extends the eigenvalue decomposition to arbitrary matrices, not just square symmetric ones, and reveals key structural properties such as the rank of A (the number of nonzero singular values) and the condition number (the ratio of the largest to smallest nonzero singular value).[1]
The origins of SVD trace back to the 19th century, with independent discoveries by Eugenio Beltrami in 1873 and Camille Jordan in 1874, who developed it in the context of simplifying bilinear forms.[2] James Joseph Sylvester contributed related canonical forms in 1889, and Erhard Schmidt introduced the term "singular values" in 1907 while studying integral equations.[2] Hermann Weyl advanced the theoretical framework in 1912 and 1949 through work on eigenvalues of Hermitian matrices.[2] The modern computational significance emerged in the mid-20th century, particularly with Carl Eckart and Gale Young's 1936 approximation theorem, which uses SVD for low-rank matrix approximations, and Gene Golub's development of efficient algorithms in the late 1960s, including the Golub-Reinsch method based on bidiagonalization and QR iterations.[3]
SVD plays a central role in numerical linear algebra and data science due to its stability and versatility.[4] It enables the computation of the Moore-Penrose pseudoinverse A^+ = V \Sigma^+ U^T, where \Sigma^+ inverts the nonzero singular values, which is essential for solving overdetermined least-squares problems and handling ill-conditioned systems.[5] In data analysis, SVD underpins principal component analysis (PCA) by decomposing the covariance matrix or centered data matrix to identify directions of maximum variance for dimensionality reduction.[6] Practical applications include image compression, where retaining only the largest singular values approximates the original image with far fewer parameters while preserving essential features (e.g., reducing storage by keeping top k components for k \ll \min(m,n)); noise removal in signals by truncating small singular values; and recommendation systems, such as those used by Netflix, where SVD factorizes user-item rating matrices to predict preferences based on latent factors like genres.[6][4] Additionally, SVD facilitates handwritten digit classification by projecting images onto low-dimensional subspaces derived from training data, achieving high accuracy (e.g., ~92% on MNIST with limited samples), and supports geometric interpretations in teaching linear algebra, such as visualizing matrix transformations as rotations, scalings, and rotations.[6][4]
Fundamentals
Definition
In linear algebra, the singular value decomposition (SVD) is a matrix factorization theorem that expresses any m \times n real or complex matrix A as the product of three matrices: A = U \Sigma V^*, where U is an m \times m unitary matrix, \Sigma is an m \times n rectangular diagonal matrix with non-negative real entries on the diagonal known as singular values, and V is an n \times n unitary matrix with V^* denoting its conjugate transpose.[7] This decomposition generalizes the eigenvalue decomposition to rectangular matrices and exists for every matrix A, regardless of its dimensions or rank.[7]
A unitary matrix U satisfies U^* U = I_m, where I_m is the m \times m identity matrix, meaning its columns form an orthonormal basis for \mathbb{C}^m (or \mathbb{R}^m for real matrices); similarly for V with respect to \mathbb{C}^n.[7] The matrix \Sigma has the form
\Sigma = \begin{pmatrix}
\sigma_1 & 0 & \cdots & 0 \\
0 & \sigma_2 & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & \sigma_{\min(m,n)}
\end{pmatrix},
where the singular values satisfy \sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0 and \sigma_{r+1} = \cdots = \sigma_{\min(m,n)} = 0, with r = \operatorname{[rank](/page/Rank)}(A) being the number of positive singular values.[7] The columns of U are the left singular vectors, and the columns of V are the right singular vectors.[7]
Key initial properties follow directly from the factorization: the product A A^* = U \Sigma \Sigma^* U^* = U \Sigma^2 U^*, where \Sigma^2 is diagonal with entries \sigma_i^2, showing that the eigenvalues of the Hermitian matrix A A^* are the squares of the singular values; analogously, A^* A = V \Sigma^* \Sigma V^* = V \Sigma^2 V^*.[7] These relations highlight how SVD diagonalizes the related Gram matrices A A^* and A^* A.[7]
Notation and basic properties
The singular value decomposition (SVD) of an m \times n matrix A is expressed as A = U \Sigma V^H, where U is an m \times m unitary matrix whose columns u_i are the left singular vectors, V is an n \times n unitary matrix whose columns v_i are the right singular vectors, and \Sigma is an m \times n rectangular diagonal matrix with nonnegative real entries \sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_p \geq 0 on the diagonal, where p = \min(m, n); these \sigma_i are termed the singular values of A.[8] For real matrices, the Hermitian transpose V^H reduces to the transpose V^T.[9]
The singular values \sigma_i are uniquely determined by A up to their conventional decreasing order.[10] The corresponding left and right singular vectors u_i and v_i are unique up to multiplication by a unit-modulus complex scalar (or sign flip for real matrices) when the singular values are distinct; however, when singular values are repeated, the individual vectors are not unique, though the invariant subspaces spanned by the associated singular vectors are unique.[10][11]
From the SVD, several fundamental properties follow directly. The rank of A, denoted \operatorname{rank}(A), equals the number of positive singular values, i.e., \operatorname{rank}(A) = r where \sigma_r > 0 = \sigma_{r+1}.[1] For a square n \times n matrix A, invertibility holds if and only if all singular values are positive, ensuring \Sigma is nonsingular and thus A is invertible.[12] Moreover, the SVD provides orthonormal bases for the four fundamental subspaces of A: the column space \operatorname{col}(A) is spanned by the first r left singular vectors \{u_1, \dots, u_r\}, the row space \operatorname{row}(A) by the first r right singular vectors \{v_1, \dots, v_r\}, the null space \operatorname{null}(A) by the remaining right singular vectors \{v_{r+1}, \dots, v_n\}, and the left null space \operatorname{null}(A^H) by the remaining left singular vectors \{u_{r+1}, \dots, u_m\}.[9]
The condition number of a square invertible matrix A, defined as \kappa(A) = \sigma_1 / \sigma_n where \sigma_1 is the largest singular value and \sigma_n the smallest, quantifies the relative sensitivity of the matrix to perturbations in norms such as the spectral or Frobenius norm; specifically, \kappa(A) \geq 1, with \kappa(A) = 1 if and only if A is a scalar multiple of a unitary matrix.[12][11]
Illustrative example
To illustrate the singular value decomposition (SVD), consider the 2×2 matrix
A = \begin{pmatrix} 3 & 0 \\ 4 & 5 \end{pmatrix}.
The SVD process begins by computing A^T A, whose eigenvalues yield the squares of the singular values. Here, the eigenvalues of A^T A are 45 and 5, so the singular values are \sigma_1 = \sqrt{45} = 3\sqrt{5} \approx 6.708 and \sigma_2 = \sqrt{5} \approx 2.236. The normalized eigenvectors of A^T A form the columns of the right singular matrix V:
V = \frac{1}{\sqrt{2}} \begin{pmatrix} 1 & -1 \\ 1 & 1 \end{pmatrix}.
The columns of the left singular matrix U are then found using u_i = A v_i / \sigma_i for i=1,2:
U = \frac{1}{\sqrt{10}} \begin{pmatrix} 1 & -3 \\ 3 & 1 \end{pmatrix},
with the diagonal matrix of singular values
\Sigma = \begin{pmatrix} 3\sqrt{5} & 0 \\ 0 & \sqrt{5} \end{pmatrix}.
Thus, the SVD is A = U \Sigma V^T.[7]
To verify, multiplying U \Sigma V^T reconstructs the original matrix:
U \Sigma V^T = \begin{pmatrix} 3 & 0 \\ 4 & 5 \end{pmatrix}.
Additionally, U and V satisfy the orthogonality properties U^T U = I and V^T V = I, confirming they are unitary matrices. This decomposition also expresses A as a sum of rank-1 matrices: A = \sigma_1 u_1 v_1^T + \sigma_2 u_2 v_2^T.[7]
When singular values are equal and positive, the SVD is non-unique: the corresponding singular vectors are only determined up to an orthogonal transformation within their eigenspace. For instance, repeated eigenvector computations for A^T A (or A A^T) with a repeated eigenvalue may yield different orthonormal bases for the same singular value, leading to equivalent but distinct U and V that still satisfy the decomposition.[10]
Interpretations
Geometric interpretation
The singular value decomposition of a matrix A geometrically interprets the associated linear transformation as a sequence of an orthogonal transformation V^T on the input space, followed by a diagonal scaling \Sigma, and then an orthogonal transformation U on the output space, expressed as A = U \Sigma V^T.[13] This decomposition reveals how A acts by first aligning the input with principal directions via V^T, which rotates or reflects the space to match the axes of \Sigma, then stretching or compressing along those axes by factors given by the diagonal entries of \Sigma, and finally reorienting the result in the output space through U.[14] The columns of U and V furnish orthonormal bases aligned with these principal directions in their respective spaces.[15]
A key visual aspect arises from the action on the unit ball: the transformation A maps the unit sphere (or circle in two dimensions) in the input space to an ellipsoid (or ellipse) in the output space, where the singular values \sigma_i (the diagonal entries of \Sigma, ordered \sigma_1 \geq \sigma_2 \geq \cdots \geq 0) determine the lengths of the semiaxes of this ellipsoid.[13] For example, in the plane, the unit circle is deformed into an ellipse whose longest and shortest axes correspond to the largest and smallest nonzero singular values, with directions pointing along the left singular vectors (columns of U).[14] This stretching reflects the varying "gain" of the transformation in different directions, with larger \sigma_i indicating greater expansion and smaller ones contraction toward zero.[15]
The orthogonal components U and V can include reflections when their determinants are -1, introducing an orientation reversal in the rotation-like action, such as flipping the input or output space across a hyperplane.[13] Despite this, the overall decomposition preserves angles and distances up to the scalings imposed by \Sigma, maintaining the Euclidean structure through the isometry of the orthogonal transformations.[14] This reflective possibility arises from sign ambiguities in the singular vectors but does not alter the fundamental scaling geometry.[13]
Relation to fundamental subspaces
The singular value decomposition (SVD) of an m \times n matrix A provides a direct connection to its four fundamental subspaces: the column space, row space, null space, and left null space. Specifically, if A = U \Sigma V^H where U is an m \times m unitary matrix, \Sigma is an m \times n diagonal matrix with nonnegative singular values \sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0 = \sigma_{r+1} = \cdots = \sigma_{\min(m,n)} ordered decreasingly, and V is an n \times n unitary matrix, then the column space of A, denoted \operatorname{Col}(A), is the span of the first r columns of U, i.e., \operatorname{Col}(A) = \operatorname{span}\{u_1, \dots, u_r\}, where u_i are the left singular vectors corresponding to the nonzero singular values.[7][16] This subspace has dimension r, matching the rank of A.
The row space of A, denoted \operatorname{Row}(A), is equivalently the column space of A^H and is given by the span of the first r columns of V, i.e., \operatorname{Row}(A) = \operatorname{span}\{v_1, \dots, v_r\}, where v_i are the right singular vectors associated with the nonzero singular values.[7][16] Meanwhile, the null space of A, or kernel \operatorname{Nul}(A), consists of all vectors x such that Ax = 0, and it is spanned by the remaining columns of V, specifically \operatorname{Nul}(A) = \operatorname{span}\{v_{r+1}, \dots, v_n\}, with dimension n - r.[7][16]
The left null space of A, denoted \operatorname{Nul}(A^H), comprises vectors y such that A^H y = 0 (or y^H A = 0), and it is spanned by the columns of U from r+1 to m, i.e., \operatorname{Nul}(A^H) = \operatorname{span}\{u_{r+1}, \dots, u_m\}, with dimension m - r.[7][16] A distinctive feature of the SVD is that it furnishes explicit orthonormal bases for all four fundamental subspaces, with the dimensions of the nonzero and null subspaces directly determined by the number and positions of the nonzero singular values.[17][12] This orthonormal structure arises from the unitarity of U and V, ensuring the basis vectors are orthogonal and normalized.[9]
Orthonormal bases and singular vectors
In the singular value decomposition of a matrix A \in \mathbb{C}^{m \times n}, the left singular vectors u_i (for i = 1, \dots, m) are defined as the eigenvectors of the Hermitian matrix AA^*, with corresponding eigenvalues \sigma_i^2, where \sigma_i \geq 0 are the singular values of A.[7] Similarly, the right singular vectors v_i (for i = 1, \dots, n) are the eigenvectors of the Hermitian matrix A^*A, also with eigenvalues \sigma_i^2.[7] These eigenvectors are typically ordered such that \sigma_1 \geq \sigma_2 \geq \dots \geq \sigma_r > 0 = \sigma_{r+1} = \dots = \sigma_{\min(m,n)}, where r is the rank of A.[7]
The matrices U = [u_1 \ \dots \ u_m] \in \mathbb{C}^{m \times m} and V = [v_1 \ \dots \ v_n] \in \mathbb{C}^{n \times n} are unitary, meaning their columns form orthonormal bases for \mathbb{C}^m and \mathbb{C}^n, respectively.[7] This orthonormality follows from the spectral properties of the Hermitian matrices AA^* and A^*A, which admit orthonormal eigenvector decompositions.[11] Consequently, the sets \{u_1, \dots, u_m\} and \{v_1, \dots, v_n\} satisfy \langle u_i, u_j \rangle = \delta_{ij} and \langle v_i, v_j \rangle = \delta_{ij} for all i, j, where \delta_{ij} is the Kronecker delta.[11]
A key biorthogonality relation connects the singular vectors through the original matrix: for each i \leq r, A v_i = \sigma_i u_i and A^* u_i = \sigma_i v_i.[7] These equations highlight how the action of A (and its adjoint) scales and rotates between the right and left singular vectors, preserving their norms due to the unit length of u_i and v_i.[7] For indices i > r where \sigma_i = 0, the relations simplify to A v_i = 0 and A^* u_i = 0, implying that the corresponding singular vectors v_i (for i = r+1, \dots, n) span the orthogonal complement of the row space of A, while the u_i (for i = r+1, \dots, m) span the orthogonal complement of the column space of A.[7] This structure ensures the singular vectors provide a complete, orthonormal framework for decomposing A.[8]
Connections to Spectral Theory
Relation to eigenvalue decomposition
The singular value decomposition (SVD) of a matrix A generalizes the eigenvalue decomposition by establishing a connection through the associated symmetric matrices A^* A and A A^*, where A^* denotes the conjugate transpose. Specifically, the singular values \sigma_i of A are the nonnegative square roots of the eigenvalues of A^* A (or equivalently, of A A^*), and the right singular vectors (columns of V in the SVD A = U \Sigma V^*) are the corresponding eigenvectors of A^* A, while the left singular vectors (columns of U) are the eigenvectors of A A^*.[18][19] This relation holds for any real or complex matrix A of size m \times n, allowing SVD to apply to rectangular matrices where eigenvalue decomposition is undefined.[20]
In contrast to eigenvalue decomposition, which applies only to square matrices and requires them to be diagonalizable (i.e., A = Q \Lambda Q^{-1} for some invertible Q and diagonal \Lambda), SVD exists and is unique (up to signs and ordering) for every matrix, regardless of shape or symmetry.[18] For square invertible matrices, the two decompositions coincide under specific conditions: if A is normal (satisfying A A^* = A^* A), then the SVD takes the form A = U \Sigma U^*, where the left and right singular vectors are the same (up to signs for negative eigenvalues), and the singular values are the absolute values of the eigenvalues.[18] In this case, the eigenvalue decomposition aligns closely with SVD, but with eigenvalues potentially complex or negative, whereas singular values are always real and nonnegative.
SVD effectively provides an "eigen-decomposition" for rectangular matrices by facilitating generalizations like the polar decomposition and the Moore-Penrose pseudoinverse. In the polar form, any matrix A can be written as A = Q P, where Q is unitary (or orthogonal for real matrices) and P is positive semidefinite, with P derived from the SVD as V \Sigma V^*; this mirrors how eigenvalue decomposition diagonalizes symmetric matrices but extends to non-square cases.[21] Similarly, the Moore-Penrose pseudoinverse A^+, which generalizes matrix inversion, is computed directly from the SVD as A^+ = V \Sigma^+ U^*, where \Sigma^+ inverts the nonzero singular values, offering a least-squares solution framework absent in standard eigenvalue decomposition.[22]
Singular values and spectral decomposition
The singular value decomposition (SVD) of an m \times n complex matrix A admits a spectral form that expresses A as a sum of rank-one matrices weighted by its singular values. Specifically, A = \sum_{i=1}^{\min(m,n)} \sigma_i \mathbf{u}_i \mathbf{v}_i^*, where \sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(m,n)} \geq 0 are the singular values, \mathbf{u}_i are the columns of the unitary matrix U \in \mathbb{C}^{m \times m}, and \mathbf{v}_i are the columns of the unitary matrix V \in \mathbb{C}^{n \times n}.[8] If the rank r of A is less than \min(m,n), the sum can be restricted to the first r terms with \sigma_{r+1} = \cdots = \sigma_{\min(m,n)} = 0, while the full sum includes these zero contributions for completeness.[23] This decomposition generalizes the spectral theorem to arbitrary matrices, providing a canonical way to represent A through orthogonal transformations and diagonal scaling.[24]
The singular values \sigma_i can be interpreted as generalized eigenvalues, specifically the square roots of the eigenvalues of [A^*A](/page/A^*A) or AA^*, which capture the principal directions of variance or importance in the matrix's action.[23] Larger \sigma_i indicate greater stretching or compression along the corresponding singular vector directions, quantifying the matrix's effect on the unit sphere in a manner analogous to eigenvalues but applicable to non-square matrices.[8] For instance, in data analysis contexts like principal component analysis, the singular values order the components by their explanatory power, with the ratio of cumulative squared singular values assessing reconstruction quality.[23]
Key properties of the singular values include their non-increasing order, which ensures a natural hierarchy for truncation, and their role in norms such as the nuclear norm \|A\|_* = \sum_{i=1}^{\min(m,n)} \sigma_i, equivalent to the trace of the diagonal matrix \Sigma in the compact SVD form A = U \Sigma V^*.[25] This nuclear norm serves as a convex surrogate for rank minimization, promoting sparsity in the spectrum. The SVD achieves a unique canonical diagonalization under unitary transformations, where the diagonal entries are non-negative and ordered, distinguishing it as the optimal form for such factorizations up to phase ambiguities in the singular vectors.[24] For Hermitian matrices, this reduces to the standard spectral decomposition via eigenvalues.[23]
Proofs of Existence
Spectral theorem approach
The singular value decomposition (SVD) of a matrix A \in \mathbb{C}^{m \times n} can be established through the spectral theorem for Hermitian matrices, which guarantees the diagonalization of positive semidefinite operators. This approach leverages the fact that the matrices A^*A and AA^* are Hermitian and positive semidefinite, allowing their eigendecompositions to yield the singular vectors and values of A.[26][27]
Consider the matrix B = A^*A \in \mathbb{C}^{n \times n}. This matrix is Hermitian because (A^*A)^* = A^* (A^*)^* = A^*A. Moreover, B is positive semidefinite since for any x \in \mathbb{C}^n, x^* B x = x^* A^* A x = \|A x\|^2 \geq 0. By the spectral theorem for Hermitian matrices, there exists a unitary matrix V \in \mathbb{C}^{n \times n} and a diagonal matrix \Lambda = \operatorname{diag}(\lambda_1, \dots, \lambda_n) with \lambda_i \geq 0 for all i such that B = V \Lambda V^*. The columns of V, denoted v_1, \dots, v_n, form an orthonormal basis for \mathbb{C}^n consisting of eigenvectors of B, and the \lambda_i are the corresponding eigenvalues ordered non-increasingly.[7][28][29]
The singular values of A are defined as \sigma_i = \sqrt{\lambda_i} for i = 1, \dots, n, yielding non-negative real numbers ordered \sigma_1 \geq \sigma_2 \geq \dots \geq \sigma_n \geq 0. The columns of V serve as the right singular vectors of A, satisfying A^* A v_i = \lambda_i v_i = \sigma_i^2 v_i. To obtain the left singular vectors, define u_i = A v_i / \sigma_i for each i where \sigma_i > 0. These u_i are orthonormal because \langle u_i, u_j \rangle = (1/\sigma_i \sigma_j) \langle A v_i, A v_j \rangle = (1/\sigma_i \sigma_j) \langle v_i, A^* A v_j \rangle = (1/\sigma_i \sigma_j) \sigma_j^2 \langle v_i, v_j \rangle = \delta_{ij}. For indices where \sigma_i = 0, extend the basis orthonormally in the null space of A^*. Similarly, the eigendecomposition of C = AA^* \in \mathbb{C}^{m \times m} yields C = U \Lambda' U^*, where U \in \mathbb{C}^{m \times m} is unitary, \Lambda' is diagonal with entries \lambda_i' (matching the \lambda_i from B for the first \min(m,n) and zeros otherwise), and the columns of U provide the left singular vectors. This establishes A = U \Sigma V^*, where \Sigma = \operatorname{diag}(\sigma_1, \dots, \sigma_{\min(m,n)}, 0, \dots, 0) is the rectangular diagonal matrix of singular values.[27][30]
This construction naturally handles rectangular matrices, as [A^*A](/page/A^*A) is always n \times n and AA^* is m \times m, regardless of whether m > n or m < n; the zero singular values account for the difference in dimensions, leading to the reduced or compact SVD forms when restricting to nonzero singular values. The spectral theorem ensures completeness, as the eigenvectors span the full space, providing orthonormal bases for the domain and codomain. Regarding uniqueness, the singular values \sigma_i > 0 are uniquely determined, while singular vectors corresponding to distinct singular values are unique up to phase; for multiplicities (repeated \sigma_i), the corresponding eigenspaces admit any orthonormal basis therein, but the decomposition remains valid with appropriate choice.[31][32][26]
Variational characterization
The singular values of a matrix A \in \mathbb{R}^{m \times n} admit a variational characterization that provides an optimization-based perspective on their existence and properties. The largest singular value is defined as
\sigma_1(A) = \max_{\|x\|_2 = 1} \|A x\|_2,
where the maximum is attained at the leading right singular vector v_1, which serves as the corresponding argmax.[33] Subsequent singular values are obtained by restricting the maximization to orthogonal complements of the preceding singular vectors: for k = 2, \dots, \min(m,n),
\sigma_k(A) = \max_{\|x\|_2 = 1, \, x \perp \operatorname{span}\{v_1, \dots, v_{k-1}\}} \|A x\|_2,
with v_k as the argmax in this subspace.[33] This process yields a nonincreasing sequence \sigma_1 \geq \sigma_2 \geq \dots \geq \sigma_{\min(m,n)} \geq 0, and the right singular vectors \{v_k\} form an orthonormal basis for \mathbb{R}^n. A symmetric characterization holds for the left singular vectors via maximization of \|A^* y\|_2 over unit vectors y in orthogonal subspaces of \mathbb{R}^m.[34]
To establish this characterization, consider the dyadic form of the SVD, A = \sum_{j=1}^r \sigma_j u_j v_j^T, where r = \operatorname{[rank](/page/Rank)}(A). For unit vectors u \perp \operatorname{span}\{u_1, \dots, u_{k-1}\} and v \perp \operatorname{span}\{v_1, \dots, v_{k-1}\}, the inner product satisfies |u^T A v| \leq \sigma_k \|u\|_2 \|v\|_2 by orthogonality and the definition of singular values, with equality when u = u_k and v = v_k. This implies \|A v\|_2 \leq \sigma_k in the restricted subspace, confirming the maximizers and the decreasing order without invoking the full spectral theorem.[33]
The variational characterization connects directly to the eigenvalues of the Hermitian matrix A^T A \in \mathbb{R}^{n \times n}, whose eigenvalues are precisely \sigma_i^2(A). Specifically, the Rayleigh quotient characterization for the eigenvalues of A^T A yields \sigma_k^2(A) = \max_{\|x\|_2 = 1, \, x \perp \operatorname{span}\{v_1, \dots, v_{k-1}\}} \frac{x^T A^T A x}{x^T x} = \max_{\|x\|_2 = 1, \, x \perp \operatorname{span}\{v_1, \dots, v_{k-1}\}} \|A x\|_2^2, aligning with the singular value definition.[34] This equivalence, rooted in the Courant-Fischer min-max theorem applied to A^T A, demonstrates the singular values as the square roots of the ordered eigenvalues of A^T A.[33]
This optimization framework offers algorithmic insight into iterative methods for computing SVD, such as the power method applied to A^T A, by successively maximizing the Rayleigh quotient in deflated subspaces without requiring prior knowledge of the spectral theorem.[34]
Computation
Jacobi algorithms
Jacobi algorithms for computing the singular value decomposition (SVD) of a matrix A \in \mathbb{R}^{m \times n} are iterative methods that rely on successive applications of plane rotations to progressively diagonalize A. These methods extend the classical Jacobi iteration for symmetric eigenvalue problems, originally developed by Carl Gustav Jacob Jacobi in 1846, to the nonsymmetric case of SVD. Unlike direct methods such as QR iteration, Jacobi algorithms emphasize orthogonal transformations that target off-diagonal elements, making them particularly suitable for parallel implementation and high relative accuracy in singular values.[35]
The two-sided Jacobi algorithm, first proposed by Kogbetliantz in 1955, applies rotations simultaneously from both the left and right of A to zero out off-diagonal entries, akin to an RQ decomposition process that symmetrizes and diagonalizes 2×2 subblocks iteratively.[36] In each step, a pair of indices (p, q) is selected, typically the one with the largest off-diagonal magnitude or in cyclic order, and two Givens rotations—one acting on rows and one on columns—are computed to eliminate the (p,q) and (q,p) entries in the current matrix. These rotations accumulate into the left singular vector matrix U and right singular vector matrix V, while the diagonal entries converge to the singular values \Sigma. The process continues over multiple sweeps until the Frobenius norm of the off-diagonal part falls below a tolerance. This approach handles multiple singular values effectively by reducing the off-diagonal sum of squares quadratically in each sweep.[35]
The one-sided Jacobi algorithm, introduced by Hestenes in 1958, simplifies the process by applying rotations only from the right to A, implicitly diagonalizing A^T A without forming it explicitly, which yields the right singular vectors V and a bidiagonal or diagonal form for \Sigma, from which U is derived. For selected column indices (i,j), a rotation J_{i,j} is determined by solving a 2×2 eigenvalue problem on the submatrix of A^T A, zeroing the off-diagonal entry (A^T A)_{i,j} while updating the columns of A and accumulating into V. Once the columns of the updated AV are sufficiently orthogonal, the singular values are the Euclidean norms of these columns, and U is obtained by normalizing them. This method converges quadratically and is advantageous for rectangular matrices, as it avoids left-side rotations.[37]
Both variants follow a common iterative framework: initialize U = I_m and V = I_n (or just V for one-sided); repeat sweeps selecting off-diagonal pairs (e.g., by maximum norm for faster convergence or cyclic for simplicity); apply Givens rotations to target the pair, updating the matrix and accumulators; terminate when the maximum off-diagonal exceeds a small threshold \epsilon, typically $10^{-12} times the matrix norm. A pseudocode outline for the one-sided variant, representative of the rotation-based structure, is as follows:
Input: A (m × n matrix, m ≥ n)
V ← eye(n) // n × n identity
tolerance ← small value (e.g., machine epsilon * ||A||_F)
while max off-diagonal of A^T A > tolerance:
for each pair (i, j) with i < j (cyclic or by largest |a_i^T a_j|):
Compute rotation parameters c, s from 2×2 subproblem on [||a_i||^2, a_i^T a_j; a_j^T a_i, ||a_j||^2]
Apply right rotation to columns i,j of A: A[:, [i,j]] ← A[:, [i,j]] * [c, -s; s, c]
Update V: V[:, [i,j]] ← V[:, [i,j]] * [c, -s; s, c]
// End sweep
Σ ← diag([||A[:,k]||_2 for k=1 to n])
U ← A * inv(Σ) // Normalize columns
Output: U Σ V^T
Input: A (m × n matrix, m ≥ n)
V ← eye(n) // n × n identity
tolerance ← small value (e.g., machine epsilon * ||A||_F)
while max off-diagonal of A^T A > tolerance:
for each pair (i, j) with i < j (cyclic or by largest |a_i^T a_j|):
Compute rotation parameters c, s from 2×2 subproblem on [||a_i||^2, a_i^T a_j; a_j^T a_i, ||a_j||^2]
Apply right rotation to columns i,j of A: A[:, [i,j]] ← A[:, [i,j]] * [c, -s; s, c]
Update V: V[:, [i,j]] ← V[:, [i,j]] * [c, -s; s, c]
// End sweep
Σ ← diag([||A[:,k]||_2 for k=1 to n])
U ← A * inv(Σ) // Normalize columns
Output: U Σ V^T
For the symmetric case where A is square and symmetric, both algorithms guarantee convergence to the eigenvalue decomposition, as the SVD coincides with the spectral decomposition.[35] These rotations relate briefly to the analytic 2×2 SVD, where exact parameters are derived from hyperbolic functions, but in the iterative setting, approximate solutions suffice for larger matrices. Modern implementations, such as those in LAPACK, enhance these with block processing for better performance on parallel architectures while preserving quadratic convergence rates of 5–10 sweeps for typical n \leq 1000.[38]
Numerical methods
One prominent class of numerical methods for computing the singular value decomposition (SVD) of a matrix involves QR-based techniques, which proceed in two main phases. First, the matrix is reduced to upper bidiagonal form through a series of Householder reflections applied alternately from the left and right, a process known as Golub-Kahan bidiagonalization.[39] This reduction preserves the singular values and is numerically stable, requiring O(m n^2) operations for an m × n matrix with m ≥ n.[39] In the second phase, the singular values of the bidiagonal matrix are computed using a variant of the QR algorithm, often with shifts to accelerate convergence, followed by back-substitution to obtain the singular vectors if needed.[40]
For larger matrices, particularly those that are sparse or structured, divide-and-conquer algorithms offer improved efficiency over traditional iterative methods. These algorithms recursively split the bidiagonal matrix into smaller subproblems, compute the SVDs of the submatrices independently, and then merge the results using rank-revealing decompositions to handle interactions between singular values near crossings.[41] Introduced by Gu and Eisenstat, this approach achieves O(n^2) complexity for the bidiagonal SVD phase and is particularly effective for parallel implementation on distributed systems, making it suitable for matrices with dimensions exceeding 10,000.[41]
SVD algorithms are designed to ensure backward stability, meaning the computed decomposition corresponds to the exact SVD of a slightly perturbed input matrix, with the perturbation bounded by machine epsilon times the matrix norm. The forward error in singular values is then controlled by the condition number κ = σ_max / σ_min, where small σ_min amplifies relative errors in ill-conditioned problems. This stability holds for both QR-based and divide-and-conquer methods under standard floating-point arithmetic, though near-zero singular values can pose challenges due to rounding errors that may distort the gap between them and the next larger value.[42]
Standard implementations of these methods are provided in the LAPACK library, with routines like DGESVD (QR-based) and DGESDD (divide-and-conquer) serving as de facto benchmarks for dense matrices up to moderate sizes. For near-zero singular values, LAPACK employs deflation techniques to isolate and refine them accurately, mitigating floating-point underflow issues. In the 2020s, advancements in parallel computing have extended these routines to machine learning-scale matrices, with libraries like PRIMME_SVDS enabling preconditioned, distributed SVD computations on clusters for problems involving billions of entries, achieving speedups of 10-50x over serial versions on GPU-accelerated nodes.[43]
Analytic 2×2 case
For a general 2×2 matrix A = \begin{pmatrix} a & b \\ c & d \end{pmatrix}, the singular values \sigma_1 \geq \sigma_2 \geq 0 are the square roots of the eigenvalues of A^* A, where A^* denotes the conjugate transpose. These eigenvalues \lambda_{1,2} solve the characteristic equation \lambda^2 - \operatorname{tr}(A^* A) \lambda + \det(A^* A) = 0, yielding the closed-form expression
\lambda_{1,2} = \frac{\|A\|_F^2 \pm \sqrt{ \|A\|_F^4 - 4 \det(A^* A) }}{2},
with the Frobenius norm \|A\|_F^2 = |a|^2 + |b|^2 + |c|^2 + |d|^2 = \operatorname{tr}(A^* A) and \det(A^* A) = |\det(A)|^2 = |ad - bc|^2. Thus, \sigma_{1,2} = \sqrt{\lambda_{1,2}}.[44]
The unitary matrices U and V in the SVD A = U \Sigma V^*, with \Sigma = \operatorname{diag}(\sigma_1, \sigma_2), consist of the left and right singular vectors, respectively. For real matrices, U and V are orthogonal rotation matrices parameterized by angles \theta_U and \theta_V, computed as \theta = \frac{1}{2} \atantwo(2 \cdot \text{off-diagonal}, \text{diagonal difference}) applied to AA^T and A^T A, ensuring numerical stability via the two-argument arctangent function. For complex entries, additional phase factors e^{i \phi} are incorporated into the columns of U and V to render the singular values real and non-negative, providing insight into how complex phases distribute between the factors without affecting the magnitudes.[44][45]
Special cases simplify the expressions. If A is diagonal, then U = V = I_2 (up to signs/phases) and \sigma_1 = \max(|a|, |d|), \sigma_2 = \min(|a|, |d|). For symmetric A (real or Hermitian), the SVD coincides with the eigenvalue decomposition if A is positive semidefinite, with U = V (up to phases in the complex case). Permutation matrices, such as A = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}, have singular values \sigma_1 = 1, \sigma_2 = 1, with U and V as rotations by \pi/4 or equivalent phase-adjusted unitaries. These closed-form solutions enable exact computation without iteration, offering efficiency in resource-limited embedded systems for tasks like signal processing, as demonstrated in FPGA implementations post-2010.[44][46]
Thin SVD
The thin singular value decomposition (SVD), also known as the economy SVD, of an m \times n matrix A (with m \geq n) is A = U \Sigma V^*, where U is an m \times n matrix whose columns are the left singular vectors, \Sigma is an n \times n diagonal matrix containing the n singular values \sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_n \geq 0 along its main diagonal (including zeros if \operatorname{rank}(A) < n), and V is an n \times n unitary matrix whose columns are the right singular vectors.[47] This form captures all possible singular values up to \min(m,n) while using rectangular U to minimize storage, with total elements m n + n + n^2, which is less than the full SVD's m^2 + m n + n^2 when m > n.[47]
The columns of U are orthonormal (U^* U = I_n) and span the leading n dimensions of the left singular subspace, while the columns of V satisfy V^* V = I_n.[47] If \operatorname{rank}(A) = r < n, the last n - r singular values are zero, and the corresponding columns of V span the null space of A, while extra columns of U contribute to the left null space basis. This preserves the structure of the range and null spaces up to the matrix dimensions, providing an efficient factorization that exactly reconstructs A without the full square U.[47] For the case m < n, the roles are reversed, with V rectangular n \times m. Unlike the full SVD, the thin SVD omits the extraneous parts of the unitary matrices beyond \min(m,n), tailored for computational efficiency.[47]
This form is advantageous for rectangular matrices, reducing computation and storage of null space components beyond the smaller dimension, enhancing efficiency in large-scale problems or memory-limited settings.[47] For example, when m \gg n, it avoids computing the full m \times m U, saving space proportional to m(m - n).[47] It serves as a basis for further reductions like the compact SVD by truncating to the rank if known.[47]
Compact SVD
The compact singular value decomposition (SVD) of an m \times n matrix A with rank r = \operatorname{rank}(A) \leq \min(m,n) is given by A = U_r \Sigma_r V_r^*, where U_r is an m \times r matrix whose columns are the left singular vectors corresponding to the r nonzero singular values, \Sigma_r is an r \times r diagonal matrix containing these positive singular values \sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0 along its main diagonal, and V_r is an n \times r matrix whose columns are the right singular vectors.[48] This factorization focuses exclusively on the nonzero singular triplets, minimizing storage to m r + r + n r, significantly less when r \ll \min(m,n).[49]
The columns of U_r form an orthonormal basis for the column space of A (satisfying U_r^* U_r = I_r), and the columns of V_r form an orthonormal basis for the row space of A (V_r^* V_r = I_r).[48] This minimal form exactly reconstructs A while discarding all null space contributions associated with zero singular values, providing the most reduced exact decomposition.[49]
Compared to the thin SVD, the compact SVD further reduces dimensions when the matrix is rank-deficient (r < \min(m,n)), excluding the zero singular values and their vectors; they coincide when r = \min(m,n) (full rank).[48] It balances between the full SVD's completeness and the thin SVD's dimensional efficiency, useful for low-rank matrices where highlighting the effective rank is key, and serves as the starting point for truncated approximations.[49]
Truncated SVD
The truncated singular value decomposition (TSVD) of an m \times n matrix A with rank r is a rank-k approximation A_k = \sum_{i=1}^k \sigma_i u_i v_i^*, where k \leq r, the \sigma_i (with \sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0) are the singular values of A, and u_i, v_i are the corresponding left and right singular vectors.[50] This form arises by retaining only the k largest singular triplets from the compact SVD A = U_r \Sigma_r V_r^*, effectively setting the remaining singular values to zero.[50] Among all possible rank-k matrices, A_k provides the optimal approximation in both the spectral norm (operator 2-norm) and the Frobenius norm.[51]
The Eckart–Young theorem quantifies the approximation quality, stating that the minimal error in the 2-norm is \|A - A_k\|_2 = \sigma_{k+1}, while the squared Frobenius norm error is \|A - A_k\|_F^2 = \sum_{i=k+1}^r \sigma_i^2.[51] These bounds highlight the direct connection between the discarded singular values and the incurred error, making TSVD a principled method for dimensionality reduction.[50]
A key property of TSVD is the monotonic decrease in approximation error as k increases: since the singular values are ordered non-increasingly, adding more terms reduces the residual sum \sum_{i=k+1}^r \sigma_i^2 by incorporating progressively smaller but positive contributions.[50] This ensures that higher truncation levels yield strictly better approximations in the Frobenius norm, with the 2-norm error dropping to the next singular value threshold.[51]
TSVD is optimal for noise reduction, as smaller singular values often capture noise rather than signal, allowing the approximation to filter perturbations while retaining dominant structure.[52]
Applications
The Moore-Penrose pseudoinverse of a matrix A \in \mathbb{C}^{m \times n}, denoted A^+, is uniquely defined as the matrix satisfying the four Penrose equations: A A^+ A = A, A^+ A A^+ = A^+, (A A^+)^H = A A^+, and (A^+ A)^H = A^+ A, where ^H denotes the conjugate transpose.[53] These properties ensure that A^+ provides the unique solution of minimal Euclidean norm to consistent linear systems and the minimal residual solution to inconsistent ones.[53]
Given the singular value decomposition A = U \Sigma V^H, where U and V are unitary and \Sigma is diagonal with nonnegative singular values \sigma_i in decreasing order, the pseudoinverse is computed as A^+ = V \Sigma^+ U^H. Here, \Sigma^+ is the diagonal matrix that reciprocates the nonzero singular values ($1/\sigma_i for \sigma_i > 0) and sets zeros for the remaining entries.[39] This formulation leverages the SVD to directly obtain A^+ in a numerically stable manner, preserving the Hermitian symmetry and idempotence properties of the pseudoinverse.[39]
In the context of least squares problems, consider the overdetermined system A \mathbf{x} = \mathbf{b} where A has full column rank. The solution \mathbf{x} = A^+ \mathbf{b} minimizes the Euclidean residual \|A \mathbf{x} - \mathbf{b}\|_2.[54] For the underdetermined homogeneous case A \mathbf{x} = \mathbf{0}, the solutions span the null space of A, and A^+ \mathbf{0} = \mathbf{0} yields the unique minimal-norm solution among them.[54]
The SVD also facilitates total least squares (TLS), which addresses errors in both A and \mathbf{b} by minimizing perpendicular distances in the augmented system [A \mid \mathbf{b}]. The TLS solution is derived from the right singular vector corresponding to the smallest singular value of the augmented matrix, providing a robust alternative to ordinary least squares for noisy data.[55]
For ill-conditioned matrices, where small singular values \sigma_{\min} amplify errors, the pseudoinverse can be regularized by thresholding: set $1/\sigma_i = 0 for \sigma_i < \tau (a tolerance based on machine precision and matrix conditioning), yielding a rank-deficient approximation that stabilizes computations.[54]
Low-rank approximation and compression
One of the primary applications of singular value decomposition (SVD) is in low-rank matrix approximation, where a matrix A \in \mathbb{R}^{m \times n} is approximated by a lower-rank matrix A_k of rank at most k, with k \ll \min(m, n). The truncated SVD achieves this by retaining only the k largest singular values and their corresponding left and right singular vectors, yielding A_k = U_k \Sigma_k V_k^T. This approximation is optimal in the sense that it minimizes the approximation error \|A - B\|_F over all matrices B of rank at most k, where \|\cdot\|_F denotes the Frobenius norm; the same holds for the spectral norm \|\cdot\|_2. This result is established by the Eckart–Young theorem, which proves that no other rank-k matrix can provide a smaller error in these norms.[56]
The singular value spectrum of A, given by the diagonal entries of \Sigma, quantifies the information content and potential loss in such approximations. The error \|A - A_k\|_F equals the square root of the sum of the squares of the discarded singular values \sigma_{k+1}^2 + \cdots + \sigma_{\min(m,n)}^2, providing a direct measure of the information discarded. This spectrum-based assessment has informed modern compression techniques, including hybrids that combine SVD with wavelet transforms, such as wavelet difference reduction applied post-SVD for improved lossy compression ratios in the 2010s.[57]
In data compression, the truncated SVD enables efficient storage by representing A via the factorized components U_k, \Sigma_k, and V_k^T instead of the full matrix. For an m \times n matrix, full storage requires mn entries, whereas the truncated form needs approximately k(m + n + 1) entries, yielding substantial savings when k is small relative to \min(m, n). This approach is particularly effective for sparse or low-intrinsic-rank data, as the rapid decay of singular values allows high-fidelity reconstruction with minimal storage.
A classic example is grayscale image compression, where an image is represented as an m \times n pixel intensity matrix A. Computing the SVD of A and retaining the top k components reconstructs a compressed approximation A_k that preserves essential visual features while reducing file size. For instance, in a 512 × 512 image, using k = 20 can achieve over 90% compression while often preserving visual quality for natural images, as the dominant singular values capture the primary structural patterns like edges and contrasts.[58]
Principal component analysis
Singular value decomposition (SVD) provides an efficient and numerically stable method for computing principal component analysis (PCA), a technique for dimensionality reduction that identifies directions of maximum variance in a dataset.[59] For a centered data matrix X \in \mathbb{R}^{m \times n}, where m represents the number of samples and n the number of features, the SVD is given by X = U \Sigma V^*, with U \in \mathbb{R}^{m \times m} and V \in \mathbb{R}^{n \times n} orthogonal matrices, and \Sigma a diagonal matrix containing the singular values \sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(m,n)} \geq 0. The columns of V correspond to the principal components, which are the directions of maximum variance, while the singular values quantify the scale of variance along these directions.[59]
The sample variances associated with each principal component are \frac{\sigma_i^2}{m-1} for i = 1, \dots, \min(m,n), derived from the eigenvalues of the sample covariance matrix \frac{1}{m-1} X^T X. Centering the data by subtracting the mean from each feature column is essential prior to SVD, as it ensures the principal components capture variance orthogonal to the mean direction; without centering, the first singular vector may align with the mean, distorting the results. To obtain a reduced representation, the data is projected onto the top k right singular vectors: Y = X V_k, where V_k consists of the first k columns of V, yielding a lower-dimensional matrix Y \in \mathbb{R}^{m \times k} that preserves the most significant variance.[59]
The singular values provide a direct measure of explained variance, with the proportion explained by the i-th component given by \frac{\sigma_i^2}{\sum_j \sigma_j^2}, allowing assessment of how much information each principal component retains. A scree plot, which graphs the singular values (or their squares) against component index, aids in selecting k by identifying an "elbow" where additional components contribute diminishing variance. SVD-based PCA is particularly advantageous for high-dimensional data, avoiding explicit computation of the covariance matrix, which can be ill-conditioned or infeasible when n \gg m.[59]
In modern machine learning, SVD's role in PCA extends to nonlinear extensions like kernel PCA, where the eigenvalue problem in the feature space—analogous to SVD in the input space—enables capturing nonlinear structures via kernel functions. Similarly, in the 2020s, integrations with autoencoders treat linear autoencoders as equivalent to PCA via SVD, while deep variants build on this for nonlinear dimensionality reduction, often outperforming PCA in complex tasks like image classification but at higher computational cost.
Additional applications
The Kabsch algorithm employs singular value decomposition to determine the optimal rotation matrix that minimizes the root-mean-square deviation (RMSD) between two sets of atomic coordinates, facilitating the superposition of molecular structures in crystallography and structural biology. Given two point sets represented as matrices P and Q, the algorithm computes the covariance matrix H = P Q^T, performs SVD H = U \Sigma V^T, and constructs the rotation as R = V U^T, ensuring the alignment preserves handedness by checking the determinant. This method has become standard for protein structure comparison, enabling precise RMSD calculations essential for validating docking simulations and conformational analysis.
In signal processing, SVD enables noise filtering by decomposing time-series data into singular modes and truncating those with small singular values corresponding to noise, thereby reconstructing a denoised signal while preserving dominant components.[60] For instance, in experimental data such as particle image velocimetry (PIV), the matrix of signal measurements undergoes SVD, allowing selective retention of low-rank approximations that capture underlying patterns, with the truncation threshold often determined by estimating the noise level from singular value spectra.[60] This approach outperforms traditional filters in non-stationary environments, as demonstrated in reweighted SVD applications for fault detection in machinery signals.[61]
The multilinear singular value decomposition (MLSVD), or higher-order SVD, extends matrix SVD to tensors, providing a low multilinear rank approximation for higher-dimensional data by factorizing a tensor into orthogonal mode matrices and a core tensor. This decomposition is particularly useful for separable models in multidimensional arrays, such as video compression or genomic data analysis, where it reveals latent structures by unfolding the tensor along each mode and applying SVD sequentially. In practice, MLSVD minimizes the Frobenius norm error for multilinear approximations, enabling efficient storage and computation in fields like neuroimaging by reducing tensor dimensions while retaining essential variability.[62]
The orthogonal Procrustes problem seeks the nearest orthogonal matrix Q to a given matrix O in the Frobenius norm, solved via SVD of O = U \Sigma V^* yielding Q = U V^*, which projects O onto the orthogonal group without altering its singular values. This formulation arises in shape analysis and sensor calibration, where aligning datasets requires orthogonal transformations, and the solution guarantees minimality due to the Eckart-Young theorem's extension to unitary constraints.
In recommender systems, SVD facilitates latent factor models by performing low-rank matrix factorization on user-item interaction matrices, extracting hidden features that predict preferences through truncated approximations. The decomposition R \approx U \Sigma V^T identifies user and item embeddings in a shared space, with regularization on singular values preventing overfitting in sparse datasets, as evidenced by its role in the Netflix Prize where it improved recommendation accuracy by capturing collaborative filtering patterns.
Singular value decomposition supports quantum state tomography by enabling low-rank approximations of density matrices reconstructed from partial measurements, aiding error mitigation in noisy quantum devices through truncation of insignificant singular values.[63] Recent advances in 2023 integrate SVD-based matrix completion with fermionic reduced density matrices, enhancing fidelity in multi-particle state estimation and filtering experimental noise for quantum machine learning applications like error-corrected simulations.[63] This approach reduces measurement overhead, achieving chemical accuracy in reconstructing two-particle correlations while suppressing decoherence effects in variational quantum eigensolvers.[63]
Norms and Variations
Ky Fan and Hilbert–Schmidt norms
The Ky Fan k-norms form a family of unitarily invariant norms on matrices, defined directly from the singular values obtained via singular value decomposition. For an m \times n matrix A with singular values \sigma_1(A) \geq \sigma_2(A) \geq \cdots \geq \sigma_{\min(m,n)}(A) \geq 0, the Ky Fan k-norm is
\|A\|_{(k)} = \sum_{i=1}^k \sigma_i(A),
where k ranges from 1 to \min(m,n). These norms satisfy the properties of a matrix norm, including submultiplicativity and the triangle inequality, and are symmetric in the sense that they depend only on the ordered singular values. When k=1, the Ky Fan 1-norm reduces to the spectral norm (or operator 2-norm), \|A\|_2 = \sigma_1(A), which measures the maximum stretch factor induced by A.[64][65]
The Ky Fan k-norms are unitarily invariant: for unitary matrices U \in \mathbb{C}^{m \times m} and V \in \mathbb{C}^{n \times n}, \|U A V\|_{(k)} = \|A\|_{(k)}, reflecting the fact that SVD singular values remain unchanged under unitary transformations. They represent a specific case within the broader class of unitarily invariant norms, akin to a truncated variant of Schatten p-norms at p = \infty; while the Schatten \infty-norm is simply the largest singular value, the Ky Fan k-norm extends this by summing the k largest, providing a measure of the combined effect of the dominant singular directions.[64][66]
A key property linking Ky Fan norms to SVD approximations is their role in low-rank optimization. The Eckart–Young–Mirsky theorem, generalized to unitarily invariant norms, states that the minimum Ky Fan k-norm distance from A to any rank-at-most-k matrix B is achieved precisely by the rank-k truncation of the SVD of A:
\min_{\rank(B) \leq k} \|A - B\|_{(k)} = \sum_{j=1}^{\min(k, p-k)} \sigma_{k+j}(A),
where p = \min(m,n). This optimality underscores the utility of SVD in bounding approximation errors via tail singular values.[66][67]
The Hilbert–Schmidt norm, equivalent to the Frobenius norm for finite matrices, is another unitarily invariant norm expressible through SVD singular values. It is defined as
\|A\|_{HS} = \sqrt{\sum_{i=1}^{\min(m,n)} \sigma_i(A)^2} = \sqrt{\trace(A^* A)},
where A^* denotes the conjugate transpose. This norm arises as the Schatten p-norm for p=2, capturing the Euclidean length of the singular value vector in \ell_2 sense, and it equals the \ell_2 norm of the entries of A. Like the Ky Fan norms, it is unitarily invariant: \|U A V\|_{HS} = \|A\|_{HS}. The Hilbert–Schmidt norm quantifies the total "energy" of A across all singular directions, making it particularly useful for assessing overall magnitude in finite-dimensional settings.[64][68]
Generalizations to operators
The singular value decomposition (SVD) extends to bounded linear operators between Hilbert spaces. For a bounded operator T: H \to K where H and K are complex Hilbert spaces, the singular values \sigma_i(T) are defined as the square roots of the eigenvalues of the positive self-adjoint operator T^* T, arranged in decreasing order; these are also known as the s-numbers of T.[69][70] The SVD in this setting arises from the polar decomposition T = U |T|, where U is a partial isometry with initial space the closure of the range of |T| = \sqrt{T^* T}, and the singular values capture the "stretching" factors along principal directions.[70][71]
For compact operators, a special class of bounded operators that map bounded sets to precompact sets, the SVD takes a more explicit series form. Specifically, T admits a decomposition T = \sum_{i=1}^\infty \sigma_i \, y_i \otimes x_i, where \{x_i\} and \{y_i\} are orthonormal sets in H and K, respectively, such that T z = \sum_{i=1}^\infty \sigma_i \langle z, x_i \rangle y_i for all z \in H, and the singular values satisfy \sigma_i \to 0 as i \to \infty.[71][70] This representation allows compact operators to be approximated by finite-rank operators, with the error controlled by the tail of the singular values, enabling practical computations in infinite-dimensional settings.[71]
Key properties follow from the compactness: the singular values \sigma_i are positive, nonincreasing, and accumulate only at zero, distinguishing compact operators from general bounded ones.[70][71] Among compact operators, nuclear operators—equivalently trace-class operators in Hilbert spaces—are those for which the singular values are \ell^1-summable, i.e., \sum_i \sigma_i < \infty, providing a nuclear norm \|T\|_1 = \sum_i \sigma_i that measures their "size" in a summability sense.[71]
The existence of the SVD for compact operators is implied by the spectral theorem for compact self-adjoint operators, which decomposes T^* T into a series of projections onto eigenspaces with eigenvalues \sigma_i^2, from which the singular vectors are derived via the polar decomposition.[71][70]
Scale-invariant and other variations
The scale-invariant singular value decomposition addresses the sensitivity of the standard SVD to row and column scaling by incorporating diagonal scaling matrices into the factorization. For a matrix A, this variation takes the form A = D U \Sigma V^T E, where D and E are diagonal matrices that normalize the rows and columns, respectively, ensuring the core decomposition U \Sigma V^T is invariant to multiplicative scaling factors. This adjustment is particularly useful for matrices with inherent scaling differences, such as those arising in data analysis where units or magnitudes vary across dimensions.[72]
In the context of positive matrices, scale-invariance can be achieved by setting the scaling factors based on geometric means of the singular values, as in the geometric mean decomposition (GMD), which factorizes A = Q R P^T where the diagonal entries of the upper triangular R are the geometric mean \bar{\sigma} = \left( \prod \sigma_i \right)^{1/r} of the positive singular values, preserving the relative structure while normalizing overall scale. This approach maintains the ratios of the singular values, allowing robust comparison of matrix structures independent of absolute magnitudes, and finds application in fields like comparative genomics where scaling variations between datasets must be mitigated.[72][73]
The generalized singular value decomposition (GSVD) extends SVD to pairs of matrices A and B sharing the same number of columns, decomposing them as A = U \Sigma_X X^{-1} and B = V \Sigma_Y X^{-1}, where X is nonsingular, and the generalized singular values are ratios \gamma_i = \sigma_i(A)/\sigma_i(B). This formulation provides a unified framework for comparing datasets of different dimensionalities, such as genome-scale expression profiles, by highlighting shared and differential components through the ratios, which remain invariant to common transformations. The GSVD was first formalized for this purpose in seminal work and has been widely adopted for multi-dataset analysis.[73]
For multi-way data represented as tensors, the higher-order singular value decomposition (HOSVD) generalizes SVD by computing a Tucker decomposition via successive mode-wise SVDs on unfoldings of the tensor. Specifically, for an N-way tensor \mathcal{X}, HOSVD yields \mathcal{X} = \mathcal{S} \times_1 U^{(1)} \times_2 U^{(2)} \cdots \times_N U^{(N)}, where \mathcal{S} is the core tensor and each U^{(n)} contains the leading left singular vectors along mode n. This multilinear approach preserves the multi-dimensional structure and enables low-rank approximations for higher-order arrays, such as in signal processing and multidimensional data compression.
Other variations include the bidiagonal SVD, which exploits the bidiagonal structure of reduced matrices for efficient computation in structured problems like least-squares solutions, though modern alternatives like randomized SVD have largely supplanted it for large-scale settings. Randomized SVD variants, particularly those adapted for streaming data since 2014, use random projections to sketch the matrix and approximate the decomposition in a single pass, enabling processing of massive or continuously arriving data with controlled error bounds. These methods, such as sketching-based algorithms, achieve near-optimal low-rank approximations while supporting parallel and memory-efficient implementations for big data applications.[74]
Historical Development
Origins in the 19th century
The origins of the singular value decomposition trace back to the mid-19th century, when mathematicians began exploring the canonical forms of bilinear and quadratic forms in the context of linear algebra and geometry. In 1873, Italian mathematician Eugenio Beltrami published a seminal paper on bilinear functions, where he demonstrated how to decompose a general bilinear form into a sum of products of linear forms using orthogonal transformations. This work effectively introduced the concept of singular values for real square matrices by reducing the associated quadratic forms to a diagonal structure, with the diagonal entries representing the squares of what we now recognize as singular values. Beltrami's approach focused on positive definite forms, assuming the matrices involved led to positive eigenvalues, and arose from efforts to simplify expressions tied to the geometry of conics and higher-dimensional quadrics.[2][75]
Beltrami's decomposition was motivated by problems in n-dimensional geometry, including the representation of quadric surfaces and their polar relations, which connected to broader studies in elliptic integrals for computing areas and volumes in such spaces. Although not termed singular value decomposition at the time, his method laid the groundwork for understanding how linear transformations could be diagonalized under orthogonal changes of basis, particularly for nonsingular matrices with distinct singular values. This contribution emphasized the stability and geometric interpretation of the decomposition, influencing subsequent work on matrix factorizations.[2]
One year later, in 1874, French mathematician Camille Jordan independently extended these ideas in his memoir on bilinear forms, applying them to general linear transformations between vector spaces. Jordan showed that any such transformation could be represented in a diagonal form via two orthogonal bases, maximizing and minimizing the bilinear form under unit norm constraints on the vectors. His variational technique provided a more complete framework, handling the full spectrum of singular values and incorporating a deflation process to iteratively find them. Like Beltrami, Jordan's focus remained on positive definite cases and was linked to the geometry of conics in higher dimensions, as well as transformations relevant to elliptic integrals in multivariable calculus. This work solidified the recognition of the diagonal form's universality for real matrices, without naming it as SVD.[2]
In 1889, James Joseph Sylvester contributed further to the canonical forms of pairs of bilinear forms, providing a more general decomposition that encompassed non-symmetric cases and influenced the development of matrix theory relevant to SVD.[2]
20th-century advancements and naming
In the early 20th century, Erhard Schmidt introduced the term "singular values" in 1907 while studying integral equations, formalizing their role in the decomposition of compact operators. Hermann Weyl advanced the theoretical foundations in 1912 with a proof of existence for the SVD using the spectral theorem for Hermitian matrices, and further refined the framework in 1949, establishing majorization inequalities for singular values.[2]
In the 1930s, significant progress was made in extending the singular value decomposition to rectangular matrices and establishing its approximation properties. Carl Eckart and Gale Young provided a minimax characterization, demonstrating that the best low-rank approximation to a matrix in the Frobenius or spectral norm is obtained by truncating the singular value decomposition to the largest singular values, a result now known as the Eckart-Young theorem.[56] This built on 19th-century foundations by formalizing the error bounds for such approximations.[76]
Early computational approaches to the singular value decomposition emerged in the mid-20th century, facilitated by the post-World War II boom in electronic computing, which made iterative numerical methods practical for the first time. John von Neumann's work in the 1930s on unitarily invariant norms and operator decompositions laid theoretical groundwork for algorithmic development, suggesting connections to eigenvalue problems that could be adapted for computation.[76] By the 1950s, methods based on Jacobi rotations—originally proposed for symmetric eigenvalue problems in the 19th century but revitalized with computers—were extended to the two-sided Jacobi algorithm for singular value decomposition, enabling iterative orthogonal transformations to diagonalize matrices efficiently. These techniques, including early proposals by Kogbetliantz in 1955, marked the shift from theoretical existence proofs to reliable numerical implementation.[76]
In the late 1960s, Gene Golub developed efficient and stable algorithms for computing the SVD, notably the Golub-Reinsch algorithm, which uses Householder bidiagonalization followed by QR iterations on the bidiagonal form. This method became a cornerstone of numerical linear algebra software.[3]
The term "singular value decomposition" was coined by G. W. Stewart in his 1973 textbook Introduction to Matrix Computations, standardizing nomenclature for what had previously been described variably as the "Schmidt decomposition" or "fundamental theorem of linear algebra."[76] This naming reflected growing recognition of its utility in numerical linear algebra amid advancing computational capabilities.
Into the late 20th and early 21st centuries, the singular value decomposition played a pivotal role in handling big data through low-rank matrix factorization. A notable example is the Netflix Prize competition (2006–2009), where teams leveraged SVD-based models to predict user movie ratings from sparse matrices, achieving up to 10% improvement in accuracy over baseline systems and demonstrating its scalability for recommender applications.[77]