Tensor decomposition
Tensor decomposition is the mathematical process of factorizing a multi-dimensional array, or tensor, into a sum of simpler, lower-rank components, generalizing matrix factorization techniques to higher-order data structures beyond two dimensions.[1] This approach enables the extraction of latent structures and patterns from complex, high-dimensional datasets by representing the tensor as combinations of vectors or lower-order tensors.[2]
The origins of tensor decomposition trace back to 1927, when Frank Lauren Hitchcock introduced the concept of decomposing a three-way array into rank-1 terms, laying the groundwork for multi-linear algebra applications.[2] The field advanced significantly in the 1970s through Richard Harshman's development of the Parallel Factors (PARAFAC) model, a canonical form of decomposition, alongside independent work by Philip Kruskal on uniqueness conditions.[2] By the late 20th century, tensor decompositions had permeated psychometrics and chemometrics before expanding into signal processing in the 1990s and machine learning in the 2000s, driven by the need to handle multi-way data in fields like neuroimaging and recommender systems.[1][2]
Among the most prominent methods, the Canonical Polyadic (CP) decomposition, also known as CANDECOMP/PARAFAC, expresses a tensor as a minimal sum of rank-1 tensors (outer products of vectors), offering strong uniqueness properties under mild conditions and facilitating interpretable latent factor estimation.[1][2] In contrast, the Tucker decomposition generalizes this by factoring the tensor into a core tensor multiplied by factor matrices along each mode, providing flexibility for dimensionality reduction and subspace analysis, though with reduced uniqueness compared to CP.[1][2] Other variants, such as the Tensor Train format, have emerged for efficient computation in very high-order tensors, particularly in numerical simulations and large-scale data modeling.[1]
Tensor decompositions find wide applications across disciplines, including signal processing for tasks like blind source separation and harmonic retrieval, where they disentangle mixed signals from multi-sensor data.[2] In machine learning, they support collaborative filtering in recommendation engines, topic modeling in text analysis, and compression of neural network parameters, enabling scalable processing of big data while preserving essential structures.[1][2] These techniques also underpin advancements in knowledge graph completion and multi-relational learning, highlighting their role in uncovering hidden relationships in heterogeneous datasets.[1]
Fundamentals
Notation and Terminology
A tensor is defined as a multidimensional array of numerical values, generalizing the concepts of scalars (0th-order tensors), vectors (1st-order tensors), and matrices (2nd-order tensors) to higher dimensions.[3] An Nth-order tensor, also referred to as an N-way tensor, possesses N dimensions or indices, where the order N indicates the number of modes or ways in the array.[3] Each mode corresponds to a specific dimension of the tensor, and a fiber is obtained by fixing all indices except one, serving as the higher-order analogue of rows or columns in a matrix.[3]
Standard notation in tensor decomposition employs bold lowercase letters for vectors (e.g., \mathbf{a}), bold uppercase letters for matrices (e.g., \mathbf{A}), and bold calligraphic uppercase letters for higher-order tensors (e.g., \mathcal{X}).[3] Scalars are denoted by italicized lowercase letters (e.g., a), and the entries of a tensor are represented using indices, such as x_{i_1 i_2 \dots i_N} for the element at position (i_1, i_2, \dots, i_N) in \mathcal{X}.[3]
Key tensor operations include the outer product, denoted by \circ, which combines vectors to form a tensor; for instance, \mathcal{X} = \mathbf{a}^{(1)} \circ \mathbf{a}^{(2)} \circ \dots \circ \mathbf{a}^{(N)} yields x_{i_1 i_2 \dots i_N} = a^{(1)}_{i_1} a^{(2)}_{i_2} \dots a^{(N)}_{i_N}.[3] The inner product between two same-sized tensors \mathcal{X} and \mathcal{Y} is the scalar \langle \mathcal{X}, \mathcal{Y} \rangle = \sum_{i_1, \dots, i_N} x_{i_1 \dots i_N} y_{i_1 \dots i_N}.[3] The mode-n product, denoted \mathcal{X} \times_n \mathbf{U}, multiplies the tensor \mathcal{X} \in \mathbb{R}^{I_1 \times \dots \times I_N} by a matrix \mathbf{U} \in \mathbb{R}^{J \times I_n} along the nth mode, resulting in a tensor of size I_1 \times \dots \times I_{n-1} \times J \times I_{n+1} \times \dots \times I_N, with elements given by \sum_{i_n} x_{i_1 \dots i_N} u_{j i_n}.[3] The Frobenius norm measures the magnitude of a tensor as \|\mathcal{X}\|_F = \sqrt{\sum_{i_1, \dots, i_N} x_{i_1 \dots i_N}^2}, analogous to the matrix Frobenius norm.[3]
Unfolding, or matricization, rearranges the elements of a tensor into a matrix to facilitate computations; the mode-n unfolding, denoted \mathbf{X}_{(n)}, arranges the mode-n fibers as the columns of the matrix, with rows corresponding to the nth mode and columns to the combinations of the remaining modes.[3] For a 3-way tensor \mathcal{X} \in \mathbb{R}^{I \times J \times K}, the unfoldings correspond to slices: the mode-1 unfolding \mathbf{X}_{(1)} has dimensions I \times (J K) and stacks the horizontal slices (fixed i) as columns; the mode-2 unfolding \mathbf{X}_{(2)} has dimensions J \times (I K) and uses lateral slices (fixed j); and the mode-3 unfolding \mathbf{X}_{(3)} has dimensions K \times (I J) with frontal slices (fixed k) as rows.[3] As an illustrative example, consider a 3-way tensor \mathcal{X} \in \mathbb{R}^{3 \times 4 \times 2} with entries filled sequentially from 1 to 24; its mode-1 unfolding is the $3 \times 8 matrix
\mathbf{X}_{(1)} = \begin{bmatrix}
1 & 4 & 7 & 10 & 13 & 16 & 19 & 22 \\
2 & 5 & 8 & 11 & 14 & 17 & 20 & 23 \\
3 & 6 & 9 & 12 & 15 & 18 & 21 & 24
\end{bmatrix},
where columns represent the fibers along mode 1, derived from the two frontal slices.[3]
Tensors and Their Properties
Tensors generalize vectors and matrices to higher orders, serving as the fundamental objects in multilinear algebra. A first-order tensor corresponds to a vector, a second-order tensor to a matrix, and higher-order tensors extend these concepts to multi-way arrays that capture interactions across multiple dimensions or modes.[4] This generalization enables the representation of complex, multidimensional data, such as in signal processing or psychometrics, where relationships are not adequately captured by pairwise (matrix) structures alone.[4]
A key property of tensors is their multilinearity, meaning operations along each mode—such as multiplication by a matrix—preserve linearity in that mode while fixing the others. This multilinearity underpins tensor manipulations, distinguishing them from nonlinear generalizations and facilitating decompositions that exploit low-rank structures across modes.[4] Tensor rank quantifies the complexity of these structures. The CP rank is defined as the minimal number of rank-1 tensors (outer products of vectors, one per mode) required to express the tensor as their sum. In contrast, the Tucker rank is a tuple specifying the ranks of the tensor's mode-wise unfoldings into matrices. Notably, for tensors of order greater than two, the CP rank can exceed the maximum rank of any unfolding, unlike the case for matrices where rank is uniquely determined.[4]
High-dimensional tensors pose significant challenges due to the curse of dimensionality, where storage and computational demands grow exponentially with the number of modes or dimensions. For instance, a third-order tensor of dimensions $100 \times 100 \times 100 requires storing one million entries, rendering direct analysis computationally infeasible for even moderately sized data. This exponential growth leads to ill-posed problems in direct tensor analysis, as the sheer volume obscures underlying patterns and amplifies sensitivity to noise or perturbations.[4][5]
The need for tensor decomposition arose historically in psychometrics, where early efforts to model multidimensional data structures highlighted the limitations of matrix-based factor analysis. In 1944, Raymond B. Cattell introduced the concept of parallel proportional profiles to address factor rotation in multi-way data, laying groundwork for low-rank approximations that would later formalize tensor decompositions.[4][6] To illustrate, consider a synthetic third-order tensor of CP rank R, expressed as \mathcal{X} = \sum_{r=1}^R \mathbf{a}_r \circ \mathbf{b}_r \circ \mathbf{c}_r, where each term is a rank-1 component formed by the outer product of vectors \mathbf{a}_r, \mathbf{b}_r, and \mathbf{c}_r from the respective modes. This form demonstrates how tensors can be built from simpler, interpretable building blocks, motivating decomposition techniques for real-world data reduction.[4]
Core Decomposition Methods
CANDECOMP/PARAFAC (CP) Decomposition
The CANDECOMP/PARAFAC (CP) decomposition factorizes an N-way tensor \mathcal{X} \in \mathbb{R}^{I_1 \times \cdots \times I_N} into a sum of R rank-one tensors, providing a minimal parameterization for low-rank approximations. This model expresses the tensor as
\mathcal{X} \approx \sum_{r=1}^R \mathbf{a}_r \circ \mathbf{b}_r \circ \cdots \circ \mathbf{z}_r,
where R denotes the CP rank, \circ represents the outer product, and the vectors \mathbf{a}_r \in \mathbb{R}^{I_1}, \mathbf{b}_r \in \mathbb{R}^{I_2}, ..., \mathbf{z}_r \in \mathbb{R}^{I_N} serve as columns of the factor matrices \mathbf{A} \in \mathbb{R}^{I_1 \times R}, \mathbf{B} \in \mathbb{R}^{I_2 \times R}, ..., \mathbf{Z} \in \mathbb{R}^{I_N \times R}.[3] The formulation generalizes parallel factor analysis from two-way to multi-way data, capturing latent structures through these rank-one components.[3]
Independently proposed by Carroll and Chang in 1970 under the name CANDECOMP and by Harshman in the same year as PARAFAC, the method emerged from psychometrics and phonetics to address multi-mode factor analysis.[7] A defining feature is its essential uniqueness: under mild conditions, the factor matrices are unique up to permutation of the components and scaling across the vectors in each rank-one term (with scalings summing to unity per mode). Kruskal's condition establishes a sufficient criterion for this uniqueness in the three-way case, requiring that the sum of the Kruskal ranks of the factor matrices satisfies k_{\mathbf{A}} + k_{\mathbf{B}} + k_{\mathbf{C}} \geq 2R + 2, where the Kruskal rank k_{\mathbf{U}} of a matrix \mathbf{U} \in \mathbb{R}^{I \times R} is the largest k \leq \min(I, R) such that every k \times k submatrix has full column rank.[3]
The CP decomposition offers advantages in parsimony, using the fewest parameters—specifically, \sum_{n=1}^N I_n R - (N-1)R after accounting for scaling freedoms—compared to other methods like Tucker, which facilitates interpretable additive components akin to principal components in multi-way settings.[3] However, a notable disadvantage is the computational challenge in determining the CP rank R, as finding the minimal R for exact decomposition is NP-hard.
In chemometrics, the three-way CP model is widely applied to fluorescence excitation-emission matrices across samples, decomposing the tensor to identify fluorophores. For instance, the unfolding-based least-squares objective for mode-1 unfolding becomes
\min_{\mathbf{A}, \mathbf{B}, \mathbf{C}} \left\| \mathbf{X}_{(1)} - \mathbf{A} (\mathbf{C} \odot \mathbf{B})^T \right\|_F^2,
where \odot is the Khatri-Rao product, \mathbf{X}_{(1)} \in \mathbb{R}^{I_1 \times (I_2 I_3)} is the unfolded tensor, and \|\cdot\|_F denotes the Frobenius norm; this isolates pure emission and excitation profiles while linking them to concentration profiles in \mathbf{C}.
Tucker Decomposition
The Tucker decomposition, also known as the higher-order singular value decomposition (HOSVD) in its orthogonal form, represents a tensor as the mode-wise product of a smaller core tensor and a set of factor matrices, serving as a natural extension of principal component analysis to multi-way data. Originally proposed by Ledyard R. Tucker in 1966 for three-mode factor analysis, the method decomposes a third-order tensor \mathcal{X} \in \mathbb{R}^{I \times J \times K} approximately as \mathcal{X} \approx \mathcal{G} \times_1 \mathbf{A} \times_2 \mathbf{B} \times_3 \mathbf{C}, where \mathcal{G} \in \mathbb{R}^{R_1 \times R_2 \times R_3} is the core tensor capturing multi-way interactions, and \mathbf{A} \in \mathbb{R}^{I \times R_1}, \mathbf{B} \in \mathbb{R}^{J \times R_2}, \mathbf{C} \in \mathbb{R}^{K \times R_3} are the factor matrices along each mode, often constrained to be orthogonal. This formulation generalizes to N-way tensors as \mathcal{X} \approx \mathcal{G} \times_1 \mathbf{A}^{(1)} \times_2 \mathbf{A}^{(2)} \cdots \times_N \mathbf{A}^{(N)}, with the core \mathcal{G} typically smaller than \mathcal{X} to enable compression and reveal latent structures.
The Tucker rank of a tensor is defined as the tuple (R_1, R_2, \dots, R_N), where each R_n specifies the dimensions of the core tensor \mathcal{G} and corresponds to the multilinear rank along mode n, preserving the essential multi-way variability of the data. Unlike the CP decomposition, which lacks a core tensor and offers essential uniqueness under mild conditions, the Tucker model does not possess essential uniqueness due to rotational freedom in the core tensor and factors, allowing transformations like \mathcal{G}' = \mathcal{G} \times_1 \mathbf{Q}_1 \times_2 \mathbf{Q}_2 \times_3 \mathbf{Q}_3 and \mathbf{A}' = \mathbf{A} \mathbf{Q}_1^{-1} for orthogonal \mathbf{Q}_n without altering the approximation. However, the HOSVD variant, introduced by De Lathauwer et al. in 2000, imposes orthogonality on all factor matrices, yielding the optimal low multilinear rank approximation in the sense of minimizing the Frobenius norm error for the specified ranks. This orthogonality ensures that the decomposition captures the principal components along each mode independently, facilitating interpretability and subspace identification.
To illustrate, consider a three-way tensor \mathcal{X} \in \mathbb{R}^{I \times J \times K}. The HOSVD computes the factor matrices by performing successive singular value decompositions (SVDs) on the mode-n unfoldings: \mathbf{X}_{(1)} = \mathbf{U}^{(1)} \mathbf{\Sigma}^{(1)} (\mathbf{V}^{(1)})^T yields \mathbf{A} = \mathbf{U}^{(1)} (retaining the first R_1 columns), similarly for \mathbf{B} from \mathbf{X}_{(2)} and \mathbf{C} from \mathbf{X}_{(3)}. The core tensor is then obtained as \mathcal{G} = \mathcal{X} \times_1 \mathbf{A}^T \times_2 \mathbf{B}^T \times_3 \mathbf{C}^T, where the mode products project \mathcal{X} onto the orthogonal subspaces spanned by the factors, resulting in a core whose entries reflect the multi-linear correlations. This process, while not minimizing the full tensor approximation error, provides a quasi-optimal solution that is computationally efficient and widely used for initial tensor approximations.
Advanced Decomposition Techniques
Tensor Train (TT) Decomposition
The tensor train (TT) decomposition represents a high-order tensor as a sequence of lower-dimensional cores connected in a linear chain, enabling efficient handling of very high-dimensional data without the exponential storage costs of the full tensor. Introduced by Ivan Oseledets in 2011, this format was motivated by the need to solve high-dimensional partial differential equations (PDEs) in fields such as quantum dynamics and stochastic modeling, where tensor dimensions can reach d = 10 to $1000 or more. Unlike traditional decompositions that struggle with the curse of dimensionality, the TT format maintains low-rank structure through a chain of matrix multiplications, facilitating scalable computations for tensor operations like contraction and summation.[8]
Formally, for an N-way tensor \mathcal{X} \in \mathbb{R}^{n_1 \times \cdots \times n_N}, the TT decomposition expresses each element as
\mathcal{X}_{i_1 \dots i_N} = \mathbf{G}^{(1)}_{i_1} \mathbf{G}^{(2)}_{i_2} \cdots \mathbf{G}^{(N)}_{i_N},
where \mathbf{G}^{(k)} is the k-th core, a three-dimensional array of size r_{k-1} \times n_k \times r_k with r_0 = r_N = 1, and \mathbf{G}^{(k)}_{i_k} denotes the r_{k-1} \times r_k matrix slice for mode i_k \in \{1, \dots, n_k\}. The bond dimensions r_k (for k=1, \dots, N-1) control the rank of the decomposition, and the TT-rank is defined as the tuple (r_1, \dots, r_{N-1}). For storage efficiency, the cores are often stored in matricized (unfolded) form, where each \mathbf{G}^{(k)} is represented as a matrix of size r_{k-1} n_k \times r_k, reducing redundancy while preserving the chain structure. This formulation allows the full tensor to be reconstructed on-the-fly without explicit storage of all \prod_{k=1}^N n_k elements.[8]
A key property of the TT decomposition is its compression efficiency: assuming uniform mode sizes n_k = n and maximum bond dimension r = \max_k r_k, the storage requirement is O(N r^2 n), which is linear in the tensor order N compared to the exponential O(n^N) for the dense tensor. This makes TT particularly suitable for ultra-high-order tensors, where the bond ranks r_k remain small due to the intrinsic low-rank nature of many physical systems. Additionally, the format supports transferable operations, such as arithmetic (e.g., addition and multiplication of TT tensors) and reshaping, which can be performed directly on the cores without full reconstruction. The TT-SVD algorithm initializes the decomposition by sequentially applying singular value decomposition (SVD) to matricized unfoldings of the tensor: starting from the left, it unfolds \mathcal{X} along the first k modes, computes the SVD, retains the top r_k singular values, and proceeds to the next core, enabling a stable low-rank approximation.[8]
The approximation quality in TT-SVD is quantified by an error bound in the Frobenius norm: for a TT approximation \mathcal{B} with bond ranks r_k, the error satisfies \|\mathcal{X} - \mathcal{B}\|_F \leq \sqrt{\sum_{k=1}^{N-1} \sigma_{r_k+1,k}^2}, where \sigma_{j,k} are the singular values from the k-th sequential unfolding. This bound ensures that truncation errors accumulate conservatively, often yielding near-optimal compression for tensors with rapidly decaying singular values, as seen in applications to discretized high-dimensional operators.[8]
Hierarchical Tucker (HT) Decomposition
The Hierarchical Tucker (HT) decomposition represents a high-order tensor using a multi-level factorization structured as a tree, where non-leaf nodes store small transfer tensors and leaf nodes store factor matrices. This format was introduced by Hackbusch and Kühn in 2009, initially motivated by applications in quantum chemistry and the approximation of integral operators.[9] Unlike linear chain formats such as the tensor train, the HT approach employs a general tree topology that clusters modes hierarchically, enabling more flexible low-rank approximations for tensors with balanced dimensions.[10]
In the HT formulation, a tensor \mathcal{X} \in \mathbb{R}^{d_1 \times \cdots \times d_N} is approximated by recursively factoring subspaces along the tree branches. For a binary tree example, the root-level approximation can be expressed as \mathcal{X} \approx \sum_{r=1}^{R_0} \mathbf{b}_r^{(1)} \circ \mathcal{G}_r^{(2,\dots,N)}, where \mathbf{b}_r^{(1)} \in \mathbb{R}^{d_1} are basis vectors, R_0 is the root rank, and \mathcal{G}_r^{(2,\dots,N)} are intermediate low-rank tensors further decomposed at child nodes, such as \mathcal{G}_r^{(2,\dots,N)} \approx \sum_{s=1}^{R_1} \mathbf{U}_s^{(2)} \circ \mathcal{M}_s^{(3,\dots,N)} with factor matrices \mathbf{U}^{(i)} \in \mathbb{R}^{d_i \times r_i} at leaves and transfer tensors \mathcal{M} at internal nodes. The HT-rank is defined as the tuple of ranks (r_\nu)_{\nu \in \mathcal{T}} assigned to each node \nu in the tree \mathcal{T}, quantifying the low-rank structure at every clustering level.[10]
The storage complexity of an HT decomposition for an N-th order tensor with equal mode sizes d and maximum rank R is O(N d R + N R^3) in a balanced binary tree, significantly reducing memory requirements compared to the full Tucker format's O(R^N) for high N. This efficiency arises from storing only the factor matrices at leaves (total size O(N d R)) and compact transfer tensors at non-leaves (dominated by O(R^3) per internal node across approximately N nodes).[10]
A common method for constructing the HT decomposition is the hierarchical singular value decomposition (HT-SVD), which builds the approximation recursively by computing SVDs on matricizations of tensor clusters. Starting from the leaves, the algorithm performs SVD on the unfolding of each subtree tensor, retaining the leading singular vectors up to the prescribed node rank, and proceeds bottom-up to the root while orthogonalizing bases to preserve near-optimal approximation quality. The error is controlled by singular value truncation at each level, ensuring the relative approximation error is bounded by the sum of local truncation errors propagated through the tree.[10]
For instance, consider a 7-way tensor with a balanced binary tree structure: the root node has rank R_0, its two children (each clustering 3 and 4 modes, respectively) have ranks R_1 and R_2, and further subdivisions continue to leaves. The HT-SVD constructs factor matrices for each mode and transfer tensors at internal nodes by sequential SVDs on cluster unfoldings, with the overall error analyzed via the decay of singular values at each truncation step.[10]
Computational Approaches
Alternating Least Squares (ALS)
The alternating least squares (ALS) method serves as a foundational block-coordinate optimization technique for fitting low-rank tensor decompositions by iteratively minimizing the Frobenius norm of the approximation error, \|\mathcal{X} - \hat{\mathcal{X}}\|_F^2, where \mathcal{X} is the observed tensor and \hat{\mathcal{X}} is its low-rank approximation.[7][3] Introduced as a core algorithm for the PARAFAC model by Harshman in 1970, ALS cyclically updates one factor matrix at a time while holding the others fixed, solving a linear least-squares problem for each update.[7] This approach leverages the separability of the objective function across modes, making it applicable to both CP and Tucker decompositions.[3]
In the general ALS framework, the update for a specific factor matrix, say \mathbf{A}, involves unfolding the tensor along the corresponding mode and solving \mathbf{A} \leftarrow \arg\min_{\mathbf{A}} \|\mathbf{X}_{(1)} - \mathbf{A} \mathbf{Z}^T\|_F^2, where \mathbf{Z} is a matrix constructed from the Khatri-Rao products of the other fixed factor matrices.[3] The process cycles through all modes until convergence, with the error non-increasing at each iteration due to the optimality of individual least-squares solutions.[3] Typically, ALS converges to a local minimum, as the overall problem is non-convex, though the monotonic decrease in the objective ensures practical termination.[11]
For the CP decomposition, the unfolding-based updates explicitly use the Khatri-Rao product; for an N-way tensor with factor matrices \mathbf{A}^{(n)} (n=1,\dots,N) and rank R, the update for \mathbf{A}^{(1)} sets \mathbf{Z} = \mathbf{A}^{(N)} \odot \cdots \odot \mathbf{A}^{(2)}, where \odot denotes the Khatri-Rao product, yielding \mathbf{A}^{(1)} = \mathbf{X}_{(1)} \mathbf{Z} (\mathbf{Z}^T \mathbf{Z})^{-1}.[3] In the Tucker decomposition, updates involve higher-order unfoldings and the Kronecker product of factor matrices, such as updating the core tensor \mathcal{G} via multilinear products while fixing the factor matrices, or vice versa for each mode-n matricization \mathbf{X}_{(n)} \approx \mathbf{A}^{(n)} \mathbf{G}_{(n)} (\mathbf{A}^{(N)} \otimes \cdots \otimes \mathbf{A}^{(n+1)} \otimes \mathbf{A}^{(n-1)} \otimes \cdots \otimes \mathbf{A}^{(1)})^T.[3] Initialization is crucial for avoiding poor local minima and is often performed using the higher-order singular value decomposition (HOSVD) to obtain initial factor matrices from leading singular vectors of each mode unfolding, or via random orthogonal matrices for simplicity.[12][3]
ALS offers simplicity in implementation and parallelization across modes, as each update is an independent least-squares solve, but it can exhibit slow convergence or stagnation in ill-conditioned problems due to the sensitivity to initialization and the discrete nature of block updates.[3] Common stopping criteria include a relative change in the fit error below $10^{-6} or a maximum number of iterations, such as 100, to balance computation and accuracy.[3]
The following pseudocode outlines the ALS algorithm for CP decomposition of a third-order tensor \mathcal{X} \in \mathbb{R}^{I \times J \times K} into rank-R factors \mathbf{A} \in \mathbb{R}^{I \times R}, \mathbf{B} \in \mathbb{R}^{J \times R}, \mathbf{C} \in \mathbb{R}^{K \times R}:
Initialize A, B, C (e.g., via HOSVD or random)
fit = ||X - [A,B,C]||_F^2
tol = 1e-6, maxiter = 100
for iter = 1 to maxiter do
// Update A
Z = C ⊙ B (Khatri-Rao product)
A = X_(1) * Z * pinv(Z' * Z)
A = A / [norm](/page/Norm)(A, '2') // Normalize columns (optional)
// Update B
Z = C ⊙ A
B = X_(2) * Z * pinv(Z' * Z)
B = B / [norm](/page/Norm)(B, '2')
// Update C
Z = B ⊙ A
C = X_(3) * Z * pinv(Z' * Z)
C = C / [norm](/page/Norm)(C, '2')
newfit = ||X - [A,B,C]||_F^2
if (fit - newfit) / fit < tol then break
fit = newfit
end
Initialize A, B, C (e.g., via HOSVD or random)
fit = ||X - [A,B,C]||_F^2
tol = 1e-6, maxiter = 100
for iter = 1 to maxiter do
// Update A
Z = C ⊙ B (Khatri-Rao product)
A = X_(1) * Z * pinv(Z' * Z)
A = A / [norm](/page/Norm)(A, '2') // Normalize columns (optional)
// Update B
Z = C ⊙ A
B = X_(2) * Z * pinv(Z' * Z)
B = B / [norm](/page/Norm)(B, '2')
// Update C
Z = B ⊙ A
C = X_(3) * Z * pinv(Z' * Z)
C = C / [norm](/page/Norm)(C, '2')
newfit = ||X - [A,B,C]||_F^2
if (fit - newfit) / fit < tol then break
fit = newfit
end
This pseudocode assumes access to tensor unfoldings \mathbf{X}_{(n)} and the pseudoinverse; normalization steps help maintain scale invariance in CP.[3]
Optimization-Based Methods
Optimization-based methods for tensor decomposition treat the problem as a nonlinear least squares minimization over the factor matrices, enabling the use of continuous optimization techniques such as gradient descent. For the CANDECOMP/PARAFAC (CP) decomposition of a third-order tensor \mathcal{X} \in \mathbb{R}^{I \times J \times K}, the objective is to minimize \frac{1}{2} \|\mathcal{X} - \hat{\mathcal{X}}\|_F^2, where \hat{\mathcal{X}} = \sum_{r=1}^R \mathbf{a}_r \circ \mathbf{b}_r \circ \mathbf{c}_r and \circ denotes the outer product. The partial derivative with respect to an element a_{ir} of the first factor matrix \mathbf{A} is given by \frac{\partial f}{\partial a_{ir}} = - \left( \mathcal{X} \times_2 \mathbf{b}_r^\top \times_3 \mathbf{c}_r^\top \right)_i + \sum_{\ell=1}^R \lambda_{r\ell}^{(1)} a_{i\ell}, where \lambda_{r\ell}^{(1)} = (\mathbf{b}_r^\top \mathbf{b}_\ell)(\mathbf{c}_r^\top \mathbf{c}_\ell); this simplifies to the error term -2 \sum_{j,k} (x_{ijk} - \hat{x}_{ijk}) b_{jr} c_{kr} when considering the full gradient structure for the unregularized case.[13] Gradient descent updates the factors iteratively using these derivatives, offering a direct alternative to block-wise methods by performing simultaneous adjustments across all parameters.[13]
To handle large-scale tensors where full gradient computation is prohibitive due to memory and time constraints, stochastic gradient descent (SGD) approximates the gradient using random subsets of tensor elements or fibers, enabling scalable fitting of CP decompositions. In SGD variants for CP, the update for a factor matrix element leverages a mini-batch gradient, such as \nabla_{\mathbf{A}} \approx -2 \mathbf{Y}_{(1)} (\mathbf{C} \odot \mathbf{B}) for the matricized residual \mathbf{Y}_{(1)}, processed in passes over sparse or dense data without storing the full unfolded tensor.[14] This approach has demonstrated efficiency on tensors with billions of elements, achieving convergence comparable to exact methods while reducing per-iteration costs.[14]
Regularization is incorporated into the objective to promote desirable properties, such as balancing scaling ambiguities or inducing sparsity. A common L2 penalty adds terms like \frac{\lambda}{2} \sum_n \|\mathbf{A}^{(n)}\|_F^2 to the loss, modifying the gradient to \frac{\partial \hat{f}}{\partial \mathbf{A}^{(n)}} = - \mathbf{Z}_{(n)} \mathbf{A}_{(-n)} + \mathbf{A}^{(n)} \boldsymbol{\Lambda}^{(n)} + \lambda \mathbf{A}^{(n)}, which stabilizes optimization for ill-conditioned problems.[13] For sparsity, L1 penalties on factor elements encourage zero loadings, integrated via proximal gradient steps in the optimization loop.[15] In Tucker decomposition, orthogonality constraints on factor matrices are enforced using Riemannian optimization on the Stiefel manifold, where gradients are projected onto the tangent space to maintain orthonormality during descent, improving convergence for higher-order approximations.[16]
Advanced all-at-once methods accelerate convergence beyond basic gradient descent by approximating the Hessian. The Gauss-Newton algorithm linearizes the residual around the current factors and solves least-squares subproblems iteratively, often outperforming first-order methods in fewer steps for CP fitting, with parallel implementations achieving near-linear speedup on multi-core systems.[17] Levenberg-Marquardt extends this by blending gradient descent and Gauss-Newton steps via a damping parameter, enhancing robustness to poor initializations. Bayesian formulations introduce priors on factors (e.g., Gaussian for smoothness or spike-and-slab for sparsity) to quantify uncertainty, with variational inference or MCMC sampling providing posterior distributions over decompositions, useful for noisy data in signal processing.
Scalability to big data is achieved through distributed frameworks, where unfolding and Khatri-Rao products are parallelized across nodes; for instance, MapReduce pipelines compute partial gradients by partitioning tensor slices, enabling CP decomposition of terabyte-scale data with linear speedup proportional to cluster size. As an illustrative example for Tensor Train (TT) decomposition, SGD with learning rate scheduling optimizes the cores \mathcal{G}^{(k)} by minimizing \frac{1}{2} \|\mathcal{X} - \mathcal{G}\|_F^2, where updates use stochastic approximations of the TT-reconstructed gradient:
Initialize TT cores G^{(1)}, ..., G^{(d)} with small random values
η_0 = initial learning rate (e.g., 0.01)
For t = 1 to max_iter:
Sample mini-batch indices for tensor elements
Compute stochastic gradient ∇_t ≈ ∂/∂G^{(k)} [1/2 ∑ (x_i - TT(G)_i)^2 ] for each core k
η_t = η_0 / (1 + γ t) # Learning rate decay with γ = 0.001
G^{(k)} ← Proj(G^{(k)} - η_t ∇_t) # Proj ensures TT orthogonality if constrained
If convergence: break
Initialize TT cores G^{(1)}, ..., G^{(d)} with small random values
η_0 = initial learning rate (e.g., 0.01)
For t = 1 to max_iter:
Sample mini-batch indices for tensor elements
Compute stochastic gradient ∇_t ≈ ∂/∂G^{(k)} [1/2 ∑ (x_i - TT(G)_i)^2 ] for each core k
η_t = η_0 / (1 + γ t) # Learning rate decay with γ = 0.001
G^{(k)} ← Proj(G^{(k)} - η_t ∇_t) # Proj ensures TT orthogonality if constrained
If convergence: break
This pseudocode, applied to TT cores, supports efficient training on high-dimensional data by processing batches sequentially, with decay preventing overshooting.[18]
Applications and Extensions
In Signal Processing and Chemometrics
In signal processing, tensor decompositions facilitate the analysis of multi-dimensional array data from sensor networks. The CANDECOMP/PARAFAC (CP) decomposition is particularly effective for blind source separation in multi-sensor environments, where signals from multiple sources are mixed and observed across spatial, temporal, and frequency modes. By modeling the received data as a third-order tensor and decomposing it into rank-1 components, CP enables the recovery of individual source signals without prior knowledge of the mixing process, as demonstrated in applications to multi-user communication systems where the number of active users corresponds to the tensor rank.[19] This approach leverages the uniqueness properties of CP under mild conditions to separate nondisjoint sources in underdetermined scenarios.[20]
Tensor-based beamforming extends these principles to array signal processing, enhancing direction-of-arrival estimation and interference suppression in multi-antenna systems. Using Tucker or CP decompositions, the covariance tensor of received signals is factored to model multidimensional channel structures, such as in multiple-input multiple-output (MIMO) radar, where it captures spatial and temporal correlations for robust adaptive beamforming.[21] This method outperforms traditional vectorized techniques by preserving the inherent multi-way geometry of the data, leading to improved resolution in noisy environments.[19]
In chemometrics, PARAFAC decomposition plays a central role in multi-way calibration for spectroscopic and chromatographic data, enabling the extraction of chemically meaningful profiles from complex mixtures. For fluorescence spectroscopy, PARAFAC decomposes excitation-emission-time data cubes into tri-linear components, each representing the pure emission spectra, excitation profiles, and concentration time courses of individual fluorophores, even in the presence of overlapping signals.[22] This trilinear structure ensures rotational invariance, allowing direct interpretation of components as chemical species without ambiguity. In chromatography-mass spectrometry, PARAFAC supports multi-way calibration by resolving elution profiles, mass spectra, and sample concentrations in comprehensive two-dimensional gas chromatography (GC × GC) data, facilitating quantification of target analytes in complex matrices with minimal sample preparation.[23] Such applications in hyphenated techniques reduce the need for complete chromatographic separation, enhancing throughput in environmental and pharmaceutical analyses.[24]
A notable case study involves Tucker decomposition for noise reduction in electroencephalography (EEG) tensors structured along time, frequency, and channel modes. By retaining a low-rank core tensor while discarding higher-order noise components, Tucker achieves significant data compression while preserving essential brain activity signals in low-SNR recordings.[25] This approach separates artifacts like eye blinks or muscle noise from neural responses, enabling cleaner feature extraction for brain-computer interfaces.[26]
Tensor decompositions offer key advantages in these domains by handling multi-modal data natively, avoiding the information loss associated with vectorization or matricization that flattens multi-way interactions into matrices. This preservation of structure supports more interpretable models and higher fidelity reconstructions, often quantified by the relative squared error metric \frac{\|\mathcal{X} - \hat{\mathcal{X}}\|_F^2}{\|\mathcal{X}\|_F^2}, where \mathcal{X} is the original tensor and \hat{\mathcal{X}} its approximation, typically achieving values below 0.05 for well-fitted models in signal and chemical data.[19] However, these methods can be sensitive to outliers, such as spurious sensor readings or contaminated samples, which distort the low-rank approximation; robust variants, like those incorporating cellwise outlier detection, mitigate this by downweighting anomalies during fitting.[27]
In Machine Learning and Data Compression
Tensor decompositions play a pivotal role in machine learning by enabling the factorization of high-dimensional data into lower-rank components, which enhances model efficiency and interpretability. In recommender systems, the CANDECOMP/PARAFAC (CP) decomposition is commonly applied to user-item-context tensors to capture multi-way interactions, such as user preferences influenced by time or location, outperforming traditional matrix-based methods in sparse data scenarios. For instance, CP factorization approximates the rating tensor \mathcal{R} \in \mathbb{R}^{I \times J \times K} as \mathcal{R} \approx \sum_{r=1}^R \mathbf{u}_r \circ \mathbf{i}_r \circ \mathbf{c}_r, where \mathbf{u}_r, \mathbf{i}_r, and \mathbf{c}_r represent latent factors for users, items, and context, respectively, allowing scalable prediction of missing entries.
In neural network compression, tensor train (TT) decomposition is widely used to represent weight tensors in a compact format, significantly reducing the number of parameters while preserving performance. By decomposing multi-dimensional weight tensors into a sequence of lower-dimensional cores, TT formats enable substantial reduction in model size for deep learning recommendation models, with improvements in accuracy for tasks like click-through rate prediction. This approach is particularly effective for convolutional and recurrent layers, where the structured low-rank approximation minimizes storage and computation overhead on resource-constrained devices.
Tensor decompositions also facilitate data compression for large-scale datasets by exploiting multi-way correlations, making them suitable for applications like video analysis. For video data modeled as space-time-color tensors, TT and hierarchical Tucker (HT) decompositions achieve compression ratios up to 3:1 while maintaining visual quality metrics such as peak signal-to-noise ratio around 30 dB.[28] Post-decomposition quantization further enhances compression by reducing the bit precision of factor matrices, as seen in hybrid TT-quantization schemes that combine low-rank approximation with 8-bit encoding to achieve substantial reductions in convolutional neural network models.[29]
A notable case study involves HT decomposition in collaborative filtering, where it models user-item interactions as higher-order tensors to capture complex multi-way dependencies beyond pairwise relations. Unlike matrix factorization, HT captures hierarchical structures in the data, leading to improved prediction accuracy.
Extensions of tensor decompositions in machine learning include tensor regression models for handling multi-way inputs, formulated as \mathbf{y} = \mathcal{X} \times_1 \mathbf{B}, where \mathcal{X} is the input tensor, \times_1 denotes the mode-1 tensor-matrix product, and \mathbf{B} is the coefficient matrix, enabling predictions from structured covariates like images or graphs.[30] Integration with deep learning is supported by libraries such as TensorLy, which provides seamless APIs for tensorized neural networks, allowing TT and HT formats to be embedded in PyTorch or TensorFlow backends for end-to-end training of compressed models.[31]
Recent advances post-2020 have incorporated tensor decompositions into federated learning frameworks to ensure privacy preservation, where local tensor factorizations on client devices prevent direct sharing of raw data. For instance, coupled TT decomposition in federated settings aggregates low-rank updates across distributed nodes, reducing communication overhead while maintaining differential privacy guarantees in tasks like personalized recommendation.[32]