Fact-checked by Grok 2 weeks ago

Tensor decomposition

Tensor decomposition is the mathematical process of factorizing a multi-dimensional , or tensor, into a sum of simpler, lower-rank components, generalizing factorization techniques to higher-order data structures beyond two dimensions. This approach enables the extraction of latent structures and patterns from complex, high-dimensional sets by representing the tensor as combinations of vectors or lower-order tensors. The origins of tensor decomposition trace back to 1927, when Frank Lauren Hitchcock introduced the concept of decomposing a three-way array into rank-1 terms, laying the groundwork for multi-linear algebra applications. The field advanced significantly in the 1970s through Richard Harshman's development of the Parallel Factors (PARAFAC) model, a of , alongside independent work by Philip Kruskal on uniqueness conditions. By the late , tensor decompositions had permeated and before expanding into in the 1990s and in the 2000s, driven by the need to handle multi-way data in fields like and recommender systems. Among the most prominent methods, the Canonical Polyadic (CP) decomposition, also known as CANDECOMP/PARAFAC, expresses a tensor as a minimal sum of rank-1 tensors (outer products of vectors), offering strong properties under mild conditions and facilitating interpretable latent factor estimation. In contrast, the generalizes this by factoring the tensor into a core tensor multiplied by factor matrices along each mode, providing flexibility for and subspace analysis, though with reduced compared to . Other variants, such as the Tensor Train format, have emerged for efficient computation in very high-order tensors, particularly in numerical simulations and large-scale data modeling. Tensor decompositions find wide applications across disciplines, including for tasks like blind source separation and harmonic retrieval, where they disentangle mixed signals from multi-sensor data. In machine learning, they support collaborative filtering in recommendation engines, topic modeling in text analysis, and compression of neural network parameters, enabling scalable processing of big data while preserving essential structures. These techniques also underpin advancements in knowledge graph completion and multi-relational learning, highlighting their role in uncovering hidden relationships in heterogeneous datasets.

Fundamentals

Notation and Terminology

A tensor is defined as a multidimensional of numerical values, generalizing the concepts of scalars (0th-order tensors), vectors (1st-order tensors), and (2nd-order tensors) to higher . An Nth-order tensor, also referred to as an N-way tensor, possesses N or indices, where the order N indicates the number of modes or ways in the . Each mode corresponds to a specific of the tensor, and a is obtained by fixing all indices except one, serving as the higher-order analogue of rows or columns in a . Standard notation in tensor decomposition employs bold lowercase letters for vectors (e.g., \mathbf{a}), bold uppercase letters for matrices (e.g., \mathbf{A}), and bold calligraphic uppercase letters for higher-order tensors (e.g., \mathcal{X}). Scalars are denoted by italicized lowercase letters (e.g., a), and the entries of a tensor are represented using indices, such as x_{i_1 i_2 \dots i_N} for the element at position (i_1, i_2, \dots, i_N) in \mathcal{X}. Key tensor operations include the , denoted by \circ, which combines vectors to form a tensor; for instance, \mathcal{X} = \mathbf{a}^{(1)} \circ \mathbf{a}^{(2)} \circ \dots \circ \mathbf{a}^{(N)} yields x_{i_1 i_2 \dots i_N} = a^{(1)}_{i_1} a^{(2)}_{i_2} \dots a^{(N)}_{i_N}. The inner product between two same-sized tensors \mathcal{X} and \mathcal{Y} is the scalar \langle \mathcal{X}, \mathcal{Y} \rangle = \sum_{i_1, \dots, i_N} x_{i_1 \dots i_N} y_{i_1 \dots i_N}. The mode-n product, denoted \mathcal{X} \times_n \mathbf{U}, multiplies the tensor \mathcal{X} \in \mathbb{R}^{I_1 \times \dots \times I_N} by a matrix \mathbf{U} \in \mathbb{R}^{J \times I_n} along the nth mode, resulting in a tensor of size I_1 \times \dots \times I_{n-1} \times J \times I_{n+1} \times \dots \times I_N, with elements given by \sum_{i_n} x_{i_1 \dots i_N} u_{j i_n}. The Frobenius norm measures the magnitude of a tensor as \|\mathcal{X}\|_F = \sqrt{\sum_{i_1, \dots, i_N} x_{i_1 \dots i_N}^2}, analogous to the matrix Frobenius norm. Unfolding, or matricization, rearranges the elements of a tensor into a to facilitate computations; the mode-n unfolding, denoted \mathbf{X}_{(n)}, arranges the mode-n fibers as the columns of the matrix, with rows corresponding to the nth mode and columns to the combinations of the remaining modes. For a 3-way tensor \mathcal{X} \in \mathbb{R}^{I \times J \times K}, the unfoldings correspond to slices: the mode-1 unfolding \mathbf{X}_{(1)} has dimensions I \times (J K) and stacks the horizontal slices (fixed i) as columns; the mode-2 unfolding \mathbf{X}_{(2)} has dimensions J \times (I K) and uses lateral slices (fixed j); and the mode-3 unfolding \mathbf{X}_{(3)} has dimensions K \times (I J) with frontal slices (fixed k) as rows. As an illustrative example, consider a 3-way tensor \mathcal{X} \in \mathbb{R}^{3 \times 4 \times 2} with entries filled sequentially from 1 to 24; its mode-1 unfolding is the $3 \times 8 \mathbf{X}_{(1)} = \begin{bmatrix} 1 & 4 & 7 & 10 & 13 & 16 & 19 & 22 \\ 2 & 5 & 8 & 11 & 14 & 17 & 20 & 23 \\ 3 & 6 & 9 & 12 & 15 & 18 & 21 & 24 \end{bmatrix}, where columns represent the fibers along mode 1, derived from the two frontal slices.

Tensors and Their Properties

Tensors generalize and to higher orders, serving as the fundamental objects in . A first-order tensor corresponds to a , a second-order tensor to a , and higher-order tensors extend these concepts to multi-way arrays that capture interactions across multiple dimensions or modes. This generalization enables the representation of complex, multidimensional data, such as in or , where relationships are not adequately captured by pairwise () structures alone. A key property of tensors is their multilinearity, meaning operations along each —such as by a —preserve in that while fixing the others. This multilinearity underpins tensor manipulations, distinguishing them from nonlinear generalizations and facilitating decompositions that exploit low- structures across modes. Tensor quantifies the complexity of these structures. The is defined as the minimal number of rank-1 tensors (outer products of vectors, one per ) required to express the tensor as their sum. In contrast, the is a specifying the ranks of the tensor's mode-wise unfoldings into . Notably, for tensors of order greater than two, the can exceed the maximum of any unfolding, unlike the case for where is uniquely determined. High-dimensional tensors pose significant challenges due to the curse of dimensionality, where storage and computational demands grow with the number of modes or dimensions. For instance, a third-order tensor of dimensions $100 \times 100 \times 100 requires storing one million entries, rendering direct computationally infeasible for even moderately sized . This leads to ill-posed problems in direct tensor , as the sheer volume obscures underlying patterns and amplifies to or perturbations. The need for tensor decomposition arose historically in psychometrics, where early efforts to model multidimensional data structures highlighted the limitations of matrix-based factor analysis. In 1944, Raymond B. Cattell introduced the concept of parallel proportional profiles to address factor rotation in multi-way data, laying groundwork for low-rank approximations that would later formalize tensor decompositions. To illustrate, consider a synthetic third-order tensor of CP rank R, expressed as \mathcal{X} = \sum_{r=1}^R \mathbf{a}_r \circ \mathbf{b}_r \circ \mathbf{c}_r, where each term is a rank-1 component formed by the outer product of vectors \mathbf{a}_r, \mathbf{b}_r, and \mathbf{c}_r from the respective modes. This form demonstrates how tensors can be built from simpler, interpretable building blocks, motivating decomposition techniques for real-world data reduction.

Core Decomposition Methods

CANDECOMP/PARAFAC (CP) Decomposition

The CANDECOMP/PARAFAC (CP) decomposition factorizes an N-way tensor \mathcal{X} \in \mathbb{R}^{I_1 \times \cdots \times I_N} into a sum of R rank-one tensors, providing a minimal parameterization for low-rank approximations. This model expresses the tensor as \mathcal{X} \approx \sum_{r=1}^R \mathbf{a}_r \circ \mathbf{b}_r \circ \cdots \circ \mathbf{z}_r, where R denotes the CP rank, \circ represents the outer product, and the vectors \mathbf{a}_r \in \mathbb{R}^{I_1}, \mathbf{b}_r \in \mathbb{R}^{I_2}, ..., \mathbf{z}_r \in \mathbb{R}^{I_N} serve as columns of the factor matrices \mathbf{A} \in \mathbb{R}^{I_1 \times R}, \mathbf{B} \in \mathbb{R}^{I_2 \times R}, ..., \mathbf{Z} \in \mathbb{R}^{I_N \times R}. The formulation generalizes parallel factor analysis from two-way to multi-way data, capturing latent structures through these rank-one components. Independently proposed by Carroll and Chang in 1970 under the name CANDECOMP and by Harshman in the same year as PARAFAC, the method emerged from and to address multi-mode . A defining feature is its essential uniqueness: under mild conditions, the factor matrices are unique up to of the components and across the vectors in each rank-one term (with scalings summing to unity per mode). Kruskal's condition establishes a sufficient criterion for this uniqueness in the three-way case, requiring that the sum of the Kruskal ranks of the factor matrices satisfies k_{\mathbf{A}} + k_{\mathbf{B}} + k_{\mathbf{C}} \geq 2R + 2, where the Kruskal rank k_{\mathbf{U}} of a matrix \mathbf{U} \in \mathbb{R}^{I \times R} is the largest k \leq \min(I, R) such that every k \times k submatrix has full column rank. The decomposition offers advantages in , using the fewest parameters—specifically, \sum_{n=1}^N I_n R - (N-1)R after accounting for scaling freedoms—compared to other methods like , which facilitates interpretable additive components akin to principal components in multi-way settings. However, a notable disadvantage is the computational challenge in determining the CP rank R, as finding the minimal R for exact is NP-hard. In , the three-way CP model is widely applied to fluorescence excitation-emission matrices across samples, decomposing the tensor to identify fluorophores. For instance, the unfolding-based least-squares objective for mode-1 unfolding becomes \min_{\mathbf{A}, \mathbf{B}, \mathbf{C}} \left\| \mathbf{X}_{(1)} - \mathbf{A} (\mathbf{C} \odot \mathbf{B})^T \right\|_F^2, where \odot is the Khatri-Rao product, \mathbf{X}_{(1)} \in \mathbb{R}^{I_1 \times (I_2 I_3)} is the unfolded tensor, and \|\cdot\|_F denotes the Frobenius norm; this isolates pure emission and excitation profiles while linking them to concentration profiles in \mathbf{C}.

Tucker Decomposition

The Tucker decomposition, also known as the (HOSVD) in its orthogonal form, represents a tensor as the mode-wise product of a smaller core tensor and a set of factor matrices, serving as a natural extension of to multi-way data. Originally proposed by Ledyard R. Tucker in 1966 for three-mode , the method decomposes a third-order tensor \mathcal{X} \in \mathbb{R}^{I \times J \times K} approximately as \mathcal{X} \approx \mathcal{G} \times_1 \mathbf{A} \times_2 \mathbf{B} \times_3 \mathbf{C}, where \mathcal{G} \in \mathbb{R}^{R_1 \times R_2 \times R_3} is the core tensor capturing multi-way interactions, and \mathbf{A} \in \mathbb{R}^{I \times R_1}, \mathbf{B} \in \mathbb{R}^{J \times R_2}, \mathbf{C} \in \mathbb{R}^{K \times R_3} are the factor matrices along each mode, often constrained to be orthogonal. This formulation generalizes to N-way tensors as \mathcal{X} \approx \mathcal{G} \times_1 \mathbf{A}^{(1)} \times_2 \mathbf{A}^{(2)} \cdots \times_N \mathbf{A}^{(N)}, with the core \mathcal{G} typically smaller than \mathcal{X} to enable compression and reveal latent structures. The rank of a tensor is defined as the (R_1, R_2, \dots, R_N), where each R_n specifies the dimensions of tensor \mathcal{G} and corresponds to the multilinear rank along mode n, preserving the essential multi-way variability of the data. Unlike the decomposition, which lacks a core tensor and offers essential uniqueness under mild conditions, the model does not possess essential uniqueness due to rotational freedom in tensor and factors, allowing transformations like \mathcal{G}' = \mathcal{G} \times_1 \mathbf{Q}_1 \times_2 \mathbf{Q}_2 \times_3 \mathbf{Q}_3 and \mathbf{A}' = \mathbf{A} \mathbf{Q}_1^{-1} for orthogonal \mathbf{Q}_n without altering the . However, the HOSVD variant, introduced by De Lathauwer et al. in 2000, imposes on all factor matrices, yielding the optimal low multilinear rank in the sense of minimizing the Frobenius for the specified ranks. This ensures that the captures the principal components along each mode independently, facilitating interpretability and identification. To illustrate, consider a three-way tensor \mathcal{X} \in \mathbb{R}^{I \times J \times K}. The HOSVD computes the factor matrices by performing successive singular value decompositions (SVDs) on the mode-n unfoldings: \mathbf{X}_{(1)} = \mathbf{U}^{(1)} \mathbf{\Sigma}^{(1)} (\mathbf{V}^{(1)})^T yields \mathbf{A} = \mathbf{U}^{(1)} (retaining the first R_1 columns), similarly for \mathbf{B} from \mathbf{X}_{(2)} and \mathbf{C} from \mathbf{X}_{(3)}. The core tensor is then obtained as \mathcal{G} = \mathcal{X} \times_1 \mathbf{A}^T \times_2 \mathbf{B}^T \times_3 \mathbf{C}^T, where the mode products project \mathcal{X} onto the orthogonal subspaces spanned by the factors, resulting in a core whose entries reflect the multi-linear correlations. This process, while not minimizing the full tensor approximation error, provides a quasi-optimal solution that is computationally efficient and widely used for initial tensor approximations.

Advanced Decomposition Techniques

Tensor Train (TT) Decomposition

The tensor train (TT) decomposition represents a high-order tensor as a sequence of lower-dimensional cores connected in a linear chain, enabling efficient handling of very high-dimensional data without the exponential storage costs of the full tensor. Introduced by Ivan Oseledets in 2011, this format was motivated by the need to solve high-dimensional partial differential equations (PDEs) in fields such as and modeling, where tensor dimensions can reach d = 10 to $1000 or more. Unlike traditional decompositions that struggle with of dimensionality, the TT format maintains low-rank structure through a chain of matrix multiplications, facilitating scalable computations for tensor operations like and . Formally, for an N-way tensor \mathcal{X} \in \mathbb{R}^{n_1 \times \cdots \times n_N}, the TT decomposition expresses each element as \mathcal{X}_{i_1 \dots i_N} = \mathbf{G}^{(1)}_{i_1} \mathbf{G}^{(2)}_{i_2} \cdots \mathbf{G}^{(N)}_{i_N}, where \mathbf{G}^{(k)} is the k-th , a three-dimensional of r_{k-1} \times n_k \times r_k with r_0 = r_N = 1, and \mathbf{G}^{(k)}_{i_k} denotes the r_{k-1} \times r_k slice for i_k \in \{1, \dots, n_k\}. The bond dimensions r_k (for k=1, \dots, N-1) control the rank of the decomposition, and the TT-rank is defined as the (r_1, \dots, r_{N-1}). For storage efficiency, the cores are often stored in matricized (unfolded) form, where each \mathbf{G}^{(k)} is represented as a of r_{k-1} n_k \times r_k, reducing while preserving the chain . This formulation allows the full tensor to be reconstructed on-the-fly without explicit storage of all \prod_{k=1}^N n_k elements. A key property of the TT decomposition is its compression efficiency: assuming uniform mode sizes n_k = n and maximum bond dimension r = \max_k r_k, the storage requirement is O(N r^2 n), which is linear in the tensor order N compared to the exponential O(n^N) for the dense tensor. This makes TT particularly suitable for ultra-high-order tensors, where the bond ranks r_k remain small due to the intrinsic low-rank nature of many physical systems. Additionally, the format supports transferable operations, such as arithmetic (e.g., addition and multiplication of TT tensors) and reshaping, which can be performed directly on the cores without full reconstruction. The TT-SVD algorithm initializes the decomposition by sequentially applying () to matricized unfoldings of the tensor: starting from the left, it unfolds \mathcal{X} along the first k modes, computes the SVD, retains the top r_k singular values, and proceeds to the next core, enabling a stable . The quality in TT-SVD is quantified by an bound in the Frobenius norm: for a TT \mathcal{B} with bond ranks r_k, the satisfies \|\mathcal{X} - \mathcal{B}\|_F \leq \sqrt{\sum_{k=1}^{N-1} \sigma_{r_k+1,k}^2}, where \sigma_{j,k} are the singular values from the k-th sequential unfolding. This bound ensures that truncation accumulate conservatively, often yielding near-optimal compression for tensors with rapidly decaying singular values, as seen in applications to discretized high-dimensional operators.

Hierarchical Tucker (HT) Decomposition

The Hierarchical (HT) decomposition represents a high-order tensor using a multi-level structured as a , where non-leaf nodes store small transfer tensors and leaf nodes store matrices. This format was introduced by Hackbusch and Kühn in 2009, initially motivated by applications in and the approximation of integral operators. Unlike linear chain formats such as the tensor train, the HT approach employs a general topology that clusters modes hierarchically, enabling more flexible low-rank approximations for tensors with balanced dimensions. In the HT formulation, a tensor \mathcal{X} \in \mathbb{R}^{d_1 \times \cdots \times d_N} is approximated by recursively factoring subspaces along the tree branches. For a binary tree example, the root-level approximation can be expressed as \mathcal{X} \approx \sum_{r=1}^{R_0} \mathbf{b}_r^{(1)} \circ \mathcal{G}_r^{(2,\dots,N)}, where \mathbf{b}_r^{(1)} \in \mathbb{R}^{d_1} are basis vectors, R_0 is the root rank, and \mathcal{G}_r^{(2,\dots,N)} are intermediate low-rank tensors further decomposed at child nodes, such as \mathcal{G}_r^{(2,\dots,N)} \approx \sum_{s=1}^{R_1} \mathbf{U}_s^{(2)} \circ \mathcal{M}_s^{(3,\dots,N)} with factor matrices \mathbf{U}^{(i)} \in \mathbb{R}^{d_i \times r_i} at leaves and transfer tensors \mathcal{M} at internal nodes. The HT-rank is defined as the tuple of ranks (r_\nu)_{\nu \in \mathcal{T}} assigned to each node \nu in the tree \mathcal{T}, quantifying the low-rank structure at every clustering level. The storage complexity of an HT decomposition for an N-th tensor with equal sizes d and maximum R is O(N d R + N R^3) in a balanced , significantly reducing memory requirements compared to the full format's O(R^N) for high N. This efficiency arises from storing only the factor matrices at leaves (total size O(N d R)) and compact transfer tensors at non-leaves (dominated by O(R^3) per internal node across approximately N nodes). A common method for constructing the HT decomposition is the hierarchical singular value decomposition (HT-SVD), which builds the approximation recursively by computing SVDs on matricizations of tensor clusters. Starting from the leaves, the algorithm performs SVD on the unfolding of each subtree tensor, retaining the leading singular vectors up to the prescribed node rank, and proceeds bottom-up to the root while orthogonalizing bases to preserve near-optimal approximation quality. The error is controlled by singular value truncation at each level, ensuring the relative approximation error is bounded by the sum of local truncation errors propagated through the tree. For instance, consider a 7-way tensor with a balanced structure: the root node has R_0, its two children (each clustering 3 and 4 s, respectively) have s R_1 and R_2, and further subdivisions continue to leaves. The HT-SVD constructs matrices for each and transfer tensors at internal nodes by sequential SVDs on cluster unfoldings, with the overall error analyzed via the decay of singular values at each truncation step.

Computational Approaches

Alternating Least Squares (ALS)

The alternating least squares () method serves as a foundational block-coordinate optimization technique for fitting low-rank tensor decompositions by iteratively minimizing the Frobenius norm of the approximation error, \|\mathcal{X} - \hat{\mathcal{X}}\|_F^2, where \mathcal{X} is the observed tensor and \hat{\mathcal{X}} is its . Introduced as a core algorithm for the PARAFAC model by Harshman in , ALS cyclically updates one factor matrix at a time while holding the others fixed, solving a linear least-squares problem for each update. This approach leverages the separability of the objective function across modes, making it applicable to both CP and decompositions. In the general ALS framework, the update for a specific factor , say \mathbf{A}, involves unfolding the tensor along the corresponding mode and solving \mathbf{A} \leftarrow \arg\min_{\mathbf{A}} \|\mathbf{X}_{(1)} - \mathbf{A} \mathbf{Z}^T\|_F^2, where \mathbf{Z} is a constructed from the Khatri-Rao products of the other fixed factor matrices. The process cycles through all modes until , with the error non-increasing at each iteration due to the optimality of individual least-squares solutions. Typically, ALS converges to a local minimum, as the overall problem is non-convex, though the monotonic decrease in the objective ensures practical termination. For the CP decomposition, the unfolding-based updates explicitly use the Khatri-Rao product; for an N-way tensor with factor matrices \mathbf{A}^{(n)} (n=1,\dots,N) and rank R, the update for \mathbf{A}^{(1)} sets \mathbf{Z} = \mathbf{A}^{(N)} \odot \cdots \odot \mathbf{A}^{(2)}, where \odot denotes the Khatri-Rao product, yielding \mathbf{A}^{(1)} = \mathbf{X}_{(1)} \mathbf{Z} (\mathbf{Z}^T \mathbf{Z})^{-1}. In the Tucker decomposition, updates involve higher-order unfoldings and the Kronecker product of factor matrices, such as updating the core tensor \mathcal{G} via multilinear products while fixing the factor matrices, or vice versa for each mode-n matricization \mathbf{X}_{(n)} \approx \mathbf{A}^{(n)} \mathbf{G}_{(n)} (\mathbf{A}^{(N)} \otimes \cdots \otimes \mathbf{A}^{(n+1)} \otimes \mathbf{A}^{(n-1)} \otimes \cdots \otimes \mathbf{A}^{(1)})^T. Initialization is crucial for avoiding poor local minima and is often performed using the higher-order singular value decomposition (HOSVD) to obtain initial factor matrices from leading singular vectors of each mode unfolding, or via random orthogonal matrices for simplicity. ALS offers simplicity in implementation and parallelization across modes, as each update is an independent least-squares solve, but it can exhibit slow or stagnation in ill-conditioned problems due to the to initialization and the discrete nature of block updates. Common stopping criteria include a relative change in the fit error below $10^{-6} or a maximum number of iterations, such as 100, to balance computation and accuracy. The following outlines the algorithm for of a third-order tensor \mathcal{X} \in \mathbb{R}^{I \times J \times K} into rank-R factors \mathbf{A} \in \mathbb{R}^{I \times R}, \mathbf{B} \in \mathbb{R}^{J \times R}, \mathbf{C} \in \mathbb{R}^{K \times R}:
Initialize A, B, C (e.g., via HOSVD or random)
fit = ||X - [A,B,C]||_F^2
tol = 1e-6, maxiter = 100
for iter = 1 to maxiter do
    // Update A
    Z = C ⊙ B  (Khatri-Rao product)
    A = X_(1) * Z * pinv(Z' * Z)
    A = A / [norm](/page/Norm)(A, '2')  // Normalize columns (optional)
    
    // Update B
    Z = C ⊙ A
    B = X_(2) * Z * pinv(Z' * Z)
    B = B / [norm](/page/Norm)(B, '2')
    
    // Update C
    Z = B ⊙ A
    C = X_(3) * Z * pinv(Z' * Z)
    C = C / [norm](/page/Norm)(C, '2')
    
    newfit = ||X - [A,B,C]||_F^2
    if (fit - newfit) / fit < tol then break
    fit = newfit
end
This assumes access to tensor unfoldings \mathbf{X}_{(n)} and the pseudoinverse; normalization steps help maintain in .

Optimization-Based Methods

Optimization-based methods for tensor decomposition treat the problem as a minimization over the factor matrices, enabling the use of continuous optimization techniques such as . For the CANDECOMP/PARAFAC () decomposition of a third-order tensor \mathcal{X} \in \mathbb{R}^{I \times J \times K}, the objective is to minimize \frac{1}{2} \|\mathcal{X} - \hat{\mathcal{X}}\|_F^2, where \hat{\mathcal{X}} = \sum_{r=1}^R \mathbf{a}_r \circ \mathbf{b}_r \circ \mathbf{c}_r and \circ denotes the outer product. The partial derivative with respect to an element a_{ir} of the first factor matrix \mathbf{A} is given by \frac{\partial f}{\partial a_{ir}} = - \left( \mathcal{X} \times_2 \mathbf{b}_r^\top \times_3 \mathbf{c}_r^\top \right)_i + \sum_{\ell=1}^R \lambda_{r\ell}^{(1)} a_{i\ell}, where \lambda_{r\ell}^{(1)} = (\mathbf{b}_r^\top \mathbf{b}_\ell)(\mathbf{c}_r^\top \mathbf{c}_\ell); this simplifies to the error term -2 \sum_{j,k} (x_{ijk} - \hat{x}_{ijk}) b_{jr} c_{kr} when considering the full gradient structure for the unregularized case. updates the factors iteratively using these derivatives, offering a direct alternative to block-wise methods by performing simultaneous adjustments across all parameters. To handle large-scale tensors where full gradient computation is prohibitive due to memory and time constraints, (SGD) approximates the using random subsets of tensor elements or fibers, enabling scalable fitting of CP decompositions. In SGD variants for CP, the update for a factor matrix element leverages a mini-batch , such as \nabla_{\mathbf{A}} \approx -2 \mathbf{Y}_{(1)} (\mathbf{C} \odot \mathbf{B}) for the matricized residual \mathbf{Y}_{(1)}, processed in passes over sparse or dense data without storing the full unfolded tensor. This approach has demonstrated efficiency on tensors with billions of elements, achieving convergence comparable to exact methods while reducing per-iteration costs. Regularization is incorporated into the objective to promote desirable properties, such as balancing scaling ambiguities or inducing sparsity. A common L2 penalty adds terms like \frac{\lambda}{2} \sum_n \|\mathbf{A}^{(n)}\|_F^2 to the loss, modifying the gradient to \frac{\partial \hat{f}}{\partial \mathbf{A}^{(n)}} = - \mathbf{Z}_{(n)} \mathbf{A}_{(-n)} + \mathbf{A}^{(n)} \boldsymbol{\Lambda}^{(n)} + \lambda \mathbf{A}^{(n)}, which stabilizes optimization for ill-conditioned problems. For sparsity, L1 penalties on factor elements encourage zero loadings, integrated via proximal gradient steps in the optimization loop. In Tucker decomposition, orthogonality constraints on factor matrices are enforced using Riemannian optimization on the Stiefel manifold, where gradients are projected onto the tangent space to maintain orthonormality during descent, improving convergence for higher-order approximations. Advanced all-at-once methods accelerate convergence beyond basic by approximating the . The Gauss-Newton linearizes the around the current factors and solves least-squares subproblems iteratively, often outperforming first-order methods in fewer steps for CP fitting, with parallel implementations achieving near-linear on multi-core systems. Levenberg-Marquardt extends this by blending and Gauss-Newton steps via a parameter, enhancing robustness to poor initializations. Bayesian formulations introduce priors on factors (e.g., Gaussian for or spike-and-slab for sparsity) to quantify uncertainty, with variational inference or MCMC sampling providing posterior distributions over decompositions, useful for noisy data in . Scalability to is achieved through distributed frameworks, where unfolding and Khatri-Rao products are parallelized across nodes; for instance, pipelines compute partial gradients by partitioning tensor slices, enabling decomposition of terabyte-scale with linear proportional to size. As an illustrative example for Tensor (TT) decomposition, SGD with scheduling optimizes the cores \mathcal{G}^{(k)} by minimizing \frac{1}{2} \|\mathcal{X} - \mathcal{G}\|_F^2, where updates use stochastic approximations of the TT-reconstructed :
Initialize TT cores G^{(1)}, ..., G^{(d)} with small random values
η_0 = initial learning rate (e.g., 0.01)
For t = 1 to max_iter:
    Sample mini-batch indices for tensor elements
    Compute stochastic gradient ∇_t ≈ ∂/∂G^{(k)} [1/2 ∑ (x_i - TT(G)_i)^2 ] for each core k
    η_t = η_0 / (1 + γ t)  # Learning rate decay with γ = 0.001
    G^{(k)} ← Proj(G^{(k)} - η_t ∇_t)  # Proj ensures TT orthogonality if constrained
    If convergence: break
This pseudocode, applied to TT cores, supports efficient training on high-dimensional data by processing batches sequentially, with decay preventing overshooting.

Applications and Extensions

In Signal Processing and Chemometrics

In signal processing, tensor decompositions facilitate the analysis of multi-dimensional array data from sensor networks. The CANDECOMP/PARAFAC (CP) decomposition is particularly effective for blind source separation in multi-sensor environments, where signals from multiple sources are mixed and observed across spatial, temporal, and frequency modes. By modeling the received data as a third-order tensor and decomposing it into rank-1 components, CP enables the recovery of individual source signals without prior knowledge of the mixing process, as demonstrated in applications to multi-user communication systems where the number of active users corresponds to the tensor rank. This approach leverages the uniqueness properties of CP under mild conditions to separate nondisjoint sources in underdetermined scenarios. Tensor-based extends these principles to array signal processing, enhancing direction-of-arrival estimation and interference suppression in multi-antenna systems. Using or decompositions, the tensor of received signals is factored to model multidimensional structures, such as in multiple-input multiple-output (, where it captures spatial and temporal correlations for robust adaptive . This method outperforms traditional vectorized techniques by preserving the inherent multi-way of the data, leading to improved in noisy environments. In , PARAFAC decomposition plays a central role in multi-way calibration for spectroscopic and chromatographic data, enabling the extraction of chemically meaningful profiles from complex mixtures. For , PARAFAC decomposes excitation-emission-time data cubes into tri-linear components, each representing the pure emission spectra, excitation profiles, and concentration time courses of individual fluorophores, even in the presence of overlapping signals. This trilinear structure ensures rotational invariance, allowing direct interpretation of components as without ambiguity. In chromatography-mass spectrometry, PARAFAC supports multi-way calibration by resolving elution profiles, mass spectra, and sample concentrations in comprehensive two-dimensional (GC × GC) data, facilitating quantification of target analytes in complex matrices with minimal sample preparation. Such applications in hyphenated techniques reduce the need for complete chromatographic separation, enhancing throughput in environmental and pharmaceutical analyses. A notable involves for in (EEG) tensors structured along time, frequency, and channel modes. By retaining a low-rank core tensor while discarding higher-order components, achieves significant data compression while preserving essential activity signals in low-SNR recordings. This approach separates artifacts like eye blinks or muscle from neural responses, enabling cleaner feature extraction for brain-computer interfaces. Tensor decompositions offer key advantages in these domains by handling multi-modal data natively, avoiding the information loss associated with vectorization or matricization that flattens multi-way interactions into matrices. This preservation of structure supports more interpretable models and higher fidelity reconstructions, often quantified by the relative squared error metric \frac{\|\mathcal{X} - \hat{\mathcal{X}}\|_F^2}{\|\mathcal{X}\|_F^2}, where \mathcal{X} is the original tensor and \hat{\mathcal{X}} its approximation, typically achieving values below 0.05 for well-fitted models in signal and chemical data. However, these methods can be sensitive to outliers, such as spurious sensor readings or contaminated samples, which distort the low-rank approximation; robust variants, like those incorporating cellwise outlier detection, mitigate this by downweighting anomalies during fitting.

In Machine Learning and Data Compression

Tensor decompositions play a pivotal role in by enabling the of high-dimensional data into lower-rank components, which enhances model efficiency and interpretability. In recommender systems, the CANDECOMP/PARAFAC () decomposition is commonly applied to user-item- tensors to capture multi-way interactions, such as user preferences influenced by time or location, outperforming traditional matrix-based methods in sparse data scenarios. For instance, approximates the rating tensor \mathcal{R} \in \mathbb{R}^{I \times J \times K} as \mathcal{R} \approx \sum_{r=1}^R \mathbf{u}_r \circ \mathbf{i}_r \circ \mathbf{c}_r, where \mathbf{u}_r, \mathbf{i}_r, and \mathbf{c}_r represent latent factors for users, items, and context, respectively, allowing scalable of missing entries. In compression, tensor train (TT) decomposition is widely used to represent weight tensors in a compact format, significantly reducing the number of parameters while preserving performance. By decomposing multi-dimensional weight tensors into a sequence of lower-dimensional cores, TT formats enable substantial reduction in model size for recommendation models, with improvements in accuracy for tasks like prediction. This approach is particularly effective for convolutional and recurrent layers, where the structured minimizes storage and computation overhead on resource-constrained devices. Tensor decompositions also facilitate data for large-scale datasets by exploiting multi-way correlations, making them suitable for applications like video analysis. For video data modeled as space-time-color tensors, TT and hierarchical (HT) decompositions achieve ratios up to 3:1 while maintaining visual quality metrics such as around 30 dB. Post-decomposition quantization further enhances by reducing the bit precision of factor matrices, as seen in hybrid TT-quantization schemes that combine with 8-bit encoding to achieve substantial reductions in models. A notable involves HT decomposition in , where it models user-item interactions as higher-order tensors to capture complex multi-way dependencies beyond pairwise relations. Unlike matrix factorization, HT captures hierarchical structures in the data, leading to improved prediction accuracy. Extensions of tensor decompositions in include tensor regression models for handling multi-way inputs, formulated as \mathbf{y} = \mathcal{X} \times_1 \mathbf{B}, where \mathcal{X} is the input tensor, \times_1 denotes the mode-1 tensor-matrix product, and \mathbf{B} is the , enabling predictions from structured covariates like images or graphs. Integration with is supported by libraries such as TensorLy, which provides seamless APIs for tensorized neural networks, allowing TT and HT formats to be embedded in or backends for end-to-end training of compressed models. Recent advances post-2020 have incorporated tensor decompositions into frameworks to ensure privacy preservation, where local tensor factorizations on client devices prevent direct sharing of raw data. For instance, coupled decomposition in federated settings aggregates low-rank updates across distributed nodes, reducing communication overhead while maintaining guarantees in tasks like personalized recommendation.

References

  1. [1]
    [PDF] Introduction to Tensor Decompositions and their Applications ... - arXiv
    Nov 29, 2017 · The scope of this paper is to give a broad overview of tensors, their decompositions, and how they are used in machine learning. As part of this ...
  2. [2]
    None
    Summary of each segment:
  3. [3]
    [PDF] Tensor Decompositions and Applications
    Abstract. This survey provides an overview of higher-order tensor decompositions, their applications, and available software. A tensor is a multidimensional ...
  4. [4]
    Tensor Decompositions and Applications | SIAM Review
    Abstract. This survey provides an overview of higher-order tensor decompositions, their applications, and available software. A tensor is a multidimensional ...
  5. [5]
  6. [6]
    “Parallel proportional profiles” and other principles for determining ...
    “Parallel proportional profiles” and other principles for determining the choice of factors by rotation. Published: December 1944. Volume 9, pages 267–283, ( ...
  7. [7]
    [PDF] FOUNDATIONS OF THE PARAFAC PROCEDURE
    NOTE: This manuscript was originally published in 1970 and is reproduced here to make it more accessible to interested scholars. The original reference is.
  8. [8]
    Tensor-Train Decomposition | SIAM Journal on Scientific Computing
    A simple nonrecursive form of the tensor decomposition in d dimensions is presented. It does not inherently suffer from the curse of dimensionality.
  9. [9]
    Local Convergence of the Alternating Least Squares Algorithm for ...
    Abstract. A local convergence theorem for calculating canonical low-rank tensor approximations (PARAFAC, CANDECOMP) by the alternating least squares algorithm ...<|control11|><|separator|>
  10. [10]
    [PDF] A Multilinear Singular Value Decomposition - UC Davis Math
    Abstract. We discuss a multilinear generalization of the singular value decomposition. There is a strong analogy between several properties of the matrix ...
  11. [11]
    [PDF] A scalable optimization approach for fitting canonical tensor ...
    Jan 27, 2011 · The OPT approach proposed here is a first-order gradient-based optimization method to solve the CP optimization problem in (2); specifically, we ...Missing: seminal | Show results with:seminal
  12. [12]
    Stochastic Gradients for Large-Scale Tensor Decomposition
    This work proposes using stochastic gradients for efficient generalized canonical polyadic (GCP) tensor decomposition of large-scale tensors. GCP tensor ...
  13. [13]
    [1202.2476] Regularized Tensor Factorizations and Higher-Order ...
    Feb 11, 2012 · We introduce frameworks for sparse tensor factorizations or Sparse HOPCA based on heuristic algorithmic approaches and by solving penalized ...
  14. [14]
    [1605.08257] Low-rank tensor completion: a Riemannian manifold ...
    May 26, 2016 · We propose a novel Riemannian manifold preconditioning approach for the tensor completion problem with rank constraint.
  15. [15]
    Newton and Alternating Least Squares for CANDECOMC/PARAFAC ...
    We study the convergence of variants of the Gauss--Newton method relative to ALS for finding exact CP decompositions as well as approximate decompositions of ...
  16. [16]
    High-dimension Tensor Completion via Gradient-based ... - arXiv
    Apr 5, 2018 · ... Tensor Train Stochastic Gradient Descent (TT-SGD) to optimize TT decomposition factors. In addition, a method named Visual Data ...
  17. [17]
    [PDF] Tensor Decompositions for Signal Processing - HAL
    Jul 20, 2025 · Most existing tensor models are based on simplified scenarios, and developing algorithms that efficiently handle tasks like signal detection, ...
  18. [18]
  19. [19]
    Tensor-Based Robust Adaptive Beamforming for Multiple-Input ...
    Dec 12, 2023 · The proposed method captures the multidimensional structure information embedded in the data matrix received by a MIMO radar and produces ...
  20. [20]
  21. [21]
    Parallel Factor Analysis (PARAFAC) of Target Analytes in GC × GC ...
    PARAFAC (parallel factor analysis) is a powerful chemometric method that has been demonstrated as a useful deconvolution technique in dealing with data ...
  22. [22]
    Chromatographic Applications in the Multi-Way Calibration Field
    In this review, recent advances and applications using multi-way calibration protocols based on the processing of multi-dimensional chromatographic data are ...
  23. [23]
  24. [24]
    [PDF] Eye blink artifact removal in EEG using tensor decomposition
    The main aim of our work is to examine the capability of tensors and more specifically Tucker decomposition's as a model to discriminate artifact related ...Missing: reduction | Show results with:reduction
  25. [25]
    A fully robust PARAFAC method for analyzing fluorescence data
    Mar 3, 2009 · Parallel factor analysis (PARAFAC) is a widespread method for modeling fluorescence data by means of an alternating least squares procedure.Missing: variants | Show results with:variants
  26. [26]
    [PDF] Utilization of Tensor Decompositions for Video-compression
    Sep 21, 2023 · In this work, we provide a study of video compression with the use of tensor train and Tucker decomposi- tions. We measure the quality of ...
  27. [27]
  28. [28]
    QuaDCNN: Quantized compression of deep CNN based on tensor ...
    Mar 28, 2025 · This study introduces QuaDCNN, a hybrid compression algorithm for Deep CNNs, which integrates Tensor-Train (TT) decomposition with quantization.
  29. [29]
    [PDF] Bayesian Methods for Tensor Regression - Technical Reports
    Jun 25, 2020 · Let f(y|X,B) denotes the density of y, and assume that the true data generating model also belongs to the class of tensor regression models with ...Missing: times_1 | Show results with:times_1
  30. [30]
    TensorLy: Tensor Learning in Python
    We have developed TensorLy, a Python library that provides a high-level API for tensor methods and deep tensorized neural networks.