Partial least squares regression
Partial least squares regression (PLSR), also known as projection onto latent structures, is a multivariate statistical technique that models the linear relationship between a predictor matrix X (containing independent variables) and a response matrix Y (containing dependent variables) by extracting latent variables that maximize the covariance between X and Y, thereby addressing issues like multicollinearity, noise, and high dimensionality in the data.[1] Originating from econometric path modeling developed by Herman Wold in the 1960s, PLSR was adapted and popularized in chemometrics by Svante Wold and Harald Martens in the late 1970s and 1980s, evolving into a versatile tool for predictive modeling across various fields.[1]
PLSR generalizes both principal component analysis (PCA), which decomposes X to explain its variance, and multiple linear regression (MLR), which directly relates X to Y but struggles with collinear predictors; in contrast, PLSR constructs latent variables that prioritize predictive relevance for Y over mere explanation of X. The method employs the nonlinear iterative partial least squares (NIPALS) algorithm, which iteratively deflates X and Y to extract successive orthogonal components, or "latent vectors," until the desired number of components is reached or model criteria are met.[1] This process yields regression coefficients for prediction, variable importance scores (e.g., VIP), and biplots for interpretation, making it robust for scenarios with more variables than observations.[1]
One of the primary advantages of PLSR is its ability to handle ill-conditioned data matrices common in high-dimensional settings, such as near-infrared spectroscopy or genomics, where traditional MLR fails due to overfitting or instability.[1] It also supports multiple responses in Y, enabling simultaneous modeling of several outcomes, and provides diagnostic tools like cross-validation to assess model quality and avoid overparameterization. Applications span chemometrics (e.g., quantitative structure-activity relationships in drug design), sensory analysis, environmental monitoring,[1] and neuroimaging, where PLSR variants like behavioral PLS are used to correlate brain activity patterns with cognitive tasks.[2] Despite its strengths, PLSR assumes linearity and can be sensitive to outliers, often requiring preprocessing like centering and scaling of variables.
Introduction and Core Concepts
Definition and Motivation
Partial least squares regression (PLSR), also known as partial least squares (PLS), is a multivariate statistical technique that models the relationships between a matrix of predictor variables \mathbf{X} and a matrix of response variables \mathbf{Y} by extracting latent variables that maximize the covariance between projections of \mathbf{X} and \mathbf{Y}.[3] This approach integrates features of principal component analysis, which focuses on variance in \mathbf{X}, and multiple regression, which predicts \mathbf{Y} from \mathbf{X}, to create components relevant to both sets of variables.[3] Unlike traditional methods, PLSR constructs these latent variables iteratively, ensuring they capture the strongest linear relationships while reducing the impact of irrelevant variation.[4]
The primary motivation for PLSR arises from the limitations of ordinary least squares (OLS) regression in handling complex datasets, particularly those exhibiting multicollinearity among predictors, high dimensionality where the number of variables greatly exceeds the number of observations (p \gg n), and substantial noise.[3] In such cases, OLS often fails due to unstable coefficient estimates and overfitting, as the covariance matrix of \mathbf{X} becomes singular or ill-conditioned.[3] PLSR addresses these issues by performing dimension reduction that prioritizes directions of maximum covariance with \mathbf{Y}, thereby providing robust predictions even in noisy environments.[4] PLSR was introduced by Herman Wold in the 1960s as an econometric method and adapted for chemometrics in the 1970s as a solution to calibration challenges in analytical chemistry, where relating instrumental responses to chemical concentrations is complicated by correlated measurements.
A key application of PLSR lies in predictive modeling for domains with inherently collinear predictors, such as spectroscopy, where adjacent wavelengths in spectral data are highly correlated, leading to redundant information that confounds standard regression.[3] By deflating the data through successive latent components, PLSR simplifies these high-dimensional inputs while preserving predictive power, making it ideal for quantitative analysis in fields like near-infrared spectrometry for quality control.[4] This method extends the exploratory capabilities of principal component analysis into a supervised framework suited for regression tasks.[3]
Core Idea
Partial least squares (PLS) regression operates on a geometric foundation that seeks to identify directions in the predictor space (X) and response space (Y) which capture the strongest linear relationships between the two datasets. At its core, PLS constructs successive pairs of weight vectors, denoted as \mathbf{w} for X and \mathbf{c} for Y, such that the resulting score vectors \mathbf{t} = X\mathbf{w} and \mathbf{u} = Y\mathbf{c} exhibit maximum covariance. This maximization occurs under the constraint that the weight vectors have unit length (\|\mathbf{w}\| = 1, \|\mathbf{c}\| = 1), ensuring that the projections represent directions of highest correlation rather than mere scaling. Geometrically, this process aligns the data by finding hyperplanes in the high-dimensional spaces of X and Y that are as close as possible, prioritizing the shared variance that links predictors to responses.[5][6]
The objective of this search is formally expressed as maximizing the covariance between the scores: \max_{\mathbf{w}, \mathbf{c}} \operatorname{Cov}(\mathbf{t}, \mathbf{u}) = \max_{\mathbf{w}, \mathbf{c}} \mathbf{w}^T X^T Y \mathbf{c}, subject to the unit norm constraints. This formulation, rooted in the NIPALS algorithm developed by Herman Wold, transforms the original variables into latent components that successively explain the covariance structure, starting with the pair that captures the largest shared variation. Unlike scalar products in simpler regressions, this eigenvector-like problem identifies orthogonal directions iteratively, with each new pair deflating the residual matrices to remove the influence of prior components.[7][5]
To extract these components, PLS proceeds through deflation steps: after computing the first score pair (\mathbf{t}_1, \mathbf{u}_1), the loadings are calculated as \mathbf{p}_1 = X^T \mathbf{t}_1 / (\mathbf{t}_1^T \mathbf{t}_1) and \mathbf{q}_1 = Y^T \mathbf{u}_1 / (\mathbf{u}_1^T \mathbf{u}_1), and the matrices X and Y are updated by subtracting the rank-one approximations \mathbf{t}_1 \mathbf{p}_1^T and \mathbf{u}_1 \mathbf{q}_1^T, respectively, yielding residuals that are orthogonal to the previous scores. This verbal description illustrates a stepwise orthogonalization, where each iteration uncovers deeper layers of covariance, building a decomposition akin to a tailored singular value decomposition focused on inter-block relationships rather than intra-block variance. The process continues until the desired number of components is reached or the residuals are negligible, providing a geometric ladder that ascends from dominant to subtle covariances.[6][7]
In contrast to principal component analysis (PCA), which identifies directions in X that maximize only the variance within X itself, PLS explicitly incorporates information from Y to guide the projection directions, ensuring that the latent variables are predictive and relevant to the responses. This Y-aware maximization addresses scenarios where X exhibits multicollinearity but strong predictive signals toward Y, making PLS particularly suited for high-dimensional data with noisy or correlated predictors.[5][6]
Mathematical Foundation
Underlying Model
Partial least squares (PLS) regression approximates a linear latent variable model for relating predictor variables to response variables, particularly in scenarios with multicollinearity or high dimensionality. The model decomposes the centered data matrices X (of dimensions n \times p, where n is the number of observations and p the number of predictors) and Y (of dimensions n \times m, where m is the number of responses) into latent components that capture the shared structure between them.[5]
The underlying equations are:
\begin{align*}
X &= T P^T + E, \\
Y &= U Q^T + F,
\end{align*}
where T and U are the score matrices (both n \times A, with A denoting the number of latent components), P is the X-loading matrix (p \times A), Q is the Y-loading matrix (m \times A), and E and F are the residual matrices capturing unexplained variation in X and Y, respectively.[5] The scores in T represent projections of the predictors onto the latent space, while those in U do the same for the responses.[8]
The scores are related through a linear approximation U \approx T B, where B is a diagonal matrix containing the correlations (or regression coefficients) between successive pairs of score vectors t_a and u_a for each component a = 1, \dots, A.[5] This relationship links the latent structures of X and Y, allowing PLS to model predictive associations.[8]
The model assumes linear relationships between the original variables and the latent components, with the score vectors in T being mutually orthogonal (T^T T = I) and those in U approximately orthogonal (U^T U \approx I).[5] After extracting each component, the residuals E and F are deflated by subtracting the contribution of the current latent variables, enabling sequential extraction of subsequent components.[8]
In this framework, the latent variables embodied in the scores T and U capture the covariance or shared variance between X and Y, facilitating dimensionality reduction while preserving predictive information, whereas the residuals E and F represent noise or structure orthogonal to the model.[5]
Objective Function
The objective function of partial least squares (PLS) regression seeks to extract successive latent components that capture the covariance structure between the predictor matrix X and the response matrix Y. For the a-th component, the goal is to maximize the covariance between the score vectors t_a = X w_a and u_a = Y c_a, which is given by \operatorname{cov}(t_a, u_a) = w_a^T X^T Y c_a / (n-1), where n is the number of observations, subject to the normalization constraints \|w_a\| = 1 and \|c_a\| = 1.[9] This criterion identifies weight vectors w_a and c_a that align the directions of maximum variance explanation in X and Y.[9]
To extend to multiple components, PLS proceeds iteratively: after extracting the pair (t_a, u_a), the matrices are deflated to remove the contribution of this component, ensuring subsequent components capture residual covariance. Specifically, the predictor matrix is updated as X \leftarrow X - t_a p_a^T, where the loading vector p_a = X^T t_a / t_a^T t_a, and similarly for the response matrix, Y \leftarrow Y - u_a q_a^T, with q_a = Y^T u_a / u_a^T u_a.[9] This deflation process enforces orthogonality among the score vectors \{t_a\} and \{u_a\} across components, preventing redundancy in the latent space.[9]
The regression aspect of PLS combines the extracted components to form predictions. Let W = [w_1, \dots, w_A], P = [p_1, \dots, p_A], and Q = [q_1, \dots, q_A] denote the matrices of weights and loadings for A components; the coefficient matrix is then B = W (P^T W)^{-1} Q^T, yielding the predicted responses \hat{Y} = X B.[9] The number of components A is typically selected via cross-validation to balance model complexity and predictive accuracy, minimizing error on held-out data.[9]
Computation
Algorithms Overview
Partial least squares (PLS) regression employs a variety of computational algorithms to extract latent variables that maximize the covariance between predictor and response matrices, addressing challenges like multicollinearity and high dimensionality. Iterative methods, such as the Nonlinear Iterative Partial Least Squares (NIPALS) algorithm, offer flexibility in handling diverse data structures by sequentially deflating the matrices after each component extraction, making them suitable for exploratory analyses.[10] In contrast, direct methods based on singular value decomposition (SVD), like the SIMPLS algorithm, provide numerical stability and efficiency by solving the optimization problem through eigendecomposition of cross-covariance matrices, avoiding iterative approximations and reducing computational overhead for large datasets. These approaches differ in their balance of interpretability and speed, with iterative methods being more transparent but potentially slower, while direct methods prioritize precision in factor estimation.[11]
PLS algorithms are categorized into modes based on the response structure: PLS1 for univariate responses (single Y variable), which simplifies the deflation step by focusing on scalar projections, and PLS2 for multivariate responses (multiple Y variables), which extends the covariance maximization to joint response modeling.[12] Kernel algorithms, used in nonlinear extensions of PLS (e.g., kernel PLS), represent data using kernel matrices for scalability and nonlinear modeling, while linear variants like standard NIPALS and SIMPLS operate directly on the original matrices. This distinction allows kernel-based implementations to handle kernel tricks for nonlinear extensions without explicit feature mapping, though both modes maintain the core PLS objective of predictive relevance.
Convergence in iterative algorithms like NIPALS is achieved by repeating projections until the scores and weights reach a fixed point, typically monitored via a small change threshold (e.g., 10^{-12}) or a maximum iteration limit (e.g., 200), ensuring deflation of residuals across components.[12] However, NIPALS can exhibit slow convergence in ill-conditioned data and non-uniqueness in weight vectors due to scaling and sign ambiguities, which may affect reproducibility without normalization constraints.[11] Direct SVD-based methods circumvent these issues by providing unique decompositions, though they require more memory for matrix operations.
The number of components in PLS models is selected using criteria that balance fit and prediction: R^2 measures the proportion of explained variance in the response (or predictors), favoring models that capture cumulative variation without overfitting, while Q^2 assesses predictive relevance through cross-validation (e.g., via predicted residual sum of squares, PRESS); values greater than 0 indicate predictive relevance, with higher values suggesting better prediction, often selected via methods like the van der Voet test. These metrics guide practical implementation by identifying the "knee" in validation plots, prioritizing generalization over mere goodness-of-fit.
NIPALS Algorithm
The Nonlinear Iterative Partial Least Squares (NIPALS) algorithm is the foundational iterative method for computing PLS components, particularly suited for the general multivariate case (PLS2) where the response matrix Y has multiple columns. It extracts successive latent variables by maximizing the covariance between X-scores (t) and Y-scores (u) through iterative projections and deflations.[10]
The algorithm begins by centering and scaling the predictor matrix X (n × p) and response matrix Y (n × q). For each latent component h (up to A components), the process initializes Y-scores u_h = Y v_h (often starting with a column of Y or random), then iterates: X-weights w_h = X^T u_h / ||X^T u_h||_2 (normalized to unit norm); X-scores t_h = X w_h; Y-weights c_h = Y^T t_h / t_h^T t_h; update u_h = Y c_h; repeat until convergence (e.g., change in t_h < 10^{-12}). Once converged, compute Y-loadings q_h = Y^T t_h / t_h^T t_h (or from c_h in some variants); X-loadings p_h = X^T t_h / t_h^T t_h. Deflate residuals: X ← X - t_h p_h^T; Y ← Y - u_h q_h^T (or t_h c_h^T in simplified forms), ensuring orthogonality of successive components.[12][10]
After A components, regression coefficients are B = W (P^T W)^{-1} Q^T, where W, P are matrices of weights and X-loadings, Q of Y-loadings. Predictions are Ŷ = X B + B_0, with B_0 from means. For a single component, pseudocode is:
Initialize: Center and scale X and Y; select initial u (e.g., first column of Y)
Loop until convergence:
w = (X^T u) / ||X^T u||_2
t = X w
c = (Y^T t) / (t^T t)
u = Y c
q = (Y^T t) / (t^T t)
p = (X^T t) / (t^T t)
Deflate: X = X - t p^T; Y = Y - t c^T (or u q^T)
Store: w, t, c, p, q
Initialize: Center and scale X and Y; select initial u (e.g., first column of Y)
Loop until convergence:
w = (X^T u) / ||X^T u||_2
t = X w
c = (Y^T t) / (t^T t)
u = Y c
q = (Y^T t) / (t^T t)
p = (X^T t) / (t^T t)
Deflate: X = X - t p^T; Y = Y - t c^T (or u q^T)
Store: w, t, c, p, q
Repeat for multiple components on deflated data. This method's iterations ensure covariance maximization but can be computationally intensive for large matrices.[10]
PLS1 Variant
The PLS1 variant of partial least squares regression is specifically designed for scenarios where the response variable Y is univariate, meaning it consists of a single scalar response vector of length n (the number of observations). This adaptation simplifies the general NIPALS algorithm by eliminating the need for iterative updates to the Y scores within each component extraction, as the covariance maximization between X and Y can be computed directly.[10] Unlike the multivariate case, PLS1 treats Y as a single column, allowing for efficient deflation and loading computations tailored to univariate prediction tasks.[10]
The algorithm begins by centering and scaling the predictor matrix X (n × p, where p is the number of variables) and the response vector y to ensure zero mean and unit variance. Initialization sets the Y scores u to y itself. For each latent component h (up to a maximum A components), the process extracts weights and scores as follows: the X weights vector w_h is computed as the normalized projection w_h = X^T u / ||X^T u||_2, ensuring ||w_h||_2 = 1; the X scores t_h are then t_h = X w_h; the Y loading q_h is directly q_h = y^T t_h / t_h^T t_h; and the inner product for normalization is simplified since no further Y weight iteration is required. This direct computation of q_h leverages the univariate nature of y, avoiding the matrix operations needed for multivariate Y.[10] The X loadings p_h are p_h = X^T t_h / t_h^T t_h. Deflation then updates the residuals: X ← X - t_h p_h^T and y ← y - t_h q_h, removing the contribution of the current component while preserving orthogonality.
After extracting A components, the regression coefficients are derived from the stored matrices: let W be the n × A matrix of weights [w_1, ..., w_A], P the p × A matrix of loadings [p_1, ..., p_A], and q the A × 1 vector of Y loadings [q_1, ..., q_A]. The coefficient vector b (p × 1) is given by b = W (P^T W)^{-1} q, which provides a least-squares fit within the latent space. The predicted response is then ŷ = X b + b_0, where b_0 is the intercept computed from the means of X and y to account for centering. This formulation ensures that the model minimizes the reconstruction error for both X and y while maximizing their covariance.[10]
PLS1 is particularly common in calibration problems, such as spectroscopic analysis where a single property (e.g., concentration) is predicted from multivariate spectra, due to its computational efficiency and robustness to collinearity in high-dimensional X. For a single component (A=1), the pseudocode is:
Initialize: Center and scale X and y; u = y
w = (X^T u) / ||X^T u||_2
t = X w
q = (y^T t) / (t^T t)
p = (X^T t) / (t^T t)
Deflate: X = X - t p^T; y = y - t q
Store: w, t, p, q
Initialize: Center and scale X and y; u = y
w = (X^T u) / ||X^T u||_2
t = X w
q = (y^T t) / (t^T t)
p = (X^T t) / (t^T t)
Deflate: X = X - t p^T; y = y - t q
Store: w, t, p, q
To extend to multiple components, repeat the process A times on the deflated residuals, accumulating W, P, and q, then compute b as above. This iterative deflation ensures successive components capture orthogonal variance, with the number A often selected via cross-validation to balance fit and prediction error.
Variants and Extensions
Orthogonal PLS (OPLS)
Orthogonal partial least squares (OPLS) is a variant of PLS that partitions the systematic variation in the predictor matrix X into predictive variation (correlated with the response Y) and orthogonal variation (uncorrelated with Y). This separation enhances model interpretability by isolating noise and irrelevant systematic variation into orthogonal components, improving diagnostics such as R² and Q² metrics without affecting predictive performance. OPLS is particularly useful in applications like spectroscopy and metabolomics, where distinguishing relevant signals from confounding factors is crucial.[13]
The OPLS algorithm extends the standard PLS decomposition by incorporating orthogonal signal correction (OSC) or direct orthogonalization in the latent variable extraction. In the NIPALS-based implementation, after computing the predictive weights and scores as in PLS, an orthogonal filter removes the uncorrelated part from X residuals before proceeding to the next component. This results in a model where each predictive component maximizes covariance with Y, while orthogonal components capture structured noise orthogonal to both X and Y. The method was introduced by Johan Trygg and Svante Wold in 2002.[13]
Regularized PLS (L-PLS)
Regularized partial least squares (L-PLS) extends the standard PLS framework by incorporating L1 (Lasso-like) regularization to promote sparsity in the model coefficients, making it particularly suitable for high-dimensional datasets where variable selection is crucial. This regularization is applied to the weights or loadings within the PLS latent structure, addressing issues like multicollinearity and overfitting that are common in scenarios with many predictors relative to observations. The approach balances prediction accuracy with interpretability by shrinking irrelevant coefficients to zero, thereby identifying key features that drive the relationship between predictors X and responses Y.[14]
The core formulation of L-PLS modifies the PLS objective to include an L1 penalty on the regression coefficients B, formulated as minimizing \|Y - X B\|^2 + \lambda \|B\|_1 within the latent variable framework of PLS, where \lambda > 0 controls the sparsity level. This penalty encourages many elements of B to be exactly zero, facilitating automatic variable selection while preserving the deflation and iterative component extraction characteristic of PLS. Unlike standard PLS, which uses all variables in each latent component, L-PLS enforces sparsity directly on the weight vectors, leading to sparser linear combinations of predictors.[14][15]
The algorithm for L-PLS adapts the NIPALS procedure through iterative soft-thresholding applied to the weight updates during each component extraction step. Specifically, for the (a+1)-th component, the weight vector is computed as
\mathbf{w}_{a+1} = \text{soft_threshold}\left( \frac{X^T \mathbf{u}_a}{\|\mathbf{X}^T \mathbf{u}_a\|}, \lambda \right),
where \mathbf{u}_a is the score vector from the previous step, and the soft-thresholding operator is defined as \text{soft_threshold}(z, \lambda) = \text{sign}(z) \max(|z| - \lambda, 0) for each element z. This thresholding is performed after the initial covariance maximization but before normalization, ensuring sparsity while maintaining the deflation of residuals in subsequent iterations. The process repeats for a specified number of components, with \lambda typically tuned via cross-validation.[14]
L-PLS induces sparsity in the latent components, which propagates to the final regression coefficients, making it advantageous for feature selection in high-dimensional settings such as genomics, where thousands of genes must be sifted to identify those associated with phenotypes. By embedding regularization within the NIPALS-like iterations, it avoids the need for post-hoc thresholding and improves model stability and interpretability compared to unregularized PLS. This method was introduced in the late 2000s, with foundational work appearing around 2009.[14][15]
Advanced Extensions
One advanced extension of partial least squares (PLS) regression is the three-pass regression filter (3PRF), introduced in 2015 as a method for forecasting with a large number of predictors. The 3PRF operates as a constrained least squares estimator that directly identifies relevant forecast factors from high-dimensional data, reducing to standard PLS as a special case when the factors are orthogonal. This approach enhances efficiency in multi-block or high-dimensional settings by avoiding iterative deflation steps common in traditional PLS algorithms, making it suitable for applications involving numerous variables such as economic forecasting.[16]
Another direct computational variant is PLS-SVD, which computes PLS components non-iteratively using singular value decomposition (SVD) on the cross-covariance matrix \mathbf{X}^T \mathbf{Y}. Specifically, the SVD yields [\mathbf{U}, \mathbf{S}, \mathbf{V}] = \mathrm{svd}(\mathbf{X}^T \mathbf{Y}), from which the first weight vectors are extracted as \mathbf{w} = \mathbf{V}(:,1) for the predictors and \mathbf{c} = \mathbf{U}(:,1) for the responses, followed by normalization and projection to obtain scores and loadings. This method is particularly advantageous for large-scale datasets, as it leverages efficient SVD implementations to bypass convergence issues in iterative algorithms like NIPALS, thereby improving computational speed without sacrificing the maximization of covariance between latent variables.
PLS correlation, also known as partial least squares correlation (PLSC), modifies the objective to maximize the correlation between latent variable scores rather than their covariance, assuming standardized input variables. This variant projects both predictor and response blocks onto common latent spaces where the correlation is optimized, making it robust to scale differences and suitable for exploratory analyses of symmetric relationships. In practice, PLSC extracts successive pairs of singular vectors from the correlation matrix of standardized \mathbf{X} and \mathbf{Y}, providing a framework for identifying shared patterns without assuming a directional predictor-response distinction.
These extensions find niche applications in complex data environments; for instance, 3PRF has been applied in neuroimaging to handle multivariate brain imaging data with many features, enabling robust factor extraction for predictive modeling. Similarly, PLS-SVD facilitates efficient analysis of large-scale datasets, such as genomic or spectroscopic arrays, where iterative methods would be prohibitive due to computational demands.[16][17]
History and Development
Origins and Key Contributors
Partial least squares (PLS) regression originated in the mid-1960s as a response to challenges in handling multicollinearity within econometric path models and principal component estimation. Herman O. A. Wold, a Swedish statistician, introduced key precursors through iterative least squares procedures, initially termed nonlinear iterative least squares (NILES), which evolved into the nonlinear iterative partial least squares (NIPALS) algorithm. This foundational work appeared in Wold's 1966 chapter, where he described methods for estimating principal components and related models using iterative techniques, laying the groundwork for PLS in functional modeling contexts within econometrics.[18]
By the mid-1970s, Wold refined these ideas for more complex systems involving latent variables, transitioning toward what would become known as "functional PLS." In his 1975 publication, he presented the NIPALS approach for path models with latent variables, emphasizing soft modeling techniques that prioritized predictive utility over strict distributional assumptions, particularly to address multicollinearity in regression-based path analyses. This work marked a pivotal shift, introducing concepts like Mode A (centroid scheme for functional relations) in subsequent elaborations, and was published in a volume on quantitative sociology. Herman Wold's contributions established PLS as a flexible tool for econometric applications, with the NIPALS precursor enabling iterative deflation of covariance structures.[19]
The evolution to "inverse PLS" occurred in the late 1970s and early 1980s, adapting the method for predictive modeling in natural sciences, particularly chemometrics. Svante Wold, Herman's son and a chemist, played a crucial role in this transition, applying and modifying PLS for handling high-dimensional data with collinearity, such as in spectroscopy. In 1983, Svante Wold, along with Harald Martens and Herman Wold, demonstrated PLS's practical utility in a seminal paper on multivariate calibration problems in analytical chemistry, effectively bridging theoretical econometrics to spectroscopic applications by inverting the modeling focus from path structures to direct prediction of responses from predictors.[20]
Milestones and Evolution
During the 1990s, the Nonlinear Iterative Partial Least Squares (NIPALS) algorithm for PLS regression achieved greater standardization within the chemometrics community, facilitating its routine application in multivariate data analysis through integration into established software platforms like SIMCA, which enhanced accessibility for spectroscopic and process modeling tasks.[5] This period marked a shift toward computational refinement rather than theoretical innovation, solidifying PLS as a practical tool for handling collinear data in industrial settings.[18]
The 2000s brought key theoretical advancements, including the introduction of Orthogonal PLS (OPLS) in 2002 by Trygg and Wold, which partitioned variance into predictive and orthogonal components to improve model interpretability and discrimination in complex datasets. Concurrently, kernel PLS emerged around 2001, extending the method to nonlinear relationships by mapping data into higher-dimensional reproducing kernel Hilbert spaces, thus bridging PLS with kernel-based machine learning techniques.[21]
In the 2010s, focus turned to handling high-dimensional and noisy data, with sparse variants like sparse PLS (SPLS) proposed in 2010 by Chun and Keleş, incorporating L1 penalization for simultaneous dimension reduction and variable selection in genomics and beyond.[22] Robust PLS methods also gained traction, addressing outliers through modifications like partial robust M-regression, as detailed in works from 2010 onward.[23] A notable refinement was the three-pass regression filter (3PRF) in 2015 by Kelly and Pruitt, which generalized PLS for forecasting with many predictors while maintaining constrained least-squares efficiency.[16]
Post-2020 developments have integrated PLS with modern computational paradigms, such as neural PLS hybrids exemplified by deep partial least squares models in 2023, combining latent variable extraction with deep learning for instrumental variable regression in causal inference.[24] Bayesian PLS approaches have also advanced, with Urbas et al. (2024) introducing a probabilistic framework that quantifies parameter uncertainty in predictions, particularly useful for spectral data analysis.[25] In 2025, further extensions include PLS models accommodating skew-normal and skew-t distributions for handling asymmetric data, and hybrid approaches combining PLS with machine learning for enhanced predictive performance in complex datasets.[26][27]
Overall, PLS has evolved from its econometric origins through its adaptation in chemometrics by Svante Wold and others—to a versatile component of machine learning pipelines, evidenced by exponential citation growth post-2000, reflecting its adoption across bioinformatics, econometrics, and predictive modeling.[28][29]
Applications
In Chemometrics and Spectroscopy
Partial least squares regression (PLSR) finds primary application in chemometrics and spectroscopy for quantitative predictions from spectral data, particularly in near-infrared (NIR) spectroscopy where it models complex relationships between absorbance spectra and chemical properties. A classic example is the prediction of gasoline octane number, where PLSR calibrates NIR spectra (typically in the 660-1215 nm range) against reference octane values, achieving accurate forecasts despite overlapping spectral features from hydrocarbon mixtures. This approach has been instrumental in rapid, non-destructive quality control in fuel analysis.[30][31]
In quantitative analysis of multicomponent mixtures, PLSR excels by calibrating predictor matrices X (spectral intensities across wavelengths) to response matrices Y (component concentrations), effectively resolving overlapping signals in techniques like Fourier-transform infrared (FTIR) or NIR spectroscopy. The method's strength lies in handling severe multicollinearity among wavelengths—common in spectroscopic data due to correlated absorbances—by projecting data into a lower-dimensional latent space that maximizes covariance between X and Y, outperforming univariate or multiple linear regression in noisy, collinear environments. Validation routinely employs RMSEP from cross-validation or independent test sets to assess predictive reliability, ensuring model robustness against spectral variations like instrument drift or sample heterogeneity.[32]
A prominent case study involves pharmaceutical tablet composition analysis, where PLSR integrated with NIR spectroscopy has been applied since the 1980s to quantify active pharmaceutical ingredients (APIs) and excipients non-destructively. Early implementations in the late 1980s and 1990s focused on uniformity testing for tablets, using PLSR to correlate diffuse reflectance spectra with API content, achieving RMSEP below 2% w/w for low-dose formulations. This evolved into process analytical technology (PAT) frameworks, where PLSR models are combined with design of experiments (DOE) such as factorial or response surface designs to optimize calibration sets, enhancing model transferability across manufacturing batches and reducing validation time. For example, in continuous tablet manufacturing, DOE-guided PLSR controls pressing pressure via real-time NIR feedback, maintaining API uniformity with prediction errors under 1.5%.[33][34][35]
In bioinformatics, partial least squares (PLS) regression has been widely applied to high-dimensional gene expression data from microarrays for disease classification tasks. A seminal approach uses PLS to reduce the dimensionality of thousands of gene features while predicting tumor types, such as distinguishing between leukemia subtypes or breast cancer classes, by extracting latent variables that maximize covariance between predictors and response outcomes. This method outperforms traditional classifiers in scenarios with multicollinear genes and small sample sizes, and has been applied to benchmark datasets like the Golub leukemia data.[36] For instance, discriminant PLS variants have successfully classified multi-class tumors from colon and prostate cancer microarray profiles, highlighting its utility in identifying biologically relevant gene patterns for personalized medicine.[37]
In the social sciences, PLS regression, particularly through partial least squares structural equation modeling (PLS-SEM), serves as a robust tool for psychometrics and factor analysis of survey data with latent structures. Unlike covariance-based SEM, PLS-SEM estimates composite models by iteratively minimizing residuals in path relationships, making it suitable for exploratory analysis of complex behavioral constructs like attitudes or satisfaction from questionnaire responses. It has been employed to model latent variables in psychometric scales, such as validating multi-item surveys for psychological traits, where it handles non-normal data and smaller samples effectively compared to maximum likelihood methods.[38] This approach equates to factor-score regression in traditional psychometrics, enabling the assessment of measurement invariance and predictive validity in fields like education and sociology.[39]
An illustrative application in finance involves using PLS for stock market forecasting from multivariate time series data. By projecting numerous economic indicators onto latent components correlated with future returns, PLS has demonstrated superior out-of-sample predictability over univariate models.[40] This supervised dimension reduction captures dependencies in financial variables, aiding portfolio optimization.
Emerging uses of PLS in neuroimaging, such as functional magnetic resonance imaging (fMRI), leverage its ability to perform supervised dimensionality reduction on spatiotemporal brain data. PLS identifies task-related patterns by maximizing covariance between voxel activations and behavioral outcomes, enabling discrimination between healthy and diseased states like Alzheimer's, with reported cross-validated sensitivities up to 84.6%.[41] Recent extensions, including oriented PLS, enhance this by partitioning variance into task-relevant components, outperforming principal component analysis (PCA) in supervised settings due to its focus on predictive relevance rather than mere data variance.[42] These advances, building on early applications in event-related potentials, support PLS as a key method for uncovering brain-behavior relationships in high-dimensional neuroimaging studies, with ongoing developments as of 2023 integrating PLS with deep learning for improved accuracy in genomics and spectroscopy applications.[43][44]
To Principal Component Regression
Principal component regression (PCR) applies principal component analysis (PCA) solely to the predictor matrix X to derive orthogonal components that capture the maximum variance in X, after which the response Y is regressed onto these components (scores) using ordinary least squares, effectively ignoring Y during the decomposition process.[8] In contrast, partial least squares (PLS) regression constructs latent components by simultaneously considering both X and Y, weighting the directions in X by their relevance to Y through maximization of the covariance between the component scores of X and Y.[8] This key distinction arises because PCR prioritizes directions of highest variance within X, potentially including noise irrelevant to Y, whereas PLS emphasizes components that explain variation in Y, making it particularly advantageous for predictive modeling when the response drives the relevant structure in the predictors.[45]
In scenarios involving multicollinearity among predictors, PLS typically outperforms PCR in prediction accuracy, as demonstrated in simulation studies where PLS achieves lower mean squared error (MSE) using fewer components than PCR.[45] For instance, on datasets with high collinearity, such as near-infrared spectra, PLS often achieves comparable or better prediction performance with fewer components than PCR.[46]
The regression coefficients highlight this integration of Y: in PCR, the coefficient vector is given by B = V (V^T X^T X V)^{-1} V^T X^T Y, where V comprises the loadings from PCA on X alone, reflecting only X's structure.[45] By comparison, PLS coefficients incorporate Y-integrated weights W, derived to maximize X-Y covariance, resulting in B = W (P^T W)^{-1} P^T Y (or equivalent forms), where P are the loadings for X and the process ensures components are attuned to Y.[8]
To Ridge Regression and Lasso
Ridge regression addresses multicollinearity in ordinary least squares (OLS) by incorporating an L2 penalty term, which shrinks the coefficients toward zero without setting any to exactly zero, thereby stabilizing estimates but retaining all predictors without explicit dimension reduction. In contrast, Lasso regression employs an L1 penalty to induce sparsity by driving some coefficients to exactly zero, enabling automatic variable selection; however, it can be unstable in the presence of highly correlated predictors, often selecting only one from a group of correlated variables.
Partial least squares (PLS) regression differs fundamentally by constructing latent variables that maximize covariance between predictors and the response, integrating dimension reduction with predictive modeling to handle both multicollinearity and high-dimensional settings (high p/n ratios) more effectively than Ridge, which focuses solely on penalization without reducing variables. Unlike Lasso, PLS preserves information from correlated predictors by incorporating response-guided components, avoiding the instability of sparse selection in such scenarios.[45]
Simulation studies have shown that PLS can yield better predictive performance than Ridge regression in multicollinear settings, often requiring fewer effective parameters for similar MSE.[45] In chemometric applications with correlated spectroscopic data, such as laser-induced breakdown spectroscopy, PLS has shown competitive MSE performance compared to both Ridge and Lasso.[47]
Implementations
Software Packages
Partial least squares (PLS) regression is implemented in various open-source and commercial software packages, enabling its application across statistical computing environments. In the R programming language, the pls package provides core functionality for PLS regression, utilizing the NIPALS algorithm as the default method while also supporting alternatives like kernel PLS and SIMPLS, with specific handling for univariate (PLS1) and multivariate (PLS2) response variables.[48][49] Complementing this, the mixOmics package extends PLS to multi-block data integration and sparse variants, such as sparse PLS (sPLS) for variable selection via lasso penalization and multi-block PLS for analyzing multiple datasets measured on the same samples.[50][51]
In Python, the scikit-learn library offers the PLSRegression class within its cross_decomposition module, implementing PLS for regression tasks with support for multiple targets (PLS1/PLS2) and an option for SVD-based computation through the related PLSSVD transformer, which performs singular value decomposition on the cross-covariance matrix.[52][53] For advanced extensions, the pypls package specializes in PLS-DA and OPLS-DA, tailored for high-dimensional data in fields like metabolomics and mass spectrometry analysis.[54]
Commercial tools include SIMCA from Sartorius (formerly Umetrics), which provides comprehensive PLS and orthogonal PLS (OPLS) modeling for multivariate data analysis, including tools for model building, validation, and prediction in process industries.[55] Similarly, Unscrambler by CAMO Software focuses on spectroscopy applications, supporting PLS regression alongside PCA for calibration and predictive modeling in near-infrared and other spectral data contexts.[56]
For MATLAB, the Statistics and Machine Learning Toolbox includes the plsregress function for PLS regression, computing predictor and response loadings with options for cross-validation and handling multicollinear data.[57] A prominent third-party extension is PLS_Toolbox by Eigenvector Research, offering an extensive suite of chemometric tools including advanced PLS variants (e.g., OPLS, SIMPLS) and machine learning integrations for multivariate analysis within MATLAB; as of November 2025, it supports MATLAB versions up to R2024b but is incompatible with R2025a due to Java deprecation.[58][59]
Practical Usage Guidelines
In partial least squares (PLS) regression, preprocessing the data is a foundational step to ensure model reliability and interpretability. Both the predictor matrix \mathbf{X} and response matrix \mathbf{Y} should be centered by subtracting their column means and scaled by dividing by their standard deviations, transforming them into z-scores; this equalizes the influence of variables with different units or scales, preventing dominance by high-variance predictors and aiding in the identification of relevant latent structures. Outliers must be addressed early, as they can distort component loadings and inflate prediction errors—techniques such as robust scalers (e.g., using medians and interquartile ranges) or diagnostic plots like Hotelling's T^2 and distance-to-model plots are recommended to detect and either remove or downweight them before fitting the model.[9][60][61]
Selecting the optimal number of latent components A requires careful validation to balance explanatory power and generalizability. Cross-validation, particularly leave-one-out or k-fold variants, is the standard method: fit models with increasing A and choose the value that minimizes the predicted residual sum of squares (PRESS), which quantifies prediction error on held-out data; this guards against overfitting, where excessive components capture noise rather than signal. As a rule, stop adding components once PRESS begins to increase or when the reduction in error plateaus, ensuring the model retains parsimony without sacrificing predictive accuracy.[62][12][63]
Model validation emphasizes metrics that assess both fit and predictive utility. The cross-validated coefficient of determination, Q^2, measures out-of-sample performance, with values exceeding 0.05 indicating at least minimal predictive relevance, though Q^2 > 0.5 is desirable for robust models; negative Q^2 signals poor generalization and warrants model revision. Variable importance in projection (VIP) scores provide insight into predictor relevance, calculated as the weighted sum of squared loadings across components—variables with VIP > 1 are deemed influential and retained, while those below 0.8 can often be excluded to simplify the model without substantial loss in performance.[9][64][65]
Practical tips enhance PLS application efficiency. For univariate responses (single \mathbf{Y}), begin with PLS1, which optimizes a single set of weights for the response and often yields simpler, more stable models than PLS2 for multivariate \mathbf{Y}; PLS2 is preferable only when responses are highly correlated. For improved interpretability, especially in noisy datasets, orthogonal PLS (OPLS) variants separate predictive and orthogonal variation, clarifying which predictors directly relate to \mathbf{Y} versus unrelated noise. Recent guidelines from 2023 emphasize integrating PLS with deep learning for nonlinear extensions, such as deep PLS layers that learn hierarchical latent representations while preserving interpretability, particularly useful in high-dimensional settings like instrumental variable regression.[50][66][67]