Partial least squares path modeling
Partial least squares path modeling (PLS-PM), also referred to as partial least squares structural equation modeling (PLS-SEM), is a variance-based multivariate statistical technique used to estimate and test causal relationships in structural equation models by maximizing the explained variance of endogenous latent variables.[1] Developed as a "soft modeling" approach, it integrates path analysis with latent variable modeling to handle complex predictive models, particularly in scenarios involving small sample sizes, non-normal data distributions, and both reflective and formative measurement constructs.[2] The method originated from the work of Swedish statistician Herman Wold in the mid-1960s, with key foundational contributions in his 1975 and 1982 publications that extended principal component analysis to path models with latent variables.[3] Wold's NIPALS (nonlinear iterative partial least squares) algorithm formed the basis for estimating model parameters through iterative ordinary least squares regressions, emphasizing prediction over strict parameter accuracy.[4] Subsequent advancements in the 1980s and 1990s, including software developments like Lohmöller's LVPLS (1987), broadened its application in social sciences, marketing, and management research.[1] Unlike covariance-based structural equation modeling (CB-SEM), which focuses on reproducing the covariance matrix to assess overall model fit and assumes multivariate normality, PLS-PM employs a component-based approach using surrogate composites for latent variables, making it more flexible and robust to data violations.[1] This nonparametric method avoids global goodness-of-fit measures, instead prioritizing metrics like R² for structural paths and loadings or composite reliability for measurement models, which supports its use in exploratory theory building and complex models with many indicators or constructs.[2] PLS-PM's advantages include higher statistical power for detecting relationships in smaller samples (often as few as 10 times the number of paths), tolerance for multicollinearity, and the ability to incorporate both types of measurement models without convergence issues common in CB-SEM.[1] Recent methodological enhancements, such as consistent PLS (PLSc) for improved parameter consistency and bootstrapping for significance testing, have addressed earlier criticisms regarding bias in factor model estimation.[5] Widely applied in fields like business, psychology, and information systems, PLS-PM facilitates the analysis of mediation, moderation, and multigroup comparisons, with popular software tools including SmartPLS and ADANCO enabling user-friendly implementation.[6] Its growing adoption, evidenced by thousands of annual publications since the 2010s, underscores its role in predictive analytics and theory development where traditional assumptions cannot be met.[1]Introduction
Definition and Purpose
Partial least squares path modeling (PLS-PM), also known as partial least squares structural equation modeling (PLS-SEM), is a variance-based statistical technique for estimating causal path models that involve latent constructs measured by multiple indicators.[7] It integrates principal component analysis to approximate latent variables and multiple regression to model their interrelationships, enabling the analysis of complex dependencies among observed and unobserved variables.[8] This approach differs from covariance-based SEM by prioritizing the maximization of explained variance in a predictive context rather than exact covariance reproduction.[5] The primary purpose of PLS-PM is to support prediction-oriented research, particularly in exploratory settings where theory development is ongoing or data limitations exist.[7] It is well-suited for handling small sample sizes (as few as 100 cases for complex models), non-normal data distributions, and intricate models featuring many latent constructs or indicators per construct, making it popular in fields like marketing, management, and social sciences.[7] By focusing on predictive accuracy and model simplicity, PLS-PM facilitates theory testing and extension without stringent parametric assumptions.[5] At its core, PLS-PM comprises two interconnected components: the structural model, which defines the directional relationships (paths) among latent variables to capture theoretical constructs' causal influences, and the measurement model, which specifies how observed indicators relate to their respective latent variables, either reflectively (indicators as effects of the construct) or formatively (construct as a composite of indicators).[8] PLS-PM originated within the soft modeling paradigm pioneered by Herman Wold in the 1970s and 1980s, which emphasizes flexible, iterative estimation for exploratory and predictive goals over rigid confirmatory analysis under strict distributional requirements.[8] This paradigm contrasts with traditional "hard" modeling by accommodating uncertainty in model specification and data quality.[7]Historical Development
The foundations of partial least squares path modeling (PLS-PM) were laid by Swedish statistician Herman Wold in the 1960s, who developed the Non-linear Iterative Partial Least Squares (NIPALS) algorithm for factor analysis and regression tasks, providing an iterative approach to handle multicollinearity and latent structures in data.[9] In the 1970s, Wold advanced these methods toward "soft modeling," a flexible framework for estimating path models with latent variables, which he advanced in publications during the 1970s, notably his 1975 paper on soft modeling using the NIPALS approach, and applied to econometric systems as an alternative to rigid covariance-based techniques.[3] This period also saw parallel developments by Karl G. Jöreskog in the LISREL approach for structural equation modeling, highlighting the growing interest in latent variable methods during the decade.[10] The formalization of PLS-PM occurred in 1981 with Christian Lohmöller's dissertation, Latent Variable Path Modeling with Partial Least Squares, which outlined the first complete algorithm for estimating complex path models and implemented it in the LVPLS software, enabling practical computation of PLS-based estimations.[11] In the same year, Fornell and Larcker's seminal work on evaluating structural equation models with unobservable variables and measurement error popularized PLS-PM in marketing research by proposing key assessment criteria, such as composite reliability and average variance extracted, which became standards for model validation.[12] The 2000s marked a resurgence in PLS-PM's adoption, driven by the release of SmartPLS software in 2005 by Ringle, Wende, and Will, which offered an accessible graphical interface for variance-based structural equation modeling and broadened its use across disciplines like business and social sciences.[13] This era also saw influential contributions from Wold's later works on systems analysis under indirect observation (1982) and partial least squares in encyclopedic entries (1985), which refined the theoretical underpinnings of soft modeling for predictive applications.[14][15] Additionally, Tenenhaus et al. (2005) shifted the terminology from PLS-PM to PLS structural equation modeling (PLS-SEM), emphasizing its role in predictive modeling and goodness-of-fit indices like the GoF metric.[9] More recent advancements addressed PLS-PM's inconsistencies in parameter recovery, with Dijkstra and Henseler introducing consistent PLS (PLSc) in 2015, an adjustment that ensures asymptotic normality and better estimation of path coefficients in reflective models, building on earlier critiques of bias in traditional PLS.[16] Since 2015, PLS-SEM has continued to evolve with advancements including strategies for handling missing data, holistic exploratory-confirmatory frameworks, and applications in AI-driven business research, as documented in recent literature (Hair et al., 2024; Special Issue on Advanced PLS-SEM, 2024).[17][18]Conceptual Framework
Latent Variables and Path Models
In partial least squares path modeling (PLS-PM), latent variables represent unobserved theoretical constructs that are inferred from a set of observed indicators, capturing abstract concepts such as attitudes, intentions, or complex behaviors that cannot be directly measured.[1] These variables serve as the core building blocks of the model, allowing researchers to model relationships among multifaceted phenomena in fields like marketing, psychology, and management.[19] Latent variables are classified into two types based on their role in the model: exogenous latent variables, denoted by ξ, which function as independent predictors and are not explained by other constructs within the model; and endogenous latent variables, denoted by η, which act as dependent outcomes influenced by exogenous variables or other endogenous constructs.[19] Exogenous variables typically appear on the left side of the model with outgoing arrows, while endogenous variables receive incoming arrows and may include error terms to account for unexplained variance.[1] This distinction enables the modeling of causal chains, where exogenous variables drive endogenous ones, facilitating the analysis of predictive relationships without requiring strict distributional assumptions.[19] Path models in PLS-PM are visualized as directed graphs that illustrate the hypothesized causal pathways among latent variables, providing a structural representation of the theory under investigation.[1] The inner model, or structural model, defines the relationships between latent variables through path coefficients (e.g., γ for paths from ξ to η), emphasizing prediction by maximizing the explained variance in endogenous variables.[19] Complementing this, the outer model, or measurement model, connects each latent variable to its observed indicators, forming the basis for empirical approximation of the constructs (detailed further in reflective and formative specifications).[1] Standard conventions in PLS-PM path diagrams use circles or ovals to depict latent variables (ξ and η), rectangles for observed indicators, and single-headed arrows to indicate directional relationships, with paths generally flowing from left to right to reflect the progression from predictors to outcomes.[19] Double-headed arrows may occasionally denote correlations between exogenous variables, though they are less common in predictive-oriented PLS-PM designs.[1] A simple example of a PLS-PM structure involves two latent variables: an exogenous variable ξ₁ (e.g., product quality) linked to indicators x₁ and x₂, which predicts an endogenous variable η₁ (e.g., customer satisfaction) connected to indicators y₁ and y₂. In the path diagram, this appears as a circle for ξ₁ with arrows to x₁ and x₂ (outer model links), a circle for η₁ with arrows to y₁ and y₂, and a single-headed arrow from ξ₁ to η₁ (inner model path), illustrating a basic causal hypothesis without mediation or moderation.[1]Reflective vs. Formative Measurement Models
In partial least squares path modeling (PLS-PM), measurement models link latent variables to their observed indicators and are specified as either reflective or formative based on the underlying theoretical direction of causality. Reflective models treat the latent variable as the exogenous cause influencing its indicators, which serve as error-prone manifestations or effects of the construct. This specification assumes that indicators are interchangeable, share a common theme, and exhibit positive correlations due to their common cause, allowing the omission of any single indicator without fundamentally altering the construct's meaning.[20][21] In PLS-PM, reflective measurement models are estimated using Mode A, a correlation-based approach that derives outer weights through simple linear regressions of each indicator on the latent variable score to minimize residual variance in the indicators. The estimation equation for the latent variable score in Mode A is typically \mathbf{Y}_j = \sum w_{jk} \mathbf{X}_{jk} where weights w_{jk} are obtained via w_{jk} = (\mathbf{Y}_j' \mathbf{Y}_j)^{-1} \mathbf{Y}_j' \mathbf{X}_{jk}, assuming linear relationships, uncorrelated errors, and high internal consistency among indicators. This mode aligns with exploratory research contexts where confirming the unidimensionality and reliability of indicators is paramount.[21] Formative measurement models reverse the causality, positing that the indicators are exogenous causes that collectively form the latent variable, with no error term assumed at the construct level. Indicators in formative models are not interchangeable—omitting one can change the construct's conceptual domain—and they may lack correlation, though multicollinearity poses a risk that can bias weights and inflate variances. PLS-PM estimates formative models via Mode B, using multiple regression to compute outer weights, as in \mathbf{Y}_j = \mathbf{X}_j \boldsymbol{\beta}_j + \epsilon_j where weights \boldsymbol{\beta}_j = (\mathbf{X}_j' \mathbf{X}_j)^{-1} \mathbf{X}_j' \mathbf{Y}_j, emphasizing the indicators' unique contributions to the construct.[21][22] The selection of reflective versus formative models hinges on theoretical criteria, including the direction of causality (construct-to-indicator for reflective, indicator-to-construct for formative), indicator interchangeability, and implications for the construct's antecedents and consequences. Reflective specifications suit effect indicators like attitudes toward a brand or customer satisfaction, where multiple items reflect an underlying psychological state. Formative specifications are appropriate for causal indicators, such as socioeconomic status (formed by income, education, and occupation) or marketing constructs like price consciousness (driven by sensitivity to discounts and price comparisons). Hybrid models, combining reflective and formative elements (e.g., via repeated indicators in hierarchical structures), are feasible in PLS-PM to capture complex relationships.[20][22] Reflective models facilitate straightforward assessment of convergent and discriminant validity, making them ideal for theory-building in exploratory studies, whereas formative models provide flexibility for modeling diverse causal inputs but demand collinearity diagnostics, such as variance inflation factors (VIF) below 5, to ensure stable estimates. Misspecification—such as modeling a formative construct like product attributes (e.g., features forming perceived quality) as reflective—remains prevalent in marketing research, potentially leading to erroneous path coefficients and invalid inferences; conversely, treating satisfaction as formative overlooks its reflective nature as a manifested outcome. Researchers must ground the choice in theory to mitigate these risks and enhance model validity.[20][23]Methodology
Model Specification
Model specification in partial least squares path modeling (PLS-PM) involves a systematic process to define the relationships between latent constructs and their indicators, as well as the paths among the constructs themselves. The first step is to identify the key theoretical constructs based on the research objectives, followed by hypothesizing directional relationships among these constructs to form the structural model. A path diagram is then drawn to visually represent these hypothesized paths, with exogenous constructs (independent variables) influencing endogenous constructs (dependent variables), ensuring no recursive loops or unmeasured constructs are included. Next, observed indicators are assigned to each latent construct, specifying whether the measurement model is reflective (indicators caused by the construct) or formative (construct caused by indicators), as determined by theoretical considerations. The structural model captures the hypothesized relationships among latent variables and is expressed in matrix notation as \eta = B\eta + \Gamma\xi + \zeta, where \eta is the vector of endogenous latent variables, \xi is the vector of exogenous latent variables, B is the matrix of path coefficients among endogenous variables, \Gamma is the matrix of path coefficients from exogenous to endogenous variables, and \zeta is the vector of disturbances or errors. This formulation allows for the modeling of complex interdependent relationships, with the goal of maximizing the explained variance in the endogenous constructs.[24] The measurement models link the latent variables to their observed indicators. For reflective models, the equations are \mathbf{y} = \Lambda_y \eta + \epsilon for endogenous constructs and \mathbf{x} = \Lambda_x \xi + \delta for exogenous constructs, where \mathbf{y} and \mathbf{x} are vectors of indicators, \Lambda_y and \Lambda_x are matrices of factor loadings, and \epsilon and \delta are error terms. In formative models, the equations reverse the causality: \eta = B_y \mathbf{y} + \zeta for endogenous and \xi = B_x \mathbf{x} + \theta for exogenous, where B_y and B_x are matrices of indicator weights, and \zeta and \theta are disturbances. The overall data is represented by the matrix \mathbf{X} of dimensions n \times m, where n is the number of observations and m is the total number of indicators; latent variable scores are approximated as \mathbf{Y} = \mathbf{X} \mathbf{W}, with \mathbf{W} as the weight matrix, and loadings via the matrix \mathbf{P} such that \mathbf{X} \approx \mathbf{Y} \mathbf{P}'.[25] To handle model complexity, multi-group analysis is specified by defining the same path diagram and measurement models separately for each subgroup (e.g., based on demographic categories), enabling subsequent comparison of path coefficients across groups without altering the core specification.[26] Higher-order models extend the specification by treating lower-order constructs as indicators for a superordinate latent variable, particularly in formative second-order setups where multiple first-order reflective or formative constructs form a higher-level abstract construct to reduce model complexity and parsimony.[27]Estimation Procedure
The estimation procedure in partial least squares path modeling (PLS-PM) relies on an iterative algorithm that approximates latent variable scores and path coefficients through alternating least squares regressions, originally developed by Herman Wold and formalized for path models by Christian Lohmöller.[28][9] This component-based approach estimates the outer measurement model (linking indicators to latent variables) and the inner structural model (linking latent variables) in a two-stage process of outer and inner approximations, without requiring distributional assumptions like multivariate normality.[29] The algorithm proceeds by iteratively updating weights and scores until convergence, maximizing the explained variance in a predictive sense rather than reproducing a covariance matrix.[2] The PLS algorithm distinguishes between two modes for outer estimation, depending on the measurement model type. Mode A, used for reflective models, employs a correlation-based approach where indicators are assumed to reflect the latent variable, and outer weights are derived to maximize the correlation between indicators and the latent variable score.[30] In contrast, Mode B, applied to formative models, uses a regression-based approach where the latent variable is formed by its indicators, and weights are estimated to minimize the least squares error in regressing the latent score onto the indicators.[30] These modes allow flexibility in handling different causal directions in measurement specifications.[31] The estimation follows a step-by-step iterative process. First, latent variable scores are initialized, typically by setting initial weights to equal values (e.g., 1 / number of indicators) or using simple averages of centered indicators.[30] Next, inner estimates approximate the structural relationships by regressing endogenous latent variables (η) on exogenous ones (ξ) using ordinary least squares (OLS), yielding path coefficients and inner weights that propagate influences across the model.[2] Then, outer estimates update weights and scores for each latent variable block based on the selected mode: for Mode A, weights reflect covariances between indicators and the current latent score; for Mode B, weights come from OLS regression of the latent score on indicators.[30] Latent scores are rescaled and updated as weighted sums of indicators using these new weights. This outer-inner cycle repeats until the change in latent variable scores or weights falls below a small threshold, such as 10^{-7}.[32] Key equations underpin these steps. For reflective models (Mode A), outer weights \mathbf{w} for a block of indicators \mathbf{X} are computed as \mathbf{w} = \frac{\mathbf{X}^T \boldsymbol{\eta}}{\boldsymbol{\eta}^T \boldsymbol{\eta}}, where \boldsymbol{\eta} is the current latent variable score vector, equivalent to the covariance normalized by the variance of \boldsymbol{\eta}.[2] The updated latent score is then \boldsymbol{\eta}^* = \mathbf{X} \mathbf{w} / (\mathbf{w}^T \mathbf{w}) to ensure unit variance. For the inner model, structural estimates solve the system \boldsymbol{\eta} = \mathbf{B} \boldsymbol{\eta} + \boldsymbol{\Gamma} \boldsymbol{\xi} + \boldsymbol{\zeta} via OLS regression of endogenous scores on predicted exogenous scores, where \mathbf{B} captures inner paths among endogenously related latents and \boldsymbol{\Gamma} links exogenous to endogenous latents.[31] These regressions use current approximations of scores as inputs.[30] To handle non-linear relationships, extensions incorporate non-linear kernels or transformations within the PLS framework, such as quadratic terms or kernel-based mappings that project data into higher-dimensional spaces for capturing non-linear effects in either outer or inner estimations.[33] These options modify the weight calculations to use non-linear functions, preserving the iterative structure while accommodating complex dependencies.[34] Convergence is assessed by monitoring the relative change in latent variable scores or weights across iterations, stopping when it is less than a predefined criterion like 10^{-7} or after a maximum of 300 iterations to prevent excessive computation.[35] Unlike global optimization methods, the PLS algorithm seeks local optima through this fixed-point iteration, which may depend on initial values but typically converges rapidly for well-specified models.[29]Model Assessment and Validation
Model assessment in partial least squares path modeling (PLS-PM), also known as PLS-SEM, involves a two-stage process: first evaluating the measurement models to ensure the reliability and validity of the constructs, and then assessing the structural model to examine the relationships between constructs. This sequential approach ensures that the outer model (indicators to constructs) is sound before interpreting the inner model (constructs to constructs).Measurement Model Assessment
For reflective measurement models, which are the most common in PLS-PM, reliability is evaluated using composite reliability (CR) and Cronbach's alpha, with values greater than 0.7 indicating satisfactory internal consistency. Convergent validity is assessed by examining the average variance extracted (AVE), which should exceed 0.5, meaning the construct explains more than half of the variance in its indicators; additionally, outer loadings should be above 0.7 to confirm that indicators reliably capture their construct. Discriminant validity ensures that constructs are distinct from each other and is tested using the Fornell-Larcker criterion, where the square root of the AVE for each construct must be greater than its correlations with other constructs, and the heterotrait-monotrait ratio (HTMT), where values below 0.85 (or 0.90 in more conservative cases) indicate sufficient discrimination. Cross-loadings can also be inspected, with indicators loading higher on their assigned construct than on others. For formative measurement models, assessment focuses on indicator collinearity (VIF < 5), indicator significance (via bootstrapping), and relevance (statistical significance of outer weights via bootstrapping and outer loadings >= 0.5), with redundancy analysis used to check convergent validity.[36] If the measurement models meet these criteria, researchers proceed to the structural model; otherwise, indicators may need revision or removal.Structural Model Assessment
The structural model is evaluated starting with the coefficient of determination (R²), which measures the proportion of variance explained in endogenous constructs; thresholds are 0.1 for small, 0.25 for medium, and 0.5 for large explanatory power, though context-specific interpretations apply. The effect size (f²) quantifies the impact of an exogenous construct on R², with values of 0.02 (small), 0.15 (medium), and 0.35 (large) guiding interpretation. Predictive relevance is assessed via the Stone-Geisser's Q², obtained through blindfolding (e.g., omission distance of 7), where Q² > 0 indicates predictive power, and sizes are classified as 0.02 (small), 0.15 (medium), and 0.35 (large). Collinearity among exogenous constructs should also be checked using variance inflation factors (VIF < 5).Significance Testing
Path coefficients, outer loadings, and other estimates are tested for significance using nonparametric bootstrapping, typically with 5,000 resamples, to generate confidence intervals or p-values; paths with t-values > 1.96 (p < 0.05) are considered significant. For formative models, bootstrapping assesses indicator weights. Parametric alternatives, such as standard errors from the PLS algorithm, are less common but can be used when assumptions hold.Advanced Metrics
To enhance consistency with covariance-based SEM, the consistent PLS (PLSc) algorithm corrects for attenuation bias in reflective constructs, yielding unbiased path coefficients and correlations. Model fit can be approximated using the standardized root mean square residual (SRMR), with values below 0.08 indicating good overall fit, though PLS-PM lacks a comprehensive global fit measure like chi-square. The HTMT inference test via bootstrapping provides a more robust discriminant validity check than traditional methods.Reporting Standards
PLS-PM results should be reported with a path diagram showing standardized path coefficients, significance levels, and R² values, accompanied by tables summarizing measurement metrics (e.g., loadings, CR, AVE, HTMT) and structural metrics (e.g., path coefficients, f², Q²). Bootstrapping results, including confidence intervals, enhance transparency, and software outputs (e.g., from SmartPLS) should be clearly documented.| Criterion | Threshold | Purpose |
|---|---|---|
| Composite Reliability (CR) | > 0.7 | Internal consistency reliability |
| Average Variance Extracted (AVE) | > 0.5 | Convergent validity |
| HTMT | < 0.85 | Discriminant validity |
| R² | > 0.25 (medium) | Explained variance |
| f² | > 0.15 (medium) | Effect size |
| Q² | > 0 | Predictive relevance |
| SRMR | < 0.08 | Overall model fit |