Partial correlation
Partial correlation is a statistical measure that quantifies the degree and direction of the linear association between two continuous random variables while adjusting for the potential confounding effects of one or more additional continuous variables.[1][2] Introduced by Karl Pearson in his 1896 work on regression and heredity, it extends the Pearson correlation coefficient to multivariate settings by isolating the unique relationship between the variables of interest.[3] The coefficient ranges from -1, indicating a perfect negative linear relationship after adjustment, to +1 for a perfect positive one, with 0 signifying no such relationship.[1][2] The partial correlation coefficient between two variables, say X and Y, controlling for a third variable Z, is computed using the formula: r_{xy.z} = \frac{r_{xy} - r_{xz} r_{yz}}{\sqrt{(1 - r_{xz}^2)(1 - r_{yz}^2)}} where r_{xy}, r_{xz}, and r_{yz} are the standard Pearson correlation coefficients among the respective pairs.[1] This formula derives from the residuals of linear regressions of X and Y on Z, effectively removing the linear influence of Z before assessing the correlation.[2] For multiple controlling variables, the computation generalizes through matrix algebra involving the inverse of the correlation matrix, though the principle remains the same: partialling out shared variance. In practice, partial correlation is essential for discerning direct associations in complex datasets, such as in epidemiology to evaluate relationships between exposures and outcomes while adjusting for covariates like age or socioeconomic status.[2] It differs from the related semipartial correlation, which controls for the effect of additional variables on only one of the primary variables, allowing assessment of unique predictive contributions in regression models.[2] Statistical significance of partial correlations can be tested using t-statistics or F-tests, accounting for sample size and degrees of freedom reduced by the number of controls.[1] Applications span fields like psychology, economics, and biology, where it helps avoid spurious inferences from unadjusted bivariate correlations.[4]Fundamentals
Definition
Partial correlation is a measure of the strength and direction of the linear association between two random variables while accounting for the influence of one or more additional controlling variables. It achieves this by computing the correlation between the residuals of the two primary variables after each has been regressed linearly on the set of controlling variables, effectively removing the shared variance attributable to those controls.[1] This approach builds on the foundational concept of simple correlation, where the Pearson correlation coefficient \rho_{XY} quantifies the linear relationship between two variables X and Y as the covariance divided by the product of their standard deviations, \rho_{XY} = \frac{\cov(X,Y)}{\sigma_X \sigma_Y}, with values ranging from -1 (perfect negative linear association) to +1 (perfect positive linear association) and 0 indicating no linear association.[5] For two variables X and Y controlling for a third variable Z, the partial correlation coefficient is formally defined as \rho_{XY \cdot Z} = \frac{\rho_{XY} - \rho_{XZ} \rho_{YZ}}{\sqrt{(1 - \rho_{XZ}^2)(1 - \rho_{YZ}^2)}}, where \rho_{XY}, \rho_{XZ}, and \rho_{YZ} are the respective Pearson correlation coefficients; this formula isolates the unique association between X and Y beyond the effects of Z.[1] The concept of partial correlation was introduced by Karl Pearson in 1896 as an extension of simple correlation to handle multivariate relationships, particularly in distinguishing genuine associations from spurious ones arising from confounding factors.[3]Basic Properties
The partial correlation coefficient possesses key mathematical and statistical properties that align it closely with the simple Pearson correlation while accounting for confounding variables. It is symmetric in the variables of interest, such that \rho_{XY \cdot Z} = \rho_{YX \cdot Z}, reflecting the bidirectional nature of the conditional linear association after controlling for Z.[6] Like the simple correlation, the partial correlation is bounded between -1 and 1, with 0 indicating no remaining linear relationship between the variables after adjustment, positive values denoting direct associations, and negative values indicating inverse associations; this bound follows from its definition as the correlation of residuals, which inherits the Cauchy-Schwarz inequality properties of standard correlations.[7][6] The coefficient is invariant under nonsingular linear transformations of the variables (e.g., affine shifts or scalings), as these transformations preserve the standardized residuals used in its computation, ensuring the measure remains consistent across equivalent scales.[6] Assuming the variables follow a multivariate normal distribution, the sample partial correlation provides a consistent estimator of the population parameter and is asymptotically unbiased in large samples, with its sampling distribution approaching normality via transformations like Fisher's z.[8] The squared partial correlation \rho^2_{XY \cdot Z} represents the proportion of variance in X (or Y) uniquely explained by Y (or X) after controlling for Z, and it equals the incremental increase in the squared multiple correlation coefficient when adding one predictor to a regression model involving the other predictors.[9]Computation Methods
Linear Regression Approach
One method for computing the partial correlation coefficient between two variables X and Y while controlling for a set of variables Z relies on linear regression to isolate the unique linear association by removing the effects of Z. This approach treats the partial correlation as the correlation between the residuals obtained after regressing X and Y separately on Z.[9][10] The procedure follows these steps: First, fit a linear regression model of X on Z to predict \hat{X} = \beta_{X \cdot Z} Z (where \beta_{X \cdot Z} is the vector of regression coefficients, assuming variables are appropriately centered or an intercept is included), and compute the residuals e_X = X - \hat{X}. Similarly, regress Y on Z to obtain \hat{Y} = \beta_{Y \cdot Z} Z and residuals e_Y = Y - \hat{Y}. The partial correlation is then given by \rho_{XY \cdot Z} = \corr(e_X, e_Y) = \frac{\cov(e_X, e_Y)}{\sqrt{\var(e_X) \var(e_Y)}}, which quantifies the linear relationship between X and Y after adjusting for the linear influence of Z.[9][10] This method offers an intuitive understanding of confounding effects, as the residuals represent the portions of X and Y unexplained by Z, allowing direct assessment of the residual association. It is also computationally straightforward and easily implemented in statistical software; for instance, in R, thelm() function can generate residuals, followed by cor() on them, while in Python, libraries like statsmodels provide similar regression tools, with dedicated functions in packages such as pingouin for direct computation.[9][11][12]
As a numerical example, consider a hypothetical dataset of n=50 individuals with measurements of height (X, in cm), weight (Y, in kg), and age (Z, in years). Regressing height on age yields an estimated slope \beta_{X \cdot Z} \approx 0.8 (indicating height increases by about 0.8 cm per year of age in this sample), producing residuals e_X. Similarly, regressing weight on age gives \beta_{Y \cdot Z} \approx 0.4 (weight increases by about 0.4 kg per year), with residuals e_Y. The correlation between these residuals is \rho_{XY \cdot Z} \approx 0.65, suggesting a moderately strong partial association between height and weight independent of age.