Fact-checked by Grok 2 weeks ago

Pearson correlation coefficient

The Pearson correlation coefficient, also known as Pearson's product-moment correlation coefficient and denoted as r, is a statistical measure that quantifies the strength and direction of the linear relationship between two continuous variables, with values ranging from -1 (perfect negative linear correlation) to +1 (perfect positive linear correlation) and 0 indicating no linear correlation. Developed by British statistician in 1895 as an extension of earlier ideas on by , it provides a dimensionless index invariant to linear transformations of the variables, making it widely applicable in fields such as , , and social sciences for assessing associations in . The formula for the Pearson correlation coefficient is given by
r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}},
where x_i and y_i are individual data points, and \bar{x} and \bar{y} are the sample means; this expression normalizes the by the product of the standard deviations, ensuring |r| \leq 1. Computationally, r equals the of the line between standardized variables, sharing the same sign as that (positive for upward trends, negative for downward). The square of r, known as the r^2, represents the proportion of variance in one variable predictable from the other under a .
Valid use of the Pearson coefficient assumes a linear relationship between the variables, continuous quantitative data without extreme outliers, and homoscedasticity (constant variance of residuals); for inferential purposes like hypothesis testing, bivariate normality is also required to ensure the sampling distribution of r follows known properties. Violations, such as nonlinearity or influential outliers, can lead to underestimation or overestimation of the true association, prompting alternatives like Spearman's rank correlation for monotonic but nonlinear relationships. Despite these limitations, the coefficient remains a foundational tool in statistical analysis due to its interpretability and connection to regression, influencing modern methods in machine learning and data science.

History and Naming

Origins and Development

The ideas underlying the Pearson correlation coefficient emerged from earlier statistical explorations of relationships between variables. In the mid-19th century, Belgian statistician laid foundational concepts in his 1846 work Lettres à S.A.R. le Duc régnant de Saxe-Cobourg et , sur la théorie des probabilités, appliquée aux sciences morales et politiques, where he applied to social phenomena and examined interdependent variables such as , , and rates, emphasizing systematic "rapports" or relations among them. The mathematical formula underlying the coefficient was first derived by French mathematician and astronomer Auguste Bravais in 1844 in his work on probabilities, though it was Pearson who independently developed and applied it extensively in . These notions influenced subsequent biometric studies, particularly in quantifying deviations and associations in . Building on Quetelet's framework, advanced the study of variable interdependence in the 1880s through his investigations into . In his 1886 paper "Regression Towards Mediocrity in Hereditary Stature," published in the Journal of the Anthropological Institute of and , Galton described the phenomenon of offspring measurements regressing toward the population relative to parental extremes, introducing the term "" to capture this tendency and highlighting proportional relationships in familial traits. Galton's empirical work on sweet peas and human heights provided a practical basis for measuring linear dependencies, directly inspiring further mathematical formalization. In 1895, Pearson contributed to the understanding of regression in his paper "Note on Regression and Inheritance in the Case of Two Parents," published in the Proceedings of the Royal Society of London. He formalized the correlation coefficient the following year in "Mathematical Contributions to the Theory of Evolution. III. Regression, Heredity, and Panmixia," published in the Philosophical Transactions of the Royal Society, where he derived its properties for evolutionary and hereditary analysis, integrating it into the emerging field of biometrics alongside collaborator W. F. R. Weldon. To disseminate these ideas, Pearson co-founded the journal Biometrika in 1901 with Weldon and Galton, establishing it as a dedicated outlet for statistical applications in biology that prominently featured correlation analyses. The coefficient gained broader adoption in the early through Ronald A. Fisher's extensions, particularly his paper "Frequency Distribution of the Values of the Correlation Coefficient in Samples from an Indefinitely Large Population," published in , which provided the essential for inference and testing in biometric and genetic studies. Fisher's contributions bridged Pearson's measure with modern statistical methods, facilitating its integration into experimental design and by the 1920s.

Notation and Terminology

The standard notation for the Pearson correlation coefficient designates \rho (the Greek letter rho) as the parameter, representing the true linear between two variables in the entire population, while r (the Roman letter) denotes the sample statistic, which estimates \rho from observed data. This measure is commonly known by several alternative terms, including the , reflecting its computation as a normalized product of deviations from means, and the bivariate normal correlation, as it parameterizes the linear dependence in a bivariate normal distribution. Historically, the terminology evolved from the simpler "coefficient of correlation," as introduced by in his seminal 1896 paper where he formalized the measure for and , to the more precise modern designation "" to distinguish its specific formula and origins. The must be distinguished from other measures, such as , which assesses monotonic rather than strictly linear relationships and is nonparametric, applicable to without assuming .

Motivation and Definition

Conceptual Motivation

The Pearson correlation coefficient quantifies the extent to which two variables exhibit a linear relationship, a concept that gains intuition from visualizing via scatterplots. In such plots, a perfect positive linear appears as points aligned precisely along an upward-sloping straight line, corresponding to a coefficient of +1; a perfect negative shows points along a downward-sloping line, yielding -1. When points form a random cloud with no directional trend, the coefficient is 0, indicating the absence of linear structure—even if curved or nonlinear patterns might exist in the . This metric extends the idea of covariance, which captures the joint variability of two variables but is sensitive to their measurement units and scales. By standardizing covariance through division by the product of the variables' standard deviations, the Pearson coefficient becomes dimensionless and scale-invariant, producing values strictly between -1 and +1 that solely reflect the strength and direction of linear dependence. Consider the relationship between and as a straightforward illustration: in a dataset of pre-teen girls, the coefficient reaches about 0.694, signaling a moderate to strong positive linear link where greater tends to accompany higher , though this association does not imply causation, as confounding factors like and activity levels play key roles. At a high level, the coefficient emerges from the principles of , where one minimizes the sum of squared vertical distances between data points and a fitted straight line to best predict one variable from another. This process yields a that, when normalized by the predictor's variability and related to the response's scale, equates to the divided by the product of standard deviations, thus providing a unified, proportional gauge of linear relatedness.

Population Definition

The population Pearson correlation coefficient, denoted \rho_{X,Y}, quantifies the strength and direction of the linear relationship between two random variables X and Y in a . It is defined as \rho_{X,Y} = \frac{\Cov(X,Y)}{\sigma_X \sigma_Y}, where \Cov(X,Y) denotes the between X and Y, and \sigma_X and \sigma_Y are the standard deviations of X and Y, respectively. This formula assumes that X and Y have finite variances, ensuring that \sigma_X > 0 and \sigma_Y > 0, as the standard deviations must be well-defined for the denominator to be non-zero. The term \Cov(X,Y) expands to the \E[(X - \mu_X)(Y - \mu_Y)], with \mu_X = \E[X] and \mu_Y = \E[Y] as the means, so the coefficient can equivalently be written as \rho_{X,Y} = \frac{\E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X \sigma_Y}. While the definition applies to any joint distribution satisfying the finite variance condition, the coefficient's role as the linear correlation parameter is most directly interpretable under the assumption of a bivariate normal joint distribution for X and Y. In this setting, \rho_{X,Y} fully parameterizes the linear dependence in the joint .

Sample Definition

The sample Pearson correlation coefficient, denoted r, quantifies the strength and direction of the linear association between two continuous variables based on a of n paired observations (x_i, y_i) for i = 1 to n. It provides an estimate of the corresponding population \rho, adapting the theoretical definition to empirical data. The formula for r is given by r = \frac{ \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) }{ \sqrt{ \sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2 } }, where \bar{x} and \bar{y} are the sample means of the x and y values, respectively. This expression equals the sample divided by the product of the sample deviations; using the unbiased versions (dividing by n-1 for covariance and variances) yields the same result, as the factors cancel in the ratio. To compute r in practice, first calculate the sample means \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i and \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i. Next, center the data by finding the deviations x_i - \bar{x} and y_i - \bar{y} for each pair. Then, sum the products of these deviations to obtain \sum (x_i - \bar{x})(y_i - \bar{y}), and separately sum the squared deviations \sum (x_i - \bar{x})^2 and \sum (y_i - \bar{y})^2. Finally, divide the sum of products by the square root of the product of the two sums of squared deviations. Practical computation requires attention to potential issues, such as missing values or zero variance. If any observations are missing, pairwise deletion—using only complete pairs for the calculation—is a common method to retain as much data as possible while ensuring valid pairs, though it can result in varying sample sizes across variable pairs./05:_Descriptive_Statistics/5.08:_Handling_Missing_Values) Division by zero occurs if \sum (x_i - \bar{x})^2 = 0 or \sum (y_i - \bar{y})^2 = 0, rendering r undefined; this happens when one variable is constant across the sample, indicating zero variability and thus no possible linear relationship.

Mathematical Properties

Basic Properties

The Pearson correlation coefficient, denoted as \rho for the population parameter and r for the sample statistic, is bounded between -1 and 1, inclusive. This range arises from the application of the Cauchy-Schwarz inequality to the structure, ensuring that the coefficient cannot exceed these limits in magnitude. Equality holds at \rho = 1 or \rho = -1 precisely when the variables exhibit a perfect linear , meaning all points lie exactly on a straight line with positive or negative , respectively. The coefficient possesses symmetry such that \rho_{X,Y} = \rho_{Y,X}, reflecting that the linear between two variables is mutual and does not depend on the of consideration. Additionally, \rho remains under positive linear transformations of the variables, specifically affine transformations of the form aX + b and cY + d where a > 0 and c > 0; such shifts in location (adding constants b or d) or positive scaling (multiplying by positive constants a or c) do not alter the value of \rho. The sign of \rho indicates the direction of the linear relationship: a positive value signifies a direct where increases in one tend to coincide with increases in the other, while a negative value denotes an inverse where increases in one correspond to decreases in the other. The coefficient is degenerate and when the variance of either is zero, as this results in in the denominator of the formula, rendering the measure inapplicable for constant variables.

Geometric Interpretation

The Pearson correlation coefficient admits a natural geometric interpretation in terms of vectors in \mathbb{R}^n, where n is the number of observations. Consider two random variables X and Y, with centered data vectors \vec{x} = (x_1 - \bar{x}, \dots, x_n - \bar{x}) and \vec{y} = (y_1 - \bar{y}, \dots, y_n - \bar{y}), where \bar{x} and \bar{y} are the sample means. The correlation coefficient r equals the cosine of \theta between these vectors: r = \cos \theta = \frac{\vec{x} \cdot \vec{y}}{\|\vec{x}\| \|\vec{y}\|} where \vec{x} \cdot \vec{y} is the and \|\cdot\| denotes the Euclidean norm. This formulation arises because centering removes the mean, projecting the data orthogonal to the all-ones vector, allowing the angle to capture linear association directly. This perspective visualizes the through the relative orientation of \vec{x} and \vec{y}. If the vectors are orthogonal (\theta = 90^\circ), then r = 0, indicating no linear relationship, as the directions are perpendicular with zero . Perfect positive occurs when the vectors align (\theta = 0^\circ), yielding r = 1, while perfect negative corresponds to opposition (\theta = 180^\circ), giving r = -1. Intermediate angles reflect partial associations, with |r| measuring the strength via how closely the vectors point in the same or opposite directions. The geometric view also links to ordinary least squares , where the method minimizes the sum of squared vertical residuals by orthogonally projecting the response vector onto the spanned by the predictors. In , the correlation r quantifies the alignment between the centered predictor and response vectors, determining the proportion of variance explained (R^2 = r^2) and ensuring residuals are perpendicular to the fitted line in the . This projection underscores how r reflects the "fit" of one to another without scaling issues, as the cosine normalizes for vector lengths. For illustration, consider a scatterplot of n points (x_i, y_i), where the centered \vec{x} and \vec{y} can be plotted as arrows from the in a overlaid on the plot. If the points form a tight upward line, the arrows align closely, yielding \theta \approx 0^\circ and r \approx 1; scattered points with no trend show arrows at roughly $90^\circ, giving r \approx 0. This highlights how the angle intuitively conveys both direction and strength of linear dependence in the cloud.

Interpretation and Practical Use

Magnitude and Sign Interpretation

The sign of the Pearson correlation coefficient (r) indicates the direction of the linear relationship between two variables. A positive r means that as the value of one variable increases, the value of the other variable tends to increase as well, reflecting a direct association. Conversely, a negative r signifies that an increase in one variable is generally accompanied by a decrease in the other, indicating an inverse relationship. The magnitude of r, expressed as its absolute value |r|, quantifies the strength of this linear association, ranging from 0 (no linear relationship) to 1 (perfect linear relationship). Common interpretive guidelines, from Cohen's conventions for effect sizes in behavioral sciences, classify |r| \approx 0.10 as small or weak, \approx 0.30 as medium or moderate, and \approx 0.50 as large or strong; however, these thresholds are inherently subjective and serve as rough benchmarks rather than rigid rules. Interpretations of magnitude are highly context-dependent, varying across disciplines due to differences in variability, , and theoretical expectations. For instance, in the social sciences, correlations as low as |r| = 0.1 can be practically meaningful given the multifaceted nature of behaviors and large sample sizes, whereas in physics, values below |r| = 0.9 are often considered weak owing to the expectation of near-perfect linear relationships in controlled systems. As an illustrative example, in , a Pearson correlation of r = 0.6 between measures of anxiety and performance on cognitive tasks is commonly viewed as a strong positive , suggesting a substantial linear in behavioral contexts.

Common Pitfalls in Interpretation

One common pitfall in interpreting the Pearson correlation coefficient is the assumption that a significant implies causation. The coefficient measures only the strength and of a linear between two variables, without establishing any directional or causal ; for instance, a high positive between variables X and Y does not indicate whether X causes Y, Y causes X, or both are influenced by a third factor. This error is particularly prevalent in observational studies where variables are not controlled. Spurious correlations represent another frequent misinterpretation, where an apparent linear relationship arises from coincidence or unaccounted factors rather than any meaningful . A classic example is the positive correlation between sales and attacks, which stems from seasonal confounding—both increase during warmer months due to higher beach attendance—rather than any direct influence of ice cream consumption on marine incidents. Such spurious links can mislead if not scrutinized for underlying causes. The Pearson correlation is highly sensitive to outliers, which can dramatically inflate or deflate the coefficient's magnitude, leading to erroneous conclusions about the overall relationship. A single extreme point can pull the correlation toward an apparent strong linear trend, even if the bulk of the shows little association; for example, in a of heights and weights mostly clustered around values, one unusually tall with corresponding high weight could yield a spuriously high |r| value. This vulnerability underscores the need to inspect scatterplots and consider robust alternatives before interpretation. Finally, the coefficient only captures linear relationships and may fail to detect strong nonlinear associations, resulting in a near-zero value despite a clear . In U-shaped , where one increases and then decreases with the other (such as levels and forming a parabolic ), the positive and negative deviations cancel out, yielding r ≈ 0 and suggesting no when a monotonic nonlinear one exists. This limitation highlights the importance of visualizing to avoid overlooking curved dependencies.

Statistical Inference

Hypothesis Testing Overview

Hypothesis testing for the Pearson correlation coefficient assesses whether the sample r provides evidence of a nonzero population coefficient \rho, or if the observed association could plausibly arise from random sampling variation. The is typically formulated as H_0: \rho = 0, asserting no linear correlation in the population, against an such as H_a: \rho \neq 0 for a two-sided test or one-sided variants depending on the . This framework allows researchers to infer the presence of a linear relationship while controlling the Type I error rate, the probability of incorrectly rejecting a true . Various methods exist to test this null hypothesis, broadly categorized into parametric, nonparametric, and exact approaches. Parametric tests, pioneered by , assume bivariate of the data and leverage the of r under H_0 to evaluate significance. Nonparametric alternatives, such as tests, estimate the by randomly re-pairing observations from the two variables while maintaining the marginal distributions, offering robustness to distributional assumptions. Bootstrap methods resample the with replacement to approximate the variability of r, providing a flexible way to conduct tests without strong assumptions. For small sample sizes, exact tests utilize the precise of r under H_0, computed using the exact under bivariate or via numerical methods. The from these tests represents the probability of observing a sample with |r| at least as extreme as the one obtained, given that H_0: \rho = 0 holds true in the . A low indicates that such an extreme result is unlikely under the , supporting rejection of H_0 and of a linear . Power considerations are crucial for test design, as the ability to detect a true nonzero \rho (1 - β, where β is the Type II error rate) increases with larger sample sizes and larger effect sizes. For instance, achieving 80% to detect a moderate of |\rho| = 0.3 at a 5% level requires approximately 85 observations. Inadequate sample sizes can lead to low , increasing the risk of failing to detect meaningful correlations.

Standard Error and Confidence Intervals

The of the sample Pearson correlation coefficient r, denoted SE(r), provides a measure of the variability of r as an estimate of the population correlation \rho. For large sample sizes n, an approximation for the is given by \text{SE}(r) \approx \sqrt{\frac{1 - r^2}{n - 2}}, which arises from the asymptotic of the of r under the assumption of bivariate . This formula, derived by considering the variance of r for small deviations from \rho, is particularly useful for assessing the precision of r when n is sufficiently large (typically n > 30). Confidence intervals for \rho based on r are often constructed using this standard error, but the sampling distribution of r is skewed, especially for values of r near \pm 1 or small n, leading to asymmetric intervals on the correlation scale. To address this skewness, Fisher's z-transformation is applied, where z = \frac{1}{2} \ln \left( \frac{1 + r}{1 - r} \right) has an approximately normal distribution with standard error \text{SE}(z) \approx 1 / \sqrt{n - 3}; the resulting interval for z is then back-transformed to the r-scale using the hyperbolic tangent function to obtain an asymmetric confidence interval for \rho. Details of the transformation and its properties are covered elsewhere. An alternative nonparametric approach to estimating the and intervals for r is the bootstrap method, which resamples pairs (x_i, y_i) with replacement from the original sample to generate an empirical of r. The bootstrap is the standard deviation of the resampled coefficients across B replications (typically B \geq 1000), while intervals are obtained from the quantiles of this , providing robust estimates without relying on assumptions. This method is especially valuable for small samples or non-normal data, as it captures the empirical variability directly. For illustration, consider a sample of size n = 30 with r = 0.5. The approximate is \text{SE}(r) \approx \sqrt{(1 - 0.5^2)/(30 - 2)} \approx 0.164. Using Fisher's z-transformation, the 95% for \rho is approximately (0.17, 0.73), reflecting the asymmetry and greater width toward higher correlations. In contrast, a naive symmetric r \pm 1.96 \times \text{SE}(r) would be (0.18, 0.82), which overestimates the lower bound due to .

Transformation Methods

The Fisher z-transformation, introduced by Ronald A. Fisher, provides a method to normalize the of the Pearson correlation coefficient r, transforming it into a variable whose distribution is approximately , particularly for moderate to large sample sizes. The transformation is defined as z = \frac{1}{2} \ln \left( \frac{1 + r}{1 - r} \right), which is equivalent to the inverse hyperbolic tangent function, z = \tanh^{-1}(r). Under the assumption of bivariate normality, the variance of z is approximately \frac{1}{n - 3}, where n is the sample size; this approximation holds well for |r| not too close to 1 and n > 10. This transformation facilitates hypothesis testing for the population correlation \rho. For testing the null hypothesis H_0: \rho = 0, an equivalent approach uses the test statistic t = \frac{r \sqrt{n - 2}}{\sqrt{1 - r^2}}, which follows a Student's t-distribution with n - 2 degrees of freedom under H_0 and the bivariate normality assumption. This t-test is widely used due to its simplicity and exact distribution under the null, avoiding the need for the z-transformation in basic significance testing. For small sample sizes, the exact of r under bivariate normality can be expressed using the : the is f(r) = \frac{(1 - r^2)^{(n-4)/2}}{\mathrm{B}\left( \frac{1}{2}, \frac{n-2}{2} \right)} \cdot (1 - \rho^2)^{(n-1)/2} \cdot {}_2F_1\left( \frac{1}{2}, \frac{1}{2}; \frac{n-1}{2}; \frac{1 + \rho r}{2} \right)^{n-2}, where \mathrm{B} is the and {}_2F_1 is the ; under H_0: \rho = 0, it simplifies to a form proportional to (1 - r^2)^{(n-4)/2}. However, computing this exact distribution is complex for , so approximations like the Fisher z-transformation or the t-test are preferred even for smaller n, with simulations or tables used when necessary. The Fisher z-transformation is particularly valuable in of correlations, as it stabilizes the variance across studies with varying true correlations and sample sizes, allowing for more reliable weighted averaging of effect sizes; the transformed values are combined assuming approximate with known variance, then back-transformed if needed.

Applications in Analysis

Role in Regression

In , the Pearson correlation coefficient r directly relates to the \beta of the regression line, providing a standardized measure of the linear association between the predictor X and response Y. Specifically, the population is given by \beta = r \frac{\sigma_Y}{\sigma_X}, where \sigma_Y and \sigma_X are the standard deviations of Y and X, respectively; this holds analogously for the sample estimates b_1 = r \frac{s_Y}{s_X}. The sign of r matches that of \beta (or b_1), indicating the direction of the , while the magnitude of r scales the relative to the variability in the variables. The coefficient of determination R^2, which quantifies the proportion of variance in Y explained by X in the model, equals the square of the Pearson correlation coefficient: R^2 = r^2. This equivalence arises because, in , the multiple correlation coefficient R (between observed and predicted Y) is the of r, making R^2 a direct measure of the model's explanatory power tied to the strength of the bivariate correlation. Thus, |r| close to 1 implies a strong fit, with nearly all variance accounted for, while r near 0 suggests minimal linear explanatory value from the predictor. In regression diagnostics, the Pearson correlation between residuals and the predictor should be near zero to confirm the assumption and absence of . Deviations from zero may indicate nonlinearity or patterns in the residuals, prompting model refinement; scatterplots of residuals versus X visually assess this, with random scatter supporting model adequacy. For example, in predicting student exam scores (Y) from study hours (X), if r = 0.7, the slope would be b_1 = 0.7 \frac{s_Y}{s_X}, and R^2 = 0.49, meaning 49% of score variance is explained by hours studied, with the signaling a moderately strong positive fit.

Sensitivity to Distributions

The is defined only when the variables involved have finite second moments, meaning their means and variances must exist and be finite; otherwise, the and standard deviations used in its computation are undefined. The coefficient requires a minimum sample size of at least three observations to compute variances meaningfully, as fewer pairs yield indeterminate or trivial results (e.g., perfect of ±1 for n=2). However, small sample sizes lead to unstable estimates, with sample correlations often inaccurate and showing wide ; for instance, a true population of 0.40 with n=25 may yield a 90% spanning 0.07 to 0.65. Stability improves with larger samples, typically approaching n=250 for reliable estimates in common scenarios, while approximations in hypothesis testing (e.g., the t-test) rely on the and recommend n≥30 for reasonable normality assumptions. The Pearson coefficient is highly sensitive to outliers, which can disproportionately influence the covariance term and dramatically alter the result; analytical derivations and simulations show that even a single coincidental (affecting both variables) can cause substantial distortions in the , deviating far from the true value. For example, in uncorrelated data, one such might shift the estimated from near zero to a high positive or negative value, misleading interpretations of strength. Non-normal distributions exacerbate robustness issues, biasing the coefficient itself—often inflating it by up to +0.14 in heavy-tailed cases—and distorting ; significance tests like the t-test on Pearson's inflate Type I error rates and reduce power under nonnormality, leading to unreliable p-values and confidence intervals. In modern big data and machine learning contexts, the Pearson coefficient faces critique for its exclusive focus on linear relationships, potentially yielding near-zero values despite strong nonlinear associations and thus overlooking complex patterns in high-dimensional datasets; alternatives like Spearman's rank correlation are often preferred for capturing monotonic nonlinearity without assuming linearity.

Variants and Extensions

Partial and Weighted Variants

The coefficient measures the degree of association between two while controlling for the influence of one or more additional , often referred to as confounders or covariates. This extension of the standard Pearson correlation allows researchers to isolate the direct relationship between the primary by removing the linear effects of the controlling (s). For two variables X and Y controlling for a third Z, the population partial correlation coefficient \rho_{XY.Z} is given by \rho_{XY.Z} = \frac{\rho_{XY} - \rho_{XZ} \rho_{YZ}}{\sqrt{(1 - \rho_{XZ}^2)(1 - \rho_{YZ}^2)}} where \rho_{XY}, \rho_{XZ}, and \rho_{YZ} are the standard Pearson correlation coefficients among the respective pairs. This formula derives from the residuals of linear regressions of X and Y on Z, effectively computing the Pearson correlation on those residuals. Partial correlations are widely applied in multivariate analysis to discern genuine associations amid confounding factors, such as in psychological studies examining relationships between cognitive traits while adjusting for socioeconomic status, or in epidemiology to assess links between exposures and outcomes independent of age or sex. For instance, the partial correlation between income and education levels might be calculated while controlling for age; if the unadjusted correlation is 0.45 but drops to 0.32 after adjustment, it suggests age accounts for some of the observed association. The weighted Pearson correlation coefficient extends the standard form to account for unequal importance or reliability of observations, particularly useful when data exhibit heteroscedasticity or arise from designs with varying sampling probabilities. It incorporates weights w_i for each pair (x_i, y_i), yielding the sample estimate r_w: r_w = \frac{\sum w_i (x_i - \bar{x}_w)(y_i - \bar{y}_w)}{\sqrt{\sum w_i (x_i - \bar{x}_w)^2 \sum w_i (y_i - \bar{y}_w)^2}} where \bar{x}_w = \sum w_i x_i / \sum w_i and similarly for \bar{y}_w. This adjustment ensures that more reliable or representative points contribute proportionally more to the overall measure. Weighted correlations find prominent use in , where weights correct for unequal selection probabilities or nonresponse, enabling unbiased estimates of associations in fields like polling or educational assessments. For example, in analyzing survey data on health behaviors, weights based on demographic can yield a more accurate between exercise frequency and reflective of the broader .

Specialized Forms

The disattenuated Pearson correlation, also known as the correction for attenuation, adjusts the standard coefficient for measurement error in the variables, providing an estimate of the between true underlying constructs. Developed by , this specialized form uses the formula \rho^* = \frac{r}{\sqrt{r_{xx} r_{yy}}}, where r is the observed Pearson , and r_{xx} and r_{yy} are the reliability coefficients (e.g., test-retest or ) of the respective measures. This correction assumes , where observed scores are the sum of true scores and uncorrelated errors, and it can yield values exceeding 1 in magnitude if reliabilities are low, though practical bounds are typically imposed. It is particularly useful in and research to infer potential associations unmasked by imperfect measurement. For angular or circular data, where variables represent directions or periodic phenomena (e.g., wind directions or clock times), the standard Pearson correlation fails due to the wrap-around nature of the circle, potentially underestimating associations. A specialized circular correlation coefficient, proposed by and , addresses this by projecting the angles onto the unit circle and computing a sine-based analog: r_T = \frac{\sum_{i=1}^n \sin(\theta_i - \bar{\theta}) \sin(\phi_i - \bar{\phi})}{\sqrt{\sum_{i=1}^n \sin^2(\theta_i - \bar{\theta}) \sum_{i=1}^n \sin^2(\phi_i - \bar{\phi})}}, where \theta_i and \phi_i are the angular observations, and \bar{\theta}, \bar{\phi} are their circular means. This measure ranges from -1 to 1, is invariant to location and reflection, and its asymptotic distribution under the of follows a standard normal after , enabling significance testing. It has been widely adopted in fields like and for analyzing directional data. Pearson's distance transforms the correlation coefficient into a dissimilarity for clustering or , commonly defined as d = 1 - r, yielding values from 0 (perfect positive correlation) to 2 (perfect negative correlation), or sometimes normalized as d = (1 - r)/2, ranging from 0 to 1. This form emphasizes the properties for distance-based analyses, such as of profiles, where it penalizes deviations from linear agreement more severely than distances in high dimensions. It is scale-invariant like the original Pearson but serves as a pseudo- in dissimilarity matrices, commonly implemented in statistical software for . In theory, the Pearson has been adapted to quantify correlations in , particularly for detecting entanglement beyond classical limits. By applying the classical formula to measurement outcomes in , researchers derive entanglement witnesses; for instance, if the of the Pearson exceeds $1/\sqrt{2} for certain two-qubit states, the system is entangled. Recent extensions use it to measure total correlations, including quantum discord, via traces over density matrices or expectation values of Pauli operators, as in r = \frac{\operatorname{Tr}(\rho_{AB} \sigma_A \otimes \sigma_B) - \langle \sigma_A \rangle \langle \sigma_B \rangle}{\sqrt{(1 - \langle \sigma_A \rangle^2)(1 - \langle \sigma_B \rangle^2)}}. These applications, growing since 2020, highlight quantum correlations surpassing classical Pearson bounds in Bell tests and multipartite systems.

References

  1. [1]
    1.6 - (Pearson) Correlation Coefficient, \(r\) | STAT 501
    1.6 - (Pearson) Correlation Coefficient, r · If b 1 is negative, then r takes a negative sign. · If b 1 is positive, then r takes a positive sign.
  2. [2]
    Correlation: Pearson, Spearman, and Kendall's tau | UVA Library
    May 27, 2025 · The Pearson correlation coefficient, also called Pearson's product-moment correlation and denoted as r, is a measure of the strength and ...Missing: primary | Show results with:primary
  3. [3]
    On a form of spurious correlation which may arise when indices are ...
    Mathematical contributions to the theory of evolution. —On a form of spurious correlation which may arise when indices are used in the measurement of organs.
  4. [4]
    Galton, Pearson, and the Peas: A Brief History of Linear Regression ...
    The complete name of the correlation coefficient deceives many students into a belief that Karl Pearson developed this statistical measure himself. Although ...
  5. [5]
    Pearson Correlation and Linear Regression
    The Pearson correlation coefficient, r, can take on values between -1 and 1. The further away r is from zero, the stronger the linear relationship between the ...
  6. [6]
    Conducting correlation analysis: important limitations and pitfalls - NIH
    The correlation coefficient was described over a hundred years ago by Karl Pearson [1], taking inspiration from a similar idea of correlation from Sir Francis ...
  7. [7]
    [PDF] ANTHROPOLOGICAL MISCELLANEA. - galton.org
    The experiments showed further that the mean filial regression towards mediocrity was directly proportional to the parental devia tion from it. This curious ...
  8. [8]
    VII. Note on regression and inheritance in the case of two parents
    Note on regression and inheritance in the case of two parents. Karl Pearson ... This text was harvested from a scanned image of the original document using ...
  9. [9]
    VII. Mathematical contributions to the theory of evolution. - Journals
    Mathematical contributions to the theory of evolution.—III. Regression, heredity, and panmixia. Karl Pearson ... Published:01 January 1896https://doi.org/10.1098/ ...
  10. [10]
    Interpreting Correlation Coefficients - Statistics By Jim
    Pearson's correlation coefficient is represented by the Greek letter rho (ρ) for the population parameter and r for a sample statistic. This correlation ...Missing: notation | Show results with:notation
  11. [11]
    Stats: Correlation
    There is a measure of linear correlation. The population parameter is denoted by the greek letter rho and the sample statistic is denoted by the roman letter r.
  12. [12]
  13. [13]
    Pearson Product-Moment Correlation Coefficient
    The Pearson product-moment correlation coefficient (hereafter referred to as “coefficient”) was created by Karl Pearson in 1896 to address ...
  14. [14]
    Correlation Coefficient -- from Wolfram MathWorld
    The correlation coefficient is also known as the product-moment coefficient of correlation or Pearson's correlation.<|control11|><|separator|>
  15. [15]
    Correlation Coefficients: Appropriate Use and Interpretation
    The Pearson correlation coefficient is typically used for jointly normally distributed data (data that follow a bivariate normal distribution).Missing: primary | Show results with:primary
  16. [16]
    Fisher (1925) Chapter 6 - Classics in the History of Psychology
    Such an estimate is called the correlation coefficient, or the product moment correlation, the latter term [p. ... 155] the coefficient of correlation does ...
  17. [17]
  18. [18]
    A comparison of the Pearson and Spearman correlation methods
    Pearson correlation coefficients measure only linear relationships. Spearman correlation coefficients measure only monotonic relationships.What Is Correlation? · Pearson = −0.093, Spearman... · Pearson = −1, Spearman =...
  19. [19]
    Pearson and Spearman Correlations: A Guide to Understanding and ...
    Jan 19, 2024 · Spearman correlation uses data rank to measure monotonicity between ordinal or continuous variables. Pearson correlation, on the other hand, ...
  20. [20]
    Pearson Correlation Coefficient (r) | Guide & Examples - Scribbr
    May 13, 2022 · It is a number between –1 and 1 that measures the strength and direction of the relationship between two variables.Missing: primary | Show results with:primary
  21. [21]
    Statistical notes for clinical researchers: covariance and correlation
    The Pearson correlation coefficient is obtained by dividing covariance value with standard deviations (SDs) of X and Y variables. PEARSON CORRELATION ...Covariance · Pearson Correlation... · Spearman's Rank Correlation...
  22. [22]
    [PDF] Covariance, Regression, and Correlation - The Personality Project
    Karl Pearson, who referred to Galton's function later gave Galton credit as developing the equation we now know as the Pearson Product Moment Correlation ...
  23. [23]
    18.2 - Correlation Coefficient of X and Y | STAT 414
    The correlation coefficient (rho) is a dimensionless measure of dependency between two random variables, defined using their standard deviations.
  24. [24]
    Pearson Product-Moment Correlation - Laerd Statistics
    The Pearson correlation coefficient, r, can take a range of values from +1 to -1. A value of 0 indicates that there is no association between the two variables.
  25. [25]
    5.3.2 Bivariate Normal Distribution - Probability Course
    It is the distribution for two jointly normal random variables when their variances are equal to one and their correlation coefficient is ρ. Two random ...
  26. [26]
    18.1 - Pearson Correlation Coefficient | STAT 509
    For the Pearson correlation coefficient, we assume that both X and Y are measured on a continuous scale and that each is approximately normally distributed.
  27. [27]
    [PDF] 2 Probability - Computer Science
    ... correlation coefficient is undefined because the variance of Y is zero ... Pearson Correlation Coefficient (ρ). MIC Score. C. H. F. G. E. D. 0. 0.25. 0.5. 0.75. 1.
  28. [28]
    [PDF] Thirteen Ways to Look at the Correlation Coefficient Joseph Lee ...
    Feb 19, 2008 · Then, in 1895, Karl Pearson published Pearson's r. Our article focuses on Pearson's correlation coefficient, pre- senting both the ...
  29. [29]
    13.1 The Correlation Coefficient r - Introductory Business Statistics 2e
    Dec 13, 2023 · To provide mathematical precision to the measurement of what we see we use the correlation coefficient. The correlation tells us something about ...
  30. [30]
    Covariance and Correlation - Data Analysis in the Geosciences
    If one variable does not vary, one of the standard deviations will be zero, causing the denominator to be zero and making the correlation coefficient undefined ...
  31. [31]
    [PDF] One More Geometric Interpretation of Pearson's Correlation
    It is known that Pearson's correlation coefficient r is equal to the cosine of the angle between the vectors ⃗x and ⃗y describing the centered.
  32. [32]
    [PDF] Correlation and Geometry - School of Statistics
    Jan 4, 2017 · Geometrical Interpretations. Correlation Coefficient Geometry. Geometry of Pearson's Correlation. Bryant, P. (1984). Geometry, statistics ...
  33. [33]
    User's guide to correlation coefficients - PMC - NIH
    Lin's concordance correlation coefficient (ρc) is a measure which tests how well bivariate pairs of observations conform relative to a gold standard or another ...
  34. [34]
    Effect size guidelines for individual differences researchers
    Cohen, 1988, Cohen, 1992 recommended Pearson r values of 0.10, 0.30, and 0.50 to demarcate small, medium, and large effects, respectively.
  35. [35]
    What Does Effect Size Tell You? - Simply Psychology
    Jul 31, 2023 · The value of the effect size of Pearson r correlation varies between -1 (a perfect negative correlation) to +1 (a perfect positive correlation).
  36. [36]
    Correlation Coefficient: Simple Definition, Formula, Easy Steps
    In 1892, British statistician Francis Ysidro Edgeworth published a paper called “Correlated Averages,” Philosophical Magazine, 5th Series, 34, 190-204 where he ...
  37. [37]
    Correlation vs. Causation | Difference, Designs & Examples - Scribbr
    Jul 12, 2021 · While causation and correlation can exist simultaneously, correlation does not imply causation. In other words, correlation is simply a ...
  38. [38]
    Pearson Product-Moment Correlation (cont...) - Laerd Statistics
    Outliers can have a very large effect on the line of best fit and the Pearson correlation coefficient, which can lead to very different conclusions regarding ...
  39. [39]
    Pearson's Correlation Coefficient - linear relationship - LIS Academy
    Apr 1, 2024 · Pearson's r is specifically used to measure linear relationships, which means that it only detects relationships where one variable changes ...
  40. [40]
    1.9 - Hypothesis Test for the Population Correlation Coefficient
    In this section, we learn how to conduct a hypothesis test for the population correlation coefficient (the greek letter "rho").
  41. [41]
    Correlation Testing - Practical Statistics for Astronomers II
    2.2 Correlation testing. Let us now consider the formal tests for correlation. To do so, we have to make the initial choice - parametric or non-parametric?
  42. [42]
    Pearson correlation - MedCalc Statistical Software Manual
    The P-value is the probability that you would have found the current result if the correlation coefficient were in fact zero (null hypothesis). If this ...
  43. [43]
    G*Power Data Analysis Examples: Power Analysis for Correlations
    The required sample size for a power of .7 is 37. The required sample size for a power of .8 is 46, and the required sample size for a power of .9 is 61.
  44. [44]
    014: On the "Probable Error" of a Coefficient of Correlation Deduced ...
    This 1921 journal article by Ronald Aylmer Fisher, Sir, discusses the 'probable error' of a correlation coefficient from a small sample.
  45. [45]
    A Brief Note on the Standard Error of the Pearson Correlation
    Sep 6, 2023 · The studentized regression coefficient t = ( b − β ) σ b with b and β as the estimated and true regression weights, respectively, and σ b as the ...<|control11|><|separator|>
  46. [46]
    [PDF] Frequency Distribution of the Values of the Correlation Coefficient in ...
    Dec 7, 2005 · Frequency Distribution of the Values of the Correlation Coefficient in. Samples from an Indefinitely Large Population. R. A. Fisher. Biometrika, ...
  47. [47]
    Meta‐analyzing partial correlation coefficients using Fisher's z ...
    Jul 8, 2023 · The Fisher's z transformation is a variance-stabilizing transformation, so the sampling variance estimated with does not depend on the Fisher's ...
  48. [48]
    2.6 - (Pearson) Correlation Coefficient r | STAT 462
    Furthermore, because r2 is always a number between 0 and 1, the correlation coefficient r is always a number between -1 and 1.
  49. [49]
    Relationship between coefficient of determination and correlation ...
    Oct 27, 2021 · (6) Using the relationship between correlation coefficient and slope estimate, we conclude: R2=(sxsy^β1)2=r2xy. (7)
  50. [50]
    Chapter 7: Correlation and Simple Linear Regression
    Pearson's linear correlation coefficient only measures the strength and direction of a linear relationship. ... The coefficient of determination, R2, is 54.0%.
  51. [51]
    4.3 - Residuals vs. Predictor Plot | STAT 501
    Let's investigate various residuals vs. predictors plots to learn whether adding predictors to any of the above three simple linear regression models is advised ...Missing: diagnostics | Show results with:diagnostics
  52. [52]
    5.3 Evaluating the regression model | Forecasting - OTexts
    A simple and quick way to check this is to examine scatterplots of the residuals against each of the predictor variables. If these scatterplots show a pattern, ...
  53. [53]
    Linear Regression Assumptions and Diagnostics in R: Essentials
    Nov 3, 2018 · This chapter describes regression assumptions and provides built-in plots for regression diagnostics in R programming language.
  54. [54]
    Regression diagnostics | Nature Methods
    Apr 28, 2016 · One of the most versatile regression diagnostic methods is to plot the residuals ri against the predictors (xi, ri) and the predicted values (ŷ ...
  55. [55]
    On relationships between the Pearson and the distance correlation ...
    The fact that dCor is defined for X and Y with finite first moments, while cor ( X , Y ) requires finite second moments leads us to the conjecture that the ...
  56. [56]
    sample size - Pearson correlation minimum number of pairs
    Mar 19, 2017 · I have only 7 pairs of data, is this enough for me to do a correlation using pearson r? Or should i use spearman's r instead because of this small sample size?Correlation for Small Dataset? - Cross ValidatedWhat is a reasonable sample size for correlation analysis for both ...More results from stats.stackexchange.com
  57. [57]
    At what sample size do correlations stabilize? - ScienceDirect.com
    Sample correlations converge to the population value with increasing sample size, but the estimates are often inaccurate in small samples.
  58. [58]
    The instability of the Pearson correlation coefficient in the presence ...
    In this paper, we show analytically and by simulations that the conventional measure of correlation is heavily influenced by the presence of outliers.
  59. [59]
    Reducing Bias and Error in the Correlation Coefficient Due to ... - NIH
    Nonnormality caused the correlation coefficient to be inflated by up to +.14, particularly when the nonnormality involved heavy-tailed distributions.
  60. [60]
    [PDF] Pearson's Correlation in Predictive Analytics and Machine Learning
    Jun 20, 2025 · Its primary constraint is its exclusive focus on linear associations; it can completely miss strong non-linear relationships, potentially ...
  61. [61]
    [PDF] Weighted and Unweighted Correlation Methods for Large
    Apr 2, 2018 · Weighted and Unweighted Correlation Methods—4. After the X and Y vectors are ranked, they are plugged into the weighted Pearson correlation.
  62. [62]
    Inferential procedures based on the weighted Pearson correlation ...
    We will show the proposed tests can achieve robust the type I error control even when the sample size is small and the data does not follow normal distributions ...
  63. [63]
    Modifying Spearman's Attenuation Equation to Yield Partial ... - NIH
    In this inquiry, Spearman's correction is modified to allow partial removal of measurement error from either-or-both of two variables being correlated.
  64. [64]
    The correction for attenuation - Cross Validated - Stack Exchange
    Dec 7, 2020 · So, I tried to correct the correlation coefficient with Spearman's disattenuation formula: rx′y′=rxy/√(rxxryy) where the raw correlation ...Pearson correlation between a variable and its squareDerivation of the standard error for Pearson's correlation coefficientMore results from stats.stackexchange.com
  65. [65]
    Disattenuating Correlation Coefficients - Rasch.org
    Disattenuation tells us whether the correlation between two sets of measures is low because of measurement error or because the two sets are really ...
  66. [66]
    correlation coefficient for circular data | Biometrika - Oxford Academic
    Abstract. SUMMARY. A coefficient to measure association between angular variables is discussed, its asymptotic distribution found, and its properties developed.Missing: Pearson | Show results with:Pearson<|separator|>
  67. [67]
    A Correlation Coefficient for Circular Data - jstor
    A correlation coefficient for circular data 329. (iii) PT is invariant under choice of origin for 0 and (D, and reflection of one of ? and (F changes the ...
  68. [68]
    [PDF] Circular Data Correlation - NCSS
    This procedure computes summary statistics, generates rose plots and circular histograms, and computes the circular correlation coefficient for circular data.Missing: Pearson | Show results with:Pearson
  69. [69]
    pearson dissimilarity
    Sep 5, 2017 · PEARSON DISSIMILARITY ; Purpose: Compute the Pearson correlation coefficient transformed to a dissimilarity measure between two variables.
  70. [70]
    What's the best measure of representational dissimilarity?
    Jan 9, 2019 · Euclidean and Mahalanobis distances are preferable to the popular Pearson correlation distance as a choice of representational dissimilarity measure.<|control11|><|separator|>
  71. [71]
    Pearson correlation coefficient as a measure for certifying and ...
    Feb 24, 2020 · Abstract. A scheme for characterizing entanglement using the statistical measure of correlation given by the Pearson correlation coefficient ( ...
  72. [72]
    Quantifying total correlations in quantum systems through the ... - arXiv
    Jun 26, 2023 · In this work, we provide an alternative method to quantify the total correlations through the Pearson correlation coefficient.
  73. [73]
    Entanglement criterion and strengthened Bell inequalities based on ...
    Sep 15, 2024 · This work demonstrates how the Pearson correlation coefficient can be used to establish an entanglement criterion for quantum systems of two ...