Pearson correlation coefficient
The Pearson correlation coefficient, also known as Pearson's product-moment correlation coefficient and denoted as r, is a statistical measure that quantifies the strength and direction of the linear relationship between two continuous variables, with values ranging from -1 (perfect negative linear correlation) to +1 (perfect positive linear correlation) and 0 indicating no linear correlation.[1][2] Developed by British statistician Karl Pearson in 1895 as an extension of earlier ideas on regression by Francis Galton, it provides a dimensionless index invariant to linear transformations of the variables, making it widely applicable in fields such as biology, economics, and social sciences for assessing associations in bivariate data.[3][4] The formula for the Pearson correlation coefficient is given byr = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}},
where x_i and y_i are individual data points, and \bar{x} and \bar{y} are the sample means; this expression normalizes the covariance by the product of the standard deviations, ensuring |r| \leq 1.[1][2] Computationally, r equals the slope of the simple linear regression line between standardized variables, sharing the same sign as that slope (positive for upward trends, negative for downward).[1] The square of r, known as the coefficient of determination r^2, represents the proportion of variance in one variable predictable from the other under a linear model.[1] Valid use of the Pearson coefficient assumes a linear relationship between the variables, continuous quantitative data without extreme outliers, and homoscedasticity (constant variance of residuals); for inferential purposes like hypothesis testing, bivariate normality is also required to ensure the sampling distribution of r follows known properties.[2][5] Violations, such as nonlinearity or influential outliers, can lead to underestimation or overestimation of the true association, prompting alternatives like Spearman's rank correlation for monotonic but nonlinear relationships.[2] Despite these limitations, the coefficient remains a foundational tool in statistical analysis due to its interpretability and connection to regression, influencing modern methods in machine learning and data science.[6]