Correlation coefficient
The correlation coefficient is a statistical measure that quantifies the strength and direction of the linear association between two variables, ranging from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no linear relationship.[1] It is widely used in fields such as psychology, economics, and natural sciences to assess how changes in one variable correspond to changes in another, without implying causation.[2] The most common form, known as Pearson's product-moment correlation coefficient (denoted as r), was developed by Karl Pearson in 1895 as part of his work on the mathematical theory of evolution, building on earlier ideas from Francis Galton about regression and heredity.[3] Pearson's r is calculated using the formula r = cov(X,Y) / (σ_X σ_Y), where cov(X,Y) is the covariance between variables X and Y, and σ_X and σ_Y are their standard deviations.[4] For Pearson's r to be reliable, the relationship must be linear and free from significant outliers, as violations can lead to misleading interpretations; bivariate normality is assumed for statistical inference.[5] Other notable types of correlation coefficients address limitations of Pearson's r for non-linear or non-parametric data. Spearman's rank correlation coefficient (ρ or r_s), introduced by Charles Spearman in 1904, evaluates the monotonic relationship between ranked variables rather than raw values, making it suitable for ordinal data or when normality assumptions fail. It is computed as the Pearson correlation on ranked data, yielding values from -1 to +1, and is particularly robust to outliers. Kendall's tau (τ), developed by Maurice Kendall in 1938, measures the ordinal association based on concordant and discordant pairs in rankings, offering another non-parametric alternative.[6] These coefficients, like Pearson's, do not distinguish correlation from causation and require careful consideration of sample size for significance testing.[7] In practice, correlation coefficients facilitate hypothesis testing about associations, with statistical significance determined via t-tests or p-values, and their squared values (r²) indicating the proportion of variance explained (coefficient of determination).[8] Guidelines for interpretation classify |r| < 0.3 as weak, 0.3–0.7 as moderate, and > 0.7 as strong, though these thresholds vary by context.[9] Overall, correlation coefficients remain foundational tools in statistical analysis, enabling researchers to explore relationships while underscoring the need for complementary methods like regression to model dependencies.[10]Fundamentals
Definition
In statistics, correlation refers to a measure of statistical dependence between two random variables, indicating how they tend to vary together without implying causation, as a relationship may arise from confounding factors or coincidence rather than one variable directly influencing the other.[11] This dependence can manifest as linear or monotonic associations, where changes in one variable are systematically accompanied by changes in the other, either in the same direction (positive) or opposite direction (negative). Correlation coefficients standardize this relationship to provide a dimensionless quantity that facilitates comparison across different datasets or scales. To understand correlation, it is essential to first consider prerequisite concepts such as random variables, which are variables whose values are determined by outcomes of a random process, and covariance, an unnormalized measure of the joint variability between two such variables that quantifies how they deviate from their expected values in tandem.[12] Covariance captures the direction and magnitude of this co-variation but is sensitive to the units of measurement, making it less comparable across contexts; correlation coefficients address this by normalizing covariance relative to the individual variabilities of the variables involved.[13] The correlation coefficient typically ranges from -1 to +1, where a value of +1 signifies perfect positive association (both variables increase together), -1 indicates perfect negative association (one increases as the other decreases), and 0 suggests no linear association, though non-linear dependencies may still exist.[14] This bounded scale allows for intuitive interpretation of the strength and direction of the relationship. The concept was introduced by Francis Galton in the late 1880s as part of his work on regression and heredity, with Karl Pearson providing a formal mathematical definition in the 1890s, establishing it as a cornerstone of statistical analysis.[15][16] The Pearson correlation coefficient serves as the most common example of this measure in practice.[17]General Properties
Correlation coefficients exhibit several fundamental mathematical properties that make them useful for measuring associations between variables. The population correlation coefficient, denoted by the Greek letter ρ, quantifies the true linear relationship between two random variables in the entire population, while the sample correlation coefficient, denoted r, serves as an estimate of ρ based on observed data from a finite sample.[18] This distinction is crucial because r is subject to sampling variability and converges to ρ as the sample size increases.[8] A key property is the decomposition of the correlation coefficient in terms of covariance and standard deviations. Specifically, the population correlation is given by \rho_{X,Y} = \frac{\operatorname{Cov}(X,Y)}{\sigma_X \sigma_Y}, where \operatorname{Cov}(X,Y) is the covariance between X and Y, and \sigma_X and \sigma_Y are their respective standard deviations.[19] This relation standardizes the covariance, rendering the correlation coefficient dimensionless and independent of the units of measurement for the variables. The sample analog follows the same form, replacing population parameters with sample estimates.[20] Due to this standardization, correlation coefficients are bounded between -1 and +1, with values of ±1 indicating perfect positive or negative linear relationships, 0 indicating no linear association, and intermediate values reflecting the strength and direction of the linear dependence.[19] Additionally, the coefficient is symmetric, such that \rho_{X,Y} = \rho_{Y,X}, and invariant under linear transformations of the variables, meaning that affine shifts (adding constants) or scalings (multiplying by positive constants) do not alter its value.[21][22] These properties hold for standardized measures like the Pearson correlation coefficient.[23] However, these properties come with limitations: correlation coefficients are designed to detect linear associations and may produce low values even for strong nonlinear relationships, failing to capture dependencies that deviate from linearity.[24] For instance, variables related through a quadratic or exponential function might yield a correlation near zero despite a clear pattern.[25]Pearson Correlation Coefficient
Formula and Computation
The Pearson correlation coefficient, denoted as \rho_{XY} for a population, measures the linear relationship between two random variables X and Y. It is defined as \rho_{XY} = \frac{\mathrm{Cov}(X,Y)}{\sigma_X \sigma_Y} = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X \sigma_Y}, where \mathrm{Cov}(X,Y) is the covariance, \sigma_X and \sigma_Y are the standard deviations, \mu_X and \mu_Y are the means, and E[\cdot] denotes the expected value.[26][27] For a sample of n paired observations (x_i, y_i), the sample Pearson correlation coefficient r estimates \rho_{XY} using r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}}, where \bar{x} and \bar{y} are the sample means. This formula arises from the sample covariance divided by the product of sample standard deviations; the sample covariance is typically computed as \frac{1}{n-1} \sum (x_i - \bar{x})(y_i - \bar{y}) to provide an unbiased estimate of the population covariance, while the sample variances use the same n-1 denominator for unbiasedness. In the correlation formula, the n-1 terms cancel out, yielding the expression above, which is consistent but slightly biased as an estimator of \rho_{XY}.[4] To compute r, first calculate the sample means \bar{x} = \frac{1}{n} \sum x_i and \bar{y} = \frac{1}{n} \sum y_i. Next, center the data by subtracting these means to obtain deviations (x_i - \bar{x}) and (y_i - \bar{y}). Then, compute the numerator as the sum of the products of these deviations, which estimates the covariance (scaled by n-1). Finally, compute the denominator as the square root of the product of the sums of squared deviations, which are proportional to the sample variances. Dividing yields r, which ranges from -1 to 1.[4] Consider a small dataset of n=4 paired observations on heights (in inches) and weights (in pounds) for illustration: (60, 120), (62, 125), (65, 130), (68, 135).| i | Height x_i | Weight y_i | x_i - \bar{x} | y_i - \bar{y} | (x_i - \bar{x})(y_i - \bar{y}) | (x_i - \bar{x})^2 | (y_i - \bar{y})^2 |
|---|---|---|---|---|---|---|---|
| 1 | 60 | 120 | -3.75 | -7.5 | 28.125 | 14.0625 | 56.25 |
| 2 | 62 | 125 | -1.75 | -2.5 | 4.375 | 3.0625 | 6.25 |
| 3 | 65 | 130 | 1.25 | 2.5 | 3.125 | 1.5625 | 6.25 |
| 4 | 68 | 135 | 4.25 | 7.5 | 31.875 | 18.0625 | 56.25 |
| Sum | 255 | 510 | 0 | 0 | 67.5 | 36.75 | 125 |
cor() function from the base stats package calculates r for vectors x and y using the formula above. Similarly, in Python, the scipy.stats.pearsonr(x, y) function from SciPy provides r and its p-value.