Correlation
In statistics, correlation refers to a measure of the strength and direction of the linear relationship between two continuous variables, quantified by a correlation coefficient that ranges from -1 to +1, where values near 1 indicate a strong positive association, near -1 a strong negative association, and near 0 no linear association.[1] The concept originated in the late 19th century through the work of Francis Galton, who developed the idea of the correlation coefficient to quantify consistent linear relationships between numeric variables, such as the relationship between the heights of parents and their children in his studies of heredity.[2] Karl Pearson later formalized the mathematical formula for the Pearson product-moment correlation coefficient in 1895, establishing it as a cornerstone of modern statistical analysis.[3] The most common form, Pearson's correlation coefficient (denoted as r for samples and ρ for populations), assumes normally distributed data and measures linear relationships, with positive values indicating that as one variable increases, the other tends to increase, and negative values showing the opposite.[1] For non-normal or ordinal data, alternatives like Spearman's rank correlation coefficient (ρ_s) are used, which assess monotonic relationships by ranking variables and are more robust to outliers.[1] Other variants, such as Kendall's tau, evaluate ordinal associations based on concordant and discordant pairs, providing another measure of rank correlation strength.[4] Key properties of correlation coefficients include their dimensionless nature, symmetry (the correlation between X and Y equals that between Y and X), and independence from variable scaling, making them versatile for comparing relationships across datasets.[1] However, correlation does not imply causation, as associations may arise from confounding factors, chance, or indirect influences, a limitation emphasized since its early development to prevent misinterpretation in fields like medicine and social sciences.[1] It also only captures linear or monotonic patterns, potentially underestimating nonlinear relationships, and is sensitive to outliers in the case of Pearson's method.[5] Applications of correlation span numerous disciplines, including assessing variable associations in psychology, economics, biology, and environmental science, often visualized through scatterplots to illustrate patterns before formal computation.[6] In research, it serves as a preliminary tool for hypothesis generation, informing regression analysis or experimental design, but requires cautious interpretation alongside significance testing (e.g., p-values) to evaluate reliability.[7]Fundamentals of Correlation
Definition and Interpretation
Correlation is a statistical measure that quantifies the strength and direction of the linear relationship between two variables, standardized to range from -1 to +1. A coefficient of +1 represents perfect positive linear association, where one variable increases proportionally with the other; 0 indicates no linear association; and -1 signifies perfect negative linear association, where one variable decreases as the other increases.[8] This measure focuses exclusively on linear dependencies and does not capture nonlinear relationships or imply causation.[9] The term "correlation" was coined by British scientist Francis Galton in 1888, during his studies on regression and biological inheritance, to describe the tendency of traits to vary together. Galton's ideas were expanded by statistician Karl Pearson in 1895, who developed a mathematical framework for quantifying this association, laying the foundation for modern correlational analysis.[10][3] Interpreting the correlation coefficient involves assessing both its sign (positive or negative direction) and magnitude (strength of the linear link). Values close to 0 suggest a weak association, while common guidelines classify |r| < 0.3 as weak, 0.3–0.7 as moderate, and >0.7 as strong; however, these thresholds are subjective and context-dependent, varying across fields like psychology or economics.[9] For instance, a correlation of 0.8 might indicate a robust linear relationship in social sciences but require cautious interpretation in physics due to differing expectations for effect sizes.[8] Scatterplots provide the essential visual aid for interpreting correlation, plotting paired observations as points on a coordinate plane to reveal patterns. High positive correlation appears as points tightly clustered along an upward-sloping line, negative correlation along a downward-sloping line, and low correlation as a diffuse cloud with no clear linear trend, enabling intuitive assessment of both strength and potential outliers.[11]Correlation and Independence
In probability theory, two random variables X and Y are defined as uncorrelated if their covariance is zero, that is, \operatorname{Cov}(X, Y) = 0, or equivalently, E[(X - \mu_X)(Y - \mu_Y)] = 0, where \mu_X = E[X] and \mu_Y = E[Y].[12] This condition implies that there is no linear relationship between the deviations of X and Y from their respective means.[13] Independence of X and Y always implies that they are uncorrelated, since the joint expectation factors under independence: E[XY] = E[X]E[Y], leading to \operatorname{Cov}(X, Y) = 0.[13] However, the converse does not hold in general: zero correlation does not imply statistical independence.[14] A classic counterexample involves X uniformly distributed on [-1, 1] and Y = X^2. Here, E[X] = 0 and E[XY] = E[X^3] = 0 (since X^3 is an odd function over a symmetric interval), so \operatorname{Cov}(X, Y) = 0, confirming uncorrelatedness.[15] Yet, X and Y are dependent, as the distribution of Y given X = 0 (where Y = 0) differs from the marginal distribution of Y, which is a scaled chi-squared-like density on [0, 1].[15] An important exception occurs for jointly normal distributions. If X and Y follow a bivariate normal distribution, then zero correlation (\rho_{X,Y} = 0) is equivalent to independence.[16] This equivalence arises because the joint density factors into the product of marginal normals precisely when the off-diagonal covariance term vanishes.[17] Full details on this property are discussed in the context of bivariate normal distributions. In practice, tests of zero correlation, such as those based on the Pearson correlation coefficient, can assess independence only when the normality assumption holds; otherwise, they merely detect the absence of linear dependence, potentially missing nonlinear relationships.[18]Pearson's Product-Moment Correlation
Mathematical Definition
The Pearson product-moment correlation coefficient for two random variables X and Y, denoted \rho_{X,Y}, is defined as the covariance between X and Y divided by the product of their standard deviations: \rho_{X,Y} = \frac{\operatorname{Cov}(X,Y)}{\sigma_X \sigma_Y}, where \operatorname{Cov}(X,Y) = E[(X - \mu_X)(Y - \mu_Y)], \mu_X = E[X] and \mu_Y = E[Y] are the expected values, \sigma_X = \sqrt{\operatorname{Var}(X)}, and \sigma_Y = \sqrt{\operatorname{Var}(Y)}.[19][20] This formulation, introduced by Karl Pearson in 1895, quantifies the strength and direction of the linear relationship between the variables, assuming finite variances.[21] The coefficient can be derived from the covariance of standardized variables. Let Z_X = (X - \mu_X)/\sigma_X and Z_Y = (Y - \mu_Y)/\sigma_Y be the standardized versions of X and Y, each with mean zero and variance one. Then, \rho_{X,Y} = E[Z_X Z_Y] = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X \sigma_Y}, which normalizes the covariance to lie within a bounded range, facilitating comparison across different scales.[3] Geometrically, \rho_{X,Y} represents the cosine of the angle between the centered random vectors associated with X and Y in the L^2 space of square-integrable functions, where the inner product is the expectation: \rho_{X,Y} = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sqrt{E[(X - \mu_X)^2] E[(Y - \mu_Y)^2]} = \cos \theta. This interpretation highlights the coefficient as a measure of directional alignment in a vector space framework.[3] The value of \rho_{X,Y} satisfies -1 \leq \rho_{X,Y} \leq 1, a consequence of the Cauchy-Schwarz inequality applied to the inner product E[(X - \mu_X)(Y - \mu_Y)]. Equality holds at \rho_{X,Y} = 1 if and only if Y = aX + b for some a > 0 and constant b (perfect positive linear relationship), and at \rho_{X,Y} = -1 if a < 0 (perfect negative linear relationship).[20][3]Sample Correlation Coefficient
The sample correlation coefficient r, also known as Pearson's r, estimates the population correlation \rho from a finite sample of n paired observations (x_i, y_i) for i = 1, \dots, n. It is calculated as r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}}, where \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i and \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i are the sample means.[21] This expression, originally formulated by Karl Pearson, normalizes the sample covariance by the product of the sample standard deviations, yielding a dimensionless measure bounded between -1 and 1.[21] Although r is consistent for \rho as n \to \infty, it serves as a biased estimator for finite n, systematically underestimating |\rho| when |\rho| > 0, with the expected bias approximately E(r) \approx \rho \left(1 - \frac{1 - \rho^2}{2n}\right).[22] The magnitude of this downward bias increases with |\rho| and decreases with larger n, but it can distort inferences in small samples. To mitigate this bias and stabilize variance for inference, Ronald Fisher introduced the z-transformation, z = \frac{1}{2} \ln \left( \frac{1 + r}{1 - r} \right) = \artanh(r), which follows approximately a normal distribution with mean \artanh(\rho) and variance $1/(n-3) for n > 3.[22] This transformation is particularly useful for confidence intervals and meta-analyses of correlations, as the near-normality holds even for moderate n.[22] Computationally, the formula for r relies on deviations from the means, d_{x_i} = x_i - \bar{x} and d_{y_i} = y_i - \bar{y}, to center the data and eliminate the need for explicit mean subtraction in subsequent steps after initial calculation. Unlike the unbiased sample covariance, which divides the sum of cross-products by n-1 to account for degrees of freedom, the correlation coefficient avoids this adjustment in its core sums because the n-1 factors in the denominator's standard deviations cancel with the numerator's covariance term, preserving the scale-invariant property.[21] This shortcut simplifies implementation in software and manual calculations, as raw sums of deviations suffice without Bessel's correction at the correlation stage. For hypothesis testing, particularly under the null hypothesis H_0: \rho = 0 (no linear association in the population), the sample r can be assessed using the t-statistic t = r \sqrt{\frac{n-2}{1 - r^2}}, which follows a Student's t-distribution with n-2 degrees of freedom when the data are bivariate normal.[22] This test, derived from the sampling distribution of r under H_0, provides an exact p-value for small to moderate n, outperforming normal approximations in finite samples.[22] Rejection of H_0 at a chosen significance level indicates evidence of linear dependence, with the test's power increasing with n and |\rho|.Properties and Assumptions
The Pearson product-moment correlation coefficient exhibits several key invariance properties that make it a robust measure of linear association under certain transformations. Specifically, it remains unchanged under separate affine transformations of the variables, meaning that if the variables X and Y are replaced by aX + b and cY + d respectively, where a > 0, c > 0, and b, d are constants, the population correlation \rho and sample correlation r are invariant. This scale and location invariance ensures that the coefficient focuses solely on the relative positioning of data points, independent of units or shifts. Regarding sampling properties, the sample correlation coefficient r serves as a consistent estimator of the population correlation \rho, converging in probability to \rho as the sample size n increases, provided the variables have finite variances.[23] For large n, the sampling distribution of r is approximately normal after applying Fisher's z-transformation, z = \frac{1}{2} \ln \left( \frac{1 + r}{1 - r} \right), which stabilizes the variance and facilitates inference such as confidence intervals and hypothesis tests.[19] This asymptotic normality holds under the assumption of finite fourth moments, though the raw distribution of r is skewed for small to moderate n.[19] The coefficient relies on several fundamental assumptions for its definition and meaningful interpretation. It requires that both variables have finite second moments, i.e., E[X^2] < \infty and E[Y^2] < \infty, ensuring the variances \sigma_X^2 and \sigma_Y^2 are well-defined and positive. Additionally, for \rho (or r) to accurately quantify the strength of association, the relationship between X and Y must be linear; the coefficient measures only linear dependence and assumes no substantial deviations from this form. If these assumptions are violated, such as when \sigma_X = 0 or \sigma_Y = 0 (indicating a constant variable), the coefficient is undefined due to division by zero in its formula.[24] A notable limitation arises from its focus on linearity: the Pearson correlation is insensitive to nonlinear relationships, even strong ones. For instance, if Y = X^2 for X uniformly distributed over [-1, 1], the variables are perfectly dependent, but \rho = 0 because the association is quadratic rather than linear.[25] This highlights that a near-zero value does not imply independence, only the absence of linear correlation.Illustrative Example
To illustrate the computation of Pearson's product-moment correlation coefficient, consider a hypothetical dataset of heights (in cm) and weights (in kg) for five adults: heights are 160, 165, 170, 175, 180; corresponding weights are 50, 55, 60, 65, 70.[26] This dataset exhibits a perfect linear relationship, as each increase of 5 cm in height corresponds to an increase of 5 kg in weight. The sample correlation coefficient r is calculated using the formula r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}, where x_i are the heights, y_i are the weights, and \bar{x}, \bar{y} are their respective means.[26] First, compute the means: \bar{x} = 170 cm and \bar{y} = 60 kg. The deviations from the means, their products, and squared deviations are shown in the table below:| Height (x_i) | Weight (y_i) | x_i - \bar{x} | y_i - \bar{y} | Product | (x_i - \bar{x})^2 | (y_i - \bar{y})^2 |
|---|---|---|---|---|---|---|
| 160 | 50 | -10 | -10 | 100 | 100 | 100 |
| 165 | 55 | -5 | -5 | 25 | 25 | 25 |
| 170 | 60 | 0 | 0 | 0 | 0 | 0 |
| 175 | 65 | 5 | 5 | 25 | 25 | 25 |
| 180 | 70 | 10 | 10 | 100 | 100 | 100 |
| Sums | 250 | 250 | 250 |