Covariance and correlation
Covariance and correlation are statistical measures used to describe the joint variability and linear relationship between two random variables.[1] Covariance quantifies the extent to which the variables deviate from their means in the same direction, with positive values indicating that they tend to increase or decrease together, negative values showing an inverse relationship, and zero suggesting no linear association.[2] Correlation, often referring to the Pearson correlation coefficient, standardizes covariance by dividing it by the product of the variables' standard deviations, yielding a dimensionless value between -1 and +1 that assesses both the strength and direction of the linear relationship.[3] The concept of correlation originated in the late 19th century through the work of Francis Galton, who introduced the term in the context of biological inheritance and regression toward the mean, and was formalized by Karl Pearson in his 1896 paper on mathematical contributions to evolution.[4][5] Covariance, as a more general measure from probability theory, gained prominence alongside correlation in multivariate analysis; its population formula is \Cov(X, Y) = \E[(X - \E[X])(Y - \E[Y])], equivalent to \E[XY] - \E[X]\E[Y], while the sample estimator uses division by n-1 for unbiasedness.[1][2] Key properties include symmetry (\Cov(X, Y) = \Cov(Y, X)), bilinearity, and the fact that \Cov(X, X) = \Var(X); for correlation, \rho_{XY} = \frac{\Cov(X, Y)}{\sqrt{\Var(X)\Var(Y)}}, it equals \pm 1 for perfect linear relationships and 0 for uncorrelated variables, though zero correlation does not imply independence.[2][3] These measures are foundational in fields like finance, where they inform portfolio diversification through covariance matrices, and in data science for identifying patterns in datasets.[1] Unlike covariance, which depends on the units of measurement and can range from -\infty to +\infty, correlation's bounded scale makes it more interpretable for comparing relationships across different scales.[3] Extensions include partial correlation for controlling confounding variables and rank-based alternatives like Spearman's rho for non-linear monotonic relationships.[2]Fundamental Concepts
Definition of Covariance
Covariance is a statistical measure that quantifies the extent to which two random variables, X and Y, vary together, capturing the direction and degree of their linear relationship.[6] For a pair of random variables defined over a probability space, the population covariance, denoted \operatorname{Cov}(X, Y), is formally defined as the expected value of the product of their deviations from their respective means: \operatorname{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] This expression arises from the linearity of the expectation operator, as expanding the product yields E[XY - X E[Y] - Y E[X] + E[X] E[Y]] = E[XY] - E[X] E[Y], providing an equivalent formulation \operatorname{Cov}(X, Y) = E[XY] - E[X] E[Y].[6][7] The population covariance represents a theoretical parameter for the entire distribution of the variables, whereas the sample covariance serves as an empirical estimate derived from observed data points, with details on its computation addressed separately.[8] The sign of the covariance indicates the nature of the linear co-movement: a positive value signifies that X and Y tend to increase or decrease in tandem, a negative value implies they move in opposite directions, and a value of zero suggests no linear association, though the variables may still be dependent in nonlinear ways.[9][10] Covariance carries units that are the product of the units of X and Y, rendering it scale-dependent; for instance, measuring one variable in different units alters the covariance's magnitude without changing the underlying relationship.[11]Definition of Correlation
The Pearson product-moment correlation coefficient, denoted \rho_{X,Y}, measures the strength and direction of the linear association between two random variables X and Y. It is defined as the covariance between X and Y divided by the product of their standard deviations: \rho_{X,Y} = \frac{\Cov(X,Y)}{\sigma_X \sigma_Y}, where \Cov(X,Y) is the covariance, and \sigma_X and \sigma_Y are the standard deviations of X and Y, respectively. This standardization normalizes the covariance, which serves as the numerator, to produce a bounded measure. The coefficient \rho_{X,Y} ranges from -1 to +1. A value of +1 indicates a perfect positive linear relationship, where increases in one variable correspond exactly to proportional increases in the other; -1 signifies a perfect negative linear relationship, with increases in one corresponding to proportional decreases in the other; and 0 implies no linear association between the variables.[12] These interpretations hold specifically for linear dependencies, as the coefficient does not capture nonlinear relationships.[13] Valid application of \rho_{X,Y} requires that the relationship between the variables is linear and that both variables have finite variances, ensuring the standard deviations are well-defined.[13] In contrast to covariance, which is scale-dependent and retains the units of the variables' product, the correlation coefficient is dimensionless and invariant to changes in scale or location of the variables.[12] The term "correlation" was coined by Francis Galton in 1888 to describe interdependent relations.[14]Mathematical Properties
Properties of Covariance
Covariance possesses several key algebraic properties that arise from the linearity of the expectation operator, making it a useful tool for deriving expressions involving sums and linear combinations of random variables. Specifically, covariance is bilinear: for scalar constants a and c, and random variables X, Y, Z, \operatorname{Cov}(aX + b, Y) = a \operatorname{Cov}(X, Y), where b is any constant (since adding a constant to the first argument does not affect the centered product in the covariance definition), and \operatorname{Cov}(X, Y + cZ) = \operatorname{Cov}(X, Y) + c \operatorname{Cov}(X, Z). These follow directly from the bilinearity of expectation: \mathbb{E}[(aX + b - \mathbb{E}[aX + b])(Y - \mathbb{E}[Y])] = a \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] for the first, and similarly for the second by expanding the expectation of the product.[15] For a random vector \mathbf{X} = (X_1, \dots, X_n)^\top, the covariance matrix \Sigma has entries \Sigma_{ij} = \operatorname{Cov}(X_i, X_j). This matrix is symmetric because \operatorname{Cov}(X_i, X_j) = \operatorname{Cov}(X_j, X_i), and the diagonal entries are the variances \operatorname{Var}(X_i). Moreover, \Sigma is positive semi-definite: for any vector \mathbf{a} \in \mathbb{R}^n, \mathbf{a}^\top \Sigma \mathbf{a} = \operatorname{Var}(\mathbf{a}^\top \mathbf{X}) \geq 0, with equality if \mathbf{a}^\top \mathbf{X} is constant almost surely. A direct consequence of bilinearity is the decomposition of the variance of a sum: for random variables X and Y, \operatorname{Var}(X + Y) = \operatorname{Var}(X) + \operatorname{Var}(Y) + 2 \operatorname{Cov}(X, Y). This expands to the general case for multiple variables, facilitating the analysis of aggregate variability in linear combinations.[16] The Cauchy-Schwarz inequality provides a bound on the magnitude of covariance: for random variables X and Y with finite variances, |\operatorname{Cov}(X, Y)| \leq \sqrt{\operatorname{Var}(X)} \sqrt{\operatorname{Var}(Y)}, with equality if and only if X and Y are linearly dependent almost surely (i.e., one is an affine function of the other). This follows from applying the standard Cauchy-Schwarz inequality to the expectation inner product: \left( \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] \right)^2 \leq \mathbb{E}[(X - \mathbb{E}[X])^2] \mathbb{E}[(Y - \mathbb{E}[Y])^2].[17] If \operatorname{Cov}(X, Y) = 0, then X and Y are said to be uncorrelated, meaning their deviations from means do not systematically co-vary. However, uncorrelated random variables are not necessarily independent. A classic counterexample is X \sim \operatorname{[Uniform](/page/Uniform)}[-1, 1] and Y = X^2: here, \mathbb{E}[X] = 0, \mathbb{E}[Y] = \int_{-1}^1 x^2 \cdot \frac{1}{2} \, dx = \frac{1}{3}, and \mathbb{E}[XY] = \mathbb{E}[X^3] = 0 by odd symmetry, so \operatorname{Cov}(X, Y) = 0. Yet, X and Y are dependent, as the distribution of Y given X = x is degenerate at x^2, not matching the marginal of Y.[18]Properties of Correlation
The Pearson correlation coefficient, denoted as \rho_{X,Y}, exhibits scale invariance under affine transformations of the variables. Specifically, for constants a \neq 0 and c \neq 0, the correlation satisfies \rho_{aX + b, cY + d} = \operatorname{sign}(a c) \rho_{X,Y}, meaning it remains unchanged in magnitude but may flip sign depending on the directions of the scalings.[19] This property arises from the normalization by standard deviations in its definition, distinguishing it from the scale-sensitive covariance.[20] Another key property involves the product of correlations in multivariate settings. When the partial correlation between X and Y given Z is zero—indicating conditional independence in a linear sense—the correlation \rho_{X,Y} equals the product \rho_{X,Z} \rho_{Z,Y}. This holds in general from the definition of partial correlation and reflects how linear dependencies propagate through an intermediary Z, such as in a simple chain model without direct links; for jointly normal variables, it further implies conditional independence.[21] The correlation coefficient is bounded by |\rho_{X,Y}| \leq 1, a consequence of the Cauchy-Schwarz inequality applied to the covariance: |\operatorname{Cov}(X,Y)| \leq \sqrt{\operatorname{Var}(X) \operatorname{Var}(Y)}.[22] Equality occurs if and only if Y = aX + b almost surely for some constants a and b, corresponding to perfect linear dependence.[22] Covariance forms the unnormalized foundation for this bounded measure. A zero correlation \rho_{X,Y} = 0 implies that X and Y are uncorrelated, and for any pair of random variables, independence entails uncorrelatedness. However, the converse does not generally hold; uncorrelated variables can still exhibit dependence, as in mixtures of bivariate normals with nonlinear relationships. In the special case of jointly bivariate normal distributions, uncorrelatedness does imply full independence.[23] Regarding inference, the sampling distribution of the sample correlation r under the null hypothesis \rho = 0 is approximately normal for large sample sizes n, with mean 0 and variance $1/(n-1). This asymptotic normality, \sqrt{n} r \approx \mathcal{N}(0, 1), facilitates hypothesis testing for the absence of linear association.Estimation from Data
Sample Covariance
The sample covariance provides an estimate of the covariance between two variables based on a finite set of paired observations from a population. For a sample of size n consisting of paired values (x_1, y_1), \dots, (x_n, y_n), the sample covariance s_{XY} is computed as s_{XY} = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}), where \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i and \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i are the sample means of the x_i and y_i, respectively.[1] This formula measures the average product of deviations from the respective sample means, scaled by n-1 to account for the estimation process.[24] A related but biased estimator uses division by n instead of n-1, analogous to the maximum likelihood estimator under the assumption of independent and identically distributed observations. However, the version with n-1 in the denominator yields an unbiased estimator of the population covariance, meaning its expected value equals the true population covariance \sigma_{XY} for any distribution with finite second moments. This unbiasedness holds generally for independent samples, though the normality assumption simplifies proofs of related properties like the Wishart distribution for multivariate cases.[24] The adjustment to n-1, known as Bessel's correction, addresses the degrees of freedom lost when estimating the population means with sample means. Since the deviations (x_i - \bar{x}) and (y_i - \bar{y}) are calculated relative to values derived from the same data, the sum of squared deviations tends to underestimate the true population variability; dividing by n-1 rather than n corrects this downward bias by effectively increasing the scale factor.[24] In the multivariate setting, the sample covariance extends to a symmetric positive semi-definite matrix S of order p \times p for p variables, where the diagonal elements are sample variances and off-diagonal elements are sample covariances between pairs of variables. The (j,k)-th entry of S is s_{jk} = \frac{1}{n-1} \sum_{i=1}^n (x_{ij} - \bar{x}_j)(x_{ik} - \bar{x}_k), with \bar{x}_j denoting the sample mean of the j-th variable; this matrix serves as an unbiased estimator of the population covariance matrix \Sigma.[1] For illustration, consider a sample of n=5 paired observations on heights (in inches) and weights (in pounds): (60, 120), (62, 125), (64, 130), (66, 135), (68, 140). The sample mean height is \bar{x} = 64 and sample mean weight is \bar{y} = 130. The deviations for height are -4, -2, 0, 2, 4 and for weight are -10, -5, 0, 5, 10, yielding products of 40, 10, 0, 10, 40 with sum 100. Thus, the sample covariance is s_{XY} = 100 / 4 = 25, indicating a positive linear association on the scale of the variables' units.[1]Sample Correlation Coefficient
The sample correlation coefficient, denoted r, serves as the point estimator for the population correlation coefficient \rho. It is computed by normalizing the sample covariance with the product of the sample standard deviations:r = \frac{s_{XY}}{s_X s_Y},
where s_{XY} is the sample covariance, and s_X and s_Y are the sample standard deviations of the variables X and Y, respectively.[25] This yields a dimensionless measure bounded between -1 and 1, with values near 1 or -1 indicating strong positive or negative linear relationships, respectively. A computationally convenient form of the formula, avoiding explicit computation of means and standard deviations in intermediate steps, is
r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}},
where \bar{x} and \bar{y} are the sample means.[25] The sample correlation coefficient is slightly biased as an estimator of \rho, tending to underestimate the absolute value (i.e., biased downward for |\rho| > 0) in finite samples from normal populations, with the bias magnitude ranging from about 0.01 to 0.04 depending on sample size n and \rho.[26] To stabilize the variance of r for inference, particularly when |r| is close to 1, Fisher's z-transformation is applied:
z = \frac{1}{2} \ln \left( \frac{1 + r}{1 - r} \right),
which approximately follows a normal distribution with variance $1/(n-3).[27] As a consistent estimator, the sample correlation coefficient converges in probability to the population value \rho as the sample size n \to \infty, by the law of large numbers applied to the underlying sample moments.[28] For illustration, consider a dataset of heights (in cm) and pulmonary anatomical dead spaces (in ml) for 15 children:
| Height (x) | Dead space (y) |
|---|---|
| 110 | 44 |
| 116 | 31 |
| 120 | 50 |
| 124 | 54 |
| 128 | 56 |
| 132 | 60 |
| 136 | 62 |
| 140 | 66 |
| 144 | 70 |
| 148 | 74 |
| 152 | 78 |
| 156 | 82 |
| 160 | 86 |
| 164 | 90 |
| 170 | 94 |