Correlation ratio
The correlation ratio, denoted as η (eta), is a statistical coefficient that measures the strength of the nonlinear association between a categorical independent variable and a continuous dependent variable, representing the proportion of variance in the dependent variable explained by membership in the categories of the independent variable.[1] Introduced by Karl Pearson in a 1903 paper presented to the Royal Society, it was further developed in his 1905 memoir on skew correlation and nonlinear regression, enabling the quantification of relationships beyond linear assumptions.[2][3] The measure ranges from 0, indicating no association, to 1, indicating perfect prediction of the dependent variable from the categorical predictor.[4] Unlike the Pearson product-moment correlation coefficient, which assumes linearity and symmetry, the correlation ratio is asymmetric—its value depends on which variable is treated as dependent—and it excels at capturing curvilinear dependencies, making it valuable in analysis of variance (ANOVA) contexts where eta squared (η²) serves as an effect size metric.[4] The formula for η is the square root of the ratio of the between-group sum of squares to the total sum of squares: \eta = \sqrt{\frac{\sum_i N_i (\bar{y}_i - \bar{y})^2}{\sum_i \sum_\alpha (y_{i\alpha} - \bar{y})^2}}, where N_i is the sample size in category i, \bar{y}_i is the mean of the dependent variable in category i, and \bar{y} is the overall mean.[1] This formulation aligns with ANOVA principles, as Pearson originally integrated it into early developments of variance analysis.[3] In practice, the correlation ratio requires an interval- or ratio-level dependent variable and a nominal- or ordinal-level independent variable with sufficient observations per category to ensure reliability, and it assumes no specific causal direction while lacking a sign to indicate positive or negative association.[4] It has been widely applied in fields such as psychology, biology, and social sciences for assessing nonlinear effects in experimental and observational data, often as a complement to parametric tests, and unbiased estimators like epsilon squared have been proposed to correct for sampling bias in small samples.[5]Overview and Definition
Introduction
The correlation ratio, denoted by η, is a statistical measure that quantifies the strength of the association between a discrete categorical independent variable and a continuous dependent variable.[6] It serves as a coefficient of nonlinear correlation, enabling the detection of dependencies that may not follow a straight-line pattern.[7] Originating from the need to evaluate curvilinear or nonlinear relationships, the correlation ratio provides a more versatile tool than linear-only measures like the Pearson correlation coefficient. In analysis of variance (ANOVA) contexts, it assesses the extent to which variance in the continuous outcome is explained by membership in the categorical groups.[8] The notation η represents the correlation ratio itself, while its squared form, η², denotes the proportion of variance in the dependent variable accounted for by the categorical predictor.[6]Formal Definition
The correlation ratio, denoted \eta, quantifies the degree of association between a categorical predictor variable X with k categories and a continuous dependent variable Y. It is defined as the square root of \eta^2, where \eta^2 (eta squared) represents the proportion of the total variance in Y explained by the categorical differences in X, and \eta is taken to be non-negative.[9] The primary formula for \eta^2 is \eta^2 = \frac{\sum_{x=1}^k n_x (\bar{y}_x - \bar{y})^2}{\sum_{x=1}^k \sum_{i=1}^{n_x} (y_{x i} - \bar{y})^2}, where n_x is the number of observations in category x, \bar{y}_x is the mean of Y for category x, \bar{y} is the overall mean of Y, and y_{x i} is the i-th observation of Y in category x.[9] An equivalent expression for \eta^2 is the ratio of the weighted variance of the category means of Y to the total variance of Y: \eta^2 = \frac{\sigma^2(\bar{y})}{\sigma^2(y)}, where \sigma^2(\bar{y}) is the variance among the category means \bar{y}_x (weighted by group sizes), and \sigma^2(y) is the total variance of the observations y_{x i}.[10]Mathematical Properties
Range and Interpretation
The correlation ratio, denoted as \eta, and its square \eta^2 both range from 0 to 1, inclusive.[11] A value of \eta = 0 signifies no association between the categorical predictor and the continuous dependent variable, occurring when the means of all categories are equal to the overall mean of the dependent variable.[12] Conversely, \eta = 1 indicates perfect prediction, where there is no variance within any category (i.e., all observations within each category are identical).[12] The squared correlation ratio \eta^2 interprets as the proportion of the total variance in the dependent variable that is explained by the categorical predictor, with higher values reflecting a stronger nonlinear association.[12] The correlation ratio \eta itself is undefined when the total variance of the dependent variable is zero, as this would involve division by zero in its computation.[13] In scenarios involving nonlinear relationships, \eta can exceed the absolute value of Pearson's linear correlation coefficient |r|, highlighting the former's ability to capture curvilinear associations that the latter misses.[4] Common interpretive guidelines for the strength of \eta^2 classify values of approximately 0.01 as small, 0.06 as medium, and 0.14 as large effects (Cohen, 1988).[14] These thresholds emphasize conceptual magnitude rather than strict cutoffs, as the practical significance depends on context.[14]Relation to Variance Components
The correlation ratio, denoted as \eta, quantifies the strength of association between a categorical predictor variable X and a continuous outcome variable Y through its square \eta^2, which represents the proportion of the total variance in Y attributable to differences across categories of X. This measure originates from Karl Pearson's foundational work on non-linear regression and skew correlations, where it was introduced as a way to capture the variability explained by grouped data without assuming linearity. In essence, \eta^2 emerges directly from the partitioning of the total sum of squares (SS) in a dataset into components explained by the categorical factor and unexplained residuals, providing a mechanistic link to variance analysis.[15] The variance decomposition underlying \eta^2 follows the fundamental identity in one-way analysis of variance (ANOVA): the total sum of squares SS_{\text{total}} equals the between-group sum of squares SS_{\text{between}} plus the within-group sum of squares SS_{\text{within}}, or SS_{\text{total}} = SS_{\text{between}} + SS_{\text{within}}. Here, SS_{\text{between}} = \sum_x n_x (\bar{y}_x - \bar{y})^2 captures the variance due to differences in the means \bar{y}_x of Y across categories x of X, weighted by the group sizes n_x, while SS_{\text{within}} = \sum_x \sum_{i \in x} (y_i - \bar{y}_x)^2 reflects the residual variance within each category around its respective mean. Thus, \eta^2 = SS_{\text{between}} / SS_{\text{total}} indicates the fraction of total variability in Y explained by membership in the categories of X. This decomposition highlights how \eta^2 isolates the contribution of the categorical variable to the overall spread in Y, independent of within-group fluctuations.[16][15] In the context of one-way ANOVA, \eta^2 serves as a key effect size measure for evaluating the magnitude of group differences on the continuous variable, analogous to the coefficient of determination R^2 in linear regression, where it quantifies the practical significance of the categorical factor beyond mere statistical testing. This connection positions \eta^2 as an essential tool for interpreting ANOVA results, emphasizing the proportion of variance systematically accounted for by the predictor rather than random error.[15] Notably, \eta^2 possesses properties that enhance its utility in variance partitioning: it is invariant to linear transformations of the scale of Y, ensuring that rescaling the outcome does not alter the measure, and it tends to increase as the number of categories in X grows when the underlying association with Y strengthens, reflecting finer-grained explanations of variance. These characteristics make \eta^2 robust for comparative analyses across datasets with varying measurement units or category granularities.[16]Relationships to Other Measures
Comparison with Pearson Correlation
The correlation ratio, denoted as η, quantifies the strength of any functional relationship—linear or nonlinear—between a categorical predictor variable and a continuous outcome variable, whereas Pearson's correlation coefficient, r, specifically measures the degree of linear association between two continuous variables.[4][17] This distinction in applicability arises because η is derived from analysis of variance (ANOVA) frameworks, partitioning variance explained by categorical groups, while r relies on covariance standardized by product-moment calculations assuming interval-level data for both variables.[4][17] When the underlying relationship is strictly linear and the predictor is binary, the correlation ratio equals the absolute value of Pearson's r, such that η = |r|; however, for polytomous categories, η generally surpasses |r| even in linear scenarios due to its sensitivity to group dispersions, and in the presence of nonlinearity or curvilinearity, η exceeds |r|, providing a more comprehensive indicator of association strength.[4][17] For instance, with a binary categorical predictor, η directly matches |r| under linearity, but for polytomous categories, η generally surpasses |r| even in linear scenarios due to its sensitivity to group dispersions.[17] A primary advantage of η over r is its ability to capture curvilinear relationships without assuming linearity, making it suitable for scenarios where predictors are nominal or ordinal categories, such as treatment groups or demographic classifications affecting a continuous response.[4][17] It also integrates naturally with variance decomposition in ANOVA, offering interpretable effect sizes like η² as the proportion of variance accounted for by the predictor.[4] Relative to r, η has limitations including its inherently positive and asymmetric nature—η values depend on which variable is treated as categorical—preventing assessment of relationship directionality akin to r's sign.[4][17] Additionally, η requires explicit categorical grouping of the predictor, which may not apply directly to purely continuous pairs where r remains the standard, and it is less routinely implemented in statistical software for non-ANOVA contexts.[4]Historical Context
The correlation ratio, denoted η, was introduced by Karl Pearson in the early 1900s as a generalization of the correlation coefficient to accommodate nonlinear relationships and categorical predictors. In his 1905 memoir, Pearson formalized η to quantify the extent to which variation in a continuous variable is explained by discrete groupings of another variable, addressing limitations of linear measures in biological and evolutionary data analysis.[3] During the 1920s and 1930s, Ronald A. Fisher offered a pointed critique of the correlation ratio's practical utility, highlighting its dependence on the arbitrary number of categories, which affects its sampling distribution and interpretability. Fisher advocated for analysis of variance F-tests as superior for inferential purposes, dismissing η as redundant since it essentially restates variance components already captured by ANOVA without adding unique inferential power.[18] Egon Pearson, Karl Pearson's son, countered this in a 1926 review of Fisher's Statistical Methods for Research Workers, defending η as a valuable descriptive tool for gauging association strength independently of hypothesis testing. He argued that the measure warranted clearer exposition in educational contexts to help students appreciate its scope beyond mere redundancy.[19] Subsequently, η and its square η² evolved into a standard effect size metric in analysis of variance, endorsed for reporting practical significance in experimental designs. While its prominence has waned in favor of linear alternatives for straightforward associations, η persists in contexts requiring assessment of nonlinear or categorical effects.[20]Practical Usage
Numerical Example
To illustrate the computation of the correlation ratio, consider a hypothetical dataset of test scores from 15 students across three subjects: Algebra (5 scores: 45, 70, 29, 15, 21), Geometry (4 scores: 40, 20, 30, 42), and Statistics (6 scores: 65, 95, 80, 70, 85, 73).[21] The first step is to calculate the mean score for each category. For Algebra, the mean is (45 + 70 + 29 + 15 + 21) / 5 = 36. For Geometry, the mean is (40 + 20 + 30 + 42) / 4 = 33. For Statistics, the mean is (65 + 95 + 80 + 70 + 85 + 73) / 6 = 78. The overall mean across all scores is the total sum (780) divided by the total number of observations (15), yielding 52.[21] Next, compute the between-category sum of squares (SS_b), which measures the variation due to differences between category means: \text{SS}_b = \sum n_k (\bar{y}_k - \bar{y})^2, where n_k is the sample size in category k, \bar{y}_k is the category mean, and \bar{y} is the overall mean. Substituting the values:5(36 - 52)^2 + 4(33 - 52)^2 + 6(78 - 52)^2 = 5(256) + 4(361) + 6(676) = 1280 + 1444 + 4056 = 6780.
The total sum of squares (SS_t) is then found by summing the squared deviations of all individual scores from the overall mean, resulting in 9640. The squared correlation ratio is \eta^2 = \text{SS}_b / \text{SS}_t = 6780 / 9640 \approx 0.7033, so \eta \approx \sqrt{0.7033} = 0.8386.[21] In this context, the value of \eta^2 \approx 0.70 indicates that approximately 70% of the total variance in test scores is explained by differences between the subject categories, with the remaining 30% attributable to within-category variation.[21] This manual computation can also be performed using statistical software. In R, the
eta_squared function from the effectsize package computes \eta^2 directly from an ANOVA model object.[22] In Python, the anova function from the pingouin library returns eta-squared as part of its output for categorical predictors.[23] However, understanding the underlying steps as shown here is essential for verifying results and grasping the measure's basis in variance decomposition.