Fact-checked by Grok 2 weeks ago

Coefficient of determination

The coefficient of determination, often denoted as R², is a statistical measure in regression analysis that quantifies the proportion of the total variance in the dependent variable that can be explained by the independent variable(s) in a model.^[1] It ranges from 0 to 1, where a value of 0 indicates that the model explains none of the variability, a value of 1 indicates a perfect fit, and intermediate values represent the percentage of variance accounted for by the model (e.g., an R² of 0.75 means 75% of the variance is explained). Introduced by geneticist Sewall Wright in his 1921 paper "Correlation and Causation," the concept emerged in the context of path analysis to assess relationships in complex systems, such as agricultural and biological data.^[2] In simple linear regression, R² is equivalent to the square of the Pearson correlation coefficient (r) between the observed and predicted values, providing a direct link to measures of linear association.^[3] For multiple linear regression models of the form Y_i = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip} + \epsilon_i, where Y_i is the dependent variable, X_{ij} are predictors, \beta_j are coefficients, and \epsilon_i is the error term, R² is calculated as the ratio of the explained sum of squares (SSR) to the total sum of squares (SST): R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}, with SSE denoting the sum of squared residuals (unexplained variance). This decomposition—SST = SSR + SSE—highlights how R² partitions total variability into explained and residual components, making it a key goodness-of-fit statistic.^[1] While widely used across fields like economics, social sciences, and engineering to evaluate model performance, R² has limitations: it does not imply causation, can increase with irrelevant predictors in multiple regression (leading to the development of adjusted R²), and its "good" value depends on context—for instance, values above 0.8 may be excellent in physical sciences but modest in behavioral studies.^[3] Despite these caveats, R² remains a foundational tool for interpreting the explanatory power of regression models in statistical analysis.^[2]

Definitions

Proportion of explained variance

The coefficient of determination, denoted R^2, quantifies the proportion of the total variance in the dependent variable that is explained by the independent variables in a regression model.^[1] It is formally defined as R^2 = 1 - \frac{[SS_{res}](/page/Residual_sum_of_squares)}{[SS_{tot}](/page/Total_sum_of_squares)}, where SS_{res} is the residual sum of squares representing the unexplained variance, and SS_{tot} is the total sum of squares capturing the overall variability in the data.^[4] This measure ranges from 0 to 1, where R^2 = 0 indicates that the model explains none of the variance (equivalent to using the mean as the predictor), and R^2 = 1 signifies a perfect fit with no residual variance.^[5] The total sum of squares, SS_{tot} = \sum (y_i - \bar{y})^2, measures the total variability of the observed values y_i around their mean \bar{y}, serving as a baseline for the dispersion in the dependent variable before any modeling.^[1] After fitting the regression model, the residual sum of squares, SS_{res} = \sum (y_i - \hat{y}_i)^2, quantifies the remaining unexplained variability between the observed values and the predicted values \hat{y}_i.^[4] Thus, R^2 directly reflects the fraction of SS_{tot} that the model accounts for, highlighting its effectiveness in capturing patterns in the data. From the perspective of variance reduction in prediction, R^2 arises as the complement of the proportion of variance left unexplained by the model. In predictive terms, the variance of the prediction error is proportional to SS_{res}, while the model's explanatory power reduces the expected error variance from the total level SS_{tot} by the amount attributable to the predictors.^[5] This decomposition underscores R^2 as a metric of how much the regression improves predictions over a naive mean-based approach, with higher values indicating greater reduction in prediction uncertainty.^[1] To illustrate, consider a simple dataset with four observations of an independent variable x (e.g., dosage levels: 1, 2, 3, 4) and dependent variable y (e.g., response rates: 2, 4, 5, 4). The mean of y is \bar{y} = 3.75, so SS_{tot} = (2-3.75)^2 + (4-3.75)^2 + (5-3.75)^2 + (4-3.75)^2 = 4.75. Fitting a simple linear regression yields predicted values \hat{y} = 2.7, 3.4, 4.1, 4.8, with residuals leading to SS_{res} = (2-2.7)^2 + (4-3.4)^2 + (5-4.1)^2 + (4-4.8)^2 = 2.3. Thus, R^2 = 1 - \frac{2.3}{4.75} \approx 0.516, meaning approximately 51.6% of the variance in y is explained by x.

Relation to unexplained variance

The complement of the coefficient of determination, denoted as $1 - R^2, quantifies the proportion of the total variance in the dependent variable that remains unexplained by the regression model.^[6] This value, sometimes called the coefficient of non-determination, directly measures the model's failure to account for variability in the response variable.^[1] The unexplained variance is formally computed as the ratio of the residual sum of squares (SS_res) to the total sum of squares (SS_tot):

$1 - R^2 = \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}} = \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2},

where y_i are the observed values, \hat{y}_i are the predicted values from the model, and \bar{y} is the mean of the observed values.^[1] This component captures irreducible error— inherent stochastic variability in the data that no model can eliminate—as well as variance attributable to omitted variables, model misspecification, or unmodeled interactions.^[7] A high value of $1 - R^2 signals inadequate model performance, as it reflects a large portion of the data's variability left unaccounted for, potentially indicating the need for additional predictors or a different modeling approach.^[1] For instance, in regression analyses of skin cancer rates versus latitude, an R^2 of approximately 0.68 implies $1 - R^2 \approx 0.32, meaning 32% of the variance in rates is unexplained by latitude alone, possibly due to factors like ozone levels or lifestyle differences.^[1] In contrast, a dataset with a strong linear relationship might yield R^2 = 0.80, so $1 - R^2 = 0.20, where the residuals e_i = y_i - \hat{y}_i sum to squared values representing only 20% of total variability (e.g., SS_res = 200 when SS_tot = 1000), highlighting better but still imperfect model adequacy.^[1]

Squared correlation coefficient

In simple linear regression, the coefficient of determination R^2 is equal to the square of the sample Pearson correlation coefficient r between the observed response values y and the predicted values \hat{y}.^[8] This relationship holds specifically for the bivariate case with one predictor variable.^[9] The mathematical equivalence arises because R^2 = \frac{\mathrm{SSR}}{\mathrm{SST}}, where SSR is the regression sum of squares and SST is the total sum of squares, and this simplifies to the squared correlation. To see this, note that the Pearson correlation r = \frac{\mathrm{cov}(x, y)}{s_x s_y}, where s_x and s_y are the standard deviations of the predictor x and response y. In simple linear regression, the slope \hat{\beta}_1 = r \frac{s_y}{s_x}, and substituting into the expression for SSR yields R^2 = \left( \frac{\mathrm{cov}(x, y)}{s_x s_y} \right)^2 = r^2.^[10] Equivalently, since the predicted values \hat{y} are a linear transformation of x, r also equals the correlation between y and \hat{y}, confirming R^2 = [\mathrm{cor}(y, \hat{y})]^2.^[8] This equivalence is valid under the assumptions of simple linear regression, particularly that the relationship between the predictor and response is linear, and the analysis is limited to two variables without additional predictors.^[11] For example, consider data on college GPA (colgpa) and high school GPA (hsgpa) for n = 141 students. The Pearson correlation r between colgpa and hsgpa is 0.4146. Squaring this gives r^2 = 0.4146^2 = 0.1719. Fitting the simple linear regression model yields SSR = 3.335 and SST = 19.406, so R^2 = \frac{3.335}{19.406} = 0.1719, matching the squared correlation.^[8]

Interpretation

In simple linear regression

In simple linear regression, the coefficient of determination, denoted R^2, represents the proportion of the total variance in the response variable Y that is explained by the predictor variable X.^[1] For instance, an R^2 value of 0.75 indicates that 75% of the variability in Y can be attributed to its linear relationship with X, while the remaining 25% is due to other factors or random error.^[12] This measure provides a straightforward way to assess how well the linear model captures the underlying pattern in the data.^[13] The predictive power of R^2 in this context reflects the degree to which the model's predictions align with the actual observed values along the fitted straight line. Higher values suggest that the data points cluster closely around the regression line, implying more reliable predictions for new observations within the range of X. Conversely, a low R^2 indicates greater scatter, meaning the linear fit offers limited insight into Y's behavior.^[13] In simple linear regression, R^2 is equivalent to the square of the Pearson correlation coefficient between X and Y, reinforcing its role as a measure of linear association strength.^[1] To illustrate intuitively, consider a scatterplot of data points representing height (X) and weight (Y) for a group of individuals, with a straight regression line fitted through them. The total deviation of points from the mean weight (horizontal lines) decomposes into explained deviations (vertical distances from the line to the mean) and residual deviations (vertical distances from points to the line). An R^2 of 0.80 here would mean 80% of the spread in weights is accounted for by the linear trend with height, visualized by the line passing near most points, while the residuals show the unexplained scatter.^[12] The value of R^2 ranges from 0 to 1, where 0 signifies no linear relationship (the line explains none of the variance, as points are randomly scattered) and 1 indicates a perfect linear fit (all points lie exactly on the line). However, this range applies specifically to linear associations; a strong nonlinear relationship may yield a low R^2 despite a clear pattern, as the metric does not capture curvature or other non-straight forms.^[3]^[14]

In multiple linear regression

In multiple linear regression, the coefficient of determination, denoted R^2, quantifies the collective explanatory power of all predictor variables in accounting for the variability in the response variable. It represents the fraction of the total variance in the response that the model captures through the combined effects of multiple predictors, providing a measure of overall model fit. This value always ranges between 0 and 1, where a higher R^2 indicates that a larger proportion of the response variance is explained by the predictors together, though the interpretation emphasizes the model's performance relative to a baseline intercept-only model that explains none of the variance beyond the mean.^[1]^[15] As additional predictors are incorporated into the model, R^2 will not decrease and typically increases, reflecting the added variables' contribution to reducing residual variance; however, this rise does not necessarily signify a substantial or meaningful enhancement in understanding, particularly if the new predictors overlap substantially with existing ones. For instance, in a stepwise regression approach where predictors are added sequentially based on their statistical significance, each step can show an incremental increase in R^2, with the marginal contribution of a new predictor interpreted as the change in R^2 attributable to its inclusion, highlighting how the model's explanatory power accumulates but requires caution against overinterpretation.^[16]^[17]^[18] Multicollinearity, arising when predictors are moderately or highly correlated, can result in a high overall R^2 while complicating the attribution of explanatory effects to individual predictors, as it increases the variance of coefficient estimates and leads to less reliable assessments of their unique roles despite the strong combined fit.^[19]^[20]^[21] This extension from simple linear regression, where R^2 reflects the squared correlation between one predictor and the response, underscores the cumulative nature of explanation in multivariate settings.^[1]

Limitations and inflation effects

One key limitation of the coefficient of determination, R^2, arises in multiple linear regression models where adding more predictor variables—even those that are irrelevant or purely noisy—will always increase (or at least not decrease) the value of R^2 when fitted to the sample data.^[22] This inflation occurs because the model gains flexibility to fit the specific quirks and noise in the training dataset, rather than capturing true underlying patterns, which promotes overfitting and reduces the model's generalizability.^[22] Several caveats further underscore the risks of over-relying on R^2. A high R^2 does not imply causation between predictors and the response variable; it only measures association, and spurious correlations can yield misleadingly strong fits.^[14] Similarly, R^2 can appear elevated in misspecified models, such as those omitting key variables or assuming incorrect functional forms, masking structural flaws in the analysis.^[14] Moreover, R^2 is computed solely from in-sample data and provides no insight into out-of-sample prediction error, potentially overestimating a model's predictive power for new observations. To illustrate the inflation effect, consider a simulated dataset with 50 observations and an initial simple model using one relevant predictor, yielding an R^2 of around 0.3; upon adding nine irrelevant noise variables (randomly generated), the R^2 can inflate to 0.9 or higher due to overfitting, as the model interpolates the noise rather than the signal—though this fit fails to hold on unseen data.^[23] To mitigate these issues, R^2 should be interpreted alongside other diagnostics, such as p-values to assess predictor significance and cross-validation techniques to evaluate out-of-sample performance and detect overfitting.^[24]

Extensions

Adjusted coefficient of determination

The adjusted coefficient of determination, denoted \bar{R}^2, modifies the ordinary coefficient of determination R^2 by incorporating a penalty for the number of predictors in the model, yielding a less biased estimate of the population proportion of explained variance. Unlike R^2, which monotonically increases or stays the same when additional predictors are included regardless of their relevance, \bar{R}^2 decreases if the added predictors do not sufficiently improve the model fit, thereby discouraging overfitting. The formula for the adjusted coefficient of determination is

\bar{R}^2 = 1 - (1 - R^2) \frac{n-1}{n - k - 1},

where n is the sample size and k is the number of predictors (excluding the intercept).^[25] This adjustment arises from a derivation that accounts for degrees of freedom in variance estimation: the total sum of squares (TSS) is divided by its degrees of freedom n-1 to obtain an unbiased estimate of the total variance, while the residual sum of squares (RSS) is divided by n - k - 1 for an unbiased estimate of the error variance; \bar{R}^2 then represents the ratio of these adjusted variances, equivalent to 1 minus the ratio of the unbiased error variance to the unbiased total variance.^[25] To illustrate, consider a dataset with n = 30 observations where the unadjusted R^2 = 0.60. For a model with k = 1 predictor, \bar{R}^2 = 1 - (1 - 0.60) \frac{29}{28} \approx 0.586, indicating a slight downward adjustment. If the same R^2 = 0.60 holds for a model with k = 5 predictors, \bar{R}^2 = 1 - (1 - 0.60) \frac{29}{24} \approx 0.517, demonstrating how the penalty grows with model complexity even without improvement in fit.^[25]

Partial coefficient of determination

The partial coefficient of determination, often denoted as R^2_{Y_j | \mathbf{X}_{-j}}, quantifies the marginal contribution of a specific predictor variable X_j to explaining the variance in the response variable Y in a multiple linear regression model, after controlling for the effects of all other predictors \mathbf{X}_{-j}. It is defined as the proportional reduction in the residual sum of squares (SSE) when X_j is added to the model containing the other predictors:

R^2_{Y_j | \mathbf{X}_{-j}} = 1 - \frac{\text{SSE}(\mathbf{X}_{-j}, X_j)}{\text{SSE}(\mathbf{X}_{-j})} = \frac{\text{SSR}(X_j | \mathbf{X}_{-j})}{\text{SSE}(\mathbf{X}_{-j})},

where \text{SSR}(X_j | \mathbf{X}_{-j}) is the extra sum of squares due to X_j, \text{SSE}(\mathbf{X}_{-j}) is the error sum of squares for the reduced model excluding X_j, and \text{SSE}(\mathbf{X}_{-j}, X_j) is the error sum of squares for the full model.^[26] This measure isolates the unique explanatory power of X_j, ranging from 0 (no additional contribution) to 1 (complete explanation of remaining variance).^[27] In interpretation, the partial R^2 represents the proportion of the variance in Y that remains unexplained by the other predictors and is subsequently accounted for by adding X_j. Unlike the overall coefficient of determination, which assesses the full model's fit, the partial version highlights the incremental benefit of an individual predictor, making it valuable for identifying redundant variables or multicollinearity effects where predictors overlap in their explanations.^[28] For instance, a partial R^2 near 0 indicates that X_j adds little unique information beyond the other variables already in the model.^[26] The partial coefficient of determination can also be expressed in terms of correlations, specifically relating to the squared partial correlation pr^2_{Y_j \cdot \mathbf{X}_{-j}}, which equals the partial R^2, and involving the squared semi-partial correlation sr^2_{Y_j (\mathbf{X}_{-j})}:

R^2_{Y_j | \mathbf{X}_{-j}} = pr^2_{Y_j \cdot \mathbf{X}_{-j}} = \frac{sr^2_{Y_j (\mathbf{X}_{-j})}}{1 - R^2_{Y | \mathbf{X}_{-j}}},

where sr^2_{Y_j (\mathbf{X}_{-j})} = R^2_{Y | \mathbf{X}_{-j}, X_j} - R^2_{Y | \mathbf{X}_{-j}} is the semi-partial squared correlation, measuring the unique contribution to total variance, while the denominator adjusts for the variance already explained by the reduced model.^[27] This formulation underscores how partial R^2 normalizes the semi-partial contribution to the unexplained variance.^[28] Consider an example from a multiple regression analysis of body fat percentage (Y) predicted by triceps skinfold thickness (X_1) and thigh circumference (X_2). The reduced model with only X_1 yields R^2_{Y | X_1} = 0.71 and SSE = 143.12, while the full model gives SSE = 109.95. The partial R^2 for X_2 given X_1 is then R^2_{Y_2 | X_1} = (143.12 - 109.95)/143.12 = 0.232, indicating that X_2 explains an additional 23.2% of the variance in body fat not accounted for by X_1 alone.^[26] This value is modest compared to the overall R^2 \approx 0.78 for the full model, illustrating how predictor overlap can diminish an individual variable's partial contribution despite a strong total fit.^[28]

Generalizations and decompositions

In regression models with orthogonal predictors, the coefficient of determination decomposes additively into the sum of the individual R² values (or squared partial correlations) contributed by each predictor, reflecting their independent effects on explained variance. This follows from the orthogonality of the design matrix, where the projection onto the fitted values is the sum of orthogonal projections onto each predictor space, yielding R^2 = \sum_{j=1}^p R_j^2, with R_j^2 = \frac{\| P_j y \|^2}{\| y - \bar{y} \|^2} for the projection matrix P_j of the j-th predictor.^[29] When predictors are correlated (non-orthogonal cases), such additive decomposition no longer holds directly, but hierarchical partitioning addresses this by evaluating all possible subsets of predictors and allocating variance based on the average independent contribution of each across models, thus providing a measure of relative importance while accounting for collinearity.^[30] Alternatively, the Shapley value method from cooperative game theory decomposes R² by computing the average marginal contribution of each predictor (or group) over all possible combinations, ensuring an equitable partition that sums to the total R² and handles shared variance.^[31] For instance, in a multiple regression with environmental and socioeconomic predictors, this approach might attribute 0.15 of an overall R² = 0.45 to climate variables and 0.20 to income factors, after averaging marginal gains across coalitions.^[31] A broader geometric generalization interprets R² within the vector space of centered observations, where it equals the squared cosine of the angle θ* between the observed response vector y - \bar{y} and the fitted values \hat{y} - \bar{y}, i.e., R^2 = \cos^2(\theta^*) = \frac{(\hat{y} - \bar{y})' (y - \bar{y}) }{ \| y - \bar{y} \| \cdot \| \hat{y} - \bar{y} \| }, emphasizing the directional alignment between actual and predicted data.^[32] Extensions to nonlinear models introduce pseudo-R² forms to approximate goodness-of-fit. McFadden's pseudo-R², commonly used for discrete choice models like conditional logit, is given by \rho^2 = 1 - \frac{\ln L(M)}{\ln L(M_0)}, where L(M) is the likelihood of the full model and L(M_0) that of the intercept-only null model; values near 0.2–0.4 often indicate reasonable fit, though it understates compared to linear R².^[33]

Application in logistic regression

In logistic regression, which models binary outcomes, the coefficient of determination cannot be directly applied as in linear regression due to the non-linear nature of the logit link and the absence of a straightforward variance decomposition. Instead, pseudo-R² measures are used to assess model fit by quantifying the improvement in the likelihood function over a baseline null model. These measures are derived from maximum likelihood estimation and provide a way to evaluate how well the predictors explain the observed data relative to an intercept-only model.^[34] One common pseudo-R² variant is the Cox and Snell measure, defined as

R^2_{CS} = 1 - \left( \frac{L_0}{L_1} \right)^{2/n},

where L_0 is the likelihood of the null (intercept-only) model, L_1 is the likelihood of the fitted model, and n is the sample size. This measure, proposed by Cox and Snell, ranges between 0 and less than 1, reflecting the proportional reduction in the deviance but bounded by the null model's likelihood.^[35] To address the limitation that Cox and Snell's R² cannot reach 1 even for a perfect model, Nagelkerke introduced a scaled version:

R^2_N = \frac{R^2_{CS}}{1 - L_0^{2/n}}.

This adjustment normalizes the measure so its maximum value is 1, making it more intuitive for comparing fit across models while still based on likelihood ratios. Nagelkerke's formulation is widely adopted in statistical software for binary logistic regression. Interpreting these pseudo-R² values presents challenges distinct from linear regression. Unlike the ordinary R², which represents the proportion of total variance explained by the model, pseudo-R² measures indicate the relative improvement in predictive likelihood rather than variance reduction. For instance, a value of 0.10 does not mean 10% of the "variance" is explained but rather that the full model improves the log-likelihood by about 10% relative to the null, adjusted for sample size; values are typically lower than in linear models for similar data. These measures are most useful for comparing nested models rather than assessing absolute explanatory power.^[34] Consider a logistic regression example predicting binary income (1 for above-median, 0 otherwise) from years of education, with a sample of n = 500. The null model's log-likelihood is -346.574, while the fitted model's is -322.489. The resulting Cox and Snell R² is 0.092, and Nagelkerke's R² is 0.122. In a corresponding linear regression on the same data, the ordinary R² is 0.11, showing rough concordance in scale but highlighting that pseudo-R² values remain modest and emphasize likelihood gains over variance fit.^[36] Despite their utility, pseudo-R² measures in logistic regression have limitations: they are not directly comparable to the linear R² due to differing underlying assumptions about error distributions and cannot be interpreted as proportions of explained variation in the binary outcome. Instead, they serve primarily for relative model comparison within the same dataset, such as evaluating whether adding predictors meaningfully improves fit beyond the null. Over-reliance on a single pseudo-R² can mislead, so complementary diagnostics like AIC or Hosmer-Lemeshow tests are recommended.^[34]

Comparisons

With other goodness-of-fit measures

The coefficient of determination, R^2, quantifies the proportion of variance in the response variable explained by the model in linear regression, but it does not penalize model complexity and tends to increase with additional predictors, potentially leading to overfitting.^[37] In contrast, information criteria such as the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) provide alternative goodness-of-fit measures that balance explanatory power against model complexity, making them suitable for model selection.^[38] These criteria are particularly useful when comparing models for predictive accuracy rather than just in-sample fit, as R^2 emphasizes.^[39] The AIC is defined as \text{AIC} = -2 \log L + 2k, where L is the maximized likelihood of the model and k is the number of parameters, imposing a fixed penalty of 2 per parameter to estimate relative predictive performance.^[37] Unlike the adjusted R^2, which penalizes complexity proportionally to the ratio of unexplained variance adjusted for degrees of freedom, AIC derives from information theory and asymptotically approximates the expected Kullback-Leibler divergence, favoring models with lower values for out-of-sample prediction. The BIC, formulated as \text{BIC} = -2 \log L + k \log n with n as the sample size, applies a stronger penalty that grows with n, making it more conservative in selecting parsimonious models, especially in large datasets.^[40] This logarithmic penalty in BIC contrasts with AIC's constant one, leading BIC to favor simpler models more aggressively than AIC or adjusted R^2.^[38] In generalized linear models (GLMs), the deviance serves as a goodness-of-fit measure analogous to the residual sum of squares in linear regression, defined as D = -2 (\log L_m - \log L_s), where L_m is the likelihood of the fitted model and L_s is the saturated model likelihood.^[41] Lower deviance indicates better fit, and reductions in deviance can test model improvements, much like changes in $1 - R^2.^[41] For instance, in logistic regression—a common GLM application—deviance assesses fit similarly to pseudo-R^2 measures, though it focuses on likelihood rather than variance explained.^[41] R^2 is preferred for interpreting explanatory power within the training data, particularly in simple linear contexts, while AIC and BIC are favored for model selection aimed at prediction, as they incorporate penalties to avoid overfitting.^[37] Deviance is ideal for GLMs where likelihood-based inference is central, offering a direct parallel to R^2's role in ordinary least squares.^[38] Consider a linear regression example with n = 100 observations: a parsimonious model might yield R^2 = 0.70 and AIC = 180, while adding two extraneous predictors increases R^2 to 0.72 but raises AIC to 185 and BIC to 192 due to the penalties, illustrating the trade-off where AIC/BIC select the simpler model despite the modest fit gain.^[37]

Relation to residual statistics

The coefficient of determination, denoted R^2, directly relates to residual statistics through its foundational formula, which quantifies the proportion of variance explained by the model in terms of mean squared error (MSE) and total mean square (MSTot). Specifically, R^2 = 1 - \frac{\text{MSE}}{\text{MSTot}}, where MSE represents the average squared residual (the difference between observed and predicted values), and MSTot is the total variance in the dependent variable.^[42]^[43] This connection highlights how R^2 measures error reduction: a higher R^2 indicates a lower MSE relative to the total variability, implying the model's predictions deviate less from actual outcomes. The standard error of the estimate, defined as s = \sqrt{\text{MSE}}, serves as the typical prediction error and is inversely related to R^2; as R^2 approaches 1, s decreases, reflecting tighter fits around the regression line. For instance, s = \sqrt{(1 - R^2)} \times \text{SD}(y) in simple linear regression, where SD(y) is the standard deviation of the observed values, underscoring the link between explained variance and residual dispersion.^[43]^[44] In the context of hypothesis testing, R^2 integrates with the ANOVA F-statistic to assess overall regression significance. The F-statistic is computed as F = \frac{\text{MSR}}{\text{MSE}}, where MSR (mean square regression) derives from the explained sum of squares, and this ratio can be expressed in terms of R^2 as F = \frac{R^2 / k}{(1 - R^2) / (n - k - 1)}, with k as the number of predictors and n as the sample size; a significant F-test supports that R^2 exceeds what would occur by chance.^[45] To illustrate, consider a dataset with n = 10 observations where the total sum of squares (SS_tot) is 100, yielding MSTot = SS_tot / (n-1) = 100 / 9 ≈ 11.11. If the model's residuals sum to SS_res = 40, then MSE = SS_res / (n-2) = 40 / 8 = 5, and R^2 = 1 - \frac{5}{11.11} ≈ 0.55, demonstrating a 55% reduction in error variance compared to a null model using only the mean. This computation from residuals directly yields R^2, emphasizing its role in evaluating predictive accuracy.^[46]

Historical Development

Origins in early statistics

The concept of the coefficient of determination emerged in the early 20th century, building on earlier statistical developments. The idea of partitioning total variance into explained and unexplained components was advanced through the analysis of variance (ANOVA) developed by Ronald Fisher during the 1910s and 1920s. Fisher's work on ANOVA, beginning with his 1918 paper "The Correlation Between Relatives on the Supposition of Mendelian Inheritance" and expanding through subsequent publications, partitioned total variance into components attributable to different sources, providing a framework for quantifying the proportion of explained variation in experimental data. This variance decomposition laid important groundwork for measures like the coefficient of determination, which quantifies the ratio of explained variance to total variance as a key insight in assessing model fit. The specific term "coefficient of determination," denoted as R², was introduced by geneticist Sewall Wright in his 1921 paper "Correlation and Causation," in the context of path analysis to assess relationships in complex systems such as biological and agricultural data.^[2] The formulation built directly on the least squares method pioneered by Adrien-Marie Legendre in 1805 and Carl Friedrich Gauss in 1809, who minimized the sum of squared residuals to estimate parameters in linear models. Their approach quantified the discrepancy between observed and predicted values but did not explicitly frame it as a proportion of total variation; instead, it emphasized optimal fitting for astronomical and geodetic data. Fisher's innovation extended this by integrating it into variance decomposition, transforming the residual-based metric into a standardized measure of explanatory power. Preceding Fisher's contributions, Francis Galton introduced the idea of regression in the 1880s through studies on hereditary traits, such as stature, where he observed that offspring tended to regress toward the population mean. Galton's work established the linear relationship between variables but lacked an explicit coefficient of determination, focusing instead on the geometric mean of regression lines without quantifying the proportion of variance explained. In simple linear regression, the coefficient of determination is equivalent to the square of the Pearson correlation coefficient (r), an interpretation that Fisher elaborated in his seminal 1925 book Statistical Methods for Research Workers. There, Fisher provided rigorous statistical grounding through tables and tests for significance, linking bivariate correlation with analysis of variance and making the metric accessible for biological and agricultural research. An early equivalent, the squared correlation coefficient, had been noted in prior work but gained practical application through Fisher's contributions.

Evolution and modern usage

Following World War II, the coefficient of determination saw significant refinements to address limitations in multiple regression settings. Although Mordecai Ezekiel introduced the adjusted R² in 1930 as a penalty for additional predictors to mitigate overfitting, its widespread adoption occurred during the 1950s and 1960s amid growing computational capabilities and the rise of multivariate statistical analysis. This adjustment became a standard tool in econometric and social science research by the 1970s, as evidenced in influential texts on regression that emphasized its role in model selection. Concurrently, the partial R² emerged as a key extension in multivariate statistics, quantifying the unique contribution of individual predictors while controlling for others; its formalization and application gained traction in the 1960s through works on linear models, facilitating hierarchical testing in fields like psychology and economics. The standardization of R² and its variants accelerated with the proliferation of statistical software in the late 20th century. SAS, first released in 1976, incorporated R² and adjusted R² as default outputs in procedures like PROC REG, enabling routine computation in large-scale data analysis while including caveats in documentation about interpreting it as explanatory power rather than predictive accuracy. Similarly, the R programming language, developed in the early 1990s and first publicly announced in 1993 with source code released under GPL in 1995 and version 1.0.0 in 2000, integrated these metrics into its base lm() function, promoting open-source accessibility and embedding warnings against overreliance on in-sample R² for causal inference.) These implementations democratized the use of R² across disciplines but also highlighted risks of misuse, such as data dredging to inflate values without theoretical justification. In the 1980s, econometric debates intensified around R²'s vulnerabilities to specification bias and overfitting, prompting a shift toward robust validation methods. Edward Leamer's 1983 critique underscored how extreme sensitivity analyses could reveal fragility in models with high R², influencing the field to prioritize out-of-sample testing for generalizability. Today, R² remains integral to machine learning for assessing regression models, including non-linear ones, as implemented in libraries like scikit-learn's r2_score function, which evaluates fit on held-out data.^[47] However, in the big data era, critiques emphasize its limitations in high-dimensional settings, where it may mask poor generalization; practitioners now pair it with cross-validation to avoid overoptimism.

References

[1]
2.5 - The Coefficient of Determination, r-squared | STAT 462
The coefficient of determination or r-squared value, denoted r 2 , is the regression sum of squares divided by the total sum of squares.Missing: history | Show results with:history
[2]
The coefficient of determination R-squared is more informative than ...
Jul 5, 2021 · The coefficient of determination (Wright, 1921) can be interpreted as the proportion of the variance in the dependent variable that is ...
[3]
Coefficient of Determination (R²) | Calculation & Interpretation - Scribbr
Apr 22, 2022 · The coefficient of determination is a number between 0 and 1 that measures how well a statistical model predicts an outcome.What is the coefficient of... · Calculating the coefficient of...
[4]
Biostatistics Series Module 6: Correlation and Linear Regression - NIH
A coefficient of determination can be calculated to denote the proportion of the variability of y that can be attributed to its linear relation with x. This is ...The Scatter Plot · The Correlation Coefficient · Simple Linear Regression
[5]
The coefficient of determination R2 and intra-class correlation ...
Sep 13, 2017 · The coefficient of determination R 2 quantifies the proportion of variance explained by a statistical model and is an important summary statistic of biological ...<|control11|><|separator|>
[6]
Coefficient of Determination
The coefficient of determination is the percent of variation explained by the regression equation, or the explained variation divided by the total variation.
[7]
[PDF] Assumption Lean Regression - Wharton Statistics and Data Science
Nov 26, 2018 · We prefer for such variation the term “noise” over “error.” Sometimes it is called “irreducible variation” because it exists even if the true ...
[8]
[PDF] Week 4: Simple Linear Regression III
It turns out that R2 is also the square of the correlation between the observed y and its model-predicted values: cor(y, y)2 qui reg colgpa hsgpa predict ...
[9]
[PDF] Correlation & Simple Regression - University of Iowa
... regression line fits the data). For simple regression, the coefficient of determination is simply the square of the correlation: R2 = r2. Example (continued):.
[10]
[PDF] Lecture 10: F-Tests, R 2, and Other Distractions
Oct 1, 2015 · R2 = ( cXY. sXsY. )2. (8) which we can recognize as the squared correlation coefficient between X and Y. (hence the square in R2). A noteworthy ...
[11]
2.6 - (Pearson) Correlation Coefficient r | STAT 462
The correlation coefficient r is directly related to the coefficient of determination r 2 in the obvious way.
[12]
R-squared or coefficient of determination (video) - Khan Academy
Nov 19, 2010 · In linear regression, r-squared (also called the coefficient of determination) is the proportion of variation in the response variable that is explained by ...
[13]
How To Interpret R-squared in Regression Analysis - Statistics By Jim
R-squared measures the strength of the relationship between your linear model and the dependent variables on a 0 - 100% scale. Learn about this statistic.
[14]
2.8 - R-squared Cautions | STAT 462
Caution # 1. The coefficient of determination r2 and the correlation coefficient r quantify the strength of a linear relationship. It is possible that r2 = 0% ...
[15]
[PDF] STAT 224 Lecture 4 Multiple Linear Regression, Part 3
Multiple R2, also called the coefficient of determination, is defined as. R2 = SSR. SST. = 1 −. SSE. SST. = proportion of variability in Y explained by X1,..., ...
[16]
10.2 - Stepwise Regression | STAT 501
The general idea behind the stepwise regression procedure is that we build our regression model from a set of candidate predictor variables by entering and ...Missing: increase | Show results with:increase
[17]
[PDF] 4 Introduction to Multiple Linear Regression
The R2 for the simple linear regression was .076, whereas R2 = .473 for the multiple regression model. Adding the weight variable to the model increases R2 by ...
[18]
[PDF] Multiple regression
In a stepwise regression, predictor variables are entered into the regression equation one at a time based upon statistical criteria. At each step in the ...
[19]
10.4 - Multicollinearity | STAT 462
Multicollinearity exists when two or more of the predictors in a regression model are moderately or highly correlated with one another.
[20]
[PDF] Multicollinearity (and Model Validation) - San Jose State University
A serious issue in multiple linear regression is multicollinearity, or near- linear dependence among the regression variables, e.g., x3 = 2x1 + 3x2.
[21]
[PDF] Multicollinearity - Academic Web
Jul 24, 2025 · R-squared = 0.0626. -------------+ ... In short, multicollinearity is not a problem that is unique to OLS regression, and the various.
[22]
5.3 - The Multiple Linear Regression Model | STAT 462
The estimates of the \beta coefficients are the values that minimize the sum of squared errors for the sample. The exact formula for this is given in the next ...
[23]
[PDF] A Brief, Nontechnical Introduction to Overfitting in Regression-Type ...
Figure 2. Pure noise variables still produce good R2 values if the model is overfitted. The distribution of R2 values from a series of simulated regression.Missing: squared | Show results with:squared
[24]
Lesson 10: Model Building - STAT ONLINE
The model with a smaller mean square prediction error (or larger cross-validation R 2 ) is a better predictive model. Consider residual plots, outliers ...<|control11|><|separator|>
[25]
Derivation of R² and adjusted R² | The Book of Statistical Proofs
Dec 6, 2019 · 2) Using (4) , the coefficient of determination can be also written as: R2=1−∑ni=1(yi−^yi)2∑ni=1(yi−¯y)2=1−1n∑ni=1(yi−^yi)21n∑ni=1(yi−¯y)2. (8)Missing: seminal paper
[26]
[PDF] Applied linear statistical models - Statistics - University of Florida
... Applied linear regression models. 4th ed. c2004. Includes bibliographical references and index. ISBN 0-07-238688-6 (acid-free paper). 1. Regression analysis ...
[27]
[PDF] Semipartial (Part) and Partial Correlation
Some relevant formulas for the semipartial and squared semipartial correlations are then k k. GX k. YG. YH k k k. GX k k. Tol b. R b. R. R sr. Tol b. R b sr k k.
[28]
Partial and Semipartial Correlation
This says that the squared semipartial correlation is equal to the difference between two R2 values. The difference between the squared partial and semipartial ...
[29]
[PDF] Decomposing Variance - Department of Statistics
Oct 10, 2021 · Decomposition of R2 (orthogonal case). The R2 for simple linear ... 1 + R2. 2 ≈ 2R2. 28 / 40. Page 29. Decomposition of R2. Case 2: P j R2 j < R2.
[30]
Hierarchical partitioning as an interpretative tool in multivariate ...
This note is to draw the attention of ecologists to a relatively recent method, hierarchical partitioning, that does not aim to identify a best regression model ...
[31]
[PDF] Shapley Decomposition of R-Squared in Machine Learning Models
Aug 26, 2019 · This amazing decomposition of a single prediction into its constituent parts across model features is one of the main goals of Shapley value ...Missing: components | Show results with:components
[32]
[PDF] An overview of the elementary statistics of correlation, R-squared ...
Feb 21, 2018 · The R-squared for two vectors thus is the squared cosine of the angle θ* between the centered values. The root mean squared error (RMSE) ...
[33]
[PDF] Conditional Logit Analysis of Qualitative Choice Behavior
Conditional logit analysis of qualitative choice behavior. DANIEL MCFADDEN'. UNIVERSITY OF CALIFORNIA AT BERKELEY. BERKELEY, CALIFORNIA. 1. Preferences and ...Missing: URL | Show results with:URL
[34]
FAQ: What are pseudo R-squareds? - OARC Stats - UCLA
Oct 20, 2011 · If comparing two models on the same data, McFadden's would be higher for the model with the greater likelihood. McFadden's (adjusted). Image ...
[35]
Analysis of Binary Data - 2nd Edition - D.R. Cox - Routledge
In stockAnalysis of Binary Data. By D.R. Cox, E. J. Snell Copyright 1989. Hardback $161.00. eBook $172.50. ISBN 9780412306204. 248 Pages. Published May 15, 1989 by ...
[36]
[PDF] Pseudo R2 and Information Measures (AIC & BIC)
Sep 8, 2024 · Pseudo R2 measures are analogs to OLS R2, with McFadden's being popular. AIC and BIC are information measures used to assess model fit.
[37]
[PDF] Linear Model Selection and Regularization
This procedure has an advantage relative to AIC, BIC, Cp, and adjusted R2, in that it provides a direct estimate of the test error, and doesn't require an ...
[38]
How to compare regression models - Duke People
After fitting a number of different regression or time series forecasting models to a given data set, you have many criteria by which they can be compared.
[39]
Model selection and Akaike's Information Criterion (AIC)
Akaike, H. (1974). A new look at the statistical model identification.IEEE Transactions on Automatic Control, AC-19, 716–723. Google Scholar.
[40]
Estimating the Dimension of a Model - Project Euclid
These terms are a valid large-sample criterion beyond the Bayesian context, since they do not depend on the a priori distribution. Citation. Download Citation.
[41]
Generalized Linear Models | P. McCullagh - Taylor & Francis eBooks
Jan 22, 2019 · The success of the first edition of Generalized Linear Models led to the updated Second Edition, which continues to provide a definitive ...
[42]
R-Squared Explained: Measuring Model Fit - DataCamp
May 14, 2025 · r-squared formula involving 1 - MSE / var(y). where: MSE is the mean squared error of the model,; Var(y) is the variance of the true outcome ...
[43]
Standard Error of the Regression vs. R-squared - Statistics By Jim
The standard error of the regression (s) is the square root of the mean square error (MSE), where: Equation for mean square error.
[44]
Mathematics of simple regression - Duke People
Review of the mean model, formulas for the slope and intercept of a simple regression model, formulas for R-squared and standard error of the regression.
[45]
2.6 - The Analysis of Variance (ANOVA) table and the F-test
Of course, that means the regression sum of squares (SSR) and the regression mean square (MSR) are always identical for the simple linear regression model. Now, ...
[46]
R2 Score & Mean Square Error (MSE) Explained - BMC Software
Jul 24, 2025 · What is variance? · What is the R2 score? · What is mean square error (MSE)? · How to calculate MSE in Python · What is a good Mean Squared Error ( ...Missing: determination MSTot)
[47]
Fisher and Regression - Project Euclid
In 1922 R. A. Fisher introduced the modern regression model, synthesizing the regression theory of Pearson and Yule and the least squares theory of Gauss.
[48]
3.4. Metrics and scoring: quantifying the quality of predictions
Scikit-learn uses estimator score methods, scoring parameters, and metric functions to evaluate model predictions. Common metrics include accuracy, balanced ...F1_score · Accuracy_score · Roc_auc_score · 3.5. Validation curves