Residual sum of squares

The residual sum of squares (RSS), also known as the sum of squared residuals or error sum of squares (SSE), is a fundamental statistical measure in regression analysis that quantifies the total discrepancy between observed data points and the values predicted by a fitted model. It is defined as the sum of the squared differences between each observed value y_i and its corresponding predicted value \hat{y}_i, expressed by the formula RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2, where n is the number of observations.^[1] In linear regression, the ordinary least squares method minimizes the RSS to estimate the model parameters, ensuring the best linear fit to the data by reducing unexplained variation.^[1] RSS plays a central role in the analysis of variance (ANOVA) framework for regression models, where the total sum of squares (TSS)—measuring the overall variation in the response variable around its mean—is decomposed into the regression sum of squares (SSR), which captures variation explained by the model, and the RSS, which represents the unexplained or residual variation: TSS = SSR + RSS.^[2] This decomposition allows for the calculation of the coefficient of determination R^2 = 1 - \frac{RSS}{TSS}, which indicates the proportion of variance in the dependent variable explained by the independent variables, with values closer to 1 signifying a better fit.^[3] The degrees of freedom associated with RSS are typically n - p - 1, where p is the number of predictors, enabling the computation of the mean squared error (MSE = RSS / (n - p - 1)) for assessing model performance and conducting hypothesis tests, such as the F-test for overall model significance.^[4] Beyond simple linear regression, RSS is integral to more complex models, including multiple linear regression and nonlinear regression,^[5] where it serves as an objective function for parameter estimation and model comparison via criteria like Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC), which penalize excessive complexity while favoring lower RSS values.^[6] In predictive modeling, a lower RSS on validation data indicates better generalization, while overfitting can result in a deceptively low RSS on training data but higher RSS on unseen data, highlighting the need for regularization techniques like ridge regression that modify the RSS objective by adding a penalty term to balance fit and complexity.^[7] Overall, RSS provides a quantifiable basis for evaluating regression model adequacy, influencing applications in fields such as economics, biology, and engineering for data-driven decision-making.^[2]

Fundamentals

Definition

The residual sum of squares (RSS), also known as the sum of squared errors (SSE), quantifies the discrepancy between observed data and values predicted by a statistical model. It serves as a key measure of model fit, with smaller values indicating that the model's predictions more closely align with the actual observations.^[8] Residuals are defined as the differences between the observed values and the corresponding fitted values from the model; RSS is then computed by squaring these residuals and summing them across all data points. Mathematically, for a dataset comprising n observations, the RSS is given by

\text{RSS} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2,

where y_i denotes the i-th observed value and \hat{y}_i denotes the i-th predicted value.^[8]^[9] This concept originated in the development of the least squares method, first published by Adrien-Marie Legendre in his 1805 work Nouvelles méthodes pour la détermination des orbites des comètes, where it was presented as an algebraic procedure for minimizing errors in orbital calculations.^[10] Independently, Carl Friedrich Gauss elaborated on the method in his 1809 treatise Theoria motus corporum coelestium in sectionibus conicis solem ambientium, providing a probabilistic justification based on the assumption of normally distributed errors.^[11]^[12] Beyond its foundational role in regression analysis, the RSS finds broad application in areas such as nonlinear curve fitting, where it evaluates how well a parametric function approximates scattered data points, and in time series modeling, where it assesses the accuracy of forecasts against historical patterns.^[13]^[14] In ordinary least squares estimation, model parameters are selected precisely to minimize the RSS.^[15]

Role in Regression Analysis

The residual sum of squares (RSS) serves as a key measure of the unexplained variation in the dependent variable after fitting a regression model, quantifying the total squared differences between observed values and the model's predictions.^[4]^[16] This metric directly reflects the portion of variability not captured by the predictors, providing an initial assessment of how well the model accounts for the data's patterns.^[4] A lower RSS generally indicates a better-fitting model, as it suggests reduced discrepancies between predictions and actual outcomes; however, this must be interpreted in context with the degrees of freedom to prevent favoring overly complex models that artificially reduce RSS by incorporating additional parameters without improving explanatory power.^[16] In practice, analysts often divide RSS by the appropriate degrees of freedom to obtain the mean squared error, which offers a more balanced evaluation of fit across models of varying complexity.^[4] RSS plays a central role in hypothesis testing for assessing the overall significance of a regression model through the F-statistic, where it contributes to the denominator as the mean squared error, allowing comparison against the explained variation to determine if the model as a whole explains more than mere chance.^[4]^[17] A significant F-test, derived in part from RSS, supports rejecting the null hypothesis that all regression coefficients (except the intercept) are zero, affirming the model's utility.^[4] Despite its utility, RSS has notable limitations: it does not inherently penalize model complexity, as adding predictors invariably decreases or maintains RSS without necessarily enhancing generalizability, which can lead to overfitting.^[16]^[18] Additionally, because residuals are squared, RSS is particularly sensitive to outliers, which can disproportionately inflate the value and distort the assessment of model accuracy.^[19]^[20]

Mathematical Formulations

Simple Linear Regression

In simple linear regression, the model posits a linear relationship between a response variable y and a single explanatory variable x, expressed as y_i = \beta_0 + \beta_1 x_i + \varepsilon_i for i = 1, \dots, n, where \beta_0 and \beta_1 are the intercept and slope parameters, respectively, and \varepsilon_i represents the random error term assumed to have mean zero and constant variance.^[15] The ordinary least squares (OLS) method estimates these parameters by minimizing the residual sum of squares (RSS), yielding fitted values \hat{y}_i = b_0 + b_1 x_i, where b_0 and b_1 are the OLS estimators.^[21] The RSS is then given explicitly by

\text{RSS} = \sum_{i=1}^n (y_i - b_0 - b_1 x_i)^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2.

^[21] To relate RSS to the sample means \bar{y} and \bar{x}, the residuals can be rewritten as e_i = y_i - \hat{y}_i = (y_i - \bar{y}) - b_1 (x_i - \bar{x}). Substituting into the RSS formula and expanding yields

\text{RSS} = \sum_{i=1}^n (y_i - \bar{y})^2 - b_1^2 \sum_{i=1}^n (x_i - \bar{x})^2,

where the first term is the total sum of squares (TSS) measuring total variation in y, and the second term captures variation explained by the regression. Since b_1 = r \frac{s_y}{s_x} with r denoting the sample correlation coefficient between x and y, s_y and s_x the sample standard deviations, this simplifies to \text{RSS} = (1 - r^2) \cdot \text{TSS}.^[22] For illustration, consider a hypothetical dataset with n=4 observations: (x_i, y_i) = (1,2), (2,4), (3,5), (4,7). The sample means are \bar{x} = 2.5 and \bar{y} = 4.5. The OLS estimates are b_1 = 1.6 and b_0 = 0.5, producing fitted values \hat{y}_i = 2.1, 3.7, 5.3, 6.9 and residuals e_i = -0.1, 0.3, -0.3, 0.1. The RSS is then (-0.1)^2 + (0.3)^2 + (-0.3)^2 + (0.1)^2 = 0.2. Here, TSS = 13 and r^2 \approx 0.985, confirming \text{RSS} = (1 - 0.985) \times 13 = 0.2.^[15]^[22]

Multiple Linear Regression

In multiple linear regression, the model is expressed in matrix form as \mathbf{y} = X \boldsymbol{\beta} + \boldsymbol{\varepsilon}, where \mathbf{y} is an n \times 1 response vector, X is an n \times (p+1) design matrix incorporating the intercept and p predictors, \boldsymbol{\beta} is a (p+1) \times 1 parameter vector, and \boldsymbol{\varepsilon} is an n \times 1 error vector with independent components assumed to have mean zero and constant variance \sigma^2.^[23] The ordinary least squares (OLS) estimator minimizes the residual sum of squares and is given by \hat{\boldsymbol{\beta}} = (X^T X)^{-1} X^T \mathbf{y}, assuming X^T X is invertible, which requires the columns of X to be linearly independent.^[23] The fitted values are then \hat{\mathbf{y}} = X \hat{\boldsymbol{\beta}}, and the residuals are defined as \mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}.^[23] Substituting the OLS estimator yields the residual sum of squares as \text{RSS} = \mathbf{e}^T \mathbf{e} = (\mathbf{y} - \hat{\mathbf{y}})^T (\mathbf{y} - \hat{\mathbf{y}}).^[23] This can be rewritten in a more compact matrix form as \text{RSS} = \mathbf{y}^T (I - X (X^T X)^{-1} X^T) \mathbf{y}, where I is the n \times n identity matrix.^[24] The term H = X (X^T X)^{-1} X^T is known as the hat or projection matrix, which projects \mathbf{y} onto the column space of X; thus, \text{RSS} = \mathbf{y}^T (I - H) \mathbf{y}, and I - H is the residual maker matrix, which is idempotent and symmetric.^[24] The RSS represents a quadratic form in \mathbf{y}, specifically \mathbf{y}^T M \mathbf{y} where M = I - H, and this structure facilitates its use in distribution theory and inference under normality assumptions.^[25] A key property is that RSS is invariant under linear transformations of the predictors when the data are centered (i.e., predictors have zero mean), as centering ensures the projection matrix adjusts accordingly without altering the unexplained variation measure.^[25] Computationally, using the projection matrix H allows efficient calculation of RSS directly from \mathbf{y} without explicitly computing residuals, which is particularly advantageous for large datasets where matrix operations can be optimized.^[24]

Connections to Other Statistics

Decomposition of Variance

In linear regression, the total variation in the response variable is decomposed into components attributable to the model and to unexplained error, providing a framework for assessing model fit. The total sum of squares (TSS) quantifies the overall variability around the mean response, defined as \sum_{i=1}^n (y_i - \bar{y})^2, where \bar{y} is the sample mean of the observed responses y_i. This decomposes additively into the explained sum of squares (ESS), which captures variation explained by the fitted model as \sum_{i=1}^n (\hat{y}_i - \bar{y})^2, and the residual sum of squares (RSS), which measures unexplained variation as \sum_{i=1}^n (y_i - \hat{y}_i)^2, such that \mathrm{TSS} = \mathrm{ESS} + \mathrm{RSS}.^[26] This additivity, known as the partition theorem, holds under ordinary least squares (OLS) estimation due to the orthogonality of the residuals and the centered fitted values. Specifically, the residuals vector \mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} satisfies \mathbf{e}^T (\hat{\mathbf{y}} - \bar{y} \mathbf{1}) = 0, where \mathbf{1} is a vector of ones, ensuring the vectors are perpendicular in Euclidean space. This orthogonality follows from the projection properties of OLS, where the hat matrix \mathbf{H} = \mathbf{X}(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T projects \mathbf{y} onto the column space of the design matrix \mathbf{X}, and the residuals lie in the orthogonal complement, invoking the Pythagorean theorem to guarantee the decomposition.^[26] In matrix notation, the TSS can be expressed as \mathbf{y}^T \left( \mathbf{I}_n - \frac{1}{n} \mathbf{J} \right) \mathbf{y}, where \mathbf{I}_n is the n \times n identity matrix and \mathbf{J} = \mathbf{1}\mathbf{1}^T is the all-ones matrix; the full decomposition persists under the standard OLS assumptions of linearity, no perfect multicollinearity, and homoscedasticity. The associated degrees of freedom reflect the dimensionality: df_{\mathrm{TSS}} = n-1 for the total, df_{\mathrm{ESS}} = p for the explained (with p predictors), and df_{\mathrm{RSS}} = n-p-1 for the residuals.^[27]^[28] This decomposition forms the basis for the analysis of variance (ANOVA) table in regression output, which summarizes the sums of squares, degrees of freedom, mean squares (obtained by dividing sums of squares by degrees of freedom), and the F-statistic for testing overall model significance. The table typically includes rows for regression (ESS), residual (RSS), and total (TSS), enabling inference on whether the predictors collectively explain a significant portion of the variance.^[28]

Coefficient of Determination

The coefficient of determination, denoted R^2, quantifies the proportion of the total variation in the dependent variable that is explained by the regression model. It is defined as
R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} = \frac{\text{ESS}}{\text{TSS}},
where RSS is the residual sum of squares, TSS is the total sum of squares measuring the total variation around the mean of the dependent variable, and ESS is the explained sum of squares capturing the variation accounted for by the model.^[29]^[30] In interpretation, R^2 represents the fraction of the variance in the observed values of the dependent variable y that is attributable to the variation in the predictors through the fitted model, with values ranging from 0 to 1. A value of 0 indicates that the model does not explain any of the variability in y beyond the mean, while a value of 1 implies perfect explanation of all variability.^[31] To address the tendency of R^2 to increase with the addition of more predictors regardless of their relevance, the adjusted coefficient of determination, denoted \bar{R}^2, incorporates a penalty for the number of parameters:
\bar{R}^2 = 1 - \left(1 - R^2\right) \frac{n-1}{n - p - 1},
where n is the sample size and p is the number of predictors. This adjustment ensures that \bar{R}^2 only increases if the added predictor improves the model beyond what would be expected by chance.^[32]^[33] Despite its utility, R^2 has limitations: it can artificially increase when irrelevant variables are added to the model, making it less reliable for comparing models with different numbers of predictors without adjustment. Additionally, R^2 is not appropriate for assessing nonlinear models, as it assumes linearity in the relationship between predictors and the dependent variable, and it is undefined when TSS = 0, which occurs if all observed values of y are identical.^[33]^[34] For illustration in simple linear regression, consider an example where the total sum of squares TSS = 1827.6 and the regression (explained) sum of squares is 119.1, yielding RSS = TSS - ESS = 1708.5; thus, R^2 = 1 - \frac{1708.5}{1827.6} \approx 0.065, indicating that approximately 6.5% of the variance in the dependent variable is explained by the model.^[31]

Pearson Correlation

In simple linear regression, the square of the Pearson product-moment correlation coefficient, denoted r^2, provides a direct measure of the proportion of variance in the response variable explained by the predictor, linking explicitly to the residual sum of squares (RSS) and total sum of squares (TSS) via the relation

r^2 = 1 - \frac{\text{RSS}}{\text{TSS}}.

This equation demonstrates how a stronger linear association between the variables reduces the RSS relative to the TSS, thereby increasing r^2.^[35]^[36] The Pearson correlation coefficient r itself quantifies the strength and direction of this linear association and is computed as

r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}},

where \bar{x} and \bar{y} are the sample means of the predictor x and response y, respectively.^[37] In the context of simple linear regression with a single predictor, r^2 is equivalent to the coefficient of determination R^2, both representing the fraction of total variance accounted for by the model.^[36] This equivalence underscores r^2 as a standardized metric of linear fit, independent of the units of measurement for x and y.^[35] Karl Pearson first formalized this connection between correlation and least squares residuals in his seminal 1895 paper, where he developed the correlation coefficient as part of broader contributions to evolutionary theory and biometric analysis.^[37] Pearson's work highlighted how residuals from least squares fitting reveal the degree of linear dependence, laying the groundwork for modern regression diagnostics. In multiple linear regression, the coefficient of determination R^2 generalizes this concept by relating to the multiple correlation coefficient R, defined as the correlation between the observed response and the predicted response from multiple predictors; specifically, R^2 = R^2, but this R does not reduce to the simple Pearson r for any individual predictor.^[38] A key caution is that the equality r^2 = R^2 holds exclusively in the univariate case with one predictor; in multiple regression settings, squaring the simple correlation with a single variable does not yield the overall R^2.^[38] Moreover, since r assesses only linear associations, non-linear relationships can produce misleadingly low r^2 values despite potentially strong overall dependencies, a phenomenon Pearson later termed "spurious correlation" in contexts where apparent linear links are artifactual.^[39]