Fact-checked by Grok 2 weeks ago

Residual sum of squares

The residual sum of squares (RSS), also known as the sum of squared residuals or error sum of squares (SSE), is a fundamental statistical measure in regression analysis that quantifies the total discrepancy between observed data points and the values predicted by a fitted model. It is defined as the sum of the squared differences between each observed value y_i and its corresponding predicted value \hat{y}_i, expressed by the formula RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2, where n is the number of observations. In linear regression, the ordinary least squares method minimizes the RSS to estimate the model parameters, ensuring the best linear fit to the data by reducing unexplained variation. RSS plays a central role in the analysis of variance (ANOVA) framework for models, where the (TSS)—measuring the overall variation in the response variable around its mean—is decomposed into the (SSR), which captures variation explained by the model, and the RSS, which represents the unexplained or variation: TSS = SSR + RSS. This decomposition allows for the calculation of the R^2 = 1 - \frac{RSS}{TSS}, which indicates the proportion of variance in the dependent variable explained by the independent variables, with values closer to 1 signifying a better fit. The associated with RSS are typically n - p - 1, where p is the number of predictors, enabling the computation of the (MSE = RSS / (n - p - 1)) for assessing model performance and conducting hypothesis tests, such as the for overall model significance. Beyond , RSS is integral to more complex models, including multiple linear regression and , where it serves as an objective function for parameter estimation and model comparison via criteria like (AIC) or (BIC), which penalize excessive complexity while favoring lower RSS values. In predictive modeling, a lower RSS on validation indicates better generalization, while overfitting can result in a deceptively low RSS on training but higher RSS on unseen , highlighting the need for regularization techniques like that modify the RSS objective by adding a penalty term to balance fit and complexity. Overall, RSS provides a quantifiable basis for evaluating regression model adequacy, influencing applications in fields such as , , and for data-driven decision-making.

Fundamentals

Definition

The residual sum of squares (), also known as the sum of squared errors (), quantifies the discrepancy between observed data and values predicted by a . It serves as a key measure of model fit, with smaller values indicating that the model's predictions more closely align with the actual observations. Residuals are defined as the differences between the observed values and the corresponding fitted values from the model; RSS is then computed by squaring these residuals and summing them across all data points. Mathematically, for a comprising n observations, the RSS is given by \text{RSS} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2, where y_i denotes the i-th observed value and \hat{y}_i denotes the i-th predicted value. This concept originated in the development of the method, first published by in his 1805 work Nouvelles méthodes pour la détermination des orbites des comètes, where it was presented as an algebraic procedure for minimizing errors in orbital calculations. Independently, elaborated on the method in his 1809 treatise Theoria motus corporum coelestium in sectionibus conicis solem ambientium, providing a probabilistic justification based on the assumption of normally distributed errors. Beyond its foundational role in , the finds broad application in areas such as nonlinear , where it evaluates how well a approximates scattered data points, and in time series modeling, where it assesses the accuracy of forecasts against historical patterns. In ordinary , model parameters are selected precisely to minimize the .

Role in Regression Analysis

The residual sum of squares () serves as a key measure of the unexplained variation in the dependent variable after fitting a model, quantifying the total squared differences between observed values and the model's predictions. This metric directly reflects the portion of variability not captured by the predictors, providing an initial assessment of how well the model accounts for the data's patterns. A lower RSS generally indicates a better-fitting model, as it suggests reduced discrepancies between predictions and actual outcomes; however, this must be interpreted in context with the to prevent favoring overly complex models that artificially reduce RSS by incorporating additional parameters without improving explanatory power. In practice, analysts often divide RSS by the appropriate to obtain the , which offers a more balanced evaluation of fit across models of varying complexity. RSS plays a central in hypothesis testing for assessing the overall significance of a model through the F-statistic, where it contributes to the denominator as the , allowing comparison against the explained variation to determine if the model as a whole explains more than mere chance. A significant , derived in part from RSS, supports rejecting the that all coefficients (except ) are zero, affirming the model's utility. Despite its utility, RSS has notable limitations: it does not inherently penalize model complexity, as adding predictors invariably decreases or maintains RSS without necessarily enhancing generalizability, which can lead to . Additionally, because residuals are squared, RSS is particularly sensitive to outliers, which can disproportionately inflate the value and distort the assessment of model accuracy.

Mathematical Formulations

Simple Linear Regression

In , the model posits a linear relationship between a response y and a single explanatory x, expressed as y_i = \beta_0 + \beta_1 x_i + \varepsilon_i for i = 1, \dots, n, where \beta_0 and \beta_1 are the intercept and parameters, respectively, and \varepsilon_i represents the random error term assumed to have mean zero and constant variance. The ordinary least squares (OLS) method estimates these parameters by minimizing the residual sum of squares (RSS), yielding fitted values \hat{y}_i = b_0 + b_1 x_i, where b_0 and b_1 are the OLS estimators. The RSS is then given explicitly by \text{RSS} = \sum_{i=1}^n (y_i - b_0 - b_1 x_i)^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2. To relate RSS to the sample means \bar{y} and \bar{x}, the residuals can be rewritten as e_i = y_i - \hat{y}_i = (y_i - \bar{y}) - b_1 (x_i - \bar{x}). Substituting into the RSS formula and expanding yields \text{RSS} = \sum_{i=1}^n (y_i - \bar{y})^2 - b_1^2 \sum_{i=1}^n (x_i - \bar{x})^2, where the first term is the total sum of squares (TSS) measuring total variation in y, and the second term captures variation explained by the regression. Since b_1 = r \frac{s_y}{s_x} with r denoting the sample correlation coefficient between x and y, s_y and s_x the sample standard deviations, this simplifies to \text{RSS} = (1 - r^2) \cdot \text{TSS}. For illustration, consider a hypothetical with n=4 observations: (x_i, y_i) = (1,2), (2,4), (3,5), (4,7). The sample means are \bar{x} = 2.5 and \bar{y} = 4.5. The OLS estimates are b_1 = 1.6 and b_0 = 0.5, producing fitted values \hat{y}_i = 2.1, 3.7, 5.3, 6.9 and residuals e_i = -0.1, 0.3, -0.3, 0.1. The RSS is then (-0.1)^2 + (0.3)^2 + (-0.3)^2 + (0.1)^2 = 0.2. Here, TSS = 13 and r^2 \approx 0.985, confirming \text{RSS} = (1 - 0.985) \times 13 = 0.2.

Multiple Linear Regression

In multiple linear regression, the model is expressed in matrix form as \mathbf{y} = X \boldsymbol{\beta} + \boldsymbol{\varepsilon}, where \mathbf{y} is an n \times 1 response , X is an n \times (p+1) incorporating the intercept and p predictors, \boldsymbol{\beta} is a (p+1) \times 1 parameter , and \boldsymbol{\varepsilon} is an n \times 1 error with independent components assumed to have mean zero and constant variance \sigma^2. The ordinary least squares (OLS) estimator minimizes the residual sum of squares and is given by \hat{\boldsymbol{\beta}} = (X^T X)^{-1} X^T \mathbf{y}, assuming X^T X is invertible, which requires the columns of X to be linearly independent. The fitted values are then \hat{\mathbf{y}} = X \hat{\boldsymbol{\beta}}, and the residuals are defined as \mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}. Substituting the OLS estimator yields the residual sum of squares as \text{RSS} = \mathbf{e}^T \mathbf{e} = (\mathbf{y} - \hat{\mathbf{y}})^T (\mathbf{y} - \hat{\mathbf{y}}). This can be rewritten in a more compact matrix form as \text{RSS} = \mathbf{y}^T (I - X (X^T X)^{-1} X^T) \mathbf{y}, where I is the n \times n identity matrix. The term H = X (X^T X)^{-1} X^T is known as the hat or projection matrix, which projects \mathbf{y} onto the column space of X; thus, \text{RSS} = \mathbf{y}^T (I - H) \mathbf{y}, and I - H is the residual maker matrix, which is idempotent and symmetric. The RSS represents a quadratic form in \mathbf{y}, specifically \mathbf{y}^T M \mathbf{y} where M = I - H, and this structure facilitates its use in distribution theory and under normality assumptions. A key property is that RSS is invariant under linear transformations of the predictors when the data are centered (i.e., predictors have zero ), as centering ensures the adjusts accordingly without altering the unexplained variation measure. Computationally, using the projection matrix H allows efficient calculation of RSS directly from \mathbf{y} without explicitly computing residuals, which is particularly advantageous for large datasets where matrix operations can be optimized.

Connections to Other Statistics

Decomposition of Variance

In , the total variation in the response variable is decomposed into components attributable to the model and to unexplained , providing a for assessing model fit. The (TSS) quantifies the overall variability around the response, defined as \sum_{i=1}^n (y_i - \bar{y})^2, where \bar{y} is the sample of the observed responses y_i. This decomposes additively into the (ESS), which captures variation explained by the fitted model as \sum_{i=1}^n (\hat{y}_i - \bar{y})^2, and the residual sum of squares (RSS), which measures unexplained variation as \sum_{i=1}^n (y_i - \hat{y}_i)^2, such that \mathrm{TSS} = \mathrm{ESS} + \mathrm{RSS}. This additivity, known as the partition theorem, holds under ordinary least squares (OLS) estimation due to the of the residuals and the centered fitted values. Specifically, the residuals \mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} satisfies \mathbf{e}^T (\hat{\mathbf{y}} - \bar{y} \mathbf{1}) = 0, where \mathbf{1} is a of ones, ensuring the vectors are in . This follows from the properties of OLS, where the hat \mathbf{H} = \mathbf{X}(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T projects \mathbf{y} onto the column of the \mathbf{X}, and the residuals lie in the , invoking the to guarantee the decomposition. In matrix notation, the TSS can be expressed as \mathbf{y}^T \left( \mathbf{I}_n - \frac{1}{n} \mathbf{J} \right) \mathbf{y}, where \mathbf{I}_n is the n \times n identity matrix and \mathbf{J} = \mathbf{1}\mathbf{1}^T is the all-ones matrix; the full decomposition persists under the standard OLS assumptions of linearity, no perfect multicollinearity, and homoscedasticity. The associated degrees of freedom reflect the dimensionality: df_{\mathrm{TSS}} = n-1 for the total, df_{\mathrm{ESS}} = p for the explained (with p predictors), and df_{\mathrm{RSS}} = n-p-1 for the residuals. This decomposition forms the basis for the analysis of variance (ANOVA) table in output, which summarizes the sums of squares, , mean squares (obtained by dividing sums of squares by ), and the F-statistic for testing overall model significance. The table typically includes rows for (ESS), residual (RSS), and total (TSS), enabling inference on whether the predictors collectively explain a significant portion of the variance.

Coefficient of Determination

The , denoted R^2, quantifies the proportion of the total variation in the dependent that is explained by the model. It is defined as
R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} = \frac{\text{ESS}}{\text{TSS}},
where RSS is the residual sum of squares, TSS is the measuring the around the of the dependent , and ESS is the capturing the variation accounted for by the model.
In interpretation, R^2 represents the fraction of the variance in the observed values of the dependent variable y that is attributable to the variation in the predictors through the fitted model, with values ranging from 0 to 1. A value of 0 indicates that the model does not explain any of the variability in y beyond the , while a value of 1 implies perfect explanation of all variability. To address the tendency of R^2 to increase with the addition of more predictors regardless of their relevance, the adjusted , denoted \bar{R}^2, incorporates a penalty for the number of parameters:
\bar{R}^2 = 1 - \left(1 - R^2\right) \frac{n-1}{n - p - 1},
where n is the sample size and p is the number of predictors. This adjustment ensures that \bar{R}^2 only increases if the added predictor improves the model beyond what would be expected by chance.
Despite its utility, R^2 has limitations: it can artificially increase when irrelevant variables are added to the model, making it less reliable for comparing models with different numbers of predictors without adjustment. Additionally, R^2 is not appropriate for assessing nonlinear models, as it assumes in the relationship between predictors and the dependent variable, and it is undefined when TSS = 0, which occurs if all observed values of y are identical. For illustration in simple linear regression, consider an example where the total sum of squares TSS = 1827.6 and the regression (explained) sum of squares is 119.1, yielding RSS = TSS - ESS = 1708.5; thus, R^2 = 1 - \frac{1708.5}{1827.6} \approx 0.065, indicating that approximately 6.5% of the variance in the dependent variable is explained by the model.

Pearson Correlation

In simple linear regression, the square of the Pearson product-moment correlation coefficient, denoted r^2, provides a direct measure of the proportion of variance in the response variable explained by the predictor, linking explicitly to the residual sum of squares (RSS) and total sum of squares (TSS) via the relation r^2 = 1 - \frac{\text{RSS}}{\text{TSS}}. This equation demonstrates how a stronger linear association between the variables reduces the RSS relative to the TSS, thereby increasing r^2. The Pearson correlation coefficient r itself quantifies the strength and direction of this linear association and is computed as r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}}, where \bar{x} and \bar{y} are the sample means of the predictor x and response y, respectively. In the context of with a single predictor, r^2 is equivalent to the R^2, both representing the fraction of total variance accounted for by the model. This equivalence underscores r^2 as a standardized of linear fit, independent of the units of measurement for x and y. Karl Pearson first formalized this connection between and residuals in his seminal 1895 paper, where he developed the as part of broader contributions to evolutionary theory and biometric analysis. Pearson's work highlighted how residuals from fitting reveal the degree of linear dependence, laying the groundwork for modern regression diagnostics. In multiple linear regression, the R^2 generalizes this concept by relating to the multiple R, defined as the correlation between the observed response and the predicted response from multiple predictors; specifically, R^2 = R^2, but this R does not reduce to the simple Pearson r for any individual predictor. A key caution is that the equality r^2 = R^2 holds exclusively in the univariate case with one predictor; in multiple settings, squaring the with a single variable does not yield the overall R^2. Moreover, since r assesses only linear associations, non-linear relationships can produce misleadingly low r^2 values despite potentially strong overall dependencies, a Pearson later termed "spurious " in contexts where apparent linear are artifactual.