Fact-checked by Grok 2 weeks ago

Mean squared error

The mean squared error (MSE), also known as mean squared deviation (MSD), is a fundamental statistical measure that quantifies the average of the squares of the differences between estimated values and actual observed values, providing an indication of the accuracy of an or predictive model. Introduced in the early as part of the method of least squares by to handle random errors in astronomical observations, MSE serves as a key criterion for optimizing parameter estimates by minimizing the expected squared deviation. Mathematically, for an \hat{\theta} of a \theta, the MSE is defined as the \text{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta)^2], which decomposes into the of the estimator plus the square of its : \text{MSE}(\hat{\theta}) = \text{Var}(\hat{\theta}) + [\text{Bias}(\hat{\theta})]^2. This decomposition highlights MSE's role in balancing (low variance) and accuracy (low bias) in , with unbiased estimators having MSE equal to their variance alone. In practice, for a sample of n observations, the empirical MSE is calculated as \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2, where y_i are actual values and \hat{y}_i are predictions, emphasizing larger errors through squaring while ensuring non-negativity. MSE is widely applied in to evaluate model fit, where it forms the basis for ordinary estimation by minimizing the sum of squared residuals, and in as a for training algorithms like and neural networks due to its differentiability and interpretability. Its sensitivity to outliers—stemming from the penalty—makes it particularly suitable for normally distributed errors, though alternatives like may be preferred otherwise. Overall, MSE remains a cornerstone metric for assessing predictive performance across fields including statistics, , and , often complemented by its square root, the root mean squared error (RMSE), for interpretation in original units.

Core Concepts

Definition

The mean squared error (MSE) of an \hat{\theta} of a \theta is defined as the of the squared difference between the estimator and the true parameter value, where the expectation is taken over the of the random sample used to compute \hat{\theta}. \text{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta)^2] This MSE quantifies the average squared deviation of the estimator from the true parameter across all possible samples from the underlying distribution. In the sample context, the empirical MSE serves as an estimate of the population MSE and is calculated as the average of the squared differences between observed values and their estimates or predictions for a finite of size n. \text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 The in the population MSE definition requires basic concepts from probability: a (here, the \hat{\theta}) and its (the or sum representing the long-run average under the ). MSE applies to both estimators, which infer fixed parameters like means or variances, and predictors, which forecast realizations of random variables such as future observations; in the predictor case, the is random rather than fixed, but the MSE remains analogous as E[(T(\mathbf{Y}) - U)^2], where U is the random .

Basic Properties

The mean squared error (MSE) is inherently non-negative, as it is computed as the of squared differences, which are always greater than or equal to zero. Equality holds the predictions or estimates match the true values exactly for all observations, resulting in zero error. As a measure, MSE amplifies larger deviations from the true values due to the squaring operation, making it particularly sensitive to outliers compared to linear error metrics. This emphasis on large errors arises because the squared term grows quadratically, whereas absolute deviations grow only linearly, leading MSE to penalize substantial discrepancies more heavily than metrics like (). The units of MSE are the square of the units of the original data, which introduces dependence and can complicate direct interpretability, as the does not align intuitively with the of the variable being estimated.

Statistical Roles

In Estimation

In statistical , the mean squared error (MSE) serves as a fundamental function for evaluating and selecting optimal of under squared . The associated with an \hat{\theta} of a \theta is defined as the of the squared difference (\hat{\theta} - \theta)^2, which quantifies the average squared deviation and balances both and variability in the process. This , formalized in the mid-20th century by , allows for the comparison of decision rules by minimizing the overall , leading to Bayes or optimal depending on prior information or worst-case considerations. When comparing estimators under MSE, the sample mean \bar{X} is the minimum MSE estimator among unbiased estimators for the population mean \mu in the case of independent and identically distributed observations, and it is the minimax estimator under squared error loss for the normal distribution. This optimality holds because, in the Bayesian framework with a flat prior, it corresponds to the posterior mean, which minimizes the expected squared error among all possible functions of the data. In broader estimation problems, MSE facilitates the selection of estimators that trade off bias and variance to achieve lower overall risk, such as in shrinkage methods where a slightly biased estimator reduces variance sufficiently to lower MSE compared to unbiased alternatives. MSE-optimal estimators often exhibit desirable asymptotic properties, including , where the MSE converges to zero as the sample size increases, implying that the estimator \hat{\theta} converges in probability (and in ) to the true \theta. This arises because, for large samples, the variance term diminishes while remains controlled, ensuring reliable in models. , in particular, provides a stronger guarantee than mere probabilistic , as it directly ties to the vanishing of the MSE. The use of MSE in estimation traces its roots to Carl Friedrich Gauss's development of the method in 1809, which minimizes the sum of squared residuals and laid the groundwork for MSE as a criterion in parameter fitting. Its prominence grew in frequentist during the 20th century, with the MSE of an \hat{\theta} expressed as \text{MSE}(\hat{\theta}) = \text{Var}(\hat{\theta}) + [\text{Bias}(\hat{\theta})]^2, capturing how variance and systematic error contribute to overall estimation inaccuracy.

Bias-Variance Relationship

The mean squared error (MSE) of an estimator \hat{\theta} for a parameter \theta can be decomposed into the squared and the variance of the estimator, providing into the sources of estimation . This decomposition highlights how MSE quantifies both the systematic deviation of the estimator from the true value () and the variability due to sampling (variance). To derive this, consider the MSE defined as \mathrm{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta)^2], where the expectation is taken over the randomness in \hat{\theta}. Rewrite the error term as (\hat{\theta} - \theta) = (\hat{\theta} - E[\hat{\theta}]) + (E[\hat{\theta}] - \theta), so (\hat{\theta} - \theta)^2 = (\hat{\theta} - E[\hat{\theta}] + E[\hat{\theta}] - \theta)^2. Expanding the square gives (\hat{\theta} - E[\hat{\theta}] + E[\hat{\theta}] - \theta)^2 = (\hat{\theta} - E[\hat{\theta}])^2 + 2(\hat{\theta} - E[\hat{\theta}])(E[\hat{\theta}] - \theta) + (E[\hat{\theta}] - \theta)^2. Taking the expectation and using the linearity of expectation yields E[(\hat{\theta} - \theta)^2] = E[(\hat{\theta} - E[\hat{\theta}])^2] + 2E[(\hat{\theta} - E[\hat{\theta}])(E[\hat{\theta}] - \theta)] + E[(E[\hat{\theta}] - \theta)^2]. The cross term simplifies to zero because E[\hat{\theta} - E[\hat{\theta}]] = 0, and E[\hat{\theta}] - \theta is constant with respect to the expectation over \hat{\theta}. The first term is the variance \mathrm{Var}(\hat{\theta}), and the third term is the squared bias [E[\hat{\theta}] - \theta]^2. Thus, \mathrm{MSE}(\hat{\theta}) = [E[\hat{\theta}] - \theta]^2 + \mathrm{Var}(\hat{\theta}). This derivation relies on the properties of expectation and variance for unbiased or biased estimators alike. The term [E[\hat{\theta}] - \theta]^2 represents the systematic , capturing how far the average value of the deviates from the true , while the variance \mathrm{Var}(\hat{\theta}) measures the random or spread around that average. Together, they explain why MSE serves as a comprehensive measure of , balancing against potential in complex models. In selection, this reveals trade-offs, particularly in high-dimensional settings where increasing model flexibility reduces but often inflates variance due to limited relative to parameters. Optimal estimators minimize the sum, favoring simpler models in sparse regimes to avoid excessive variance dominance. For vector-valued estimators \hat{\boldsymbol{\theta}} estimating \boldsymbol{\theta} \in \mathbb{R}^p, the MSE generalizes to the expected squared Euclidean norm E[\|\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}\|^2], which decomposes as \|\boldsymbol{b}\|^2 + \mathrm{trace}(\boldsymbol{\Sigma}), where \boldsymbol{b} = E[\hat{\boldsymbol{\theta}}] - \boldsymbol{\theta} is the bias vector and \boldsymbol{\Sigma} = \mathrm{Cov}(\hat{\boldsymbol{\theta}}) is the covariance matrix. The trace term aggregates the variances along each dimension, extending the scalar case to multivariate precision assessment.

Applications in Modeling

In Regression

In linear regression, the mean squared error (MSE) serves as a key measure of model fit, defined as the average of the squared residuals, where residuals are the differences between observed values y_i and predicted values \hat{y}_i. Specifically, for a dataset of size n, MSE is given by \text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{\text{RSS}}{n}, with RSS denoting the residual sum of squares. This formulation treats MSE as an estimate of the population variance of the errors, assuming the model is correctly specified. The ordinary least squares (OLS) minimizes this MSE to find the best-fitting \hat{y} = X \hat{\beta}. Starting from the model y = X \beta + \epsilon, the is \text{RSS} = (y - X \beta)^T (y - X \beta). Taking the with respect to \beta and setting it to zero yields \frac{\partial \text{RSS}}{\partial \beta} = -2 X^T (y - X \beta) = 0, which solves to the normal equations X^T X \beta = X^T y, and thus the OLS \hat{\beta} = (X^T X)^{-1} X^T y, assuming X^T X is invertible. This minimization ensures the parameters balance the fit across all observations, leading to unbiased and minimum-variance estimates under Gauss-Markov assumptions. For prediction in , the expected MSE at a new point is \sigma^2 (1 + p/n), where \sigma^2 is the variance and p is the number of (including ). This accounts for the irreducible \sigma^2 plus an additional variance term (p/n) \sigma^2 due to estimation uncertainty, which increases with model complexity relative to sample size. MSE also informs in by penalizing overly complex models. The adjusted R^2, defined as R^2_{\text{adj}} = 1 - \frac{(n-1)}{n-p-1} \cdot \frac{\text{MSE}}{s_y^2}, where s_y^2 is the total variance, increases only if adding a predictor sufficiently reduces MSE beyond the degrees-of-freedom penalty; thus, maximizing adjusted R^2 equates to minimizing MSE in comparative assessments. While MSE extends to , where it is similarly computed as \text{MSE} = \text{SSE} / (n - p) with SSE the sum of squared residuals from the nonlinear fit, optimization requires iterative methods due to the non-quadratic loss surface, and assumptions like homoscedasticity may not hold. In generalized linear models (GLMs), MSE is less suitable as the primary criterion because responses follow non-Gaussian distributions (e.g., or ), necessitating deviance or measures instead of squared errors on the raw scale to properly account for variance structure and link functions.

As a Loss Function

In , the mean squared error (MSE) serves as a common to quantify the discrepancy between predicted outputs \hat{y} and true targets y, guiding the optimization of model parameters through gradient-based methods. This loss is defined as the average of squared differences over a , providing a differentiable that penalizes larger errors more heavily due to the term. A key advantage of MSE in optimization lies in its simplicity for computing gradients, which are essential for algorithms like . For a single prediction, the of the squared error (y - \hat{y})^2 with respect to \hat{y} is $2(\hat{y} - y), enabling efficient updates to model weights by propagating errors backward. This derivative facilitates the chain rule application in multi-layer models. In neural networks, MSE plays a central role in training via , where the algorithm computes gradients of the total loss with respect to each weight and adjusts them iteratively to minimize the overall . The procedure treats the network's output as \hat{y} and backpropagates the error signal using the MSE derivative to update hidden layer . This approach was popularized in early neural network research, such as the work by Rumelhart, Hinton, and Williams, who demonstrated its effectiveness for learning internal representations through error minimization. Under the framework of , with MSE approximates the population risk by minimizing the average squared error on a finite sample, assuming the sample MSE converges to the expected MSE as data size increases. This principle underpins paradigms, where the goal is to find parameters that generalize beyond the set. Computationally, MSE supports variants of , including batch gradient descent, which computes the exact gradient over the full dataset for stable but resource-intensive updates, and (SGD), which uses gradients from single examples or mini-batches for faster, convergence. In practice, SGD with MSE often accelerates in large-scale neural networks by introducing beneficial that aids escaping local minima, though it requires careful tuning to manage variance in gradient estimates.

Examples

Population Mean Estimation

In the context of estimating the population mean \mu from a random sample of n independent and identically distributed observations X_1, \dots, X_n with finite mean \mu and variance \sigma^2 > 0, the sample mean \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i is a standard estimator. The mean squared error of \bar{X} is defined as the expected value of (\bar{X} - \mu)^2, which simplifies to \text{MSE}(\bar{X}) = \text{Var}(\bar{X}) = \frac{\sigma^2}{n} due to the unbiasedness of the estimator. The sample is unbiased because its equals the true parameter: E[\bar{X}] = \mu, implying a of zero, so the MSE equals the variance alone without a bias term. This property holds under the assumption of finite second moments for the observations. To compute the sample and an estimate of its MSE for a simple , follow these steps for n=5 observations, such as \{1, 3, 2, 4, 0\}:
  1. Calculate the sample : \bar{X} = \frac{1+3+2+4+0}{5} = 2.
  2. Compute the deviations from the mean: $1-2 = -1, $3-2 = 1, $2-2 = 0, $4-2 = 2, $0-2 = -2.
  3. Square the deviations: (-1)^2 = 1, $1^2 = 1, $0^2 = 0, $2^2 = 4, (-2)^2 = 4.
  4. Sum the squared deviations: $1 + 1 + 0 + 4 + 4 = 10.
  5. Divide by n-1 = 4 to get the sample variance s^2 = \frac{10}{4} = 2.5.
  6. Estimate the variance of \bar{X} as \frac{s^2}{n} = \frac{2.5}{5} = 0.5, which approximates the MSE since the estimator is unbiased.
For a numerical example with the dataset \{1, 2, 3\} where n=3 and \bar{X} = 2:
  • Deviations: $1-2 = -1, $2-2 = 0, $3-2 = 1.
  • Squared deviations: $1, $0, $1; sum = 2.
  • Sample variance s^2 = \frac{2}{2} = 1.
  • Estimated MSE \approx \frac{1}{3} \approx 0.333.
The sample mean is MSE-optimal among all unbiased estimators of \mu because it attains the Cramér-Rao lower bound \frac{\sigma^2}{n} on the variance, establishing it as the minimum variance unbiased estimator.

Variance Estimation

The estimation of the population variance \sigma^2 using the mean squared error (MSE) criterion highlights the trade-offs between and variance in estimators derived from a random sample X_1, \dots, X_n of and identically distributed random variables with finite second moment. Unlike the sample mean, which is unbiased for the population mean without correction, variance estimation introduces when using the naive n, necessitating adjustments to achieve desirable properties under MSE. The standard unbiased estimator of \sigma^2 is the sample variance S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2, where \bar{X} = n^{-1} \sum_{i=1}^n X_i is the sample ; its expected value is E[S^2] = \sigma^2, making it unbiased. In contrast, the biased estimator \bar{S}^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2 has expected value E[\bar{S}^2] = \frac{n-1}{n} \sigma^2, yielding a bias of -\frac{\sigma^2}{n}. This negative bias arises because the sample mean \bar{X} minimizes the sum of squared deviations within the sample, underestimating the population spread relative to the true . The n-1 correction in S^2 accounts for this degrees-of-freedom loss, ensuring unbiasedness. Since S^2 is unbiased, its MSE equals its variance: \mathrm{MSE}(S^2) = \mathrm{Var}(S^2). For large n, this is approximately \frac{2\sigma^4}{n}, reflecting the quadratic scaling typical of variance estimators. More precisely, \mathrm{Var}(S^2) depends on higher moments like , but the approximation holds under mild tail conditions. For the biased \bar{S}^2, the MSE is \mathrm{Var}(\bar{S}^2) + \left(\frac{\sigma^2}{n}\right)^2 \approx \frac{2\sigma^4 (n-1)}{n^2} + \frac{\sigma^4}{n^2} = \frac{(2n-1)\sigma^4}{n^2}, which is slightly lower than that of S^2 for finite n but converges to the same rate. The n-1 correction in S^2 minimizes the MSE among scale-invariant estimators under squared-error adapted for scale parameters, as it is the best invariant estimator in decision-theoretic frameworks for variance. To illustrate, consider a sample dataset \{1, 3, 4, 5, 7\} with n=5, \bar{X}=4, and deviations yielding \sum (X_i - \bar{X})^2 = 20. The unbiased sample variance is S^2 = 20 / 4 = 5. If the true \sigma^2 = 5, the approximate MSE of this estimator is \frac{2 \cdot 5^4}{5} = 250, quantifying the expected squared deviation over repeated samples of size 5. This contrasts with mean estimation, where no such divisor correction is needed for unbiasedness, emphasizing the additional complexity in variance due to estimating the location parameter simultaneously.

Gaussian Distribution Case

In the case of a random sample from a Gaussian distribution N(\mu, \sigma^2), the maximum likelihood estimator (MLE) for the mean \mu is the sample mean \bar{X}, which is unbiased and has mean squared error (MSE) equal to \sigma^2 / n. This result holds whether \sigma^2 is known or unknown, as the estimators for \mu and \sigma^2 are independent under normality. For joint estimation of \mu and \sigma^2, the MLE for the variance is \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2, which is biased downward with expectation (n-1)\sigma^2 / n. The MSE of this estimator is \sigma^4 (2n - 1) / n^2, derived from its distribution: n \hat{\sigma}^2 / \sigma^2 \sim \chi^2_{n-1}, a special case of the Wishart distribution for the univariate setting, where the variance of the chi-squared is $2(n-1). To illustrate these concepts empirically, consider simulating 1000 independent samples of size n=10 from N(0,1). For each sample, compute \bar{X} and average (\bar{X} - 0)^2 across simulations to estimate the MSE for \mu, yielding a value close to the theoretical \sigma^2 / n = 0.1. Similarly, for \hat{\sigma}^2, average (\hat{\sigma}^2 - 1)^2 across simulations approximates the theoretical MSE of (2 \cdot 10 - 1)/10^2 = 0.19. This simulation demonstrates how repeated sampling reveals the bias and variability in the variance estimator under normality. Under Gaussian assumptions, the alignment of MSE minimization with is evident in the Gauss-Markov theorem for linear models, where the (OLS) minimizes the MSE among linear unbiased estimators and coincides with the MLE when errors are , ensuring efficiency. For predictive purposes, the closed-form MSE of using \bar{X} to predict a X_{n+1} \sim N(\mu, \sigma^2) is \sigma^2 (1 + 1/n), accounting for both the inherent variance \sigma^2 and the estimation uncertainty \sigma^2 / n; this quantifies the prediction error in Gaussian predictive intervals.

Interpretation and Limitations

Statistical Meaning

The mean squared error (MSE) quantifies the average squared deviation between predicted values and actual observations, serving as a measure of the overall inaccuracy in a . This metric corresponds to the second moment of the error distribution centered at the , capturing both the variability and potential systematic in predictions. Low MSE values indicate that predictions are, on average, close to the true values, reflecting high model fidelity to the underlying data-generating process, while high values signal substantial discrepancies that undermine reliability. A key interpretive tool derived from MSE is the root mean squared error (RMSE), obtained by taking the of the MSE. Unlike MSE, which is expressed in squared units of the target variable, RMSE returns to the original of the data, making it more directly comparable to the magnitude of the observations and easier to contextualize in practical terms. For instance, an RMSE of 5 units implies that predictions deviate from reality by about 5 units on average, providing a tangible sense of that MSE's squared form obscures. This property renders RMSE the preferred for interpretive purposes in statistical and reporting. In terms of predictive confidence, MSE underpins the construction of prediction intervals, particularly when errors are assumed to follow a normal distribution. An approximate 95% prediction interval for future observations can be formed as the predicted value ± 2 × RMSE, encompassing the typical range within which most new data points are expected to fall based on historical error patterns. This interval reflects the uncertainty inherent in individual predictions, with narrower bands (lower RMSE) denoting greater precision and broader bands highlighting areas of higher variability or model inadequacy. For probabilistic forecasting, where models output full probability distributions rather than point estimates, MSE evaluates the under squared error loss but falls short compared to logarithmic loss (log-loss). Log-loss, a strictly proper , assesses the and sharpness of the entire predictive distribution by penalizing deviations in probability assignments, whereas MSE focuses solely on mean accuracy and may not reward well-calibrated probabilities. Thus, while MSE excels in point prediction scenarios, log-loss is more suitable for scenarios requiring probabilistic reliability, such as or under .

Practical Criticisms

One significant practical criticism of the mean squared error (MSE) is its high sensitivity to outliers, as the squaring of errors disproportionately amplifies the influence of large deviations, potentially leading to suboptimal model performance in datasets with anomalous values. This issue has prompted the development of robust alternatives, such as the Huber loss function, which applies a quadratic penalty to small errors similar to MSE but switches to a linear penalty for larger errors to mitigate outlier impact. Another limitation arises from MSE's scale dependency, where the metric's value changes with linear transformations of the data, making it unsuitable for comparing models across datasets with differing units or magnitudes without normalization. In contrast, relative error measures remain invariant under such transformations, providing a more consistent basis for evaluation in heterogeneous applications. Theoretically, MSE assumes that errors are independent and identically distributed, often overlooking correlation structures such as autocorrelation in time series data, which can invalidate inferences and lead to biased estimates if present. This oversight is particularly problematic in sequential data, where ignoring serial dependence inflates the apparent accuracy of models fitted under MSE minimization. Empirical studies, including the Makridakis competitions from the 1980s through the 2020s, have demonstrated that MSE can yield misleading rankings of forecasting methods due to its scale sensitivity and outlier emphasis, with (MAE) often outperforming MSE in practical accuracy assessments across diverse . For instance, analyses of M-competition results highlight MSE's unreliability for method comparisons, as it favors models excelling on few series while underpenalizing errors in others. In response to these shortcomings, modern alternatives like the (MAPE) address scale issues by normalizing errors relative to observed values, offering better interpretability in without the squaring effect. For scenarios involving asymmetric error distributions, quantile loss (also known as pinball loss) provides a targeted approach by penalizing over- and under-predictions differently based on the desired , enabling more tailored than the symmetric MSE.

References

  1. [1]
    Mean Squared Error (MSE) - Statistics By Jim
    Mean squared error (MSE) measures error in statistical models by using the average squared difference between observed and predicted values.
  2. [2]
    Mean Square Error - an overview | ScienceDirect Topics
    Mean square error (MSE) is defined as the mean of the squared differences between actual and predicted outputs in a regression model, measuring the accuracy ...
  3. [3]
    [PDF] 5 Decision Theory: Basic Concepts - Purdue Department of Statistics
    Definition 5.2. Let θ be a real valued parameter, and ˆθ an estimate of θ. The mean squared error (MSE) of ˆθ is defined as. MSE = MSE(θ,. ˆ θ) = Eθ[(ˆθ− θ)2], ...
  4. [4]
    Mean Squared Error - SAS Help Center
    28 Oct 2020 · The mean squared error is arguably the most important criterion used to evaluate the performance of a predictor or an estimator.
  5. [5]
    9.1.5 Mean Squared Error (MSE) - Probability Course
    Here, we show that g(y)=E[X|Y=y] has the lowest MSE among all possible estimators. That is why it is called the minimum mean squared error (MMSE) estimate. For ...
  6. [6]
    Variance | STAT 504 - STAT ONLINE
    If X and Y are independent random variables, then their variances will also add: V (X + Y) = V (X) + V ( Y ) if X, Y independent. More generally, if X and Y ...
  7. [7]
    [PDF] Lecture 2: Statistical Decision Theory (Part I) - Arizona Math
    This leads to the risk function of a decision rule. The risk function ... which is called mean squared error (MSE). The MSE is the sum of squared bias ...
  8. [8]
    [PDF] STAT 801: Mathematical Statistics Optimality theory for point ...
    estimate. In the Neyman Pearson spirit we measure average closeness. Definition: The Mean Squared Error (MSE) of an estimator ˆθ is the function. MSE(θ) = Eθ ...
  9. [9]
    [PDF] Estimators, Mean Square Error, and Consistency
    Jan 20, 2006 · Our adjusted estimator δ(x) = 2¯x is consistent, however. We found the MSE to be θ2/3n, which tends to 0 as n tends to infinity. This doesn't ...
  10. [10]
    [PDF] Properties of Estimators - Oxford statistics department
    If an estimator is mean square consistent, it is weakly consistent. This follows from Chebyshov's inequality: P{|ˆθ− θ| > } ≤ E(ˆθ− θ)2 2 = mse(ˆθ) 2 , so if ...
  11. [11]
    [PDF] Gauss on least-squares and maximum-likelihood estimation
    Oct 25, 2021 · Abstract: Gauss' 1809 discussion of least squares, which can be viewed as the beginning of mathematical statistics, is reviewed.
  12. [12]
    [PDF] The Mathematical Derivation of Least Squares Back ... - UGA SPIA
    OLS estimates these parameters by finding the values for the constant and coefficients that minimize the sum of the squared errors of prediction, i.e., the.
  13. [13]
    Lecture 8: Linear Regression
    We are minimizing a loss function, l(w)=1n∑ni=1(x⊤iw−yi)2. This particular loss function is also known as the squared loss or Ordinary Least Squares (OLS). OLS ...
  14. [14]
    [PDF] Simple Linear Regression Least Squares Estimates of β0 and β1
    The classic derivation of the least squares estimates uses calculus to find the β0 and β1 parameter estimates that minimize the error sum of squares: SSE =.
  15. [15]
    [PDF] Regression 2
    Mar 5, 2013 · [Bias(ƒ(xo))]² + Var(ƒ(xo))+2E[ (f(xoì-Elf«^u). ··(ECFO) -fox.)] This is called the bias-variance tradeoff (or decomposition) not flexible.
  16. [16]
    11.3 - Best Subsets Regression, Adjusted R-Sq, Mallows Cp
    MSE=\frac{SSE}{n-k-1}=\frac{\sum(y_i-\hat{y}_i)^2}{n-k-1}. That is, MSE quantifies how far away our predicted responses are from our observed responses.
  17. [17]
    [PDF] Nonlinear Regression
    Mar 9, 2013 · and, as in linear regression, we refer to SSE as the sum of squared errors. The quantity MSE, which is an abbreviation for the more complete ...
  18. [18]
    [PDF] Chapter 6 Generalized Linear Models - MIT
    It turns out that for linear models with a normally-distributed error term ǫ, the log- likelihood of the model parameters with respect to y is proportional to ...
  19. [19]
    [PDF] Implicit Bias of Gradient Descent for Mean Squared Error ...
    Implicit Bias of Gradient Descent for Mean Squared Error. Regression with Two-Layer Wide Neural Networks. Hui Jin huijin@ucla.edu. Department of Mathematics.
  20. [20]
  21. [21]
    Learning representations by back-propagating errors - Nature
    ### Summary of Squared Error Loss in Backpropagation (Rumelhart et al., 1986)
  22. [22]
    [PDF] Principles of Risk Minimization for Learning Theory - NIPS papers
    Learning is posed as a problem of function estimation, for which two princi- ples of solution are considered: empirical risk minimization and structural.Missing: original | Show results with:original
  23. [23]
    [PDF] On the Generalization Benefit of Noise in Stochastic Gradient Descent
    Jun 26, 2020 · The optimal test set MSE and final training set MSE for a range of batch sizes under a constant step budget. For each batch size, we train a ...
  24. [24]
    Mean estimation - StatLect
    Learn how the sample mean is used as an estimator of the population mean. Derive its expected value and variance, and prove its consistency.
  25. [25]
    Mean squared error of an estimator | Bias-variance decomposition
    Mean squared error (MSE) is the expected loss from an estimator, when using squared error as a loss function. It can be decomposed into bias and variance.
  26. [26]
    None
    ### Summary on Sample Mean and Cramér-Rao Bound (CRLB)
  27. [27]
    8.2.2 Point Estimators for Mean and Variance - Probability Course
    We conclude that ¯S2 is a biased estimator of the variance. Nevertheless, note that if n is relatively large, the bias is very small. Since E[¯S2]=n−1nσ2, we ...
  28. [28]
    MSE of sample variance - Randy Lai
    Aug 1, 2020 · A more statistical way to evaluate performance of estimators is to compare the mean squared error (MSE) of the estimators.
  29. [29]
    Is 1n+1∑ni=1(Xi−¯X)2 an admissible estimator for σ2?
    May 5, 2021 · Then it is known that under squared error loss, the sample variance ... scale invariant loss functions. On this point, in the linked article ...
  30. [30]
    [PDF] Statistical Inference
    Casella, George. Statistical inference / George Casella ... 5.3 Sampling from the Normal Distribution. 218. 5.3.1 Properties of the Sample Mean and Variance.
  31. [31]
    MSE for MLE of normal distribution's ${\sigma}^2
    Jul 17, 2020 · MSE for MLE of normal distribution's σ2 ... So I've known MLE for σ2 is ^σ2=1n∑ni=1(Xi−ˉX)2, and I'm looking for MSE of ^σ2. But I'm having ...MSE of estimator for normal distribution [closed]MLE of variance minimising the mean squared errorMore results from math.stackexchange.com
  32. [32]
    [PDF] Chapter 4 - The Gauss-Markov Theorem
    The Gauss-Markov theorem says that this variance-covariance (or dispersion) is the best that we can do when we restrict ourselved to linear unbiased ...
  33. [33]
    Estimators of the Mean Squared Error of Prediction in Linear ...
    This article derives a best unbiased estimator and a minimum MSE estimator under the assumption of a normal distribution. It compares the bias and the MSE of ...<|control11|><|separator|>
  34. [34]
    [PDF] 1.010 Uncertainty in Engineering - MIT OpenCourseWare
    Mean Squared Error (MSE):. The mean squared error of Θ is the second initial moment of the estimation error e. . = Θ − θ,. i.e.,. MSE (θ) = E[(Θ − θ)2] = b2 ...
  35. [35]
    [PDF] Lecture 1: Optimal Prediction (with Refreshers)
    Mean squared error is bias (squared) plus variance. This is the simplest form of the bias-variance decomposition, which is one of the central parts of ...
  36. [36]
    MSE vs. RMSE: Which Metric Should You Use? - Statology
    Sep 30, 2021 · When assessing how well a model fits a dataset, we use the RMSE more often because it is measured in the same units as the response variable.
  37. [37]
    What's the bottom line? How to compare models - Duke People
    ... squared units, and is representative of the size of a "typical" error. The ... Bias is one component of the mean squared error--in fact mean squared error ...
  38. [38]
    Chapter 16 Model performance | Stats for Data Science
    In order to produce an interval covering roughly 95% of the error magnitudes, the prediction interval is usually calculated using the model output ± 2 × RMSE.
  39. [39]
    [PDF] Prediction and Confidence Intervals in Regression Preliminaries
    What does RMSE tell you? – The RMSE gives the SD of the residuals. The RMSE thus estimates the concentration of the data around the fitted equation. Here's the ...
  40. [40]
    [PDF] Strictly Proper Scoring Rules, Prediction, and Estimation
    Strictly Proper Scoring Rules, Prediction, and Estimation. Tilmann GNEITING and Adrian E. RAFTERY. Scoring rules assess the quality of probabilistic forecasts ...
  41. [41]
    Robust Estimation of a Location Parameter - Project Euclid
    March, 1964 Robust Estimation of a Location Parameter. Peter J. Huber · DOWNLOAD PDF + SAVE TO MY LIBRARY. Ann. Math. Statist. 35(1): 73-101 (March, 1964). DOI ...
  42. [42]
    [PDF] Performance Metrics (Error Measures) in Machine Learning ... - arXiv
    - Normalized Root Mean Squared Error: NRMSE_sd = RMSE/sd -normalized by the standard ... Squared error has unit measure of squared units of data. This may not be ...
  43. [43]
    [PDF] Department of Econometrics and Business Statistics
    May 20, 2005 · This inappropriate use of the MSE was widely criticized (e.g., Chatfield, 1988; Armstrong and Collopy, 1992). The most commonly used scale- ...
  44. [44]
    Topic 2: Time Series & Autocorrelation - STAT ONLINE
    Error terms correlated over time are said to be autocorrelated or serially correlated. When error terms are autocorrelated, some issues arise when using ...
  45. [45]
    Testing the assumptions of linear regression - Duke People
    ... independent variables are multiplicative rather than additive in their original units. ... mean squared error. The formulas for estimating coefficients require no ...
  46. [46]
    [PDF] Another look at measures of forecast accuracy - Rob J Hyndman
    Nov 2, 2005 · Never- theless, the MSE was used by Makridakis et al., 1985, in the M-competition. This inappropriate use of the MSE was widely criticized (e.g. ...
  47. [47]
    (PDF) On the Selection of Error Measures for Comparisons Among ...
    Aug 6, 2025 · They argued that the MSE should not be applied for comparisons between forecasting methods primarily due to its unreliability. Armstrong and ...
  48. [48]
    [PDF] A Comprehensive Survey of Regression Based Loss Functions for ...
    Nov 5, 2022 · MSE is especially sensitive to outliers, which means that significant outliers in data may influence our model per- formance.