Fact-checked by Grok 2 weeks ago

Mean squared error

The mean squared error (MSE), also known as mean squared deviation (MSD), is a fundamental statistical measure that quantifies the average of the squares of the differences between estimated values and actual observed values, providing an indication of the accuracy of an estimator or predictive model. Introduced in the early 19th century as part of the method of least squares by Carl Friedrich Gauss to handle random errors in astronomical observations, MSE serves as a key criterion for optimizing parameter estimates by minimizing the expected squared deviation. Mathematically, for an estimator \hat{\theta} of a parameter \theta, the MSE is defined as the expected value \text{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta)^2], which decomposes into the variance of the estimator plus the square of its bias: \text{MSE}(\hat{\theta}) = \text{Var}(\hat{\theta}) + [\text{Bias}(\hat{\theta})]^2. This decomposition highlights MSE's role in balancing precision (low variance) and accuracy (low bias) in statistical inference, with unbiased estimators having MSE equal to their variance alone. In practice, for a sample of n observations, the empirical MSE is calculated as \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2, where y_i are actual values and \hat{y}_i are predictions, emphasizing larger errors through squaring while ensuring non-negativity.^[1] MSE is widely applied in regression analysis to evaluate model fit, where it forms the basis for ordinary least squares estimation by minimizing the sum of squared residuals, and in machine learning as a loss function for training algorithms like linear regression and neural networks due to its differentiability and interpretability.^[1]^[2] Its sensitivity to outliers—stemming from the quadratic penalty—makes it particularly suitable for normally distributed errors, though alternatives like mean absolute error may be preferred otherwise.^[2] Overall, MSE remains a cornerstone metric for assessing predictive performance across fields including statistics, engineering, and data science, often complemented by its square root, the root mean squared error (RMSE), for interpretation in original units.^[1]

Core Concepts

Definition

The mean squared error (MSE) of an estimator \hat{\theta} of a parameter \theta is defined as the expected value of the squared difference between the estimator and the true parameter value, where the expectation is taken over the distribution of the random sample used to compute \hat{\theta}.^[3]

\text{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta)^2]

This population MSE quantifies the average squared deviation of the estimator from the true parameter across all possible samples from the underlying distribution.^[3] In the sample context, the empirical MSE serves as an estimate of the population MSE and is calculated as the average of the squared differences between observed values and their estimates or predictions for a finite dataset of size n.^[4]

\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2

The expected value in the population MSE definition requires basic concepts from probability: a random variable (here, the estimator \hat{\theta}) and its expectation (the integral or sum representing the long-run average under the probability distribution).^[5] MSE applies to both estimators, which infer fixed population parameters like means or variances, and predictors, which forecast realizations of random variables such as future observations; in the predictor case, the target is random rather than fixed, but the MSE formula remains analogous as E[(T(\mathbf{Y}) - U)^2], where U is the random target.^[4]

Basic Properties

The mean squared error (MSE) is inherently non-negative, as it is computed as the expected value of squared differences, which are always greater than or equal to zero. Equality holds if and only if the predictions or estimates match the true values exactly for all observations, resulting in zero error. As a quadratic measure, MSE amplifies larger deviations from the true values due to the squaring operation, making it particularly sensitive to outliers compared to linear error metrics. This emphasis on large errors arises because the squared term grows quadratically, whereas absolute deviations grow only linearly, leading MSE to penalize substantial discrepancies more heavily than metrics like mean absolute error (MAE). The units of MSE are the square of the units of the original data, which introduces scale dependence and can complicate direct interpretability, as the magnitude does not align intuitively with the measurement scale of the variable being estimated.

Statistical Roles

In Estimation

In statistical decision theory, the mean squared error (MSE) serves as a fundamental risk function for evaluating and selecting optimal estimators of population parameters under squared error loss. The risk associated with an estimator \hat{\theta} of a parameter \theta is defined as the expected value of the squared difference (\hat{\theta} - \theta)^2, which quantifies the average squared deviation and balances both bias and variability in the estimation process. This framework, formalized in the mid-20th century by Abraham Wald, allows for the comparison of decision rules by minimizing the overall risk, leading to Bayes or minimax optimal estimators depending on prior information or worst-case considerations.^[3]^[6] When comparing estimators under MSE, the sample mean \bar{X} is the minimum MSE estimator among unbiased estimators for the population mean \mu in the case of independent and identically distributed observations, and it is the minimax estimator under squared error loss for the normal distribution. This optimality holds because, in the Bayesian framework with a flat prior, it corresponds to the posterior mean, which minimizes the expected squared error among all possible functions of the data. In broader estimation problems, MSE facilitates the selection of estimators that trade off bias and variance to achieve lower overall risk, such as in shrinkage methods where a slightly biased estimator reduces variance sufficiently to lower MSE compared to unbiased alternatives.^[7] MSE-optimal estimators often exhibit desirable asymptotic properties, including consistency, where the MSE converges to zero as the sample size increases, implying that the estimator \hat{\theta} converges in probability (and in mean square) to the true \theta. This consistency arises because, for large samples, the variance term diminishes while bias remains controlled, ensuring reliable inference in parametric models. Mean square consistency, in particular, provides a stronger guarantee than mere probabilistic convergence, as it directly ties to the vanishing of the MSE.^[8]^[9] The use of MSE in estimation traces its roots to Carl Friedrich Gauss's development of the least squares method in 1809, which minimizes the sum of squared residuals and laid the groundwork for MSE as a criterion in parameter fitting. Its prominence grew in frequentist estimation theory during the 20th century, with the MSE of an estimator \hat{\theta} expressed as \text{MSE}(\hat{\theta}) = \text{Var}(\hat{\theta}) + [\text{Bias}(\hat{\theta})]^2, capturing how variance and systematic error contribute to overall estimation inaccuracy.^[10]^[7]

Bias-Variance Relationship

The mean squared error (MSE) of an estimator \hat{\theta} for a parameter \theta can be decomposed into the squared bias and the variance of the estimator, providing insight into the sources of estimation error. This decomposition highlights how MSE quantifies both the systematic deviation of the estimator from the true value (bias) and the variability due to sampling (variance). To derive this, consider the MSE defined as \mathrm{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta)^2], where the expectation is taken over the randomness in \hat{\theta}. Rewrite the error term as (\hat{\theta} - \theta) = (\hat{\theta} - E[\hat{\theta}]) + (E[\hat{\theta}] - \theta), so

(\hat{\theta} - \theta)^2 = (\hat{\theta} - E[\hat{\theta}] + E[\hat{\theta}] - \theta)^2.

Expanding the square gives

(\hat{\theta} - E[\hat{\theta}] + E[\hat{\theta}] - \theta)^2 = (\hat{\theta} - E[\hat{\theta}])^2 + 2(\hat{\theta} - E[\hat{\theta}])(E[\hat{\theta}] - \theta) + (E[\hat{\theta}] - \theta)^2.

Taking the expectation and using the linearity of expectation yields

E[(\hat{\theta} - \theta)^2] = E[(\hat{\theta} - E[\hat{\theta}])^2] + 2E[(\hat{\theta} - E[\hat{\theta}])(E[\hat{\theta}] - \theta)] + E[(E[\hat{\theta}] - \theta)^2].

The cross term simplifies to zero because E[\hat{\theta} - E[\hat{\theta}]] = 0, and E[\hat{\theta}] - \theta is constant with respect to the expectation over \hat{\theta}. The first term is the variance \mathrm{Var}(\hat{\theta}), and the third term is the squared bias [E[\hat{\theta}] - \theta]^2. Thus,

\mathrm{MSE}(\hat{\theta}) = [E[\hat{\theta}] - \theta]^2 + \mathrm{Var}(\hat{\theta}).

This derivation relies on the properties of expectation and variance for unbiased or biased estimators alike. The bias term [E[\hat{\theta}] - \theta]^2 represents the systematic error, capturing how far the average value of the estimator deviates from the true parameter, while the variance \mathrm{Var}(\hat{\theta}) measures the random error or spread around that average. Together, they explain why MSE serves as a comprehensive measure of estimator accuracy and precision, balancing consistency against potential overfitting in complex models. In estimator selection, this decomposition reveals trade-offs, particularly in high-dimensional settings where increasing model flexibility reduces bias but often inflates variance due to limited data relative to parameters. Optimal estimators minimize the sum, favoring simpler models in sparse regimes to avoid excessive variance dominance. For vector-valued estimators \hat{\boldsymbol{\theta}} estimating \boldsymbol{\theta} \in \mathbb{R}^p, the MSE generalizes to the expected squared Euclidean norm E[\|\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}\|^2], which decomposes as \|\boldsymbol{b}\|^2 + \mathrm{trace}(\boldsymbol{\Sigma}), where \boldsymbol{b} = E[\hat{\boldsymbol{\theta}}] - \boldsymbol{\theta} is the bias vector and \boldsymbol{\Sigma} = \mathrm{Cov}(\hat{\boldsymbol{\theta}}) is the covariance matrix. The trace term aggregates the variances along each dimension, extending the scalar case to multivariate precision assessment.

Applications in Modeling

In Regression

In linear regression, the mean squared error (MSE) serves as a key measure of model fit, defined as the average of the squared residuals, where residuals are the differences between observed values y_i and predicted values \hat{y}_i. Specifically, for a dataset of size n, MSE is given by

\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{\text{RSS}}{n},

with RSS denoting the residual sum of squares.^[11] This formulation treats MSE as an estimate of the population variance of the errors, assuming the model is correctly specified.^[12] The ordinary least squares (OLS) estimator minimizes this MSE to find the best-fitting linear model \hat{y} = X \hat{\beta}. Starting from the model y = X \beta + \epsilon, the RSS is \text{RSS} = (y - X \beta)^T (y - X \beta). Taking the derivative with respect to \beta and setting it to zero yields

\frac{\partial \text{RSS}}{\partial \beta} = -2 X^T (y - X \beta) = 0,

which solves to the normal equations X^T X \beta = X^T y, and thus the OLS estimator

\hat{\beta} = (X^T X)^{-1} X^T y,

assuming X^T X is invertible.^[11] This minimization ensures the parameters balance the fit across all observations, leading to unbiased and minimum-variance estimates under Gauss-Markov assumptions.^[13] For prediction error in linear regression, the expected MSE at a new point is \sigma^2 (1 + p/n), where \sigma^2 is the error variance and p is the number of parameters (including the intercept). This accounts for the irreducible error \sigma^2 plus an additional variance term (p/n) \sigma^2 due to parameter estimation uncertainty, which increases with model complexity relative to sample size.^[14] MSE also informs model selection in regression by penalizing overly complex models. The adjusted R^2, defined as

R^2_{\text{adj}} = 1 - \frac{(n-1)}{n-p-1} \cdot \frac{\text{MSE}}{s_y^2},

where s_y^2 is the total variance, increases only if adding a predictor sufficiently reduces MSE beyond the degrees-of-freedom penalty; thus, maximizing adjusted R^2 equates to minimizing MSE in comparative assessments.^[15] While MSE extends to nonlinear regression, where it is similarly computed as \text{MSE} = \text{SSE} / (n - p) with SSE the sum of squared residuals from the nonlinear fit, optimization requires iterative methods due to the non-quadratic loss surface, and assumptions like homoscedasticity may not hold.^[16] In generalized linear models (GLMs), MSE is less suitable as the primary criterion because responses follow non-Gaussian distributions (e.g., binomial or Poisson), necessitating deviance or quasi-likelihood measures instead of squared errors on the raw scale to properly account for variance structure and link functions.^[17]

As a Loss Function

In machine learning, the mean squared error (MSE) serves as a common loss function to quantify the discrepancy between predicted outputs \hat{y} and true targets y, guiding the optimization of model parameters through gradient-based methods. This loss is defined as the average of squared differences over a dataset, providing a differentiable objective that penalizes larger errors more heavily due to the quadratic term.^[18] A key advantage of MSE in optimization lies in its simplicity for computing gradients, which are essential for algorithms like gradient descent. For a single prediction, the partial derivative of the squared error (y - \hat{y})^2 with respect to \hat{y} is $2(\hat{y} - y), enabling efficient updates to model weights by propagating errors backward. This derivative facilitates the chain rule application in multi-layer models.^[19] In neural networks, MSE plays a central role in training via backpropagation, where the algorithm computes gradients of the total loss with respect to each weight and adjusts them iteratively to minimize the overall error. The procedure treats the network's output as \hat{y} and backpropagates the error signal using the MSE derivative to update hidden layer connections. This approach was popularized in early neural network research, such as the work by Rumelhart, Hinton, and Williams, who demonstrated its effectiveness for learning internal representations through error minimization.^[20] Under the framework of empirical risk minimization, training with MSE approximates the population risk by minimizing the average squared error on a finite sample, assuming the sample MSE converges to the expected MSE as data size increases. This principle underpins supervised learning paradigms, where the goal is to find parameters that generalize beyond the training set.^[21] Computationally, MSE supports variants of gradient descent, including batch gradient descent, which computes the exact gradient over the full dataset for stable but resource-intensive updates, and stochastic gradient descent (SGD), which uses gradients from single examples or mini-batches for faster, noisier convergence. In practice, SGD with MSE often accelerates training in large-scale neural networks by introducing beneficial noise that aids escaping local minima, though it requires careful learning rate tuning to manage variance in gradient estimates.^[22]

Examples

Population Mean Estimation

In the context of estimating the population mean \mu from a random sample of n independent and identically distributed observations X_1, \dots, X_n with finite mean \mu and variance \sigma^2 > 0, the sample mean \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i is a standard estimator. The mean squared error of \bar{X} is defined as the expected value of (\bar{X} - \mu)^2, which simplifies to \text{MSE}(\bar{X}) = \text{Var}(\bar{X}) = \frac{\sigma^2}{n} due to the unbiasedness of the estimator.^[23]^[24] The sample mean is unbiased because its expected value equals the true parameter: E[\bar{X}] = \mu, implying a bias of zero, so the MSE equals the variance alone without a bias term.^[23] This property holds under the assumption of finite second moments for the observations. To compute the sample mean and an estimate of its MSE for a simple dataset, follow these steps for n=5 observations, such as \{1, 3, 2, 4, 0\}:

Calculate the sample mean: \bar{X} = \frac{1+3+2+4+0}{5} = 2.
Compute the deviations from the mean: $1-2 = -1, $3-2 = 1, $2-2 = 0, $4-2 = 2, $0-2 = -2.
Square the deviations: (-1)^2 = 1, $1^2 = 1, $0^2 = 0, $2^2 = 4, (-2)^2 = 4.
Sum the squared deviations: $1 + 1 + 0 + 4 + 4 = 10.
Divide by n-1 = 4 to get the sample variance s^2 = \frac{10}{4} = 2.5.
Estimate the variance of \bar{X} as \frac{s^2}{n} = \frac{2.5}{5} = 0.5, which approximates the MSE since the estimator is unbiased.^[8]

For a numerical example with the dataset \{1, 2, 3\} where n=3 and \bar{X} = 2:

Deviations: $1-2 = -1, $2-2 = 0, $3-2 = 1.
Squared deviations: $1, $0, $1; sum = 2.
Sample variance s^2 = \frac{2}{2} = 1.
Estimated MSE \approx \frac{1}{3} \approx 0.333.^[8]

The sample mean is MSE-optimal among all unbiased estimators of \mu because it attains the Cramér-Rao lower bound \frac{\sigma^2}{n} on the variance, establishing it as the minimum variance unbiased estimator.^[25]

Variance Estimation

The estimation of the population variance \sigma^2 using the mean squared error (MSE) criterion highlights the trade-offs between bias and variance in estimators derived from a random sample X_1, \dots, X_n of independent and identically distributed random variables with finite second moment. Unlike the sample mean, which is unbiased for the population mean without correction, variance estimation introduces bias when using the naive divisor n, necessitating adjustments to achieve desirable properties under MSE.^[26] The standard unbiased estimator of \sigma^2 is the sample variance

S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2,

where \bar{X} = n^{-1} \sum_{i=1}^n X_i is the sample mean; its expected value is E[S^2] = \sigma^2, making it unbiased.^[26] In contrast, the biased estimator

\bar{S}^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2

has expected value E[\bar{S}^2] = \frac{n-1}{n} \sigma^2, yielding a bias of -\frac{\sigma^2}{n}.^[26] This negative bias arises because the sample mean \bar{X} minimizes the sum of squared deviations within the sample, underestimating the population spread relative to the true mean. The n-1 correction in S^2 accounts for this degrees-of-freedom loss, ensuring unbiasedness.^[26] Since S^2 is unbiased, its MSE equals its variance: \mathrm{MSE}(S^2) = \mathrm{Var}(S^2). For large n, this is approximately \frac{2\sigma^4}{n}, reflecting the quadratic scaling typical of variance estimators.^[27] More precisely, \mathrm{Var}(S^2) depends on higher moments like kurtosis, but the approximation holds under mild tail conditions. For the biased \bar{S}^2, the MSE is \mathrm{Var}(\bar{S}^2) + \left(\frac{\sigma^2}{n}\right)^2 \approx \frac{2\sigma^4 (n-1)}{n^2} + \frac{\sigma^4}{n^2} = \frac{(2n-1)\sigma^4}{n^2}, which is slightly lower than that of S^2 for finite n but converges to the same rate.^[27] The n-1 correction in S^2 minimizes the MSE among scale-invariant estimators under squared-error loss adapted for scale parameters, as it is the best invariant estimator in decision-theoretic frameworks for variance.^[28] To illustrate, consider a sample dataset \{1, 3, 4, 5, 7\} with n=5, \bar{X}=4, and deviations yielding \sum (X_i - \bar{X})^2 = 20. The unbiased sample variance is S^2 = 20 / 4 = 5. If the true \sigma^2 = 5, the approximate MSE of this estimator is \frac{2 \cdot 5^4}{5} = 250, quantifying the expected squared deviation over repeated samples of size 5.^[26] This contrasts with mean estimation, where no such divisor correction is needed for unbiasedness, emphasizing the additional complexity in variance due to estimating the location parameter simultaneously.

Gaussian Distribution Case

In the case of a random sample from a Gaussian distribution N(\mu, \sigma^2), the maximum likelihood estimator (MLE) for the mean \mu is the sample mean \bar{X}, which is unbiased and has mean squared error (MSE) equal to \sigma^2 / n.^[29] This result holds whether \sigma^2 is known or unknown, as the estimators for \mu and \sigma^2 are independent under normality.^[29] For joint estimation of \mu and \sigma^2, the MLE for the variance is \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2, which is biased downward with expectation (n-1)\sigma^2 / n. The MSE of this estimator is \sigma^4 (2n - 1) / n^2, derived from its distribution: n \hat{\sigma}^2 / \sigma^2 \sim \chi^2_{n-1}, a special case of the Wishart distribution for the univariate setting, where the variance of the chi-squared is $2(n-1).^[30]^[29] To illustrate these concepts empirically, consider simulating 1000 independent samples of size n=10 from N(0,1). For each sample, compute \bar{X} and average (\bar{X} - 0)^2 across simulations to estimate the MSE for \mu, yielding a value close to the theoretical \sigma^2 / n = 0.1. Similarly, for \hat{\sigma}^2, average (\hat{\sigma}^2 - 1)^2 across simulations approximates the theoretical MSE of (2 \cdot 10 - 1)/10^2 = 0.19. This simulation demonstrates how repeated sampling reveals the bias and variability in the variance estimator under normality.^[8] Under Gaussian assumptions, the alignment of MSE minimization with maximum likelihood estimation is evident in the Gauss-Markov theorem for linear models, where the ordinary least squares (OLS) estimator minimizes the MSE among linear unbiased estimators and coincides with the MLE when errors are normal, ensuring efficiency.^[31] For predictive purposes, the closed-form MSE of using \bar{X} to predict a future observation X_{n+1} \sim N(\mu, \sigma^2) is \sigma^2 (1 + 1/n), accounting for both the inherent variance \sigma^2 and the estimation uncertainty \sigma^2 / n; this quantifies the prediction error in Gaussian predictive intervals.^[32]

Interpretation and Limitations

Statistical Meaning

The mean squared error (MSE) quantifies the average squared deviation between predicted values and actual observations, serving as a measure of the overall prediction inaccuracy in a statistical model. This metric corresponds to the second moment of the error distribution centered at the origin, capturing both the variability and potential systematic offset in predictions.^[33] Low MSE values indicate that predictions are, on average, close to the true values, reflecting high model fidelity to the underlying data-generating process, while high values signal substantial discrepancies that undermine reliability.^[34] A key interpretive tool derived from MSE is the root mean squared error (RMSE), obtained by taking the square root of the MSE. Unlike MSE, which is expressed in squared units of the target variable, RMSE returns to the original scale of the data, making it more directly comparable to the magnitude of the observations and easier to contextualize in practical terms.^[35] For instance, an RMSE of 5 units implies that predictions deviate from reality by about 5 units on average, providing a tangible sense of error scale that MSE's squared form obscures. This property renders RMSE the preferred metric for interpretive purposes in statistical analysis and reporting.^[36] In terms of predictive confidence, MSE underpins the construction of prediction intervals, particularly when errors are assumed to follow a normal distribution. An approximate 95% prediction interval for future observations can be formed as the predicted value ± 2 × RMSE, encompassing the typical range within which most new data points are expected to fall based on historical error patterns.^[37] This interval reflects the uncertainty inherent in individual predictions, with narrower bands (lower RMSE) denoting greater precision and broader bands highlighting areas of higher variability or model inadequacy.^[38] For probabilistic forecasting, where models output full probability distributions rather than point estimates, MSE evaluates the expected value under squared error loss but falls short compared to logarithmic loss (log-loss). Log-loss, a strictly proper scoring rule, assesses the calibration and sharpness of the entire predictive distribution by penalizing deviations in probability assignments, whereas MSE focuses solely on mean accuracy and may not reward well-calibrated probabilities.^[39] Thus, while MSE excels in point prediction scenarios, log-loss is more suitable for scenarios requiring probabilistic reliability, such as risk assessment or decision-making under uncertainty.

Practical Criticisms

One significant practical criticism of the mean squared error (MSE) is its high sensitivity to outliers, as the squaring of errors disproportionately amplifies the influence of large deviations, potentially leading to suboptimal model performance in datasets with anomalous values.^[40] This issue has prompted the development of robust alternatives, such as the Huber loss function, which applies a quadratic penalty to small errors similar to MSE but switches to a linear penalty for larger errors to mitigate outlier impact.^[40] Another limitation arises from MSE's scale dependency, where the metric's value changes with linear transformations of the data, making it unsuitable for comparing models across datasets with differing units or magnitudes without normalization.^[41] In contrast, relative error measures remain invariant under such transformations, providing a more consistent basis for evaluation in heterogeneous applications.^[42] Theoretically, MSE assumes that errors are independent and identically distributed, often overlooking correlation structures such as autocorrelation in time series data, which can invalidate inferences and lead to biased estimates if present.^[43] This oversight is particularly problematic in sequential data, where ignoring serial dependence inflates the apparent accuracy of models fitted under MSE minimization.^[44] Empirical studies, including the Makridakis competitions from the 1980s through the 2020s, have demonstrated that MSE can yield misleading rankings of forecasting methods due to its scale sensitivity and outlier emphasis, with mean absolute error (MAE) often outperforming MSE in practical accuracy assessments across diverse time series.^[45] For instance, analyses of M-competition results highlight MSE's unreliability for method comparisons, as it favors models excelling on few series while underpenalizing errors in others.^[46] In response to these shortcomings, modern alternatives like the mean absolute percentage error (MAPE) address scale issues by normalizing errors relative to observed values, offering better interpretability in forecasting without the squaring effect.^[42] For scenarios involving asymmetric error distributions, quantile loss (also known as pinball loss) provides a targeted approach by penalizing over- and under-predictions differently based on the desired quantile, enabling more tailored risk management than the symmetric MSE.^[47]

References

[1]
Mean Squared Error (MSE) - Statistics By Jim
Mean squared error (MSE) measures error in statistical models by using the average squared difference between observed and predicted values.
[2]
Mean Square Error - an overview | ScienceDirect Topics
Mean square error (MSE) is defined as the mean of the squared differences between actual and predicted outputs in a regression model, measuring the accuracy ...
[3]
[PDF] 5 Decision Theory: Basic Concepts - Purdue Department of Statistics
Definition 5.2. Let θ be a real valued parameter, and ˆθ an estimate of θ. The mean squared error (MSE) of ˆθ is defined as. MSE = MSE(θ,. ˆ θ) = Eθ[(ˆθ− θ)2], ...
[4]
Mean Squared Error - SAS Help Center
28 Oct 2020 · The mean squared error is arguably the most important criterion used to evaluate the performance of a predictor or an estimator.
[5]
9.1.5 Mean Squared Error (MSE) - Probability Course
Here, we show that g(y)=E[X|Y=y] has the lowest MSE among all possible estimators. That is why it is called the minimum mean squared error (MMSE) estimate. For ...
[6]
Variance | STAT 504 - STAT ONLINE
If X and Y are independent random variables, then their variances will also add: V (X + Y) = V (X) + V ( Y ) if X, Y independent. More generally, if X and Y ...
[7]
[PDF] Lecture 2: Statistical Decision Theory (Part I) - Arizona Math
This leads to the risk function of a decision rule. The risk function ... which is called mean squared error (MSE). The MSE is the sum of squared bias ...
[8]
[PDF] STAT 801: Mathematical Statistics Optimality theory for point ...
estimate. In the Neyman Pearson spirit we measure average closeness. Definition: The Mean Squared Error (MSE) of an estimator ˆθ is the function. MSE(θ) = Eθ ...
[9]
[PDF] Estimators, Mean Square Error, and Consistency
Jan 20, 2006 · Our adjusted estimator δ(x) = 2¯x is consistent, however. We found the MSE to be θ2/3n, which tends to 0 as n tends to infinity. This doesn't ...
[10]
[PDF] Properties of Estimators - Oxford statistics department
If an estimator is mean square consistent, it is weakly consistent. This follows from Chebyshov's inequality: P{|ˆθ− θ| > } ≤ E(ˆθ− θ)2 2 = mse(ˆθ) 2 , so if ...
[11]
[PDF] Gauss on least-squares and maximum-likelihood estimation
Oct 25, 2021 · Abstract: Gauss' 1809 discussion of least squares, which can be viewed as the beginning of mathematical statistics, is reviewed.
[12]
[PDF] The Mathematical Derivation of Least Squares Back ... - UGA SPIA
OLS estimates these parameters by finding the values for the constant and coefficients that minimize the sum of the squared errors of prediction, i.e., the.
[13]
Lecture 8: Linear Regression
We are minimizing a loss function, l(w)=1n∑ni=1(x⊤iw−yi)2. This particular loss function is also known as the squared loss or Ordinary Least Squares (OLS). OLS ...
[14]
[PDF] Simple Linear Regression Least Squares Estimates of β0 and β1
The classic derivation of the least squares estimates uses calculus to find the β0 and β1 parameter estimates that minimize the error sum of squares: SSE =.
[15]
[PDF] Regression 2
Mar 5, 2013 · [Bias(ƒ(xo))]² + Var(ƒ(xo))+2E[ (f(xoì-Elf«^u). ··(ECFO) -fox.)] This is called the bias-variance tradeoff (or decomposition) not flexible.
[16]
11.3 - Best Subsets Regression, Adjusted R-Sq, Mallows Cp
MSE=\frac{SSE}{n-k-1}=\frac{\sum(y_i-\hat{y}_i)^2}{n-k-1}. That is, MSE quantifies how far away our predicted responses are from our observed responses.
[17]
[PDF] Nonlinear Regression
Mar 9, 2013 · and, as in linear regression, we refer to SSE as the sum of squared errors. The quantity MSE, which is an abbreviation for the more complete ...
[18]
[PDF] Chapter 6 Generalized Linear Models - MIT
It turns out that for linear models with a normally-distributed error term ǫ, the log- likelihood of the model parameters with respect to y is proportional to ...
[19]
[PDF] Implicit Bias of Gradient Descent for Mean Squared Error ...
Implicit Bias of Gradient Descent for Mean Squared Error. Regression with Two-Layer Wide Neural Networks. Hui Jin huijin@ucla.edu. Department of Mathematics.
[20]
https://www.nature.com/articles/323533a0
[21]
Learning representations by back-propagating errors - Nature
### Summary of Squared Error Loss in Backpropagation (Rumelhart et al., 1986)
[22]
[PDF] Principles of Risk Minimization for Learning Theory - NIPS papers
Learning is posed as a problem of function estimation, for which two princi- ples of solution are considered: empirical risk minimization and structural.Missing: original | Show results with:original
[23]
[PDF] On the Generalization Benefit of Noise in Stochastic Gradient Descent
Jun 26, 2020 · The optimal test set MSE and final training set MSE for a range of batch sizes under a constant step budget. For each batch size, we train a ...
[24]
Mean estimation - StatLect
Learn how the sample mean is used as an estimator of the population mean. Derive its expected value and variance, and prove its consistency.
[25]
Mean squared error of an estimator | Bias-variance decomposition
Mean squared error (MSE) is the expected loss from an estimator, when using squared error as a loss function. It can be decomposed into bias and variance.
[26]
None
### Summary on Sample Mean and Cramér-Rao Bound (CRLB)
[27]
8.2.2 Point Estimators for Mean and Variance - Probability Course
We conclude that ¯S2 is a biased estimator of the variance. Nevertheless, note that if n is relatively large, the bias is very small. Since E[¯S2]=n−1nσ2, we ...
[28]
MSE of sample variance - Randy Lai
Aug 1, 2020 · A more statistical way to evaluate performance of estimators is to compare the mean squared error (MSE) of the estimators.
[29]
Is 1n+1∑ni=1(Xi−¯X)2 an admissible estimator for σ2?
May 5, 2021 · Then it is known that under squared error loss, the sample variance ... scale invariant loss functions. On this point, in the linked article ...
[30]
[PDF] Statistical Inference
Casella, George. Statistical inference / George Casella ... 5.3 Sampling from the Normal Distribution. 218. 5.3.1 Properties of the Sample Mean and Variance.
[31]
MSE for MLE of normal distribution's ${\sigma}^2
Jul 17, 2020 · MSE for MLE of normal distribution's σ2 ... So I've known MLE for σ2 is ^σ2=1n∑ni=1(Xi−ˉX)2, and I'm looking for MSE of ^σ2. But I'm having ...MSE of estimator for normal distribution [closed]MLE of variance minimising the mean squared errorMore results from math.stackexchange.com
[32]
[PDF] Chapter 4 - The Gauss-Markov Theorem
The Gauss-Markov theorem says that this variance-covariance (or dispersion) is the best that we can do when we restrict ourselved to linear unbiased ...
[33]
Estimators of the Mean Squared Error of Prediction in Linear ...
This article derives a best unbiased estimator and a minimum MSE estimator under the assumption of a normal distribution. It compares the bias and the MSE of ...<|control11|><|separator|>
[34]
[PDF] 1.010 Uncertainty in Engineering - MIT OpenCourseWare
Mean Squared Error (MSE):. The mean squared error of Θ is the second initial moment of the estimation error e. . = Θ − θ,. i.e.,. MSE (θ) = E[(Θ − θ)2] = b2 ...
[35]
[PDF] Lecture 1: Optimal Prediction (with Refreshers)
Mean squared error is bias (squared) plus variance. This is the simplest form of the bias-variance decomposition, which is one of the central parts of ...
[36]
MSE vs. RMSE: Which Metric Should You Use? - Statology
Sep 30, 2021 · When assessing how well a model fits a dataset, we use the RMSE more often because it is measured in the same units as the response variable.
[37]
What's the bottom line? How to compare models - Duke People
... squared units, and is representative of the size of a "typical" error. The ... Bias is one component of the mean squared error--in fact mean squared error ...
[38]
Chapter 16 Model performance | Stats for Data Science
In order to produce an interval covering roughly 95% of the error magnitudes, the prediction interval is usually calculated using the model output ± 2 × RMSE.
[39]
[PDF] Prediction and Confidence Intervals in Regression Preliminaries
What does RMSE tell you? – The RMSE gives the SD of the residuals. The RMSE thus estimates the concentration of the data around the fitted equation. Here's the ...
[40]
[PDF] Strictly Proper Scoring Rules, Prediction, and Estimation
Strictly Proper Scoring Rules, Prediction, and Estimation. Tilmann GNEITING and Adrian E. RAFTERY. Scoring rules assess the quality of probabilistic forecasts ...
[41]
Robust Estimation of a Location Parameter - Project Euclid
March, 1964 Robust Estimation of a Location Parameter. Peter J. Huber · DOWNLOAD PDF + SAVE TO MY LIBRARY. Ann. Math. Statist. 35(1): 73-101 (March, 1964). DOI ...
[42]
[PDF] Performance Metrics (Error Measures) in Machine Learning ... - arXiv
- Normalized Root Mean Squared Error: NRMSE_sd = RMSE/sd -normalized by the standard ... Squared error has unit measure of squared units of data. This may not be ...
[43]
[PDF] Department of Econometrics and Business Statistics
May 20, 2005 · This inappropriate use of the MSE was widely criticized (e.g., Chatfield, 1988; Armstrong and Collopy, 1992). The most commonly used scale- ...
[44]
Topic 2: Time Series & Autocorrelation - STAT ONLINE
Error terms correlated over time are said to be autocorrelated or serially correlated. When error terms are autocorrelated, some issues arise when using ...
[45]
Testing the assumptions of linear regression - Duke People
... independent variables are multiplicative rather than additive in their original units. ... mean squared error. The formulas for estimating coefficients require no ...
[46]
[PDF] Another look at measures of forecast accuracy - Rob J Hyndman
Nov 2, 2005 · Never- theless, the MSE was used by Makridakis et al., 1985, in the M-competition. This inappropriate use of the MSE was widely criticized (e.g. ...
[47]
(PDF) On the Selection of Error Measures for Comparisons Among ...
Aug 6, 2025 · They argued that the MSE should not be applied for comparisons between forecasting methods primarily due to its unreliability. Armstrong and ...
[48]
[PDF] A Comprehensive Survey of Regression Based Loss Functions for ...
Nov 5, 2022 · MSE is especially sensitive to outliers, which means that significant outliers in data may influence our model per- formance.