Fact-checked by Grok 2 weeks ago

Bias of an estimator

In statistics, the bias of an estimator \hat{\theta} for a parameter \theta is defined as the difference between the expected value of the estimator and the true parameter value, mathematically expressed as \operatorname{Bias}(\hat{\theta}) = E[\hat{\theta}] - \theta. An estimator is termed unbiased if this bias equals zero, meaning its expected value matches the parameter exactly, regardless of the sample size. This property is fundamental in evaluating the reliability of estimators, as bias quantifies systematic over- or underestimation in repeated sampling. While unbiasedness is a desirable quality for estimators, it is not the sole criterion for performance, as biased estimators can sometimes outperform unbiased ones in terms of overall accuracy. The mean squared error (MSE) provides a more comprehensive measure, decomposing into the squared bias and the variance of the estimator: \operatorname{MSE}(\hat{\theta}) = [\operatorname{Bias}(\hat{\theta})]^2 + \operatorname{Var}(\hat{\theta}). This trade-off is particularly relevant in finite samples, where introducing a small bias may reduce variance and thus lower the MSE, as seen in techniques like shrinkage estimation. For instance, the sample mean \bar{X} is an unbiased estimator of the population mean \mu under standard assumptions, but the sample variance s^2 = \frac{1}{n} \sum (X_i - \bar{X})^2 is biased downward for the population variance \sigma^2, with an unbiased correction using n-1 in the denominator. Assessing and mitigating is crucial in , as persistent can lead to incorrect conclusions in testing, confidence intervals, and predictive modeling. Properties like —where and variance both approach zero as sample size increases—further extend the analysis, ensuring estimators improve with more data even if initially ed. In practice, statisticians prioritize estimators that balance low with controlled variance, often guided by asymptotic or studies to verify performance across distributions.

Fundamentals

Definition

In , the bias of an quantifies the extent to which its deviates systematically from the of the being estimated. This measure captures the average tendency of the estimator to over- or underestimate the across repeated samples from the underlying . Formally, for an estimator \hat{\theta} of a \theta, the is defined as \Bias(\hat{\theta}) = \E[\hat{\theta}] - \theta, where the \E[\cdot] is taken with respect to the of the data under the true value \theta. A positive indicates that the estimator overestimates \theta on average, while a negative signifies underestimation. An estimator with zero is termed unbiased, meaning its equals the true . Bias represents systematic inherent in the estimator's design or the sampling process, distinct from variance, which measures the random variability or spread of the estimator around its own . While addresses the accuracy of the estimator's relative to \theta, variance quantifies its or across samples. The in the bias formula must specifically reflect the true governed by \theta, ensuring the assessment pertains to the estimator's long-run average performance under the correct model.

Mathematical Properties

The bias of an estimator possesses several key algebraic and probabilistic properties that facilitate its analysis in . One fundamental property is its . For constants a and b, and estimators \hat{\theta} and \hat{\phi} of parameters \theta and \phi, respectively, the bias of the linear combination a\hat{\theta} + b\hat{\phi} equals a \cdot \Bias(\hat{\theta}) + b \cdot \Bias(\hat{\phi}). This follows directly from the linearity of expectation, as \Bias(a\hat{\theta} + b\hat{\phi}) = a \E[\hat{\theta}] + b \E[\hat{\phi}] - (a\theta + b\phi). Another important characteristic is the behavior of bias under parameter transformations. If \psi is an affine function, such as \psi(\hat{\theta}) = a\hat{\theta} + c for constants a and c, then \Bias(\psi(\hat{\theta})) = \psi(\E[\hat{\theta}]) - \psi(\theta) = a \cdot \Bias(\hat{\theta}). This invariance under affine transformations preserves the scaled bias, ensuring that linear reparameterizations do not alter the relative bias structure. However, for nonlinear \psi, the bias \Bias(\psi(\hat{\theta})) generally differs from \psi(\Bias(\hat{\theta})), as the expectation of a nonlinear function introduces additional terms via Jensen's inequality. Bias is inherently a function of the true parameter \theta, denoted \Bias(\hat{\theta}; \theta) = \E_\theta[\hat{\theta}] - \theta. This dependence implies that the magnitude and direction of bias can vary across the parameter space, complicating uniform assessments of estimator performance. Unbiasedness requires \Bias(\hat{\theta}; \theta) = 0 for all \theta in the parameter space, a stringent condition that not all estimators satisfy globally. In large-sample settings, the asymptotic distinguishes finite-sample deviations from long-run behavior. For a of estimators \hat{\theta}_n based on n observations, the asymptotic is often examined through \lim_{n \to \infty} n \cdot \Bias(\hat{\theta}_n; \theta), which captures higher-order terms of O(1/n) even if the estimator is consistent (i.e., \Bias(\hat{\theta}_n; \theta) \to 0). This limit, when finite and nonzero, quantifies persistent in the , as seen in expansions for maximum likelihood estimators.

Unbiased Estimation

Median-Unbiased Estimators

A median-unbiased \hat{\theta} of a \theta is defined such that the of its equals \theta, meaning P(\hat{\theta} \leq \theta) = P(\hat{\theta} \geq \theta) = 1/2. This property ensures equal probabilities of under- and over-estimation, providing a form of unbiasedness centered on the rather than the . Unlike mean-unbiased estimators, where E[\hat{\theta}] = \theta, median-unbiasedness does not imply mean-unbiasedness, and the converse also fails, particularly for asymmetric distributions where the mean may be pulled by outliers. This distinction makes median-unbiased estimators valuable in scenarios with , as they avoid the influence of extreme values that distort the . Median-unbiased estimators can be constructed using order statistics, such as by estimating quantiles from the ordered sample and inverting the to solve for the parameter, ensuring the median property holds uniformly. Alternatively, in adaptive or sequential designs, they may be derived via sign tests, which leverage the signs of differences to balance over- and under-estimation probabilities. These estimators offer advantages in reducing median error for heavy-tailed or asymmetric data, often yielding lower compared to mean-unbiased alternatives by mitigating the impact of outliers.

Implications of Unbiasedness

Unbiased estimators exist for a wide range of models under regularity conditions, such as the availability of complete s for the of interest. The Lehmann-Scheffé theorem establishes that, in such cases, conditioning any unbiased on a complete yields the unique (MVUE) of the , ensuring the existence of an optimal unbiased within the class of functions of that statistic. However, unbiased estimators are generally not unique, as multiple functions of the data can satisfy the unbiasedness condition E[\hat{\theta}(X)] = \theta for a given \theta. The Rao-Blackwell theorem addresses this non-uniqueness by providing a refinement : starting from any unbiased , its given a produces another unbiased with variance no larger than the original, often strictly smaller unless the original was already a function of the . When the is complete, this Rao-Blackwellized coincides with the MVUE guaranteed by the Lehmann-Scheffé , highlighting a pathway to efficiency among unbiased estimators. Despite these theoretical advantages, unbiasedness has notable limitations and does not imply other key properties like or . An unbiased may fail to be , meaning it does not in probability to the true as the sample size increases, if its variance remains bounded away from zero. For instance, certain unbiased estimators based on fixed subsets of data exhibit persistent variance regardless of sample size, precluding . Unbiasedness also does not ensure asymptotic , as the may not achieve the Cramér-Rao lower bound even in large samples without additional structure like . In finite samples, pursuing unbiasedness often involves trade-offs with variance, where unbiased estimators can exhibit substantially higher variability than mildly biased alternatives, leading to poorer performance in terms of mean squared error. This variance inflation arises because the constraint of zero bias restricts the class of admissible estimators, sometimes forcing reliance on less stable data summaries. A common pitfall is overemphasizing unbiasedness at the expense of overall accuracy, resulting in high-variance estimators that are practically unreliable despite theoretical appeal.

Illustrative Examples

Sample Variance

The sample variance serves as a classic example of a biased estimator in statistics, illustrating how the use of the sample mean in place of the population mean introduces systematic underestimation of the population variance. Consider a random sample X_1, X_2, \dots, X_n drawn from a distribution with population mean \mu and finite population variance \sigma^2 > 0. The biased sample variance estimator is defined as S^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2, where \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i is the sample mean. Under the assumption of independent and identically distributed (i.i.d.) samples, the expected value of this estimator is E[S^2] = \frac{n-1}{n} \sigma^2, which is less than \sigma^2 for finite n > 1. This results in a negative bias of \text{Bias}(S^2) = E[S^2] - \sigma^2 = -\frac{\sigma^2}{n}, meaning the estimator systematically underestimates the true variance. To derive this expected value, expand the sum of squared deviations: \sum_{i=1}^n (X_i - \bar{X})^2 = \sum_{i=1}^n (X_i - \mu + \mu - \bar{X})^2 = \sum_{i=1}^n (X_i - \mu)^2 - n (\bar{X} - \mu)^2, since \sum_{i=1}^n (\bar{X} - \mu)^2 = n (\bar{X} - \mu)^2. Taking expectations and dividing by n, with E[(X_i - \mu)^2] = \sigma^2 and \text{Var}(\bar{X}) = \frac{\sigma^2}{n}, yields E[S^2] = \sigma^2 - \frac{\sigma^2}{n} = \frac{n-1}{n} \sigma^2. This derivation holds for i.i.d. samples from any with finite second moment, though the normality assumption simplifies distributional properties like the chi-squared scaling of the sum. An unbiased alternative corrects for this bias by adjusting the denominator to account for the lost in estimating \mu: s^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2, with E[s^2] = \sigma^2 and thus \text{Bias}(s^2) = 0. This adjustment, known as , ensures unbiasedness under the same i.i.d. and finite variance assumptions.

Poisson Probability Estimation

In the context of the with rate parameter \lambda > 0, the probability of observing zero events is given by p = e^{-\lambda}. The method of moments for \lambda is the sample mean \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i, where X_1, \dots, X_n are i.i.d. Poisson(\lambda) random variables, and \bar{X} is unbiased for \lambda since E[\bar{X}] = \lambda. Applying the same transformation to obtain an for p, the method of moments yields \hat{p} = e^{-\bar{X}}. Due to the concavity of the , this exhibits , overestimating p on average for finite sample sizes n. The exact expected value of the estimator is E[\hat{p}] = \exp\left( n \lambda (e^{-1/n} - 1) \right), which follows from the moment generating function of the Poisson distribution, where the sum \sum X_i \sim Poisson(n\lambda) and thus E[e^{t \bar{X}}] = \exp\left( n \lambda (e^{t/n} - 1) \right) evaluated at t = -1. Using the Taylor expansion e^{-1/n} = 1 - \frac{1}{n} + \frac{1}{2n^2} + O\left(\frac{1}{n^3}\right), it follows that n (e^{-1/n} - 1) = -1 + \frac{1}{2n} + O\left(\frac{1}{n^2}\right), so E[\hat{p}] = e^{-\lambda} \exp\left( \frac{\lambda}{2n} + O\left(\frac{1}{n^2}\right) \right) \approx e^{-\lambda} \left(1 + \frac{\lambda}{2n}\right). This approximation reveals a positive bias of approximately e^{-\lambda} \frac{\lambda}{2n} for large n, confirming that \hat{p} tends to overestimate the zero probability, particularly in small samples where the variability of \bar{X} around \lambda amplifies the effect of Jensen's inequality. This bias has practical implications in applications such as rare event modeling, where overestimating the probability of no occurrences can lead to conservative risk assessments. For instance, in quality control or insurance claims analysis modeled via Poisson processes, small-sample estimates may inflate perceived stability. Johnson (1951) compares such method-of-moments estimators with alternatives for the zero-class probability, highlighting their finite-sample shortcomings. Asymptotically, the bias vanishes: \lim_{n \to \infty} E[\hat{p}] = e^{-\lambda}, since e^{-1/n} \to 1 implies n (e^{-1/n} - 1) \to -1, rendering \hat{p} consistent and asymptotically unbiased for p. This aligns with the general property that method-of-moments estimators for smooth transformations are asymptotically unbiased under standard regularity conditions.

Discrete Uniform Maximum

In the discrete uniform distribution on the set {1, 2, ..., m}, where m is the unknown upper bound parameter, each integer from 1 to m has equal probability 1/m. Consider an independent random sample X_1, X_2, \dots, X_n drawn from this distribution. The maximum likelihood estimator for m is the sample maximum \hat{m} = \max(X_1, \dots, X_n), as it is the smallest value consistent with the observed data under the uniform model. The of the sample maximum \hat{m} is given by P(\hat{m} \leq k) = (k/m)^n for integer k = 1, 2, ..., m, since all observations must fall at or below k. The follows as P(\hat{m} = k) = (k/m)^n - ((k-1)/m)^n. The of this is E[\hat{m}] = \sum_{k=1}^m k \left[ (k/m)^n - ((k-1)/m)^n \right] or equivalently E[\hat{m}] = \sum_{k=1}^m \left[1 - ((k-1)/m)^n \right]. This is less than m, revealing a systematic downward in \hat{m}. The arises because the sample maximum cannot exceed the observed values, making it impossible to observe m itself unless all samples attain it, which has probability (1/m)^n. For fixed n, the magnitude increases with m, highlighting the estimator's tendency to underestimate the true upper bound, particularly in small samples. For large m, the is approximately -m/(n+1). To obtain an approximately unbiased , apply the correction \tilde{m} = \frac{n+1}{n} \hat{m} - 1, which aligns with the exact unbiased estimator in the continuous case and provides a good adjustment for large m in the i.i.d. setting. Such corrections exemplify methods to mitigate while preserving desirable properties like sufficiency of the maximum .

Extensions and Variations

Bias Under Alternative Loss Functions

In decision theory, the bias of an estimator can be generalized beyond squared error loss to arbitrary loss functions L(θ̂, θ), where the optimal point estimate is defined as the value θ* that minimizes the E[L(θ̂, θ)]. This generalization shifts the notion of from deviation relative to the (under squared loss) to deviation from θ*, allowing for asymmetric or robust criteria that better suit specific applications. For instance, under the absolute loss L(θ̂, θ) = |θ̂ - θ|, the minimizer θ* is the of the distribution of θ, leading to the concept of median bias, defined as the difference between the of the estimator and the true θ. Median-unbiased estimators, which have zero median bias, are particularly valuable in scenarios with heavy-tailed distributions where the may not exist or be sensitive to outliers. A prominent example of an asymmetric loss function is the linear-exponential (LINEX) loss, defined as L(\hat{\theta}, \theta) = e^{b(\hat{\theta} - \theta)} - b(\hat{\theta} - \theta) - 1, where b > 0 controls the , exponentially penalizing underestimation (when b > 0) more heavily than overestimation. Introduced by Varian for real estate valuation and further analyzed by Zellner in a Bayesian , the LINEX loss leads to an optimal of the form θ* = -\frac{1}{b} \ln E[e^{-b \theta} \mid \text{data}]. This formulation is useful in and economic applications where errors in one direction (e.g., underpredicting demand) have disproportionately higher costs. Another approach for comparing estimators under non-quadratic losses is Pitman closeness, which evaluates the probability that one estimator is closer to the true parameter than another under a specified metric, such as absolute deviation: P(|\hat{\theta}_1 - \theta| < |\hat{\theta}_2 - \theta|). Developed by Pitman as a criterion, it complements traditional bias measures by focusing on relative performance rather than expected deviation, proving especially insightful when mean bias fails to rank estimators effectively due to asymmetry or . In , such alternatives to mean bias are essential for non-quadratic losses, as they mitigate sensitivity to model misspecification or contamination, prioritizing criteria like or mode-based deviations.

Impact of Parameter Transformations

When an estimator \hat{\theta} is unbiased for a parameter \theta, meaning E[\hat{\theta}] = \theta, applying a nonlinear transformation g to the estimator generally introduces in the estimate of g(\theta). Specifically, the of g(\hat{\theta}) is given by E[g(\hat{\theta})] - g(\theta), which equals zero only if g is linear (or affine) in \hat{\theta}, due to the linearity of . For nonlinear g, implies that the direction of the depends on the convexity of g: if g is , E[g(\hat{\theta})] \geq g(\theta), resulting in positive , and the reverse holds for g. This property highlights that unbiasedness is not preserved under nonlinear transformations, a fundamental limitation in . To quantify this transformation-induced bias, a second-order Taylor expansion around \theta provides an approximation. Assuming \hat{\theta} has mean \theta and variance \sigma^2 = \mathrm{Var}(\hat{\theta}), the expansion yields: g(\hat{\theta}) \approx g(\theta) + g'(\theta)(\hat{\theta} - \theta) + \frac{1}{2} g''(\theta) (\hat{\theta} - \theta)^2. Taking expectations, the first-order term vanishes due to unbiasedness, leaving E[g(\hat{\theta})] \approx g(\theta) + \frac{1}{2} g''(\theta) \sigma^2, so the bias is approximately \frac{1}{2} g''(\theta) \sigma^2. This second-order approximation, derived from the delta method extended to higher orders, reveals how the curvature of g (via g'') interacts with the estimator's variability to produce bias. A concrete illustration occurs with the logarithmic transformation, common in scale parameters or rates. For g(x) = \log x, the second derivative is g''(x) = -1/x^2, which is negative, indicating concavity. Thus, \mathrm{Bias}(\log \hat{\theta}) \approx \frac{1}{2} \left( -\frac{1}{\theta^2} \right) \sigma^2 = -\frac{1}{2} \frac{\sigma^2}{\theta^2}. This negative bias means \log \hat{\theta} systematically underestimates \log \theta, with the magnitude scaling inversely with \theta^2 and proportional to the relative variance \sigma^2 / \theta^2. The approximation holds asymptotically as \sigma^2 \to 0, often in large samples. The dependence of bias on parameterization underscores a key issue: unbiasedness is not invariant under reparameterization of the model. For instance, the sample variance s^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2 is an unbiased estimator of the population variance \sigma^2 for independent identically distributed samples with finite variance, satisfying E[s^2] = \sigma^2. However, the sample standard deviation s = \sqrt{s^2} is a biased estimator of \sigma, with E < \sigma, because the square-root function is concave and induces negative bias via the above approximation (where g(x) = \sqrt{x} has g''(x) = -1/(4 x^{3/2}) < 0). This example demonstrates how switching from estimating \sigma^2 to \sigma alters the bias properties, emphasizing the need to consider the target parameter's scale. In practice, to mitigate transformation-induced , -corrected can be constructed using the second-order . For the transformation, a corrected is \log \hat{\theta} + \frac{1}{2} \frac{\hat{\sigma}^2}{\hat{\theta}^2}, where \hat{\sigma}^2 estimates \sigma^2; this adds a positive term to offset the negative . More generally, since the approximates \mathrm{Var}(\log \hat{\theta}) \approx \sigma^2 / \theta^2, the correction can be expressed as \log \hat{\theta} + \frac{1}{2} \widehat{\mathrm{Var}}(\log \hat{\theta}). Such adjustments improve finite-sample performance, though they introduce additional variability and require reliable variance estimates. These methods are widely applied in fields like and , where logarithmic reparameterizations are routine.

Error Metrics and Relationships

Connection to Variance and Mean Squared Error

The (MSE) of an \hat{\theta} for a parameter \theta provides a comprehensive measure of its accuracy, decomposing into the variance of the estimator and the square of its . Specifically, the MSE is defined as \operatorname{MSE}(\hat{\theta}) = \mathbb{E}[(\hat{\theta} - \theta)^2], and it satisfies the \operatorname{MSE}(\hat{\theta}) = \operatorname{Var}(\hat{\theta}) + [\operatorname{Bias}(\hat{\theta})]^2, where \operatorname{Var}(\hat{\theta}) = \mathbb{E}[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2] and \operatorname{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta. To derive this decomposition, begin by expanding the squared error term: \mathbb{E}[(\hat{\theta} - \theta)^2] = \mathbb{E}[(\hat{\theta} - \mathbb{E}[\hat{\theta}] + \mathbb{E}[\hat{\theta}] - \theta)^2]. Let U = \hat{\theta} - \mathbb{E}[\hat{\theta}] and B = \mathbb{E}[\hat{\theta}] - \theta, so the expression becomes \mathbb{E}[(U + B)^2] = \mathbb{E}[U^2 + 2UB + B^2]. Since B is constant with respect to the expectation over the in \hat{\theta}, this simplifies to \mathbb{E}[U^2] + 2B \mathbb{E}[U] + B^2. The cross term vanishes because \mathbb{E}[U] = 0, yielding \mathbb{E}[U^2] + B^2 = \operatorname{Var}(\hat{\theta}) + [\operatorname{Bias}(\hat{\theta})]^2. This holds under the assumption that the exists, as is standard in theory. This decomposition reveals that bias contributes to the overall error in a quadratic manner, amplifying its impact relative to variance for larger deviations from zero. For an unbiased estimator, where \operatorname{Bias}(\hat{\theta}) = 0, the MSE reduces exactly to the variance, implying that unbiasedness alone does not guarantee minimal MSE unless the variance is also minimized. In practice, estimators often involve a trade-off: techniques that reduce bias, such as using more flexible models, can increase variance by overfitting to sample noise, potentially raising the total MSE. Conversely, simpler models with low variance may introduce substantial bias, again elevating MSE. Optimal estimator selection thus requires balancing these components to minimize MSE for the given problem.

Population Variance Estimation Example

Consider a random sample X_1, X_2, \dots, X_n drawn from a with mean \mu and variance \sigma^2. The biased of the variance is defined as S^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2, where \bar{X} is the sample mean. This has E[S^2] = \frac{n-1}{n} \sigma^2, resulting in a bias of -\frac{\sigma^2}{n}. Under the normality assumption, the variance of this biased is \operatorname{Var}(S^2) = \frac{2\sigma^4}{n} \left(1 - \frac{1}{n}\right). Applying the mean squared error decomposition \operatorname{MSE} = \operatorname{Var} + (\operatorname{Bias})^2, the MSE of S^2 is \frac{2\sigma^4}{n} - \frac{\sigma^4}{n^2}. The unbiased estimator is S_{n-1}^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2, which has bias 0 by construction. Its variance is \operatorname{Var}(S_{n-1}^2) = \frac{2\sigma^4}{n-1}, so the MSE equals \frac{2\sigma^4}{n-1}. For small sample sizes n, the MSE of the biased estimator S^2 is lower than that of the unbiased estimator S_{n-1}^2, despite the nonzero bias of S^2; for example, when n=2, \operatorname{MSE}(S^2) = 0.75 \sigma^4 < 2 \sigma^4 = \operatorname{MSE}(S_{n-1}^2). This illustrates the bias-variance trade-off, where the reduced variance of the biased estimator outweighs its squared bias in terms of overall MSE. Asymptotically, as n \to \infty, the MSE of both estimators approaches 0, since their biases and variances diminish.

Bayesian Interpretation

Prior Influence on Bias

In Bayesian estimation, the concept of bias differs fundamentally from the frequentist view, where an estimator is unbiased if its equals the true θ across repeated sampling. Bayesian estimators, such as the posterior , incorporate and are thus typically biased in the frequentist sense, as their under the does not equal θ unless the is non-informative. With non-informative priors, however, Bayesian procedures achieve asymptotic unbiasedness in large samples, where the of the diminishes relative to the . For conjugate priors, the prior-induced bias arises explicitly through the structure of the posterior. Consider the normal-normal conjugate case, where observations are i.i.d. from N(θ, σ²) with known σ², and the prior is N(μ₀, τ²). The posterior mean is given by \hat{\theta} = w \mu_0 + (1 - w) \bar{x}, where w = \frac{n_0}{n_0 + n}, n₀ is the effective prior sample size (n₀ = σ²/τ²), and n is the sample size. The expected value under the sampling distribution is then E[ˆθ] = w μ₀ + (1 - w) θ, yielding a bias of w (μ₀ - θ). This bias shrinks to zero as n → ∞ since w → 0, but for finite n, it reflects the prior's pull toward μ₀. Informative priors deliberately introduce such bias to achieve shrinkage, trading off increased bias for reduced variance and overall lower mean squared error (MSE), a principle analogous to the James-Stein estimator. The James-Stein estimator, interpretable as an empirical Bayes procedure, shrinks multiple normal means toward a grand mean, with its risk E[‖ˆμ_JS - μ‖²] = N B + 3(1 - B) for dimension N and shrinkage factor B < 1, outperforming the unbiased maximum likelihood estimator (risk N) by reducing MSE through controlled bias. This shrinkage effect mirrors how Bayesian priors with positive weight on informative beliefs improve predictive performance, especially in high dimensions or small samples. Non-informative priors like the , proportional to the square root of the , aim to minimize prior influence and achieve approximate frequentist unbiasedness asymptotically. Under the , the posterior coincides with the maximum likelihood up to higher-order terms, ensuring bias vanishes as o(1/√n). This invariance property makes it a standard choice for objective Bayesian analysis seeking frequentist-like behavior in the limit.

Posterior Mean as Estimator

In , the posterior mean serves as a point estimator for the of interest, defined as the of the under the posterior : \hat{\theta} = \int \theta \, \pi(\theta \mid \text{data}) \, d\theta, where \pi(\theta \mid \text{data}) is the posterior density proportional to the product of the \pi(\theta) and the likelihood f(\text{data} \mid \theta). This estimator minimizes the expected posterior under squared error , making it the optimal Bayes action in that decision-theoretic framework. From a frequentist perspective, the bias of the posterior mean is assessed by its expected value over the sampling distribution of the data given the true parameter \theta_{\text{true}}: \text{Bias}(\hat{\theta}) = \int \left[ \int \theta \, \pi(\theta \mid \text{data}) \, d\theta - \theta_{\text{true}} \right] f(\text{data} \mid \theta_{\text{true}}) \, d\text{data}. This bias is generally nonzero when the prior \pi(\theta) is informative, as the posterior mean shrinks the likelihood-based estimate toward values favored by the prior, introducing a systematic deviation from \theta_{\text{true}}. In cases with noninformative or improper priors, the posterior mean coincides with the maximum likelihood estimator, which can exhibit bias in finite samples. Regarding admissibility under squared error loss, the posterior mean is a with respect to the chosen and is often admissible when the is proper. However, in high-dimensional settings ( p \geq 3), the posterior mean under an improper flat —the sample mean for multivariate normal data—is inadmissible, as demonstrated by the Stein phenomenon, where shrinkage estimators dominate it in terms of . A concrete illustration occurs in the conjugate normal-normal model, where data X_1, \dots, X_n \stackrel{\text{iid}}{\sim} \mathcal{N}(\theta, \sigma^2) with known \sigma^2 and \theta \sim \mathcal{N}(\mu_0, \tau^2). The posterior is \mathcal{N}\left( \frac{n \bar{X} / \sigma^2 + \mu_0 / \tau^2}{n / \sigma^2 + 1 / \tau^2}, \left( n / \sigma^2 + 1 / \tau^2 \right)^{-1} \right), so the posterior mean is \hat{\theta} = \frac{n \bar{X} / \sigma^2 + \mu_0 / \tau^2}{n / \sigma^2 + 1 / \tau^2}. This weighted average biases the estimate toward the prior mean \mu_0, with the shrinkage intensity decreasing as sample size n increases or prior precision $1/\tau^2 weakens relative to data precision n / \sigma^2.

References

  1. [1]
    4.3 - Statistical Biases | STAT 509
    For a point estimator, statistical bias is defined as the difference between the parameter to be estimated and the mathematical expectation of the estimator.
  2. [2]
    [PDF] Properties of Estimators I 7.6.1 Bias
    The bias of an estimator measures whether or not in expectation, the estimator will be equal to the true parameter. Definition 7.6.1: Bias. Let ˆθ be an ...
  3. [3]
    SticiGui Estimating Parameters from Simple Random Samples
    Dec 30, 2020 · The bias is the difference between the expected value of the estimator and the true value of the parameter.
  4. [4]
    [PDF] Bias, Variance, and MSE of Estimators
    Sep 4, 2010 · It is common to trade-off some increase in bias for a larger decrease in the variance and vice-verse.
  5. [5]
    [PDF] Bias, Variance, and MSE of Estimators - Guy Lebanon's
    Sep 4, 2010 · It is common to trade-off some increase in bias for a larger decrease in the variance and vice-verse.<|control11|><|separator|>
  6. [6]
    [PDF] 14.310x Lecture 12 - MIT Open Learning Library
    variance/efficiency. In other words, we might be willing to accept a little bit of bias in our estimator if we can have one that has a much lower variance.
  7. [7]
    Chapter 7
    If the bias is zero, then we say that the estimator is unbiased. It is clear that the sample mean is unbiased. Consider our estimator for the variance.
  8. [8]
    [PDF] Lecture 9 Estimators
    Sep 25, 2019 · We will see below that same estimator can be unbiased as an estimator for one parameter, but biased when used to estimate another parameter.
  9. [9]
    [PDF] Stat 610: Mathematical Statistics Lecture 3
    Definition 7.3.2. The bias of an estimator T(X) of g(θ) is the function of θ defined by. Eθ [T(X)]−g(θ). An estimator T(X) of g(θ) is unbiased if its bias is 0,.
  10. [10]
    [PDF] Properties of Estimators II 7.7.1 Consistency
    Bias measured whether or not, in expectation, our estimator was equal to the true value of θ. MSE measured the expected squared difference between our estimator ...
  11. [11]
    [PDF] Lecture Notes for Math 448 Statistics - math.binghamton.edu
    Dec 23, 2022 · Bias of an estimator: Def: Bias(ˆθ) = Ebθ−θ; (The bias of an estimator is its expected value minus the true value of the parameter). Note ...<|control11|><|separator|>
  12. [12]
    [PDF] 2. Point Estimation
    Example: ˆσ2 has smaller MSE than S2 (see Casella and Berger, p. 304) but is biased. If one has two estimators at hand, one being slightly biased but having a ...
  13. [13]
    [PDF] Statistical Inference
    Chapter 10 is entirely new and attempts to lay out the fundamentals of large sample inference, including the delta method, consistency and asymptotic normality, ...
  14. [14]
  15. [15]
    [PDF] Constructing median-unbiased estimators in one-parameter families ...
    If θ ∈ Θ is an unknown real parameter of a distribution under consid- eration, we are interested in constructing an exactly median-unbiased estimator ˆθ of ...
  16. [16]
    Completeness, Similar Regions, and Unbiased Estimation-Part I
    The aim of this paper is the study of two classical problems of mathematical statistics, the problems of similar regions and of unbiased estimation.Missing: original | Show results with:original
  17. [17]
    [PDF] Bias/Variance Tradeoff - MIT OpenCourseWare
    An estimator whose bias is 0 is called unbiased. Contrast bias with: • Var(θˆ) = E(θˆ− E(θˆ))2 . Of course, we'd like an estimator with low bias and low ...
  18. [18]
    8.2.2 Point Estimators for Mean and Variance - Probability Course
    The sample variance is an unbiased estimator of σ2. The sample standard deviation is defined as S=√S2,. and is commonly used as an estimator for σ.
  19. [19]
    Variance estimation - StatLect
    Learn how the sample variance is used as an estimator of the population variance. Derive its expected value and prove its properties, such as consistency.<|control11|><|separator|>
  20. [20]
  21. [21]
    Bayesian Estimation and Prediction Using Asymmetric Loss Functions
    Mar 12, 2012 · Estimators and predictors that are optimal relative to Varian's asymmetric LINEX loss function are derived for a number of well-known models.
  22. [22]
    [PDF] Unbiased Estimation - Arizona Math
    In particular: • The mean square error for an unbiased estimator is its variance. • Bias always increases the mean square error.Missing: test | Show results with:test
  23. [23]
    [PDF] On Reparameterization Invariant Bayesian Point Estimates and ...
    Sep 23, 2021 · Equality is valid only for specific parameterizations, since unbiased estimates are not invariant under reparameterization. Note that, here ...
  24. [24]
    Frequentist accuracy of Bayesian estimates - PMC - PubMed Central
    This paper concerns the frequentist assessment of Bayes estimates. A simple formula is shown to give the frequentist standard deviation of a Bayesian point ...
  25. [25]
    Statistical Decision Theory and Bayesian Analysis - SpringerLink
    eBook USD 139.00 · Available as PDF ; Softcover Book USD 189.00 · Compact, lightweight edition ; Hardcover Book USD 189.00 · Durable hardcover edition ...
  26. [26]
    Calibrating the prior distribution for a normal model with conjugate ...
    This shows that the posterior mean is a weighted average of the prior mean and the sample mean, with weights proportional to nθ and n, respectively. The ...<|separator|>
  27. [27]
    [PDF] James–Stein Estimation and Ridge Regression
    The James-Stein estimator is a shrinkage method that introduces deliberate biases to improve performance, and is a plug-in version of the Bayes estimator.