Fact-checked by Grok 2 weeks ago

Estimator

In statistics, an estimator is a rule or function that computes an estimate of an unknown parameter from observed sample data, serving as a key tool in inferential statistics to draw conclusions about a larger based on a of observations. Estimators are typically expressed as functions of random variables from the sample, producing a value that approximates the parameter of interest, such as a or variance. The quality of an estimator is assessed through several desirable properties, including unbiasedness, where the expected value of the estimator equals the true parameter value, ensuring no systematic over- or underestimation on average. Consistency requires that as the sample size grows, the estimator converges in probability to the true parameter, providing reliability with more data. Efficiency measures how well the estimator utilizes the sample by having the smallest possible variance among unbiased estimators, often linked to the Cramér–Rao lower bound for optimal performance. Additional properties like sufficiency, where the estimator captures all relevant information from the sample without redundancy, further guide the selection of estimators in practice. Robustness, which indicates resistance to deviations from assumptions, is another important consideration. Estimators form the foundation of , which yields a single value for the , and , which provides a with a confidence level, both essential in testing, , and decision-making across fields like , , and . Common examples include the sample mean as an unbiased and of the population , and the sample variance as an estimator adjusted for to achieve unbiasedness. The development of estimators often balances trade-offs, such as between bias and variance, to minimize , a comprehensive measure of accuracy defined as the expected squared difference between the estimator and the true .

Background and History

Origins in Probability and Statistics

The concept of estimation in statistics traces its roots to early , particularly through Jacob Bernoulli's formulation of the in 1713. In his posthumously published work , Bernoulli demonstrated that the relative frequency of an event in repeated trials converges to its true probability as the number of trials increases, providing a foundational idea for inferring unknown parameters from sample data. This weak law of large numbers served as a precursor to by establishing the reliability of sample proportions as approximations of population probabilities, influencing later developments in consistent estimation techniques. Building on this probabilistic foundation, advanced the field in the late with his development of , a method now recognized as the basis of . In works from the 1770s, including his 1774 memoir, Laplace introduced the idea of calculating the probability of causes from observed effects, enabling point estimates of unknown parameters by combining prior beliefs with data. This approach marked a significant step toward systematic statistical estimation, applying probability to reconcile inconsistent observations in fields like astronomy and . Carl Friedrich Gauss further solidified estimation principles in 1809 with his least squares method, presented in Theoria Motus Corporum Coelestium. Gauss proposed minimizing the sum of squared residuals to estimate parameters in linear models, assuming normally distributed errors, which equated to under those conditions. This technique represented an early formal estimator for parameters, bridging observational data and probabilistic models, and laid groundwork for modern . During the 19th century, statistical inference transitioned from these probabilistic origins toward more structured frequentist frameworks, emphasizing long-run frequencies over subjective priors. This shift promoted objective methods for parameter estimation based on repeated sampling, setting the stage for 20th-century developments in hypothesis testing and confidence intervals.

Key Milestones and Contributors

The development of estimator theory in the 20th century was profoundly shaped by Ronald A. Fisher's seminal 1922 paper, "On the Mathematical Foundations of Theoretical Statistics," which introduced maximum likelihood estimation as a method for obtaining estimators that maximize the probability of observing the given data under the assumed model. In this work, Fisher also introduced the concept of estimator efficiency and derived an early lower bound for the variance of unbiased estimators based on the information content of the sample (using the second derivative of the log-likelihood, now known as Fisher information), providing a benchmark for comparing estimator performance. The Cramér–Rao lower bound, which formalizes this variance bound more generally, was later derived independently by Harald Cramér and C. Radhakrishna Rao in the 1940s. Building on these foundations, and Egon S. Pearson advanced the field in the 1930s through their collaborative efforts on hypothesis testing and estimation criteria. Their 1933 paper, "On the Problem of the Most Efficient Tests of Statistical Hypotheses," established the Neyman-Pearson lemma, which identifies uniformly most powerful tests and has implications for the selection of unbiased estimators that minimize error rates in contexts. This framework emphasized the role of unbiasedness in estimators, influencing subsequent evaluations of estimator reliability under finite samples. Asymptotic theory emerged as a cornerstone of modern estimator analysis in the mid-1940s, with key contributions from Harald Cramér and C. Radhakrishna Rao. Cramér's 1946 book, Mathematical Methods of Statistics, systematically developed asymptotic properties such as and of estimators, deriving the Cramér-Rao lower bound for the asymptotic variance of unbiased estimators under regularity conditions. Independently, Rao's 1945 paper, "Information and the Accuracy Attainable in the Estimation of Statistical Parameters," introduced the matrix and its role in bounding estimator precision, laying groundwork for multiparameter asymptotic . Peter J. Huber's 1964 paper, "Robust Estimation of a Location Parameter," marked a pivotal shift toward robustness in estimator theory, proposing M-estimators that minimize the impact of outliers by using a convex instead of squared error. This approach demonstrated that robust estimators achieve near-efficiency under distributions while maintaining stability under contamination, influencing the design of outlier-resistant methods in applied statistics. In the late , computational innovations transformed estimator evaluation, exemplified by Bradley Efron's 1979 introduction of the bootstrap method in "Bootstrap Methods: Another Look at the Jackknife." This resampling technique allows empirical estimation of an estimator's without assumptions, enabling and variance assessment for complex statistics and extending accessibility to non-asymptotic properties.

Core Concepts

Definition of an Estimator

In statistics, particularly within the framework of , an estimator \hat{\theta} is formally defined as a of a random sample X_1, \dots, X_n drawn from a , mapping the sample values to an of an unknown \theta in the parameter space; that is, \hat{\theta} = g(X_1, \dots, X_n) for some g. This definition positions the estimator as a specifically designed to infer the value of \theta, where the sample is assumed to follow a parameterized by \theta. A key distinction exists between the estimator itself, which is the general rule or procedure (often a ), and the estimate, which is the concrete numerical value realized when the estimator is applied to a specific observed sample. For instance, the sample \bar{X} serves as an estimator for the population \mu, but a particular like \bar{X} = 5.2 from observed constitutes the estimate. Since the sample observations X_1, \dots, X_n are random variables, the estimator \hat{\theta} inherits this randomness and is thus a random variable, subject to sampling variability that depends on the underlying distribution. This random nature underscores the estimator's role in providing a probabilistic approximation to \theta across repeated sampling from the parametric model.

Estimand, Statistic, and Estimate

In , the is the target quantity of interest that seeks to quantify, typically an unknown \theta of the or a functional g(\theta) thereof, such as the or variance. This concept anchors the analysis by specifying precisely what aspect of the underlying is being investigated, independent of the data collected. A statistic is any function of the observable random sample drawn from the population, serving as a summary measure derived directly from the data. Estimators form a subset of statistics, specifically those selected to approximate the estimand; for instance, the sample mean \bar{X} is an estimator targeting the population mean \mu as the estimand when the sample arises from distributions sharing a common expected value. In contrast, the estimate is the concrete numerical value produced by applying the estimator to a specific observed sample, such as computing \bar{x} = 5.2 from data points x_1, \dots, x_n. These terms highlight the progression from theoretical target (estimand) to data-driven approximation (statistic and estimator) to realized output (estimate), ensuring precise communication in inference. Point estimators like the sample mean yield a single value, whereas related constructs such as confidence intervals provide a range of plausible estimand values to account for sampling variability, though they differ by quantifying uncertainty rather than pinpointing a single approximation.

Finite-Sample Properties

Bias and Unbiasedness

In statistics, the bias of an estimator \hat{\theta} for a parameter \theta is defined as the difference between its expected value and the true parameter value: B(\hat{\theta}) = E[\hat{\theta}] - \theta, where the expectation E is taken over the sampling distribution of the data. This measures the systematic tendency of the estimator to over- or underestimate the parameter on average across repeated samples from the population. A positive bias indicates overestimation, while a negative bias indicates underestimation. An estimator \hat{\theta} is unbiased if its expected value equals the true parameter for all possible values of \theta: E[\hat{\theta}] = \theta. Unbiasedness is preserved under linear combinations; if \hat{\theta}_1 and \hat{\theta}_2 are unbiased for \theta_1 and \theta_2, respectively, then a\hat{\theta}_1 + b\hat{\theta}_2 is unbiased for a\theta_1 + b\theta_2 for any constants a and b, due to the linearity of expectation. A classic example is the sample mean \bar{X}, which is an unbiased estimator of the population mean \mu for independent and identically distributed samples from any distribution with finite mean, as E[\bar{X}] = \mu. Unbiasedness ensures that the estimator is correct in the long run, meaning that over many repeated samples, the average value of \hat{\theta} will equal \theta, providing reliability in terms of systematic accuracy. However, it does not guarantee precision in individual samples, as the spread of estimates around the true value—measured by variance—can still be large.

Variance and Sampling Deviation

The variance of an estimator \hat{\theta} quantifies the expected squared deviation of the estimator from its own expected value, providing a measure of its dispersion across repeated samples from the same population. Formally, it is defined as \operatorname{Var}(\hat{\theta}) = E\left[(\hat{\theta} - E[\hat{\theta}])^2\right], where the expectation is taken over the sampling distribution of \hat{\theta}. This definition parallels the variance of any random variable and highlights the inherent variability in \hat{\theta} due to sampling randomness, independent of any systematic offset from the true parameter value. The sampling deviation of an estimator is captured by its standard deviation, \sqrt{\operatorname{Var}(\hat{\theta})}, which represents the typical scale of fluctuations in \hat{\theta} around E[\hat{\theta}]. In estimation contexts, this quantity is commonly termed the standard error of the estimator, serving as a practical indicator of in inferential procedures such as confidence intervals. For instance, larger standard errors imply greater uncertainty in the estimate, often arising from limited data or inherent population variability. Several factors influence the variance of an estimator. Primarily, it decreases with increasing sample size n, typically scaling as $1/n for many common estimators like the sample mean, thereby improving as more are collected. Additionally, the shape of the underlying affects the variance; for example, distributions with heavier tails or higher tend to yield estimators with larger variance due to greater spread in the . A concrete example is the sample variance estimator S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2 for a population variance \sigma^2, assuming and identically distributed observations X_1, \dots, X_n. Under (X_i \sim N(\mu, \sigma^2)), the variance of S^2 is given by \operatorname{Var}(S^2) = \frac{2\sigma^4}{n-1}, illustrating both the $1/n scaling with sample size and dependence on the population variance itself. This underscores how the estimator's variability diminishes as n grows, while remaining sensitive to \sigma^2.

Mean Squared Error

The (MSE) of an estimator \hat{\theta} for a parameter \theta is defined as the of the squared difference between the estimator and the true parameter value: \operatorname{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta)^2]. This metric quantifies the average squared deviation of the estimator from \theta over repeated samples, serving as a comprehensive measure of accuracy in decision-theoretic frameworks. The MSE decomposes into the variance of the estimator plus the square of its bias: \operatorname{MSE}(\hat{\theta}) = \operatorname{Var}(\hat{\theta}) + [B(\hat{\theta})]^2, where B(\hat{\theta}) = E[\hat{\theta}] - \theta denotes the . This decomposition highlights the trade-off between , which reflects systematic over- or underestimation, and variance, which captures random fluctuations around the estimator's expected value; thus, MSE penalizes both sources of error equally in squared terms. For unbiased estimators, where B(\hat{\theta}) = 0, the MSE simplifies to the variance, emphasizing the role of variability in such cases. To compare estimators operating on different scales or under varying parameter ranges, the relative MSE is employed, typically computed as the ratio of one estimator's MSE to a benchmark, such as that of the sample mean. This facilitates scale-invariant assessments of relative . In minimax estimation criteria, MSE functions as the primary loss measure, with the objective of selecting an estimator that minimizes the supremum of the MSE over the possible values of \theta, thereby ensuring robust against the worst-case scenario.

Relationships Among Properties

The (MSE) of an estimator serves as a comprehensive measure of its total error, which decomposes into the squared and the variance, illustrating how systematic deviation from the true combines with random sampling variability to determine overall accuracy. This reveals that the total error is not merely additive but reflects the interplay between these components, where minimizing one can influence the other. A key relationship among these properties is the inherent between and variance: efforts to eliminate entirely, such as through unbiased estimators, can inflate variance, leading to higher MSE in finite samples, while introducing controlled can reduce variance and yield a superior MSE. Shrinkage estimators exemplify this , as they deliberately estimates toward a central value to curb excessive variability; the James-Stein estimator, for instance, shrinks sample means toward a in multivariate normal settings, dominating the maximum likelihood estimator in MSE despite its . The Cramér-Rao lower bound establishes a theoretical interconnection by imposing a minimum on the variance of any unbiased estimator, equal to the reciprocal of the , thereby linking unbiasedness directly to achievable precision limits and highlighting why biased alternatives may sometimes achieve lower MSE. This bound unifies the properties by showing that variance cannot be arbitrarily reduced without or additional assumptions. In estimator selection, these relationships favor MSE minimization over strict unbiasedness in many applications, as biased estimators often provide better finite-sample performance, though maximum likelihood estimators achieve asymptotic under regularity conditions.

Illustrative Example

To illustrate the finite-sample properties of estimators, consider estimating the population variance \sigma^2 from an independent random sample X_1, \dots, X_n drawn from a normal distribution with mean \mu and variance \sigma^2. A commonly used estimator is the sample variance s^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2, where \bar{X} = n^{-1} \sum_{i=1}^n X_i is the sample mean. This estimator is unbiased, meaning its expected value equals the true parameter: \mathbb{E}[s^2] = \sigma^2, so the bias is zero. The variance of s^2 is given by \operatorname{Var}(s^2) = \frac{2\sigma^4}{n-1}, which follows from the fact that (n-1)s^2 / \sigma^2 follows a chi-squared distribution with n-1 degrees of freedom, whose variance is $2(n-1). Consequently, the mean squared error is \operatorname{MSE}(s^2) = \operatorname{Var}(s^2) + [\operatorname{Bias}(s^2)]^2 = \frac{2\sigma^4}{n-1}, since the bias term vanishes. An alternative estimator is the biased version \tilde{s}^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2. Its expected value is \mathbb{E}[\tilde{s}^2] = \frac{n-1}{n} \sigma^2, yielding a bias of \operatorname{Bias}(\tilde{s}^2) = -\frac{\sigma^2}{n}. The variance is \operatorname{Var}(\tilde{s}^2) = \frac{2\sigma^4 (n-1)}{n^2}, obtained by scaling the unbiased variance appropriately. The mean squared error is then \operatorname{MSE}(\tilde{s}^2) = \left( \frac{\sigma^2}{n} \right)^2 + \frac{2\sigma^4 (n-1)}{n^2} = \frac{(2n-1) \sigma^4}{n^2}. This demonstrates the bias-variance trade-off: although \tilde{s}^2 introduces negative bias, its lower variance results in a smaller MSE compared to s^2 for all finite n. To visualize this , assume \sigma^2 = 1 and compute the MSE for small sample sizes. The following shows the values:
Sample size nMSE of s^2 (unbiased)MSE of \tilde{s}^2 (biased)
22.0000.750
50.5000.360
100.2220.190
200.1050.098
These theoretical MSE values highlight how the biased estimator outperforms the unbiased one in terms of MSE, particularly for smaller n, though both approach zero as n grows large. In practice, one could verify these properties through simulations by generating many samples and computing the empirical , variance, and MSE.

Asymptotic and Behavioral Properties

Consistency

In statistics, refers to a large-sample property of an estimator \hat{\theta}_n, based on a sample of size n, that guarantees to the true \theta as n \to \infty. This property addresses limitations of finite-sample performance, such as , by ensuring the estimator becomes arbitrarily close to the true value with high probability in sufficiently large samples. Weak consistency, the standard definition, requires that \hat{\theta}_n converges in probability to \theta, formally \hat{\theta}_n \xrightarrow{p} \theta, meaning for every \epsilon > 0, \lim_{n \to \infty} P(|\hat{\theta}_n - \theta| > \epsilon) = 0. A stronger form, , demands almost sure convergence, \hat{\theta}_n \xrightarrow{a.s.} \theta, where the probability that the estimator deviates from \theta after some finite n is zero. , another variant, holds if the mean squared error approaches zero, \lim_{n \to \infty} \mathbb{E}[(\hat{\theta}_n - \theta)^2] = 0, which implies weak consistency under finite second moments. Consistency typically requires model , where distinct parameter values \theta_1 \neq \theta_2 induce distinct probability distributions P_{\theta_1} \neq P_{\theta_2}, along with assumptions such as independent and identically distributed (i.i.d.) observations and finite moments. Without , no can exist, as multiple parameters may yield indistinguishable data-generating processes. A classic example is the sample mean \bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i for i.i.d. random variables X_i with finite mean \mu; by the weak , \bar{X}_n \xrightarrow{p} \mu, establishing its . If the X_i also have finite variance, the strong yields strong .

Asymptotic Normality

Asymptotic normality refers to the property of certain estimators where, for sufficiently large sample sizes n, the sampling distribution of the estimator \hat{\theta}_n approximates a normal distribution centered at the true parameter \theta. This property facilitates statistical inference by allowing the use of normal-based approximations for quantities like standard errors and confidence intervals. Specifically, under appropriate regularity conditions, the normalized estimator \sqrt{n} (\hat{\theta}_n - \theta) converges in distribution to a normal random variable with mean 0 and variance \sigma^2, denoted as \sqrt{n} (\hat{\theta}_n - \theta) \xrightarrow{d} N(0, \sigma^2). This result stems from the central limit theorem (CLT) applied to the score function or estimating equations underlying the estimator. For maximum likelihood estimators (MLEs), the CLT applies to the sum of score functions, which have mean zero and finite variance, leading to the asymptotic normality after normalization by \sqrt{n}. The asymptotic variance \sigma^2 depends on the estimator; for MLEs, it equals the inverse of the Fisher information I(\theta), so \sigma^2 = 1/I(\theta), where I(\theta) = -\mathbb{E}\left[ \frac{\partial^2}{\partial \theta^2} \log f(X|\theta) \right]. This connection highlights how the information content in the data determines the precision of the estimator asymptotically. The implications of asymptotic normality are profound for inference: it enables the construction of approximate confidence intervals via \hat{\theta}_n \pm z_{\alpha/2} \cdot \hat{\sigma}/\sqrt{n}, where \hat{\sigma}^2 estimates the asymptotic variance, and hypothesis tests such as the Wald test, which compares \hat{\theta}_n to a null value using the normal approximation. These tools are reliable for large n even if exact distributions are intractable. However, the property requires regularity conditions, including that the parameter space is open, the likelihood is twice differentiable with respect to \theta almost everywhere, the support of the density does not depend on \theta, and the third derivatives exist and are integrable to ensure the CLT applies without bias in the expansion. For moment-based estimators, similar differentiability of moment conditions suffices.

Efficiency

In asymptotic theory, the of an estimator refers to its ability to achieve the minimal possible asymptotic variance among consistent estimators for a given . This concept builds on the asymptotic normality of such estimators, where the variance of the limiting determines the . Asymptotic relative (ARE) quantifies this by taking the ratio of the asymptotic variances of two estimators, with an ARE of 1 indicating equal and values greater than 1 favoring the numerator estimator. The Cramér–Rao lower bound provides the theoretical minimum for this asymptotic variance under regularity conditions. For an unbiased estimator \hat{\theta}_n based on n independent observations from a regular parametric family, the bound states \operatorname{Var}(\hat{\theta}_n) \geq \frac{1}{n I(\theta)}, where I(\theta) denotes the of the distribution at parameter \theta. This bound, derived independently by Cramér and , establishes the unattainable lower limit unless equality holds. An estimator is asymptotically efficient if its asymptotic variance equals $1/(n I(\theta)), saturating the bound. The maximum likelihood estimator (MLE) achieves this under standard regularity conditions, such as differentiability of the log-likelihood and of the , making it a for in large samples. Relative efficiency comparisons highlight how performance varies across distributions. For instance, in estimating the mean of a , the sample mean attains the and has ARE of \pi/2 \approx 1.57 relative to the sample median. In contrast, for a on [0, 1], the sample mean remains efficient with asymptotic variance $1/(12n), while the sample median has asymptotic variance $1/(4n) and thus an ARE of $1/3 relative to the mean.

Robustness

In statistics, robustness refers to the capacity of an estimator to retain its statistical properties, including controlled bias and variance, under deviations from ideal assumptions, such as the presence of outliers or misspecification of the underlying distribution. This property addresses vulnerabilities in classical estimators like the sample mean, which can exhibit substantial bias when even a small fraction of data points are contaminated. Robust estimators are designed to mitigate such sensitivities, ensuring reliable inference in real-world scenarios where data imperfections are common. Key measures of robustness include the and the breakdown point. The , introduced by Hampel, quantifies the asymptotic effect of an contamination at a specific point on the value of the estimator; a bounded ensures that no single observation can disproportionately affect the result, providing a local measure of stability. Complementing this, the breakdown point, also formalized by Hampel, represents the minimal proportion of arbitrarily corrupted observations required to make the estimator arbitrarily large or undefined, serving as a global robustness metric; estimators with breakdown points approaching 50% are considered highly resistant, as this is the theoretical maximum for location estimators. A classic example of a robust estimator is the sample for estimating a , which remains stable even if up to nearly half the observations are s, unlike the whose breakdown point is zero. For instance, in a symmetric contaminated by gross errors, the 's is bounded by a constant, limiting the impact of extremes and preserving its utility as a measure. Huber's M-estimators extend this principle by solving estimating equations derived from a that behaves quadratically for small residuals to retain efficiency under the assumed model, while transitioning to linear behavior for large residuals to cap ; the tuning constant in Huber's proposal optimally balances these trade-offs, achieving near-maximum efficiency at while maintaining a positive breakdown point.

Fisher Consistency

Fisher consistency is a property of an estimator \hat{\theta} that ensures it recovers the true parameter \theta when applied to the entire distribution. Formally, an estimator is Fisher consistent if the corresponding functional T satisfies T(F_\theta) = \theta, where F_\theta is the true parameterized by \theta. This condition implies that the estimator, viewed as a functional of the F, satisfies T(F_\theta) = \theta, meaning it equals the when the empirical distribution converges to the true one. Unlike standard unbiasedness, which requires E_\theta[\hat{\theta}] = \theta for finite samples from the correctly specified model, Fisher focuses on the population and can hold even under model misspecification. It guarantees that the estimator is asymptotically unbiased in the sense that biases vanish as the sample size grows, provided the model is correctly specified in the limit, but it does not preclude finite-sample bias. This distinction is particularly relevant in scenarios where exact unbiasedness is unattainable or impractical. Fisher consistency is closely related to generalized estimating equations (GEEs), which provide a framework for in correlated or clustered without full likelihood specification. In GEEs, the estimating equations are designed to yield consistent estimates of parameters even if the working structure or link function is misspecified, as long as the mean model is correct. This property stems from the approach, ensuring the solution to the estimating equations matches the true parameter in the population. The concept finds key applications in robust , where estimators are constructed to resist outliers or contamination while maintaining across a range of model . For instance, robust nonparametric generalized linear models achieve this by minimizing losses that ensure the population minimizer equals the true parameter. In semi-parametric , underpins Z-estimators for condition models, allowing efficient without fully specifying the nuisance , as seen in robust alternatives for elliptical or high-dimensional settings.

Estimation Methods

Method of Moments

The method of moments is a technique for estimating population parameters by equating the first k sample moments to the corresponding population moments, where k is the number of unknown , and solving the resulting for the parameter vector \theta. Introduced by in 1894, this approach provides a straightforward, distribution-free method applicable when the necessary moments exist. The procedure begins by computing the sample moments from the data, typically using raw moments m_r = \frac{1}{n} \sum_{i=1}^n X_i^r for r = 1, 2, \dots, k, and setting them equal to the theoretical population moments \mu_r(\theta), which are functions of the parameters. This yields k equations in k unknowns, solved explicitly or numerically for \hat{\theta}. For instance, in estimating the mean \mu and variance \sigma^2 of a normal distribution N(\mu, \sigma^2), the first population moment is \mu_1(\mu, \sigma^2) = \mu and the second is \mu_2(\mu, \sigma^2) = \mu^2 + \sigma^2. Equating to the sample moments gives \hat{\mu} = \bar{X} (the sample mean) and \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2 = m_2 - m_1^2, which is the sample second moment about the origin minus the square of the first moment./07:_Point_Estimation/7.02:_The_Method_of_Moments) This offers advantages in simplicity and ease of computation, as it avoids optimization and relies only on calculations, making it computationally for small samples or when explicit solutions exist. However, of moments estimators may lack efficiency, often exhibiting higher variance compared to alternatives like maximum likelihood estimators, particularly for distributions where moments do not fully capture the likelihood structure. Under mild conditions, such as the existence of the relevant population moments and the identifiability of \theta from those moments, method of moments estimators are consistent, meaning they converge in probability to the true parameter values as the sample size n \to \infty. This follows from the applied to the sample moments, which converge to the moments, combined with the for the solving function./07:_Point_Estimation/7.02:_The_Method_of_Moments) Method of moments estimators are not guaranteed to be unbiased, though they may be in specific cases like the normal mean.

Maximum Likelihood Estimation

Maximum likelihood estimation (MLE) seeks to find the value that makes the observed data most probable under the assumed . For independent and identically distributed observations X_1, \dots, X_n from a f(x \mid \theta), the is L(\theta) = \prod_{i=1}^n f(X_i \mid \theta), and the MLE \hat{\theta} is defined as \hat{\theta} = \arg\max_\theta L(\theta). Since the product form can be numerically unstable, maximization is typically performed on the log-likelihood \ell(\theta) = \log L(\theta) = \sum_{i=1}^n \log f(X_i \mid \theta), which yields the same maximizer. This approach was formally introduced by in 1922 as a general method for estimation in theoretical statistics. A distinctive feature of MLE is its invariance property: if \hat{\theta} is the MLE of the parameter \theta, then for any g(\cdot), g(\hat{\theta}) is the MLE of g(\theta). This holds because the transformation preserves the argmax operation on the likelihood, making MLE adaptable to reparameterizations without altering the estimation principle. The property ensures that estimates of derived quantities, such as functions of the original parameters, can be obtained directly from the primary MLE. Under standard regularity conditions—such as the existence of moments, differentiability of the log-likelihood, and of the parameters—the MLE possesses desirable asymptotic properties. Specifically, \hat{\theta} is consistent, meaning \hat{\theta} \xrightarrow{p} \theta as n \to \infty; asymptotically normal, with \sqrt{n} (\hat{\theta} - \theta) \xrightarrow{d} \mathcal{N}(0, I(\theta)^{-1}), where I(\theta) is the ; and asymptotically efficient, achieving the Cramér-Rao lower bound on the variance. These properties establish MLE as a for optimality in large samples, provided the model is correctly specified. Computing the MLE often requires numerical optimization, especially for complex models where closed-form solutions are unavailable. The Newton-Raphson method iteratively updates the parameter estimate using the score function and observed : \theta^{(k+1)} = \theta^{(k)} - H(\theta^{(k)})^{-1} S(\theta^{(k)}), where S(\theta) = \partial \ell(\theta)/\partial \theta is the score and H(\theta) = \partial^2 \ell(\theta)/\partial \theta \partial \theta^\top is the , converging quadratically under suitable conditions. For models with latent variables or incomplete data, the expectation-maximization () algorithm provides an alternative, alternating between an expectation step to compute expected complete-data log-likelihood and a maximization step to update parameters, monotonically increasing the observed log-likelihood until . The algorithm, introduced in 1977, is particularly useful in mixture models and hidden Markov models.

References

  1. [1]
    Estimator - StatLect
    In statistics, an estimator is a function that associates a parameter estimate to each possible sample we can observe.
  2. [2]
    [PDF] Properties of Estimators I 7.6.1 Bias
    The first estimator property we'll cover is Bias. The bias of an estimator measures whether or not in expectation, the estimator will be equal to the true ...
  3. [3]
    [PDF] Properties of Estimators - Oxford statistics department
    One way of measuring the accuracy of an estimator is via its mean square error (MSE): mse(ˆθ) = E(ˆθ− θ)2. mse(ˆθ) = V(ˆθ− θ) + {E(ˆθ− θ)}2 = V(ˆθ) + {bias(θ)} ...
  4. [4]
    [PDF] Desirable Statistical Properties of Estimators 1. Two Categories of ...
    A minimum variance estimator is therefore the statistically most precise estimator of an unknown population parameter, although it may be biased or unbiased.
  5. [5]
    [PDF] Properties of Estimators III 7.8.1 Sufficiency - Washington
    All estimators are statistics because they take in our n data points and produce a single number.
  6. [6]
    [PDF] 2.4 Properties of Point Estimators
    An important property of point estimators is whether or not their expected value equals the unknown parameter that they are estimating. If θ is considered the ...<|control11|><|separator|>
  7. [7]
    A Tricentenary history of the Law of Large Numbers - Project Euclid
    The Weak Law of Large Numbers is traced chronologically from its inception as Jacob Bernoulli's Theorem in 1713, through De Moivre's Theorem, to ultimate forms ...
  8. [8]
    [PDF] The Early Development of Mathematical Probability - Glenn Shafer
    Laplace discovered inverse probability in the course of his work on the theory of errors in the 1770s. Laplace realized that probabilities for errors, once the ...Missing: 18th | Show results with:18th
  9. [9]
    [PDF] Gauss on least-squares and maximum-likelihood estimation1
    Dec 18, 2021 · Gauss's 1809 work on least squares is viewed as the start of mathematical statistics. He showed that under normality, maximum-likelihood and  ...
  10. [10]
    [PDF] MODERN SCIENCE AND THE BAYESIAN-FREQUENTIST ...
    19th Century science was broadly Bayesian in its statistical method- ology, while frequentism dominated 20th Century scientific practice.
  11. [11]
    On the mathematical foundations of theoretical statistics - Journals
    On the mathematical foundations of theoretical statistics. R. A. Fisher ... Moss J univariateML: Maximum Likelihood Estimation for Univariate Densities ...
  12. [12]
  13. [13]
    Robust Estimation of a Location Parameter - Project Euclid
    This paper contains a new approach toward a theory of robust estimation; it treats in detail the asymptotic theory of estimating a location parameter for ...
  14. [14]
    Bootstrap Methods: Another Look at the Jackknife - Project Euclid
    January, 1979 Bootstrap Methods: Another Look at the Jackknife. B. Efron · DOWNLOAD PDF + SAVE TO MY LIBRARY. Ann. Statist. 7(1): 1-26 (January, 1979). DOI ...
  15. [15]
    [PDF] Point Estimation - Purdue Department of Statistics
    Point Estimation. 7.1 Introduction. Definition 7.1.1 A point estimator is any function W(X1,...,Xn) of a sample; that is, any statistic is a point estimator.
  16. [16]
    [PDF] Overview of Estimation - Arizona Math
    Introduction to estimation in the classical approach to statistics is based on two fundamental questions: • How do we determine estimators? • How do we evaluate ...
  17. [17]
    Section 1: Estimation - Statistics Online | STAT ONLINE
    the estimator is defined using capital letters (to denote that its value is random), and · the estimate is defined using lowercase letters (to denote that its ...
  18. [18]
    [PDF] Introduction to Estimation
    The objective of estimation is to approximate the value of a population parameter on the basis of a sample statistic. For example, the sample mean¯X is used to ...
  19. [19]
    [PDF] Probability, Statistics, and Random Processes for Engineers
    ... An estimator is a function of the observations X1, X2,...,Xn that estimates a parameter of the distribution. Estimators are random variables. When an ...
  20. [20]
    [PDF] Statistical Estimation - Cheng Mao
    Estimand: a function g(θ) of the parameter θ. E.g., g(θ) = θ or θ2. • Estimator: a statistic which is used to estimate the estimand, denoted by ˆθ = ˆθ(X1 ...
  21. [21]
    S.1 Basic Terminology | STAT ONLINE - Penn State
    Samples and statistics. Sample: A sample is a representative group drawn from the population. Statistic: A statistic is any summary number, like an average or ...
  22. [22]
    Sample mean | Properties as an estimator - StatLect
    If the sample is drawn from probability distributions having a common expected value, then the sample mean is an estimator of that expected value.
  23. [23]
    Estimation - Statistics LibreTexts
    Jan 8, 2024 · Point estimation is the form of statistical inference in which, based on the sample data, we estimate the unknown parameter of interest using a single value.Introduction · Point Estimation · Another Point Estimator · Interval Estimation
  24. [24]
    4.3 - Statistical Biases | STAT 509
    Statistical bias is defined as the difference between the parameter to be estimated and the mathematical expectation of the estimator.
  25. [25]
    SticiGui Estimating Parameters from Simple Random Samples
    Dec 30, 2020 · The bias is the difference between the expected value of the estimator and the true value of the parameter.
  26. [26]
    [PDF] Unbiased Estimation - Arizona Math
    In particular: • The mean square error for an unbiased estimator is its variance. • Bias always increases the mean square error.
  27. [27]
    1.3 - Unbiased Estimation | STAT 415 - STAT ONLINE
    A natural question then is whether or not these estimators are good in any sense. One measure of good is unbiasedness.
  28. [28]
    Sampling Distributions - UC Berkeley Statistics
    Aug 5, 2016 · The sample mean is a statistic commonly used to estimate the mean of a population. It is an unbiased estimator of the population mean. The ...
  29. [29]
    [PDF] Lecture 9 Estimators
    Sep 25, 2019 · We will see below that same estimator can be unbiased as an estimator for one parameter, but biased when used to estimate another parameter.
  30. [30]
    [PDF] Review of key points about estimators - Stat@Duke
    • Two common unbiased estimators are: 1. Sampling proportion ˆp for population proportion p. 2. Sample mean¯X for population mean µ. Page 5. Bias and the sample ...
  31. [31]
    [PDF] Bias, Variance, and MSE of Estimators
    Sep 4, 2010 · In this note we focus one estimating a parameter of the distribution such as the mean or variance. In some cases the parameter completely ...
  32. [32]
    [PDF] 8 Estimation - Arizona Math
    The sample variance is a point estimator for the population variance. A point estimator only gives an approximation to the parameter it is supposed to estimate.
  33. [33]
    24.4 - Mean and Variance of Sample Mean | STAT 414
    The mean of the sample mean is the same as the mean of the individual population. The variance of the sample mean decreases as the sample size increases.
  34. [34]
    [PDF] Notes: Estimation, Bias and Variance
    Definition: The estimator. ˆ θ for a parameter θ is said to be unbiased if. E[ˆθ] = θ. The bias of. ˆ θ is how far the estimator is from being unbiased. It is ...
  35. [35]
    [PDF] Data, Models, and Uncertainty in the Natural Sciences
    Nov 24, 2020 · the variance of the sample variance for normally distributed variables is given by var{ˆσ2 x} = 2σ4. N − 1 . (1.89). We have mentioned the ...<|control11|><|separator|>
  36. [36]
    [PDF] Lecture 1: Optimal Prediction (with Refreshers)
    Mean squared error is bias (squared) plus variance. This is the simplest form of the bias-variance decomposition, which is one of the central parts of ...
  37. [37]
    [PDF] EE363 homework 4 solutions
    We define the relative mean square estimator error as η = E(x − x)2/ E(¯x − x)2. 1. Page 2. Show that η can be expressed as a function of ρ, the correlation ...
  38. [38]
    [PDF] Robust Mean-Squared Error Estimation in the Presence of Model ...
    Finally, we demonstrate through several examples that the min- imax MSE estimator can significantly increase the performance over the conventional least-squares ...
  39. [39]
    Mean squared error of an estimator | Bias-variance decomposition
    Learn how the mean squared error (MSE) of an estimator is defined and how it is decomposed into bias and variance.
  40. [40]
    8.2.2 Point Estimators for Mean and Variance - Probability Course
    The sample variance is an unbiased estimator of σ2. The sample standard deviation is defined as S=√S2,. and is commonly used as an estimator for σ.
  41. [41]
    26.3 - Sampling Distribution of Sample Variance | STAT 414
    Let's turn our attention to finding the sampling distribution of the sample variance. The following theorem will do the trick for us!<|control11|><|separator|>
  42. [42]
    [PDF] Properties of Estimators II 7.7.1 Consistency
    This property combined with consistency and unbiasedness mean that our estimator is on target (unbiased), converges to the true parameter (consistent), and does ...
  43. [43]
    [PDF] 9 Properties of point estimators and finding them - Arizona Math
    An estimator ˆθn is consistent if it converges to θ in a suitable sense as n → ∞.
  44. [44]
    A Tutorial on Asymptotic Properties of Statistical Estimators ... - arXiv
    Sep 13, 2024 · The weak and strong consistency are based on weak law of large numbers (WLLN) and strong law of large numbers (SLLN), respectively. The ...
  45. [45]
    [PDF] Estimators, Mean Square Error, and Consistency
    Jan 20, 2006 · The bias is defined as (µδ−θ), the distance between the estimator's mean and the parameter θ. An estimator is called unbiased if the bias is 0 ...
  46. [46]
    [PDF] Lecture Notes 15 36-705 1 Asymptotic theory 2 Consistency of the ...
    Condition 1: Identifiability: A basic requirement for constructing any consistent estimator is that the model be identifiable, i.e. if θ1 6= θ2 then it must be ...
  47. [47]
    [PDF] LARGE-SAMPLE PROPERTIES OF ESTIMATORS
    The following theorem, called the Weak Law of Large Numbers (WLLN), the sample mean is consistent. Variance mean, provided the population estimator of.
  48. [48]
    [PDF] LARGE-SAMPLE DISTRIBUTION THEORY - NYU Stern
    The mean of a random sample from any population with finite mean μ and finite variance σ2 is a consistent estimator of μ. Proof: E[ ¯xn] = μ and Var[ ¯xn] = σ2/ ...
  49. [49]
    [PDF] 16 Maximum Likelihood Estimates - Purdue Department of Statistics
    We refer to the list of all assumptions as multiparameter Cramér-Rao conditions for asymptotic normality. ... MLE under regularity conditions. More accurate ...
  50. [50]
    [PDF] Lecture 14 — Consistency and asymptotic normality of the MLE 14.1 ...
    Under suitable regularity conditions, this implies that the value of θ maximizing the left side, which is ˆθ, converges in probability to the value of θ ...
  51. [51]
    [PDF] Asymptotic Relative Efficiency - Purdue Department of Statistics
    Asymptotic Relative Efficiency (ARE) is the ratio o2(0)/o?(0), summarizing the performance of one estimate relative to another. ARE > 1 means the first is more ...
  52. [52]
    The Influence Curve and its Role in Robust Estimation
    Apr 5, 2012 · The influence curve is the first derivative of an estimator, used to study local robustness properties in robust estimation.
  53. [53]
    A General Qualitative Definition of Robustness - Project Euclid
    The concept of the "breakdown point" of a sequence of estimators is defined ... Hampel "A General Qualitative Definition of Robustness," The Annals of ...
  54. [54]
    [PDF] BU-1022-M Fisher Consistency- the Evolution of a Concept
    "A statistic is consistent (Fisher consistent) if, when calculated from the whole population, it is equal to the parameter describing the probability law. Thus ...
  55. [55]
    [PDF] Fisher Consistency of GEE Models Under Link Misspecification
    Apr 7, 1995 · In this article we show that under certain common circumstances Fisher consistent estimates of regression coefficients will be obtained even if ...
  56. [56]
    Robust and efficient estimation of nonparametric generalized linear ...
    May 16, 2023 · Hence, our estimation method is Fisher consistent for every model distribution. To the best of our knowledge, this is the first robust ...
  57. [57]
    Robust Z-Estimators for Semiparametric Moment Condition Models
    In the present paper, we introduce a class of robust Z-estimators for moment condition models. These new estimators can be seen as robust alternatives for ...
  58. [58]
    [PDF] 5 Method of Moments
    We will now learn the oldest method for deriving point estimators, namely the method of moments, introduced in 1894 by Karl Pearson. Suppose that Y1,...,Yn ...
  59. [59]
    1.4 - Method of Moments | STAT 415 - STAT ONLINE
    The method of moments involves equating sample moments with theoretical moments. So, let's start by making sure we recall the definitions of theoretical ...
  60. [60]
    4.1 Method of moments | A First Course on Statistical Inference
    Thanks to the condition (4.1), the LLN implies that the sample moments a1,…,aK a 1 , … , a K are consistent in probability for estimating the population moments ...