Maximum likelihood estimation

Maximum likelihood estimation (MLE) is a fundamental statistical method for estimating the parameters of a probabilistic model by selecting the parameter values that maximize the likelihood function, which quantifies the probability of observing the given data under those parameters.^[1] This approach assumes that the observed data are a representative sample from the underlying population and seeks to make the data "most likely" by optimizing the joint probability density (or mass) function evaluated at the observed values.^[2] The concept of MLE was formally introduced by British statistician Ronald A. Fisher in his seminal 1922 paper, "On the Mathematical Foundations of Theoretical Statistics," where he developed it as a principled alternative to earlier estimation techniques like method of moments, emphasizing its role in inference and hypothesis testing.^[3] Fisher's work built on his earlier 1912 numerical procedure for parameter estimation, evolving through the 1910s to incorporate ideas of sufficiency, efficiency, and information, which justified MLE's optimality properties.^[4] Although precursors to likelihood-based methods existed in the works of earlier mathematicians like Edgeworth, Fisher is credited with coining the term "maximum likelihood" in his 1922 paper and rigorously promoting its theoretical foundations in his 1925 paper "Theory of Statistical Estimation."^[3] Under standard regularity conditions—such as the model's correct specification and differentiability of the likelihood—MLE estimators are consistent (converging to the true parameters as sample size increases), asymptotically unbiased, normally distributed, and efficient (achieving the lowest possible variance among consistent estimators).^[2] Computationally, MLE often involves maximizing the log-likelihood function for tractability, using techniques like Newton-Raphson iteration or expectation-maximization algorithms, especially for complex models.^[1] Its versatility has made MLE a cornerstone of modern statistics, underpinning applications in fields like epidemiology for regression modeling, econometrics for discrete choice analysis, and machine learning for parameter optimization in probabilistic models.^[2]^[5]

Fundamentals

Definition and Motivation

Maximum likelihood estimation (MLE) is a fundamental statistical method for inferring the parameters of a probabilistic model from observed data. Introduced by Ronald A. Fisher in his seminal 1922 paper, MLE seeks to estimate the unknown parameters by identifying those values that maximize the probability of the observed data under the assumed model.^[3] This approach arose from Fisher's work on experimental design in agriculture and genetics, where traditional estimation techniques struggled with small sample sizes and required more precise tools for parameter recovery.^[6] Fisher's motivation was to develop an estimation principle grounded in the likelihood of the data, providing a systematic way to select parameters that make the observations as plausible as possible given the model.^[7] Formally, suppose a random sample X = (X_1, \dots, X_n) is drawn from a distribution with probability density or mass function f(x; \theta), where \theta represents the unknown parameter(s). The maximum likelihood estimator \hat{\theta} is then defined as the value of \theta that maximizes the joint probability (or density) of the observed data, i.e., \hat{\theta} = \arg\max_{\theta} \prod_{i=1}^n f(x_i; \theta).^[3] This criterion directly incorporates the assumed distributional form, ensuring that the estimator aligns the model closely with the empirical evidence.^[8] In contrast to earlier methods like the method of moments, which estimate parameters by equating sample moments (such as mean and variance) to their theoretical population counterparts and thereby relying on limited summary statistics, MLE utilizes the full information contained in the likelihood across the entire sample distribution.^[8] This comprehensive use of distributional details typically yields estimators with superior statistical properties, such as lower variance, particularly in finite samples.^[7] By establishing the likelihood as the cornerstone of inference, MLE provides a principled foundation for subsequent developments in estimation theory.

Likelihood Function

The likelihood function, introduced by Ronald Fisher, quantifies the plausibility of different parameter values given observed data by treating the joint probability of the data as a function of the parameter. For a sample of independent observations x = (x_1, x_2, \dots, x_n) drawn from a discrete probability distribution p(x | \theta), where \theta denotes the parameter(s) of interest, the likelihood function is defined as

L(\theta | x) = \prod_{i=1}^n p(x_i | \theta).

This formulation views the data as fixed and assesses how well \theta explains the observations.^[3] For continuous distributions, the probability mass function p is replaced by the probability density function f(x | \theta), yielding the explicit form for the independent and identically distributed (i.i.d.) case:

L(\theta | x) = \prod_{i=1}^n f(x_i | \theta).

Here, the likelihood represents the joint density of the sample evaluated at the observed data points, serving as the mathematical foundation for parameter estimation. The joint likelihood arises from the independence assumption, equivalent to the product of individual conditional likelihoods p(x_i | \theta) or densities f(x_i | \theta).^[3] To facilitate optimization, the log-likelihood function is commonly employed:

\ell(\theta | x) = \log L(\theta | x) = \sum_{i=1}^n \log p(x_i | \theta)

(or using \log f(x_i | \theta) for continuous cases). As a monotonic increasing transformation of the positive-valued likelihood, the log-likelihood preserves the location of maxima, ensuring that the value of \theta that maximizes \ell(\theta | x) also maximizes L(\theta | x). This property simplifies differentiation and numerical computation without altering the estimation objective.^[3]^[9] In the context of inference, the likelihood function underpins hypothesis testing by providing a basis for comparing models, such as through ratios of likelihoods under alternative parameter values.^[3]

Principles of Estimation

Maximizing the Likelihood

The maximum likelihood estimator (MLE), denoted \hat{\theta}, is the value of the parameter \theta that maximizes the likelihood function L(\theta \mid \mathbf{x}) for observed data \mathbf{x}, formally defined as \hat{\theta} = \arg\max_{\theta} L(\theta \mid \mathbf{x}).^[3] This principle, introduced by Ronald Fisher, selects the parameter values that render the observed data most probable under the assumed model.^[3] Since the logarithm is a monotonically increasing function, maximizing the likelihood is equivalent to maximizing the log-likelihood \ell(\theta \mid \mathbf{x}) = \log L(\theta \mid \mathbf{x}), which often simplifies computations due to the transformation of products into sums.^[3] To obtain the MLE in practice, one solves the first-order condition known as the score equation: \frac{\partial \ell(\theta \mid \mathbf{x})}{\partial \theta} = 0.^[3] The score function, S(\theta \mid \mathbf{x}) = \frac{\partial \ell(\theta \mid \mathbf{x})}{\partial \theta}, measures the sensitivity of the log-likelihood to changes in \theta and is given explicitly by

S(\theta \mid \mathbf{x}) = \sum_{i=1}^n \frac{\partial \log f(x_i \mid \theta)}{\partial \theta},

where f(x_i \mid \theta) is the probability density or mass function of the i-th observation.^[3] Under typical regularity conditions—such as differentiability of the log-density and the true parameter lying in the interior of the parameter space—this equation identifies a critical point corresponding to the maximum. The second derivative of the log-likelihood plays a key role in assessing the curvature at this critical point and quantifying the information content of the data. The Fisher information I(\theta) is defined as

I(\theta) = -\mathbb{E}\left[ \frac{\partial^2 \ell(\theta \mid \mathbf{x})}{\partial \theta^2} \right] = \mathbb{E}\left[ S(\theta \mid \mathbf{x})^2 \right],

where the expectation is taken with respect to the true distribution of \mathbf{x}.^[10] This non-negative quantity serves as a measure of the amount of information that the data provide about \theta; higher values indicate greater precision in estimation.^[10] Equivalently, I(\theta) equals the variance of the score function, leveraging the fact that \mathbb{E}[S(\theta \mid \mathbf{x})] = 0 under the same regularity conditions.^[10] For a single parameter, the observed information is the negative Hessian - \frac{\partial^2 \ell(\hat{\theta} \mid \mathbf{x})}{\partial \theta^2} evaluated at the MLE, which approximates n I(\theta) for large samples, where n is the sample size.^[10] In the multivariate case, these concepts generalize to the score vector and Fisher information matrix, with the score equation becoming \mathbf{S}(\theta \mid \mathbf{x}) = \mathbf{0}. Under regularity conditions ensuring the strict concavity of the log-likelihood function—such as twice differentiability and positive definiteness of the information matrix—the solution to the score equation is unique and corresponds to the global maximum. This uniqueness guarantees a well-defined MLE when it exists within the parameter space.

Restricted Parameter Spaces

In maximum likelihood estimation (MLE), the parameter space is often restricted due to inherent constraints such as non-negativity of variances or positivity of scale parameters, altering the maximization process from the unconstrained case. When the true parameter lies on the boundary of the restricted space, the MLE can attain its maximum there rather than in the interior, leading to estimators that may not satisfy standard regularity conditions. For instance, in estimating the variance \sigma^2 of a normal distribution from identical observations, the likelihood increases without bound as \sigma^2 \to 0^+, so the MLE is \hat{\sigma}^2 = 0. Similarly, for a uniform distribution on [0, \theta] with \theta > 0, the likelihood function for a sample x_1, \dots, x_n is

L(\theta) = \begin{cases} \theta^{-n} & \text{if } \theta \geq \max(x_i), \\ 0 & \text{otherwise}. \end{cases}

This is maximized at the boundary \hat{\theta} = \max(x_i), as smaller \theta would make L(\theta) = 0. For constrained optimization in MLE, equality constraints are typically handled using Lagrange multipliers to incorporate the restrictions into the log-likelihood maximization. Specifically, to maximize \ell(\theta) subject to g(\theta) = 0, one solves \nabla \ell(\theta) + \lambda \nabla g(\theta) = 0 for \theta and \lambda.^[11] Inequality constraints, such as \theta \geq 0, often require direct evaluation of the likelihood on the feasible region, potentially projecting interior solutions onto the boundary if they violate the constraints. These restrictions have important implications for inference: standard score tests, which rely on interior maxima and quadratic approximations, become invalid when the MLE is on the boundary, as the score may not have mean zero. Instead, profile likelihood methods are recommended, where the likelihood is profiled over nuisance parameters within the restricted space to construct confidence regions or perform tests.^[12]

Nonparametric Maximum Likelihood Estimation

Nonparametric maximum likelihood estimation (NPMLE) seeks to maximize the likelihood function over an infinite-dimensional space of probability distributions, eschewing parametric assumptions about the form of the underlying distribution. This approach is applicable in settings where the data-generating process is unspecified beyond basic structural constraints, allowing the estimator to adapt flexibly to the observed data. A foundational result establishes the consistency of such estimators under mild identifiability and compactness conditions on the support.^[13] In survival analysis with right-censored observations, the Kaplan-Meier estimator provides a canonical example of NPMLE for the cumulative distribution function of survival times.^[14] It places point masses at the observed event times, assigning probabilities that maximize the likelihood while accounting for censoring through a product-limit construction. This estimator arises naturally as the unique maximizer in the nonparametric class of distributions with jumps only at uncensored failure times. For mixture models, where observations are draws from a convex combination of unknown component distributions, the NPMLE typically concentrates on a discrete mixing distribution with at most as many support points as data observations.^[13] An expectation-maximization (EM)-like iterative procedure identifies these support points and corresponding weights by alternately updating conditional probabilities and solving restricted maximization problems, converging to the global likelihood maximizer under standard conditions. This method has been pivotal in applications like empirical Bayes estimation and deconvolution problems. When estimating monotone densities, such as non-increasing failure rates in reliability analysis, the NPMLE is equivalent to an isotonic regression of the empirical distribution function, enforcing the monotonicity constraint while maximizing the likelihood.^[15] This yields a step function estimator that jumps at data points, providing a consistent approximation to the true density under weak smoothness assumptions. Overall, NPMLE exhibits convergence to the true underlying distribution in appropriate metrics, such as the Glivenko-Cantelli topology, provided the data satisfy i.i.d. assumptions and the true distribution lies in a sieve approximating the nonparametric class. This weak convergence holds even in high-dimensional or censored settings, underscoring the robustness of the approach.

Statistical Properties

Consistency

In statistics, the maximum likelihood estimator (MLE) \hat{\theta}_n, obtained by maximizing the log-likelihood function \ell_n(\theta) = \sum_{i=1}^n \log f(X_i \mid \theta) based on an independent and identically distributed (IID) sample X_1, \dots, X_n from a parametric family with density or mass function f(\cdot \mid \theta), is consistent for the true parameter \theta_0 if \hat{\theta}_n \xrightarrow{p} \theta_0 as the sample size n \to \infty, where \xrightarrow{p} denotes convergence in probability. This property ensures that, with probability approaching 1, the MLE becomes arbitrarily close to the true value as more data are observed. The foundational result establishing this convergence under general conditions is due to Wald.^[16] The proof of consistency relies on the law of large numbers (LLN) applied to the normalized log-likelihood, which converges pointwise to its population counterpart, combined with uniform convergence over the parameter space to preserve the location of the maximum. Specifically, for each fixed \theta,

\frac{1}{n} \ell_n(\theta) \xrightarrow{p} E_{\theta_0} [\log f(X \mid \theta)],

where the expectation is taken under the true parameter \theta_0. The right-hand side, known as the expected log-likelihood, is strictly maximized at \theta = \theta_0 under model identifiability, meaning distinct parameters yield distinct distributions. Uniform convergence of the normalized log-likelihood to this expectation—ensured when the parameter space is compact—implies that the probability of the MLE deviating significantly from \theta_0 vanishes as n \to \infty. This argument traces back to Wald's theorem, which leverages continuity of the likelihood and compactness to achieve the result.^[16]^[17] Consistency holds under a set of regularity conditions that ensure the LLN applies and the maximizer behaves well. These include: the data being IID from a distribution in a parametric family; the support of f(x \mid \theta) being independent of \theta (implying boundedness independent of the parameter); the log-likelihood \log f(x \mid \theta) being differentiable in \theta for almost all x; the true parameter \theta_0 lying in the interior of a compact parameter space \Theta; and identifiability of the model, so that E_{\theta_0} [\log f(X \mid \theta)] < E_{\theta_0} [\log f(X \mid \theta_0)] for all \theta \neq \theta_0. Additionally, the likelihood must be continuous in \theta, and the expected log-likelihood must be integrable. These conditions, formalized in Wald's work, prevent pathologies such as non-convergence due to boundary issues or non-identifiable parameters.^[16]^[17]

Asymptotic Efficiency and Normality

Under regularity conditions that ensure the existence of the maximum likelihood estimator (MLE) \hat{\theta}_n and its consistency, the asymptotic distribution of the MLE is given by the central limit theorem:

\sqrt{n} (\hat{\theta}_n - \theta_0) \xrightarrow{d} \mathcal{N}(0, I(\theta_0)^{-1}),

where \theta_0 is the true parameter value, n is the sample size, and I(\theta_0) denotes the Fisher information matrix evaluated at \theta_0.^[18] This result establishes that, for large n, the MLE is approximately normally distributed with mean \theta_0 and variance I(\theta_0)^{-1}/n, providing a foundation for inference procedures such as confidence intervals and hypothesis tests based on the MLE.^[19] The proof outline relies on a first-order Taylor expansion of the score function around \theta_0. The score equation at the MLE is \frac{1}{n} \sum_{i=1}^n \frac{\partial}{\partial \theta} \log f(X_i; \hat{\theta}_n) = 0, which expands to approximately \frac{1}{n} \sum_{i=1}^n \frac{\partial}{\partial \theta} \log f(X_i; \theta_0) + \left( \frac{1}{n} \sum_{i=1}^n \frac{\partial^2}{\partial \theta \partial \theta^T} \log f(X_i; \theta_0) \right) (\hat{\theta}_n - \theta_0) = 0. Normalizing by \sqrt{n} yields \sqrt{n} (\hat{\theta}_n - \theta_0) \approx - \left[ \frac{1}{n} \sum_{i=1}^n \frac{\partial^2}{\partial \theta \partial \theta^T} \log f(X_i; \theta_0) \right]^{-1} \cdot \sqrt{n} \cdot \frac{1}{n} \sum_{i=1}^n \frac{\partial}{\partial \theta} \log f(X_i; \theta_0). Under the regularity conditions, the normalized score \sqrt{n} \cdot \frac{1}{n} \sum_{i=1}^n \frac{\partial}{\partial \theta} \log f(X_i; \theta_0) converges in distribution to \mathcal{N}(0, I(\theta_0)) by the central limit theorem for independent identically distributed random variables, and the normalized Hessian converges in probability to -I(\theta_0), leading to the stated asymptotic normality via the delta method.^[18] This asymptotic variance I(\theta_0)^{-1}/n implies that the MLE attains the Cramér-Rao lower bound asymptotically, meaning it achieves the minimal possible variance among all unbiased estimators under the same conditions.^[20]^[21] Specifically, for large n, the variance of \hat{\theta}_n is approximated by \operatorname{Var}(\hat{\theta}_n) \approx I(\hat{\theta}_n)^{-1}/n, where the observed Fisher information I(\hat{\theta}_n) serves as a consistent estimator of I(\theta_0).^[18] Thus, the MLE is asymptotically efficient, saturating the information-theoretic limit for parameter estimation.^[20]

Invariance Properties

One key property of maximum likelihood estimators (MLEs) is functional invariance, which states that if \hat{\theta} is the MLE of a parameter \theta, then for a one-to-one (invertible) function g, the estimator g(\hat{\theta}) is the MLE of the transformed parameter g(\theta).^[22] This ensures that the estimation procedure remains consistent under reparameterization of the model. The proof relies on the definition of the MLE as the value that maximizes the likelihood function L(\theta \mid x). Consider the reparameterized likelihood L^*( \phi \mid x ) = L( g^{-1}(\phi) \mid x ), where \phi = g(\theta) and g^{-1} is the inverse. Since \hat{\theta} maximizes L(\theta \mid x), substituting \theta = g^{-1}(\phi) shows that L^*(\phi \mid x) is maximized at \phi = g(\hat{\theta}), as L( g^{-1}(g(\hat{\theta})) \mid x ) = L(\hat{\theta} \mid x) \geq L(\theta \mid x) for all \theta.^[23] Thus, the location of the maximum remains unchanged under the transformation.^[22] For instance, suppose \hat{\mu} is the MLE of the mean \mu in a normal distribution N(\mu, \sigma^2); then \hat{\mu}^2 serves as the MLE for \mu^2, but with the caveat that the squaring function is not one-to-one over \mu \in \mathbb{R}, requiring the use of an induced likelihood defined as the supremum over all \theta mapping to the same value.^[24] In contrast, for the standard deviation \sigma > 0, if \hat{\sigma}^2 is the MLE of the variance, then \sqrt{\hat{\sigma}^2} is the MLE of \sigma since the square root is one-to-one on the positive reals.^[25] This invariance extends naturally to vector-valued parameters \theta \in \mathbb{R}^k. If g: \mathbb{R}^k \to \mathbb{R}^k is a one-to-one transformation, then g(\hat{\theta}) is the MLE of g(\theta), with the proof proceeding by analogous reparameterization of the joint likelihood, ensuring the maximum occurs at the transformed estimator.^[22] For non-one-to-one vector transformations, similar caveats apply, often involving maximization over preimages in the parameter space.^[24]

Relation to Bayesian Inference

Maximum likelihood estimation (MLE) emerges as a special case within Bayesian inference, corresponding to the maximum a posteriori (MAP) estimate under a uniform prior distribution over the parameter space.^[26] This equivalence arises because, with a flat prior, the posterior density is proportional to the likelihood function alone, so maximizing the posterior is identical to maximizing the likelihood.^[27] In certain scenarios, such as when the Jeffreys prior—a non-informative prior invariant to parameter transformations—reduces to a uniform distribution (e.g., for location parameters), the MAP estimate again coincides with the MLE.^[28] However, the fundamental distinction between MLE and full Bayesian inference is the role of the prior: MLE conditions exclusively on the observed data via the likelihood, ignoring prior information, while Bayesian approaches integrate the prior, which exerts greater influence in analyses with limited data.^[27] Within Bayesian decision theory, MLE aligns with the point estimator that minimizes the expected squared error loss under a flat prior, providing a frequentist-like solution embedded in a Bayesian framework.^[29] For predictive purposes, MLE yields point estimates by plugging the parameter value into the model, whereas Bayesian methods leverage the entire posterior distribution to generate predictive distributions that account for parameter uncertainty.^[27] Asymptotically, the Bernstein-von Mises theorem establishes that the posterior distribution approximates a normal distribution centered at the MLE with variance matching the inverse Fisher information, bridging frequentist and Bayesian perspectives in large samples.^[30]

Connection to Kullback-Leibler Divergence

The Kullback-Leibler divergence, often denoted as D(P \| Q), between two probability distributions P and Q with densities p and q (with respect to a common measure) is given by

D(P \| Q) = \int p(x) \log \frac{p(x)}{q(x)} \, dx = \mathbb{E}_{P} \left[ \log \frac{dP}{dQ} \right].

This non-negative quantity equals zero if and only if P = Q almost everywhere and otherwise is positive, serving as a measure of how much additional information is required to represent samples from P when using a coding scheme optimized for Q, thereby quantifying the mismatch or inefficiency in using Q to approximate P. In statistical estimation, consider a true (unknown) data-generating distribution P and a parametric model family \{Q_\theta : \theta \in \Theta\}, where Q_\theta has density q_\theta. The maximum likelihood estimator \hat{\theta} maximizes the average log-likelihood \frac{1}{n} \sum_{i=1}^n \log q_\theta(x_i) over an i.i.d. sample x_1, \dots, x_n from P. By the law of large numbers, this empirical average converges almost surely to the population expectation \mathbb{E}_P [\log q_\theta(X)] as n \to \infty. Maximizing this expectation is equivalent to minimizing the cross-entropy H(P, Q_\theta) = -\mathbb{E}_P [\log q_\theta(X)], which decomposes as H(P, Q_\theta) = H(P) + D(P \| Q_\theta), where H(P) = -\mathbb{E}_P [\log p(X)] is the entropy of P and constant with respect to \theta. Thus, under correct model specification (where P = Q_{\theta_0} for some \theta_0), the population maximizer of the expected log-likelihood is \theta_0, which uniquely minimizes D(P \| Q_\theta) over \theta \in \Theta. This equivalence positions maximum likelihood estimation as an asymptotic minimizer of the expected KL divergence between the true distribution and the fitted model, providing an information-theoretic justification for why MLE selects parameters that best align the model with the underlying data-generating process in terms of distributional fidelity. The empirical negative log-likelihood thus acts as a Monte Carlo approximation to the cross-entropy, with the optimization implicitly targeting reduced model mismatch as measured by KL divergence. This perspective has implications for model selection, where extensions like the Akaike information criterion approximate relative KL divergences to compare competing models.

Computational Methods

Analytical Solutions

Analytical solutions for the maximum likelihood estimator (MLE) arise when the score equations, derived from setting the derivative of the log-likelihood to zero, can be solved explicitly without iterative methods.^[31] This is particularly feasible in models belonging to the exponential family, where the structure of the density facilitates closed-form expressions.^[32] In such cases, the MLE equates the expected value of the sufficient statistic to its observed sample average, leveraging the family's canonical parameterization.^[31] The exponential family encompasses a broad class of distributions with density (or mass function) of the form

f(x \mid \theta) = h(x) \exp\left( \eta(\theta)^\top T(x) - A(\theta) \right),

where h(x) is the base measure, T(x) is the sufficient statistic, \eta(\theta) is the natural parameter, and A(\theta) is the log-partition function ensuring normalization.^[31] For an independent and identically distributed sample x_1, \dots, x_n, the log-likelihood is \ell(\theta) = \sum_{i=1}^n \log f(x_i \mid \theta), and the score equation \partial \ell / \partial \theta = 0 simplifies due to the exponential structure.^[32] Specifically, in the natural parameterization, the MLE \hat{\eta} satisfies \mathbb{E}_\eta[T(X)] = \bar{T}, where \bar{T} = n^{-1} \sum_{i=1}^n T(x_i) is the sample average of the sufficient statistic; the mean parameter \mu = \nabla_\eta A(\eta) then yields \hat{\mu} = \bar{T}.^[31] This equating of moments provides the closed-form solution when the mapping from \eta to \mu is invertible, which holds under standard regularity conditions for full exponential families.^[32] Illustrative examples highlight this tractability. For the normal distribution with known variance \sigma^2 = 1 and unknown mean \mu, the family is exponential with T(x) = x and natural parameter \eta = \mu, yielding the MLE \hat{\mu} = \bar{x}, the sample mean.^[32] Similarly, for the Poisson distribution with rate \lambda, where T(x) = x and \eta = \log \lambda, the MLE is \hat{\lambda} = \bar{x}.^[32] These solutions emerge directly from the moment-matching property inherent to the exponential form.^[31] Closed-form MLEs exist in exponential families when the score equation is linear in the sufficient statistic, a direct consequence of the canonical form, allowing explicit inversion for the parameter.^[31] Existence requires the observed sufficient statistic to lie in the interior of the convex support of its distribution, ensuring a unique solution within the open parameter space; boundary cases, common in discrete models, may lead to non-existence or infinities.^[31] However, such analytical solutions are rare outside exponential families, as the score equations generally lack the linearity needed for explicit resolution, necessitating numerical optimization.^[32]

Gradient-Based Methods

Gradient-based methods are employed for maximum likelihood estimation (MLE) in scenarios where the likelihood function lacks closed-form solutions, relying on iterative optimization to ascend the log-likelihood surface using first-order derivatives. The gradient of the log-likelihood, known as the score function, provides the direction of steepest ascent, guiding parameter updates toward regions of higher likelihood. The fundamental algorithm is gradient ascent, with the parameter update rule given by

\boldsymbol{\theta}_{k+1} = \boldsymbol{\theta}_k + \alpha_k \nabla_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}_k; \mathbf{x}),

where \ell(\boldsymbol{\theta}; \mathbf{x}) denotes the log-likelihood function, \nabla_{\boldsymbol{\theta}} \ell is its gradient with respect to the parameters \boldsymbol{\theta}, \alpha_k > 0 is the step size at iteration k, and \mathbf{x} represents the observed data. The step size \alpha_k can be fixed or determined via line search to ensure sufficient increase in the log-likelihood at each step, balancing progress and stability in the optimization. This approach is particularly useful for complex models where the score function can be computed analytically or approximated numerically.^[33] For large datasets, full-batch gradient ascent becomes computationally prohibitive, prompting the use of stochastic gradient ascent, which approximates the true gradient using mini-batches of data. In this variant, the update replaces the full gradient with an unbiased estimate from a random subset of observations, enabling scalable optimization while introducing beneficial noise that aids escape from poor local maxima in non-convex landscapes. This stochastic approximation traces back to foundational work on iterative methods for root-finding in expectation.^[33]^[34]^[35] Under conditions of strict log-concave likelihoods (implying convexity of the negative log-likelihood), gradient-based methods converge to the global maximum at a rate proportional to O(1/k) for deterministic updates and similar rates under appropriate step size schedules for stochastic variants. However, in multimodal likelihoods common to high-dimensional or misspecified models, convergence is only guaranteed to a local maximum, potentially yielding suboptimal estimates depending on initialization. Adaptive methods, such as the Adam optimizer, enhance performance by dynamically adjusting step sizes based on historical gradient moments, improving convergence in practice for stochastic MLE.^[36]^[37]

Newton-Type Methods

Newton-type methods leverage second-order information from the log-likelihood function to achieve faster convergence in maximum likelihood estimation compared to first-order approaches. These methods approximate the maximum of the log-likelihood \ell(\theta) by iteratively updating the parameter estimate \theta using both the score function S(\theta) = \partial \ell / \partial \theta and the Hessian matrix H(\theta) = \partial^2 \ell / \partial \theta \partial \theta^T, or approximations thereof. The curvature captured by the Hessian allows for quadratic convergence near the optimum, making them particularly effective for problems where the log-likelihood is twice differentiable and the Hessian is positive definite at the maximum. The Newton-Raphson method, a foundational second-order technique, updates the parameter estimate as \theta_{k+1} = \theta_k - H(\theta_k)^{-1} S(\theta_k), which can equivalently be written as \theta_{k+1} = \theta_k + I(\theta_k)^{-1} S(\theta_k) when the observed information matrix (negative Hessian) aligns with the expected Fisher information I(\theta). This update solves the quadratic approximation of the log-likelihood at each step, effectively finding the root of the score equation S(\theta) = 0. The method originates from general root-finding algorithms but has been widely applied to MLE since the early 20th century.^[38] Fisher's scoring method modifies the Newton-Raphson update by replacing the observed Hessian with the expected Fisher information matrix, yielding \theta_{k+1} = \theta_k + I(\theta_k)^{-1} S(\theta_k), where I(\theta) = -E[H(\theta)]. Introduced by Ronald A. Fisher in 1925, this approach ensures the approximation matrix is always positive semi-definite under regularity conditions, improving stability for likelihood functions where the observed Hessian may be indefinite or computationally expensive to evaluate exactly. The Fisher information matrix, central to asymptotic MLE theory, quantifies the amount of information the data provide about \theta.^[10] Quasi-Newton methods, such as the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, address the computational cost of inverting the full Hessian at each iteration by maintaining a low-rank approximation to the inverse Hessian, updated using gradient differences from successive iterations. In the context of MLE, BFGS iteratively refines this approximation to mimic the Newton update without requiring second derivatives, making it suitable for high-dimensional problems. Developed independently in 1970 by Broyden, Fletcher, Goldfarb, and Shanno, BFGS has become a standard for unconstrained optimization in statistical software due to its balance of efficiency and accuracy. Under standard regularity conditions—such as the log-likelihood being twice continuously differentiable, the Hessian being positive definite near the maximum, and a suitable initial guess—Newton-type methods exhibit quadratic convergence, meaning the error decreases quadratically with proximity to the true maximum likelihood estimate. However, global convergence is not guaranteed, and the methods can fail if starting far from the optimum or if the Hessian becomes singular. To mitigate these issues, safeguards like damping (e.g., adding a Levenberg-Marquardt regularization term \lambda I to the Hessian) or line searches are incorporated, ensuring descent along the update direction while controlling step sizes. These modifications maintain superlinear convergence in practice for many MLE problems.

Expectation-Maximization Algorithm

The Expectation-Maximization (EM) algorithm provides an iterative approach to computing maximum likelihood estimates when the observed data are incomplete, such as in models involving latent variables or missing observations. Introduced by Dempster, Laird, and Rubin, it addresses the challenge of directly maximizing the observed-data log-likelihood by instead working with an auxiliary function derived from the complete-data log-likelihood.^[39] This method is particularly useful in scenarios where the likelihood involves intractable integrals over unobserved components, allowing for a two-step optimization process that guarantees non-decreasing values of the observed log-likelihood at each iteration.^[39] Consider a statistical model where the complete data consist of observed variables x and latent variables z, with joint density f(x, z \mid \theta). The complete-data log-likelihood is defined as

\ell_c(\theta; x, z) = \log f(x, z \mid \theta).

The observed-data log-likelihood, which must be maximized for MLE, is the marginal

\ell_o(\theta; x) = \log \int f(x, z \mid \theta) \, dz,

but direct maximization is often infeasible due to the integral. Instead, EM leverages the conditional expectation

Q(\theta \mid \theta'; x) = \mathbb{E}_{z \mid x, \theta'} [\ell_c(\theta; x, z)],

which serves as a lower bound on \ell_o(\theta; x) up to a constant (via Jensen's inequality).^[39] The algorithm proceeds in two steps, initialized with some \theta^{(0)}. In the E-step (expectation), compute the expected complete-data log-likelihood given the current parameter estimate:

Q(\theta \mid \theta^{(k)}; x) = \mathbb{E}_{z \mid x, \theta^{(k)}} [\ell_c(\theta; x, z)].

This step imputes the missing or latent data by averaging over their conditional distribution under \theta^{(k)}. In the M-step (maximization), update the parameters by maximizing this auxiliary function:

\theta^{(k+1)} = \arg\max_{\theta} Q(\theta \mid \theta^{(k)}; x).

The full EM update thus alternates between these steps until convergence, where each iteration yields \ell_o(\theta^{(k+1)}; x) \geq \ell_o(\theta^{(k)}; x), ensuring monotonic increase in the observed log-likelihood.^[39] Under standard regularity conditions, the sequence \{\theta^{(k)}\} converges to a stationary point of \ell_o(\theta; x), where the gradient of \ell_o vanishes, although the specific limit may depend on the initialization and could be a local maximum.^[39] The rate of convergence is typically linear, with the asymptotic rate determined by the eigenvalues of a certain matrix involving the observed Fisher information.^[39] A classic conceptual application of EM arises in fitting Gaussian mixture models, where the observed data x = \{x_1, \dots, x_n\} are assumed to arise from a mixture of K Gaussians, and the latent variables z_i indicate the component assignment for each x_i. The complete-data log-likelihood treats assignments as known, but the E-step computes the posterior probabilities (responsibilities) \gamma_{ij}^{(k)} = P(z_i = j \mid x_i, \theta^{(k)}) for component j and observation i. The M-step then updates the mixing proportions \pi_j, means \mu_j, and covariances \Sigma_j as weighted averages using these \gamma_{ij}^{(k)}, such as

\pi_j^{(k+1)} = \frac{1}{n} \sum_{i=1}^n \gamma_{ij}^{(k)}, \quad \mu_j^{(k+1)} = \frac{\sum_{i=1}^n \gamma_{ij}^{(k)} x_i}{\sum_{i=1}^n \gamma_{ij}^{(k)}}.

This iterative weighting effectively soft-clusters the data, converging to local MLEs for the mixture parameters.^[40]

Applications and Examples

Bernoulli and Binomial Distributions

A prominent application of maximum likelihood estimation arises in the context of binary data modeled by the Bernoulli distribution, which describes a single trial with success probability p, where outcomes are either success (1) with probability p or failure (0) with probability $1-p. For a sample of n independent and identically distributed Bernoulli random variables x_1, \dots, x_n, the likelihood function is L(p) = p^{\sum x_i} (1-p)^{n - \sum x_i}. The maximum likelihood estimator for p is \hat{p} = \frac{1}{n} \sum_{i=1}^n x_i, equivalent to the sample proportion of successes.^[3] The binomial distribution extends the Bernoulli model to n fixed trials, where the number of successes k follows a Binomial(n, p) distribution, with probability mass function P(K = k) = \binom{n}{k} p^k (1-p)^{n-k}. For an observed k, the likelihood is L(p) = p^k (1-p)^{n-k}, disregarding the binomial coefficient as it does not depend on p. Since the total number of successes serves as a sufficient statistic for p, the maximum likelihood estimator remains \hat{p} = \frac{k}{n}, identical to the Bernoulli case averaged over trials.^[3] To derive this estimator, consider the log-likelihood \ell(p) = k \log p + (n - k) \log (1 - p). Differentiating with respect to p yields \frac{d\ell}{dp} = \frac{k}{p} - \frac{n - k}{1 - p}; setting this equal to zero and solving gives \frac{k}{p} = \frac{n - k}{1 - p}, so k(1 - p) = (n - k)p, which simplifies to k - k p = n p - k p, or k = n p, hence \hat{p} = \frac{k}{n}. This solution maximizes the likelihood, as confirmed by the second derivative being negative for $0 < p < 1.^[3] The estimator \hat{p} = \frac{k}{n} provides an intuitive interpretation: it directly uses the observed proportion of successes to estimate the underlying probability p, thereby selecting the parameter value that renders the observed data most probable under the model. Ronald Fisher illustrated this binomial example in his foundational work to demonstrate the principles of maximum likelihood, contrasting it with Bayesian approaches.^[3]

Normal Distribution

The maximum likelihood estimation (MLE) for the parameters of the univariate normal distribution \mathcal{N}(\mu, \sigma^2) based on an independent and identically distributed sample of size n, x_1, \dots, x_n, yields closed-form solutions that are both intuitive and efficient. This approach was formalized by Ronald A. Fisher in his seminal work on statistical estimation.^[41] The likelihood function is given by

L(\mu, \sigma^2) = (2\pi \sigma^2)^{-n/2} \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \right).

^[42] To derive the MLEs, consider the profile likelihood method. First, for fixed \sigma^2, maximize the likelihood with respect to \mu by setting the partial derivative of the log-likelihood to zero, which simplifies to solving \sum_{i=1}^n (x_i - \mu) = 0, yielding the profile estimator \hat{\mu} = \bar{x}, the sample mean.^[43] Substituting \hat{\mu} back into the likelihood produces the profile likelihood for \sigma^2, which is then maximized by setting the corresponding partial derivative of the log-likelihood to zero, resulting in \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2.^[42] The MLEs are thus \hat{\mu} = \bar{x} and \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2. While \hat{\mu} is unbiased, the variance estimator \hat{\sigma}^2 is biased, satisfying E[\hat{\sigma}^2] = \frac{n-1}{n} \sigma^2.^[43] In contrast, the unbiased sample variance is s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2, which corrects for this bias by using the divisor n-1.^[42] The normal distribution belongs to the exponential family of distributions, a property that contributes to the existence of these explicit closed-form MLEs.^[44]

Exponential Family Distributions

The exponential family of distributions provides a unifying framework for many common probability distributions, including those for count and rate data, where maximum likelihood estimation (MLE) leverages the structure of sufficient statistics to simplify parameter estimation. In the canonical form, a distribution belongs to the exponential family if its probability density or mass function can be expressed as p(x \mid \eta) = h(x) \exp\left( \eta T(x) - A(\eta) \right), where \eta is the natural (canonical) parameter, T(x) is the sufficient statistic, h(x) is the base measure, and A(\eta) is the log-partition function ensuring normalization. For independent and identically distributed observations x_1, \dots, x_n, the MLE \hat{\eta} satisfies the equation \sum_{i=1}^n T(x_i) = n \mathbb{E}[T(X) \mid \eta], which equates the observed sufficient statistic to its expected value under the model, often solvable analytically or numerically depending on the form of A(\eta). A prominent example is the Poisson distribution, which models count data and is part of the exponential family with canonical parameter \eta = \log \lambda, sufficient statistic T(x) = x, and log-partition function A(\eta) = e^\eta. The likelihood function for a sample x_1, \dots, x_n is L(\lambda) = e^{-n\lambda} \lambda^{\sum x_i} / \prod x_i!, and maximizing it yields the closed-form MLE \hat{\lambda} = \bar{x}, the sample mean, which directly matches the observed mean count to the model's expected value \mathbb{E}[X] = \lambda. This estimator is analytically derived by setting the score function to zero and confirming it as a maximum via the second derivative test.^[45] The gamma distribution, used for modeling positive continuous data such as waiting times or sizes, is a two-parameter exponential family with shape \alpha and scale \beta (or rate $1/\beta), where the sufficient statistics are \sum x_i and \sum \log x_i. Unlike the Poisson case, the MLE for \alpha and \beta lacks a closed-form solution and requires iterative numerical methods, such as Newton-Raphson, to solve the coupled equations from the score: \psi(\hat{\alpha}) - \ln \hat{\beta} = \bar{\ln x} and \hat{\beta} = \bar{x} / \hat{\alpha}, where \psi is the digamma function and \bar{\ln x} = n^{-1} \sum_{i=1}^n \ln x_i. These equations arise from the exponential family structure but demand computation due to the non-linear dependence on \alpha. Efficient fixed-point iterations or expectation-maximization variants can converge quickly for this purpose.^[46] In general, for exponential families, MLE reduces to moment matching via sufficient statistics, providing efficient estimators that achieve the Cramér-Rao lower bound under regularity conditions, as the observed information matrix relates directly to the variance of T(X). This approach extends the analytical solutions discussed earlier to broader families handling diverse data types like counts (Poisson) or durations (gamma).

Multivariate and Regression Models

Maximum likelihood estimation extends naturally to multivariate settings where observations consist of vectors of random variables, allowing for the estimation of vector-valued parameters such as means and covariance matrices under assumptions of independence across observations. In these models, the likelihood function is formed by the product of the joint densities for each observation, and maximization yields estimators that account for correlations among variables.^[47] A prominent example is the multivariate normal distribution, where independent observations \mathbf{x}_1, \dots, \mathbf{x}_n \in \mathbb{R}^p are drawn from \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}). The maximum likelihood estimator for the mean vector is the sample mean \hat{\boldsymbol{\mu}} = \bar{\mathbf{x}} = \frac{1}{n} \sum_{i=1}^n \mathbf{x}_i, and for the covariance matrix, it is \hat{\boldsymbol{\Sigma}} = \frac{1}{n} \sum_{i=1}^n (\mathbf{x}_i - \bar{\mathbf{x}})(\mathbf{x}_i - \bar{\mathbf{x}})^\top. These estimators are derived by taking the log-likelihood, differentiating with respect to \boldsymbol{\mu} and \boldsymbol{\Sigma}, and solving the resulting score equations, which highlight the invariance property of MLE under transformations like the sample covariance.^[47] In regression models, MLE is applied to estimate parameters relating covariates to responses, assuming independence across data points. For ordinary linear regression with normal errors, the model is y_i = \mathbf{x}_i^\top \boldsymbol{\beta} + \epsilon_i where \epsilon_i \sim \mathcal{N}(0, \sigma^2) independently. The maximum likelihood estimator for the coefficient vector is \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}, obtained by maximizing the Gaussian likelihood, which coincides with the ordinary least squares estimator under this error assumption.^[48] This equivalence underscores how MLE provides a probabilistic foundation for least squares when normality holds.^[49] For binary response data, logistic regression models the probability of success via the logit link: P(y_i = 1 | \mathbf{x}_i) = \frac{\exp(\mathbf{x}_i^\top \boldsymbol{\beta})}{1 + \exp(\mathbf{x}_i^\top \boldsymbol{\beta})}. The parameters \boldsymbol{\beta} are estimated iteratively using maximum likelihood, as no closed-form solution exists; this involves solving the score equations \sum_{i=1}^n (y_i - \hat{p}_i) \mathbf{x}_i = \mathbf{0}, where \hat{p}_i is the fitted probability, typically via Newton-Raphson or iteratively reweighted least squares.^[50] This approach, introduced for binary sequences, ensures consistent estimation of regression effects in classification tasks.^[51] Generalized linear models (GLMs) unify these regression frameworks by specifying the response distribution from an exponential family, a linear predictor \eta_i = \mathbf{x}_i^\top \boldsymbol{\beta}, and a link function g(\mu_i) = \eta_i. The likelihood takes the form L(\boldsymbol{\beta}) = \exp\left\{ \sum_{i=1}^n \left[ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \right] \right\}, where \theta_i is the natural parameter, b the cumulant function, a(\phi) a dispersion parameter, and c a normalization term.^[52] Parameter estimation in GLMs proceeds via iterative maximization of this deviance-based likelihood, accommodating diverse responses like Gaussian for linear regression and binomial for logistic.^[52]

Extensions and Limitations

Dependent Observations

In cases where observations are not independent and identically distributed (iid), the standard maximum likelihood estimation framework must be adapted to account for the dependence structure specified by the model.^[53] For Markov chains, the likelihood function exploits the Markov property to factorize the joint distribution into the initial state probability and a product of conditional transition probabilities. Specifically, for a sequence of observations x_1, x_2, \dots, x_n from a discrete-time Markov chain with parameter \theta, the likelihood is given by

L(\theta) = p(x_1 \mid \theta) \prod_{i=2}^n p(x_i \mid x_{i-1}, \theta),

where p(x_1 \mid \theta) is the marginal probability of the first observation and p(x_i \mid x_{i-1}, \theta) denotes the transition probability.^[54] The maximum likelihood estimator \hat{\theta} is obtained by maximizing this likelihood, often via direct counting of transitions for finite-state chains, yielding consistent estimates under appropriate regularity conditions.^[55] A prominent example is the autoregressive model of order 1 (AR(1)), where x_t = \phi x_{t-1} + \epsilon_t with \epsilon_t \sim \mathcal{N}(0, \sigma^2) independent errors and |\phi| < [1](/page/1) for stationarity. Here, the exact likelihood involves the joint distribution, but due to the complexity of the marginal for x_1, a conditional likelihood is commonly maximized instead, conditioning on the initial observation x_1:

L_c(\phi, \sigma^2 \mid x_1) = \prod_{t=2}^n \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x_t - \phi x_{t-1})^2}{2\sigma^2} \right).

Maximizing L_c leads to estimators \hat{\phi} = \sum_{t=2}^n x_t x_{t-1} / \sum_{t=2}^n x_{t-1}^2 and \hat{\sigma}^2 = \sum_{t=2}^n (x_t - \hat{\phi} x_{t-1})^2 / (n-1), which coincide with least squares estimates and are asymptotically efficient for large n.^[56]^[57] For more general dependent data, such as spatial or complex temporal processes, computing the full joint likelihood can be intractable due to the high-dimensional integration required. In such scenarios, approximate methods like composite likelihood are employed, which multiply marginal or conditional likelihoods for subsets of data (e.g., pairwise or higher-order blocks) to form a tractable objective function. For instance, a pairwise composite likelihood combines the likelihood contributions from all pairs (x_i, x_j) for i < j, providing a consistent estimator that approximates the full maximum likelihood under suitable dependence structures.^[58] Adapting MLE to dependent observations introduces challenges, including substantially higher computational demands from evaluating the likelihood or its gradients, which scale poorly with data size and dependence range. Additionally, asymptotic consistency of the MLE requires conditions like \alpha-mixing or \beta-mixing to ensure the dependence decays sufficiently fast, allowing the information to accumulate similarly to the iid case.^[53]^[59]

Model Misspecification and Robustness

In maximum likelihood estimation (MLE), model misspecification occurs when the assumed parametric family does not encompass the true data-generating distribution, leading the MLE \hat{\theta} to converge in probability to a pseudo-true parameter \theta^* that minimizes the expected Kullback-Leibler divergence from the true distribution to the misspecified model.^[60] This pseudo-true value \theta^* represents the best approximation within the incorrect model family, ensuring consistency of the estimator for this target despite the misspecification.^[60] The resulting procedure is termed quasi-maximum likelihood estimation (QMLE), which retains desirable properties like consistency and asymptotic normality under mild regularity conditions, even when the likelihood is incorrect.^[60] For inference under misspecification, the standard asymptotic variance formula based on the inverse Fisher information matrix fails because it assumes the model is correctly specified.^[61] Instead, the sandwich estimator provides a robust covariance matrix for \hat{\theta}, approximated as

I(\hat{\theta})^{-1} J(\hat{\theta}) I(\hat{\theta})^{-1},

where I(\theta) is the expected negative Hessian of the log-likelihood (information matrix) and J(\theta) is the covariance matrix of the score function.^[60] This estimator accounts for the discrepancy between the model's assumed variance structure and the true one, yielding valid standard errors and confidence intervals.^[60] Under correct model specification, the sandwich estimator simplifies to the conventional inverse information matrix.^[61] To enhance robustness against misspecification or outliers, M-estimators offer a generalization of MLE by minimizing an objective function derived from a robust \rho-function rather than the negative log-likelihood, which bounds the influence of aberrant observations.^[62] These estimators solve \sum \psi(y_i; \theta) = 0, where \psi is the derivative of \rho and designed to have a redescending influence function, thereby mitigating the leverage of data points far from the model.^[62] In the context of misspecified models, M-estimators maintain consistency for \theta^* while improving efficiency relative to standard QMLE in contaminated settings.^[62] The implications of model misspecification in QMLE include potential bias in predictions or interpretations that rely on the true parameter, as \theta^* may deviate systematically from the actual value.^[60] Nonetheless, QMLE and its robust variants prioritize consistency for the pseudo-true parameter over asymptotic efficiency, making them valuable when the goal is approximate inference rather than optimal performance under an exact model.^[60] This approach is widely applied in econometrics and statistics where full model correctness is unattainable, trading some efficiency for reliability.^[60]

Bias Correction and Second-Order Properties

In finite samples, maximum likelihood estimators (MLEs) exhibit bias of order O(1/n), where n is the sample size, arising from the asymmetry in the sampling distribution of the estimator. This finite-sample bias can be approximated using higher-order asymptotic expansions, such as \mathbb{E}[\hat{\theta} - \theta_0] \approx \frac{1}{2n} \operatorname{tr}(I(\theta_0)^{-1} K(\theta_0)), where I(\theta_0) is the Fisher information matrix and K(\theta_0) is a tensor capturing the third-order moments of the score function.^[63] Several methods exist to correct this bias. Analytical approaches provide explicit formulas: for single-parameter models, Bartlett derived a bias correction term involving the third derivative of the log-likelihood, while Cox and Snell extended this to multiparameter cases using contractions of the inverse information matrix with third- and mixed second/third derivatives of the log-likelihood.^[64]^[63] Resampling techniques offer model-free alternatives; the jackknife method estimates bias by systematically omitting observations and averaging the resulting estimators, reducing bias without parametric assumptions. Similarly, the bootstrap approximates the bias by resampling with replacement from the data and computing the difference between the original and resampled MLEs. These corrections for variance, such as jackknife or bootstrap estimates, complement bias adjustments by providing more accurate uncertainty quantification in finite samples. Applying bias correction yields second-order properties for the MLE, where the corrected estimator achieves mean squared error optimality up to O(1/n^2) under regularity conditions, improving upon first-order asymptotic efficiency.^[65] This higher-order refinement ensures better finite-sample performance, particularly in small or moderately sized datasets.^[63] Bias in parameter estimation must be distinguished from prediction bias, which arises in using the estimated model for forecasting and incorporates additional variability from the estimation error, but corrections target the parameter bias directly without altering predictive distributions.^[63]

Historical Development

Early Foundations

In the 18th century, the foundations of probabilistic inference were laid through the development of inverse probability by Thomas Bayes and Pierre-Simon Laplace, which primarily emphasized the estimation of posterior distributions rather than direct maximization of likelihoods. Bayes's posthumously published work in 1763 introduced the concept of updating probabilities based on observed evidence, framing estimation in terms of conditional probabilities that would later influence likelihood-based methods.^[66] Laplace extended this in his 1774 memoir, applying inverse probability to problems of causation from events and seeking posterior modes or medians under uniform priors, though his approach remained rooted in Bayesian updating without isolating the likelihood function as a standalone criterion.^[67] These efforts established a probabilistic framework for parameter estimation but focused on full posterior inference rather than the data-dependent likelihood alone.^[66] The 19th century saw further precursors in Carl Friedrich Gauss's work on least squares estimation, which served as a prototype for maximum likelihood under normal error assumptions. In his 1809 treatise Theoria Motus Corporum Coelestium, Gauss derived the least squares method by maximizing the probability of observations assuming Gaussian errors and a uniform prior on the parameters, effectively computing the posterior mode that aligns with the maximum likelihood estimator for the normal distribution.^[67] This approach treated the arithmetic mean as the most probable value for the location parameter, bridging astronomical data fitting with probabilistic principles, though Gauss did not generalize it beyond normals or explicitly separate likelihood from priors.^[66] Laplace built on similar ideas around 1812, using empirical moments for confidence limits, but the emphasis remained on inverse methods rather than pure likelihood maximization.^[67] Francis Ysidro Edgeworth advanced these ideas in his 1884 essay "The Philosophy of Chance," where he explored the determination of most probable values through probabilistic reasoning, distinguishing between objective observational errors and subjective priors in estimation.^[68] Edgeworth's analysis highlighted the role of joint probability maximization for parameters like means and variances, prefiguring likelihood principles without formalizing them as a general estimation method, and he critiqued overly speculative Bayesian elements in favor of data-driven probable values.^[66] Despite these developments, no unified maximum likelihood framework emerged in the 19th century; instead, Karl Pearson's introduction of the method of moments in 1894 provided an alternative estimation technique, equating population moments to sample moments for parameter recovery, which gained prominence for its simplicity in fitting non-normal distributions like those in evolutionary biology. This method contrasted with probabilistic maximization by prioritizing empirical moments over probability densities, reflecting the era's preference for moment-based over likelihood-oriented approaches.^[67]

Key Advancements in the 20th Century

Ronald A. Fisher developed the foundations of maximum likelihood estimation through a series of works in the 1910s, beginning with a 1912 numerical procedure for parameter estimation in biometric contexts. Over the subsequent decade, Fisher refined these ideas, introducing the concept of "likelihood" in 1921 and incorporating principles of sufficiency and efficiency, which addressed limitations in earlier methods and laid the groundwork for a general estimation theory.^[4] The term "maximum likelihood" was formally introduced by Ronald A. Fisher in his 1922 paper "On the Mathematical Foundations of Theoretical Statistics," where he presented it as a general principle for parameter estimation that maximizes the probability of observing the given data under the assumed model.^[3] Fisher argued that this method yields estimators with desirable properties, such as sufficiency, laying the groundwork for modern statistical inference.^[3] In the 1930s, Jerzy Neyman and Egon Pearson advanced the theoretical properties of maximum likelihood estimation through their work on hypothesis testing, particularly via the likelihood ratio test, which provided a framework for evaluating the efficiency of MLE-based procedures.^[69] Abraham Wald further linked MLE to decision theory in the 1940s, proving its asymptotic sufficiency in 1943 and consistency in 1949, which addressed earlier concerns about its reliability under certain conditions and integrated it into broader frameworks for statistical decision-making. The 1970s saw significant computational advancements with the introduction of the expectation-maximization (EM) algorithm by Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin in 1977, which enabled practical maximum likelihood estimation in models with missing or latent data by iteratively maximizing a lower bound on the likelihood function.^[39] Post-1980 developments included Halbert White's 1982 analysis of MLE under model misspecification, demonstrating that while estimators may converge to pseudo-true values, their asymptotic covariance can be consistently estimated using sandwich estimators, enhancing robustness in econometric applications.^[60] Concurrently, the integration of MLE routines into statistical software packages, such as SAS's PROC LOGIST in the early 1980s, facilitated widespread adoption by automating numerical optimization for complex models.^[70]