Gauss–Markov theorem
The Gauss–Markov theorem is a cornerstone result in mathematical statistics that asserts, under specified assumptions in a linear regression model, the ordinary least squares (OLS) estimator of the regression coefficients is the best linear unbiased estimator (BLUE), possessing the minimum variance among all linear unbiased estimators.[1] This theorem establishes the optimality of OLS within the class of linear estimators that are unbiased, making it a foundational justification for its use in estimating linear models.[2] The theorem derives its name from Carl Friedrich Gauss and Andrey Markov, though its origins predate the formal naming. Gauss first proved the key result in his 1821 work Theoria combinationis observationum erroribus minimis obnoxiæ, demonstrating that least squares yields unbiased estimates with minimum variance for linear models of astronomical observations.[3][4] Markov independently rediscovered and generalized the theorem in his 1900 textbook on probability theory, extending it to broader statistical contexts.[3] By the 1930s, the result was widely recognized in statistical literature, often initially attributed solely to Markov before being credited jointly as the Gauss–Markov theorem. Central to the theorem are four classical assumptions that ensure OLS achieves BLUE status: (1) the model is linear in the parameters, expressed as y = X\beta + \epsilon; (2) the errors \epsilon have zero conditional mean (strict exogeneity); (3) the errors are homoskedastic with constant variance \sigma^2 and uncorrelated (i.e., \text{Var}(\epsilon) = \sigma^2 I); and (4) the design matrix X has full column rank to ensure identifiability.[1] The proof proceeds by showing that the OLS estimator \hat{\beta} = (X^T X)^{-1} X^T y is linear and unbiased, and that its covariance matrix \sigma^2 (X^T X)^{-1} is smaller in the positive semi-definite sense than that of any other linear unbiased estimator.[2] Notably, the theorem does not require normality of the errors for BLUE property, though normality is often assumed for additional inference procedures.[5] The Gauss–Markov theorem underpins much of modern regression analysis, particularly in econometrics, where it validates OLS as an efficient estimator for policy evaluation and forecasting under the classical linear model assumptions.[5] Violations of these assumptions, such as heteroskedasticity or autocorrelation, necessitate alternative estimators like generalized least squares, but the theorem remains a benchmark for assessing estimator performance.[6] Its enduring influence highlights the balance between bias, variance, and linearity in statistical estimation.[7]History
Origins
The origins of the Gauss–Markov theorem trace back to the development of the method of least squares in the early 19th century, primarily in the context of astronomical observations and error minimization. Adrien-Marie Legendre first formally published the least squares method in 1805 as a practical technique for determining comet orbits by minimizing the sum of squared residuals when the number of equations exceeds the unknowns. This approach treated errors as deterministic deviations rather than random variables, providing an algorithmic solution without a probabilistic foundation.[8] Carl Friedrich Gauss, who claimed to have conceived the method as early as 1795 at age 18, expanded it into a statistical framework starting in 1809. In his work Theoria Motus Corporum Coelestium, Gauss introduced a probabilistic justification, assuming errors follow a normal distribution and deriving least squares estimates as maximum likelihood estimators under this assumption. He further refined these ideas in 1821 with the first part of Theoria Combinationis Observationum Erroribus Minimis Obnoxiae, where he demonstrated—using finite-sample arguments based on first and second moments—that the least squares estimator is the best linear unbiased estimator (BLUE) among all linear unbiased estimators, provided the errors have zero mean and constant variance (homoscedasticity), without requiring normality. These contributions by Gauss established the core result of what would later be formalized as the Gauss–Markov theorem, emphasizing linearity, unbiasedness, and minimum variance. The second part of the work appeared in 1823.[8] Andrey Markov independently rediscovered and generalized aspects of the theorem around 1900, focusing explicitly on the role of unbiasedness in linear estimation. In his 1912 textbook Wahrscheinlichkeitsrechnung, Markov provided a clear treatment of the theorem's assumptions and proved that, under linearity and uncorrelated errors with equal variance, the ordinary least squares estimator achieves minimum variance among linear unbiased alternatives. This work clarified and synthesized the unbiasedness condition that Gauss had implicitly assumed, bridging the gap between deterministic and stochastic interpretations of least squares. Markov's contributions, though building on earlier ideas, highlighted the theorem's applicability in probability theory and helped propagate it in Russian and European statistical literature.[8][9]Formalization and Naming
The Gauss–Markov theorem was initially formalized by Carl Friedrich Gauss in his seminal 1821 publication, the first part of Theoria combinationis observationum erroribus minimis obnoxiae, where he proved that the arithmetic mean of observations minimizes the expected squared error under the assumption of independent errors with zero mean and constant variance, establishing it as the best linear unbiased estimator (BLUE) in that context. Gauss's derivation in this work used probabilistic arguments based on expectations and variances, without relying on the normal distribution—contrasting his earlier 1809 justification under normality—and provided the first rigorous justification linking the method to variance minimization among linear unbiased estimators. This work built on his earlier unpublished development of least squares around 1795 and its initial application in astronomy. The second part followed in 1823.[10] Andrey Andreyevich Markov independently rediscovered and generalized the result in 1900, presenting a proof that extended beyond normality by assuming only that errors are uncorrelated with constant (but unknown) variance, thus establishing the BLUE property for ordinary least squares in the linear model without distributional assumptions on the errors. Markov's contribution, detailed in a lecture that year and later incorporated into his 1912 textbook Wahrscheinlichkeitsrechnung (a German translation of his Russian work on probability), emphasized the theorem's validity under weaker conditions, focusing on the covariance structure of errors rather than their specific distribution. This generalization aligned the theorem more closely with modern statistical inference, influencing its application in regression analysis.[11] The theorem's naming reflects these dual origins, with "Gauss–Markov" becoming the conventional appellation by the mid-20th century to honor both contributors, though Gauss's foundational role predates Markov's by nearly a century. Earlier, in the 1930s, Jerzy Neyman referred to it as the "Markov theorem" in recognition of Markov's general proof, as noted in his 1934 discussion of least squares in sampling theory, and this variant persisted briefly in some econometric literature. The combined name gained prominence following historical reviews, such as Hilary Seal's 1967 analysis tracing the theorem's evolution from Gauss through intermediate developments by figures like Friedrich Robert Helmert (1872) to Markov, solidifying its standard usage in statistical texts by the 1950s.[10]Statement
Scalar Case
In the scalar case, the Gauss–Markov theorem addresses the simple linear regression model, where the response variable Y for each observation i = 1, \dots, n is expressed asY_i = \beta_0 + \beta_1 x_i + \epsilon_i,
with \beta_0 and \beta_1 as unknown scalar parameters, x_i as fixed regressors, and \epsilon_i as error terms.[12] The theorem requires the following assumptions on the errors: the expected value of each error is zero, E(\epsilon_i) = 0; the errors have constant variance, \text{Var}(\epsilon_i) = \sigma^2 > 0; and the errors are uncorrelated across observations, \text{Cov}(\epsilon_i, \epsilon_j) = 0 for all i \neq j. These conditions ensure the model is linear in parameters, the errors are mean-zero and homoskedastic, and there is no serial correlation.[12][11] Under these assumptions, the ordinary least squares (OLS) estimators of the parameters,
\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(Y_i - \bar{Y})}{\sum_{i=1}^n (x_i - \bar{x})^2},
\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{x},
where \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i and \bar{Y} = \frac{1}{n} \sum_{i=1}^n Y_i, are the best linear unbiased estimators (BLUE). This means \hat{\beta}_0 and \hat{\beta}_1 are unbiased, E(\hat{\beta}_j) = \beta_j for j = 0, 1, and possess the minimum variance among all linear unbiased estimators of the form \sum_{i=1}^n a_i Y_i.[12][11] Specifically, the variance of the slope estimator is
\text{Var}(\hat{\beta}_1) = \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2},
and the variance of the intercept estimator is
\text{Var}(\hat{\beta}_0) = \sigma^2 \left( \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n (x_i - \bar{x})^2} \right).
For any other linear unbiased estimator b_1 = \sum_{i=1}^n a_i Y_i of \beta_1, satisfying the unbiasedness conditions \sum_{i=1}^n a_i = 0 and \sum_{i=1}^n a_i x_i = 1, it holds that \text{Var}(b_1) \geq \text{Var}(\hat{\beta}_1), with equality only if a_i = k_i for the OLS weights k_i. A similar result applies to estimators of \beta_0. This optimality underscores the efficiency of OLS in the scalar setting without requiring normality of the errors.[12][11]
Matrix Case
The Gauss–Markov theorem in its matrix formulation applies to the classical linear regression model, expressed as\mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon},
where \mathbf{Y} is an n \times 1 vector of observed responses, \mathbf{X} is an n \times p full-rank design matrix with p < n, \boldsymbol{\beta} is a p \times 1 vector of unknown parameters, and \boldsymbol{\epsilon} is an n \times 1 vector of random errors.[13][14] This setup generalizes the scalar case to multiple regressors, allowing for the estimation of a parameter vector rather than a single coefficient. The theorem requires the following assumptions: (i) the model is linear in parameters; (ii) the errors have zero mean, \mathbb{E}[\boldsymbol{\epsilon}] = \mathbf{0}; (iii) the errors are uncorrelated with constant variance, \text{Cov}(\boldsymbol{\epsilon}) = \sigma^2 [\mathbf{I}_n](/page/Identity_matrix), where \sigma^2 > 0 is unknown and \mathbf{I}_n is the n \times n identity matrix; and (iv) the design matrix \mathbf{X} has full column rank to ensure uniqueness.[13][14] Under these conditions, the ordinary least squares (OLS) estimator
\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y}
is unbiased, \mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}, and possesses the covariance matrix
\text{Cov}(\hat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}. [14] The Gauss–Markov theorem states that \hat{\boldsymbol{\beta}} is the best linear unbiased estimator (BLUE) of \boldsymbol{\beta}, meaning it has the smallest variance-covariance matrix (in the sense of positive semi-definiteness) among all linear unbiased estimators of the form \tilde{\boldsymbol{\beta}} = \mathbf{C} \mathbf{Y} for some p \times n matrix \mathbf{C}.[15][16] This optimality implies that for any other linear unbiased estimator \tilde{\boldsymbol{\beta}}, \text{Cov}(\hat{\boldsymbol{\beta}}) - \text{Cov}(\tilde{\boldsymbol{\beta}}) is positive semi-definite, ensuring minimal estimation uncertainty in the linear class. The result, formalized in matrix terms by C.R. Rao, underscores the efficiency of OLS in homoscedastic linear models without requiring normality of errors.