Fact-checked by Grok 2 weeks ago

Gauss–Markov theorem

The Gauss–Markov theorem is a cornerstone result in mathematical statistics that asserts, under specified assumptions in a linear regression model, the ordinary least squares (OLS) estimator of the regression coefficients is the best linear unbiased estimator (BLUE), possessing the minimum variance among all linear unbiased estimators.^[1] This theorem establishes the optimality of OLS within the class of linear estimators that are unbiased, making it a foundational justification for its use in estimating linear models.^[2] The theorem derives its name from Carl Friedrich Gauss and Andrey Markov, though its origins predate the formal naming. Gauss first proved the key result in his 1821 work Theoria combinationis observationum erroribus minimis obnoxiæ, demonstrating that least squares yields unbiased estimates with minimum variance for linear models of astronomical observations.^[3]^[4] Markov independently rediscovered and generalized the theorem in his 1900 textbook on probability theory, extending it to broader statistical contexts.^[3] By the 1930s, the result was widely recognized in statistical literature, often initially attributed solely to Markov before being credited jointly as the Gauss–Markov theorem. Central to the theorem are four classical assumptions that ensure OLS achieves BLUE status: (1) the model is linear in the parameters, expressed as y = X\beta + \epsilon; (2) the errors \epsilon have zero conditional mean (strict exogeneity); (3) the errors are homoskedastic with constant variance \sigma^2 and uncorrelated (i.e., \text{Var}(\epsilon) = \sigma^2 I); and (4) the design matrix X has full column rank to ensure identifiability.^[1] The proof proceeds by showing that the OLS estimator \hat{\beta} = (X^T X)^{-1} X^T y is linear and unbiased, and that its covariance matrix \sigma^2 (X^T X)^{-1} is smaller in the positive semi-definite sense than that of any other linear unbiased estimator.^[2] Notably, the theorem does not require normality of the errors for BLUE property, though normality is often assumed for additional inference procedures.^[5] The Gauss–Markov theorem underpins much of modern regression analysis, particularly in econometrics, where it validates OLS as an efficient estimator for policy evaluation and forecasting under the classical linear model assumptions.^[5] Violations of these assumptions, such as heteroskedasticity or autocorrelation, necessitate alternative estimators like generalized least squares, but the theorem remains a benchmark for assessing estimator performance.^[6] Its enduring influence highlights the balance between bias, variance, and linearity in statistical estimation.^[7]

History

Origins

The origins of the Gauss–Markov theorem trace back to the development of the method of least squares in the early 19th century, primarily in the context of astronomical observations and error minimization. Adrien-Marie Legendre first formally published the least squares method in 1805 as a practical technique for determining comet orbits by minimizing the sum of squared residuals when the number of equations exceeds the unknowns. This approach treated errors as deterministic deviations rather than random variables, providing an algorithmic solution without a probabilistic foundation.^[8] Carl Friedrich Gauss, who claimed to have conceived the method as early as 1795 at age 18, expanded it into a statistical framework starting in 1809. In his work Theoria Motus Corporum Coelestium, Gauss introduced a probabilistic justification, assuming errors follow a normal distribution and deriving least squares estimates as maximum likelihood estimators under this assumption. He further refined these ideas in 1821 with the first part of Theoria Combinationis Observationum Erroribus Minimis Obnoxiae, where he demonstrated—using finite-sample arguments based on first and second moments—that the least squares estimator is the best linear unbiased estimator (BLUE) among all linear unbiased estimators, provided the errors have zero mean and constant variance (homoscedasticity), without requiring normality. These contributions by Gauss established the core result of what would later be formalized as the Gauss–Markov theorem, emphasizing linearity, unbiasedness, and minimum variance. The second part of the work appeared in 1823.^[8] Andrey Markov independently rediscovered and generalized aspects of the theorem around 1900, focusing explicitly on the role of unbiasedness in linear estimation. In his 1912 textbook Wahrscheinlichkeitsrechnung, Markov provided a clear treatment of the theorem's assumptions and proved that, under linearity and uncorrelated errors with equal variance, the ordinary least squares estimator achieves minimum variance among linear unbiased alternatives. This work clarified and synthesized the unbiasedness condition that Gauss had implicitly assumed, bridging the gap between deterministic and stochastic interpretations of least squares. Markov's contributions, though building on earlier ideas, highlighted the theorem's applicability in probability theory and helped propagate it in Russian and European statistical literature.^[8]^[9]

Formalization and Naming

The Gauss–Markov theorem was initially formalized by Carl Friedrich Gauss in his seminal 1821 publication, the first part of Theoria combinationis observationum erroribus minimis obnoxiae, where he proved that the arithmetic mean of observations minimizes the expected squared error under the assumption of independent errors with zero mean and constant variance, establishing it as the best linear unbiased estimator (BLUE) in that context. Gauss's derivation in this work used probabilistic arguments based on expectations and variances, without relying on the normal distribution—contrasting his earlier 1809 justification under normality—and provided the first rigorous justification linking the method to variance minimization among linear unbiased estimators. This work built on his earlier unpublished development of least squares around 1795 and its initial application in astronomy. The second part followed in 1823.^[10] Andrey Andreyevich Markov independently rediscovered and generalized the result in 1900, presenting a proof that extended beyond normality by assuming only that errors are uncorrelated with constant (but unknown) variance, thus establishing the BLUE property for ordinary least squares in the linear model without distributional assumptions on the errors. Markov's contribution, detailed in a lecture that year and later incorporated into his 1912 textbook Wahrscheinlichkeitsrechnung (a German translation of his Russian work on probability), emphasized the theorem's validity under weaker conditions, focusing on the covariance structure of errors rather than their specific distribution. This generalization aligned the theorem more closely with modern statistical inference, influencing its application in regression analysis.^[11] The theorem's naming reflects these dual origins, with "Gauss–Markov" becoming the conventional appellation by the mid-20th century to honor both contributors, though Gauss's foundational role predates Markov's by nearly a century. Earlier, in the 1930s, Jerzy Neyman referred to it as the "Markov theorem" in recognition of Markov's general proof, as noted in his 1934 discussion of least squares in sampling theory, and this variant persisted briefly in some econometric literature. The combined name gained prominence following historical reviews, such as Hilary Seal's 1967 analysis tracing the theorem's evolution from Gauss through intermediate developments by figures like Friedrich Robert Helmert (1872) to Markov, solidifying its standard usage in statistical texts by the 1950s.^[10]

Statement

Scalar Case

In the scalar case, the Gauss–Markov theorem addresses the simple linear regression model, where the response variable Y for each observation i = 1, \dots, n is expressed as
Y_i = \beta_0 + \beta_1 x_i + \epsilon_i,
with \beta_0 and \beta_1 as unknown scalar parameters, x_i as fixed regressors, and \epsilon_i as error terms.^[12] The theorem requires the following assumptions on the errors: the expected value of each error is zero, E(\epsilon_i) = 0; the errors have constant variance, \text{Var}(\epsilon_i) = \sigma^2 > 0; and the errors are uncorrelated across observations, \text{Cov}(\epsilon_i, \epsilon_j) = 0 for all i \neq j. These conditions ensure the model is linear in parameters, the errors are mean-zero and homoskedastic, and there is no serial correlation.^[12]^[11] Under these assumptions, the ordinary least squares (OLS) estimators of the parameters,
\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(Y_i - \bar{Y})}{\sum_{i=1}^n (x_i - \bar{x})^2},
\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{x},
where \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i and \bar{Y} = \frac{1}{n} \sum_{i=1}^n Y_i, are the best linear unbiased estimators (BLUE). This means \hat{\beta}_0 and \hat{\beta}_1 are unbiased, E(\hat{\beta}_j) = \beta_j for j = 0, 1, and possess the minimum variance among all linear unbiased estimators of the form \sum_{i=1}^n a_i Y_i.^[12]^[11] Specifically, the variance of the slope estimator is
\text{Var}(\hat{\beta}_1) = \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2},
and the variance of the intercept estimator is
\text{Var}(\hat{\beta}_0) = \sigma^2 \left( \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n (x_i - \bar{x})^2} \right).
For any other linear unbiased estimator b_1 = \sum_{i=1}^n a_i Y_i of \beta_1, satisfying the unbiasedness conditions \sum_{i=1}^n a_i = 0 and \sum_{i=1}^n a_i x_i = 1, it holds that \text{Var}(b_1) \geq \text{Var}(\hat{\beta}_1), with equality only if a_i = k_i for the OLS weights k_i. A similar result applies to estimators of \beta_0. This optimality underscores the efficiency of OLS in the scalar setting without requiring normality of the errors.^[12]^[11]

Matrix Case

The Gauss–Markov theorem in its matrix formulation applies to the classical linear regression model, expressed as
\mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon},
where \mathbf{Y} is an n \times 1 vector of observed responses, \mathbf{X} is an n \times p full-rank design matrix with p < n, \boldsymbol{\beta} is a p \times 1 vector of unknown parameters, and \boldsymbol{\epsilon} is an n \times 1 vector of random errors.^[13]^[14] This setup generalizes the scalar case to multiple regressors, allowing for the estimation of a parameter vector rather than a single coefficient. The theorem requires the following assumptions: (i) the model is linear in parameters; (ii) the errors have zero mean, \mathbb{E}[\boldsymbol{\epsilon}] = \mathbf{0}; (iii) the errors are uncorrelated with constant variance, \text{Cov}(\boldsymbol{\epsilon}) = \sigma^2 [\mathbf{I}_n](/page/Identity_matrix), where \sigma^2 > 0 is unknown and \mathbf{I}_n is the n \times n identity matrix; and (iv) the design matrix \mathbf{X} has full column rank to ensure uniqueness.^[13]^[14] Under these conditions, the ordinary least squares (OLS) estimator
\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y}
is unbiased, \mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}, and possesses the covariance matrix
\text{Cov}(\hat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}. ^[14] The Gauss–Markov theorem states that \hat{\boldsymbol{\beta}} is the best linear unbiased estimator (BLUE) of \boldsymbol{\beta}, meaning it has the smallest variance-covariance matrix (in the sense of positive semi-definiteness) among all linear unbiased estimators of the form \tilde{\boldsymbol{\beta}} = \mathbf{C} \mathbf{Y} for some p \times n matrix \mathbf{C}.^[15]^[16] This optimality implies that for any other linear unbiased estimator \tilde{\boldsymbol{\beta}}, \text{Cov}(\hat{\boldsymbol{\beta}}) - \text{Cov}(\tilde{\boldsymbol{\beta}}) is positive semi-definite, ensuring minimal estimation uncertainty in the linear class. The result, formalized in matrix terms by C.R. Rao, underscores the efficiency of OLS in homoscedastic linear models without requiring normality of errors.

Remarks

The Gauss–Markov theorem specifies that the ordinary least squares (OLS) estimator is the best linear unbiased estimator (BLUE) among all linear unbiased estimators of the regression coefficients, where "best" means possessing the covariance matrix that is minimal in the Loewner order (i.e., the smallest positive semi-definite matrix in the sense that the difference with any other such matrix is positive semi-definite).^[17] This optimality is for the entire vector of estimators simultaneously, implying that for any linear combination of the coefficients, the variance is also minimized.^[17] A key clarification is that the theorem applies exclusively to unbiased linear estimators and does not claim superiority over biased or nonlinear unbiased estimators, which may achieve lower mean squared error in some cases.^[18] Under the theorem's assumptions, however, recent work has shown that the OLS estimator is actually the minimum variance unbiased estimator (MVUE) without restricting to linearity, as nonlinear unbiased estimators form a narrow class that coincides with linear ones in this setting.^[19] The theorem does not require normality of the errors; it holds solely based on the first two moments (zero mean and constant variance with no correlation).^[2] Equality in the variance bound occurs if and only if the alternative estimator is a linear transformation of the OLS estimator that preserves unbiasedness.^[17] While the scalar case focuses on individual coefficient variances, the matrix formulation extends this to multivariate optimality, with the vector version implying the scalar results and vice versa through consistent variance comparisons.^[17] An unsatisfying aspect of the classical statement is its restriction to linear estimators, which motivates extensions like generalized least squares for heteroskedastic errors, though these fall outside the theorem's direct scope.^[19] The theorem's finite-sample guarantees provide a foundational justification for OLS in econometric and statistical applications, emphasizing its efficiency under ideal conditions without invoking asymptotic approximations.^[2]

Assumptions

Core Assumptions

The Gauss–Markov theorem applies to the classical linear regression model, where the ordinary least squares (OLS) estimator is proven to be the best linear unbiased estimator (BLUE) under a set of core assumptions. These assumptions ensure that the OLS estimator has the minimum variance among all linear unbiased estimators. They focus on the structural form of the model and the properties of the error term, without requiring normality of the errors.^[1] The model is specified as linear in the parameters: y = X\beta + \epsilon, where y is an n \times 1 vector of observations on the dependent variable, X is an n \times k matrix of regressors (including a column of ones for the intercept), \beta is a k \times 1 vector of unknown parameters, and \epsilon is an n \times 1 vector of error terms. This linearity assumption allows the parameters to be estimated via a linear combination of the observations.^[1]^[20] A key assumption is strict exogeneity, or zero conditional mean of the errors: E(\epsilon \mid X) = 0. This implies that the regressors are uncorrelated with the error term, ensuring the unbiasedness of the OLS estimator. Without this, systematic biases could arise from omitted variables or endogeneity.^[1]^[20] The error terms must also satisfy homoskedasticity and no serial correlation: \text{Var}(\epsilon \mid X) = \sigma^2 I_n, where \sigma^2 > 0 is a constant and I_n is the n \times n identity matrix. This means the errors have constant variance and are uncorrelated across observations, which is crucial for the efficiency property of OLS. Violations, such as heteroskedasticity, would make OLS inefficient but still unbiased.^[1]^[20] Finally, the regressor matrix X must have full column rank, ensuring no perfect multicollinearity among the explanatory variables. This guarantees that X^T X is invertible, allowing unique determination of the OLS estimator. Perfect collinearity would prevent identification of individual parameters.^[1]^[20]

Econometric-Specific Assumptions

In econometrics, the Gauss–Markov theorem is invoked within the framework of the classical linear regression model, where assumptions are formulated to accommodate random regressors and data structures common in economic datasets, such as cross-sections or panels. These assumptions extend the core statistical conditions by emphasizing conditional moments and data independence, ensuring that the ordinary least squares (OLS) estimator remains the best linear unbiased estimator (BLUE) even when explanatory variables are stochastic.^[5] A key econometric-specific assumption is random sampling, which posits that the observations are independently and identically distributed (i.i.d.) draws from the underlying population. This ensures that the sample reflects the population without systematic biases from non-random selection, a critical consideration in empirical economic studies where data often arise from surveys or administrative records. Without this, the finite-sample properties of OLS may not hold reliably.^[21] Another essential assumption is the absence of perfect multicollinearity among the regressors, meaning the explanatory variables X are not linearly dependent, so the design matrix has full column rank. In econometric applications, this prevents indeterminate parameter estimates and is particularly relevant when including multiple economic indicators, such as GDP components or policy variables, that may exhibit high but not perfect correlations. Violation leads to unstable variance-covariance matrices for the OLS estimator.^[22] The strict exogeneity assumption, E(u_i \mid X_i) = 0 for each observation i, is central to econometric formulations, as it accounts for potentially endogenous regressors by conditioning on the full set of explanatory variables. This differs from unconditional exogeneity in fixed-regressor settings and is vital for causal inference in economics, where omitted variables or simultaneity might otherwise bias results. It guarantees the unbiasedness of OLS conditionally on the observed X.^[5] Finally, homoskedasticity conditional on the regressors, \mathrm{Var}(u_i \mid X_i) = \sigma^2 for all i, along with no serial correlation (\mathrm{Cov}(u_i, u_j \mid X_i, X_j) = 0 for i \neq j), ensures the error covariance matrix is spherical (\sigma^2 I_n). In econometric contexts, this assumption is tested rigorously due to common issues like heteroskedasticity in income data or autocorrelation in time series, and its validity underpins the minimum variance property of OLS among linear unbiased estimators.^[7]

Proof

Outline of the Proof

The proof of the Gauss–Markov theorem establishes that, under the specified assumptions, the ordinary least squares (OLS) estimator is the best linear unbiased estimator (BLUE) of the regression coefficients in the linear model Y = X\beta + \epsilon, where Y is an n \times 1 vector of observations, X is an n \times p full-rank design matrix, \beta is a p \times 1 parameter vector, and \epsilon is an n \times 1 error vector with E(\epsilon) = 0 and \mathrm{Cov}(\epsilon) = \sigma^2 I_n. The OLS estimator is given by \hat{\beta} = (X^\top X)^{-1} X^\top Y. To outline the proof, begin by confirming the unbiasedness of the OLS estimator. Substituting the model into the estimator yields E(\hat{\beta}) = (X^\top X)^{-1} X^\top E(Y) = (X^\top X)^{-1} X^\top X \beta = \beta, which holds for all \beta under the linearity and zero-mean error assumptions.^[2] Next, consider an arbitrary linear unbiased estimator of \beta, denoted \tilde{\beta} = C Y for some p \times n matrix C. Unbiasedness requires E(\tilde{\beta}) = C X \beta = \beta for all \beta, implying the constraint C X = I_p. The variance-covariance matrix of \tilde{\beta} is then \mathrm{Var}(\tilde{\beta}) = C \cdot \mathrm{Cov}(Y) \cdot C^\top = \sigma^2 C C^\top, since \mathrm{Cov}(Y) = \mathrm{Cov}(\epsilon) = \sigma^2 I_n.^[23] Let A = (X^\top X)^{-1} X^\top, so \hat{\beta} = A Y. Decompose C = A + D where D X = 0 (i.e., the rows of D are orthogonal to the column space of X). Substituting gives \mathrm{Var}(\tilde{\beta}) = \sigma^2 (A + D)(A + D)^\top = \sigma^2 A A^\top + \sigma^2 D D^\top + \sigma^2 (A D^\top + D A^\top). The cross terms vanish because X^\top D^\top = (D X)^\top = 0 implies A D^\top = (X^\top X)^{-1} (X^\top D^\top) = 0 and D A^\top = 0 similarly, yielding \mathrm{Var}(\tilde{\beta}) = \sigma^2 A A^\top + \sigma^2 D D^\top. Note that \mathrm{Var}(\hat{\beta}) = \sigma^2 A A^\top = \sigma^2 (X^\top X)^{-1}.^[2] The matrix D D^\top is positive semi-definite, so \sigma^2 D D^\top \succeq 0, which implies \mathrm{Var}(\tilde{\beta}) \succeq \mathrm{Var}(\hat{\beta}) in the sense of positive semi-definiteness (i.e., \tilde{\beta}^\top [\mathrm{Var}(\tilde{\beta}) - \mathrm{Var}(\hat{\beta})] \tilde{\beta} \geq 0 for all \tilde{\beta}). Equality holds if and only if D = 0, meaning \tilde{\beta} = \hat{\beta}. Thus, no other linear unbiased estimator can have a smaller variance-covariance matrix than the OLS estimator. For the scalar case (simple linear regression), the proof follows analogously by restricting to one-dimensional \beta and using summation constraints on the coefficients, leading to a variance decomposition where extra terms are non-negative.^[24]^[23]

Detailed Derivation

Consider the classical linear regression model given by

\mathbf{y} = X \boldsymbol{\beta} + \boldsymbol{\epsilon},

where \mathbf{y} is an n \times 1 vector of observations, X is an n \times p design matrix of full column rank p (with n > p), \boldsymbol{\beta} is a p \times 1 vector of unknown parameters, and \boldsymbol{\epsilon} is an n \times 1 error vector satisfying the assumptions E[\boldsymbol{\epsilon}] = \mathbf{0} and \text{Cov}(\boldsymbol{\epsilon}) = \sigma^2 I_n for some \sigma^2 > 0.^[1]^[25] The ordinary least squares (OLS) estimator is

\hat{\boldsymbol{\beta}} = (X^T X)^{-1} X^T \mathbf{y}.

This estimator is linear in \mathbf{y}. To verify unbiasedness, substitute the model into the estimator:

E[\hat{\boldsymbol{\beta}}] = (X^T X)^{-1} X^T E[\mathbf{y}] = (X^T X)^{-1} X^T X \boldsymbol{\beta} = \boldsymbol{\beta},

using the zero mean of the errors. The covariance matrix of the OLS estimator is then

\text{Cov}(\hat{\boldsymbol{\beta}}) = (X^T X)^{-1} X^T \text{Cov}(\mathbf{y}) X (X^T X)^{-1} = \sigma^2 (X^T X)^{-1},

since \text{Cov}(\mathbf{y}) = \text{Cov}(\boldsymbol{\epsilon}) = \sigma^2 I_n.^[1]^[25] Now consider a general linear estimator of the form \tilde{\boldsymbol{\beta}} = C \mathbf{y}, where C is a p \times n matrix. For unbiasedness, E[\tilde{\boldsymbol{\beta}}] = C X \boldsymbol{\beta} = \boldsymbol{\beta} must hold for all \boldsymbol{\beta}, implying C X = I_p. The covariance matrix of this estimator is \text{Cov}(\tilde{\boldsymbol{\beta}}) = \sigma^2 C C^T. To show that the OLS estimator is best (minimum variance) among such estimators, compare the covariance matrices. Define D = C - (X^T X)^{-1} X^T, so C = D + (X^T X)^{-1} X^T. The unbiasedness condition C X = I_p yields D X = \mathbf{0}. Then,

C C^T = [D + (X^T X)^{-1} X^T] [D + (X^T X)^{-1} X^T]^T = D D^T + (X^T X)^{-1} + (X^T X)^{-1} X D^T + D X^T (X^T X)^{-1}.

The cross terms vanish because D X = \mathbf{0} implies X D^T = (D X)^T = \mathbf{0}, so

C C^T = D D^T + (X^T X)^{-1}.

Thus,

\text{Cov}(\tilde{\boldsymbol{\beta}}) - \text{Cov}(\hat{\boldsymbol{\beta}}) = \sigma^2 [C C^T - (X^T X)^{-1}] = \sigma^2 D D^T,

which is positive semi-definite (as \sigma^2 > 0 and D D^T has non-negative eigenvalues). Equality holds if and only if D = \mathbf{0}, i.e., \tilde{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}}. This establishes that the OLS estimator has the smallest covariance matrix (in the positive semi-definite order) among all linear unbiased estimators.^[1]^[25] An alternative derivation uses the projection interpretation. The OLS estimator projects \mathbf{y} onto the column space of X via the hat matrix H = X (X^T X)^{-1} X^T. For estimating a linear combination c^T \boldsymbol{\beta} (with c a p \times 1 vector), the OLS estimate is \hat{\gamma} = c^T \hat{\boldsymbol{\beta}} = k^T \mathbf{y} where k^T = c^T (X^T X)^{-1} X^T. Any other unbiased linear estimator \tilde{\gamma} = d^T \mathbf{y} satisfies d^T X = c^T. The variance of \tilde{\gamma} is \sigma^2 \|d\|^2, while that of \hat{\gamma} is \sigma^2 \|k\|^2. Decompose d = H d + (I - H) d; since H is the orthogonal projection onto the column space of X and d^T X = c^T implies the component in the column space matches k, orthogonality gives \|d\|^2 = \|H d\|^2 + \|(I - H) d\|^2 \geq \|H d\|^2 = \|k\|^2, with equality if (I - H) d = 0. Thus, the OLS estimate minimizes the variance for any such linear combination.^[23]

Remarks on the Proof

The proof of the Gauss–Markov theorem relies on demonstrating that the covariance matrix of the ordinary least squares (OLS) estimator is minimal in the positive semi-definite sense among all linear unbiased estimators. Specifically, for any alternative linear unbiased estimator \tilde{\beta} = C y satisfying C X = I, the difference \operatorname{Var}(\tilde{\beta}) - \operatorname{Var}(\hat{\beta}_{\text{OLS}}) = \sigma^2 (C - (X^T X)^{-1} X^T) (C - (X^T X)^{-1} X^T)^T is positive semi-definite, ensuring that the variances satisfy \operatorname{Var}(\tilde{\beta}_j) \geq \operatorname{Var}(\hat{\beta}_{j,\text{OLS}}) for each component j, with equality only if \tilde{\beta} = \hat{\beta}_{\text{OLS}}.^[25]^[2] In the scalar case, the proof simplifies to a direct decomposition of coefficients: for an unbiased linear estimator \hat{\beta}_1 = \sum c_i y_i with constraints \sum c_i = 0 and \sum c_i x_i = 1, one sets c_i = k_i + d_i where k_i are the OLS weights and d_i are deviations satisfying \sum d_i = 0 and \sum d_i x_i = 0; the variance then decomposes as \sigma^2 \sum c_i^2 = \sigma^2 \sum k_i^2 + \sigma^2 \sum d_i^2, minimized uniquely when d_i = 0 for all i.^[24] This approach highlights the geometric intuition that OLS projects orthogonally onto the column space of X, minimizing the residual variance without bias. A key insight is that the theorem's validity hinges on the spherical error covariance \sigma^2 I, as violations (e.g., heteroskedasticity or autocorrelation) render OLS inefficient, though still unbiased; in such cases, the generalized least squares estimator recovers BLUE status by pre-whitening the model.^[26] The proof does not require error normality, which is only needed for exact finite-sample inference like t-tests.^[2] Additionally, full column rank of X is essential; for rank-deficient designs, the theorem extends to estimable linear combinations l^T \beta via generalized inverses.^[25] The proof's elegance lies in its reliance on basic linear algebra—projection matrices and quadratic forms—without invoking probability distributions beyond second moments, underscoring the theorem's robustness in classical linear models.^[1]

Generalizations

Generalized Least Squares Estimator

In the generalized linear regression model y = X\beta + \varepsilon, where \mathbb{E}[\varepsilon] = 0 and \mathrm{Var}(\varepsilon) = \sigma^2 \Omega with \Omega a known positive definite matrix, the ordinary least squares estimator is no longer the best linear unbiased estimator (BLUE).^[27] Instead, the generalized least squares (GLS) estimator addresses the non-spherical error covariance by weighting observations according to the inverse of \Omega.^[15] The GLS estimator is given by

\hat{\beta}_{\mathrm{GLS}} = (X^\top \Omega^{-1} X)^{-1} X^\top \Omega^{-1} y,

assuming X^\top \Omega^{-1} X is invertible, which holds under the standard full column rank condition on X.^[27] This estimator is unbiased, as \mathbb{E}[\hat{\beta}_{\mathrm{GLS}}] = \beta, and its covariance matrix is \mathrm{Var}(\hat{\beta}_{\mathrm{GLS}}) = \sigma^2 (X^\top \Omega^{-1} X)^{-1}, which is the minimal possible among all linear unbiased estimators.^[11] Aitken's theorem establishes that \hat{\beta}_{\mathrm{GLS}} is the BLUE in this setting, generalizing the classical Gauss-Markov result to arbitrary known error covariances.^[15] To see why GLS achieves this efficiency, consider a linear transformation of the model: let \Omega = V V^\top for some invertible V, and premultiply the equation by V^{-1} to obtain V^{-1} y = V^{-1} X \beta + V^{-1} \varepsilon. The transformed errors satisfy \mathbb{E}[V^{-1} \varepsilon] = 0 and \mathrm{Var}(V^{-1} \varepsilon) = \sigma^2 I, reducing to the standard Gauss-Markov assumptions. Applying ordinary least squares to this transformed model yields \hat{\beta}_{\mathrm{GLS}}, which is BLUE by the classical theorem.^[27] This transformation preserves unbiasedness and minimizes variance without requiring normality of the errors.^[11] The GLS estimator minimizes the weighted sum of squared residuals (y - X\beta)^\top \Omega^{-1} (y - X\beta), providing a quadratic form that accounts for the error structure and ensures a unique global minimum due to the positive definiteness of \Omega^{-1}.^[27] In practice, if \Omega is unknown, feasible GLS uses a consistent estimate, but the theorem applies directly when \Omega is known.^[15]

Modern Extensions

One prominent modern extension of the Gauss–Markov theorem relaxes the classical restriction to linear unbiased estimators, demonstrating that the ordinary least squares (OLS) estimator achieves minimum variance among all unbiased estimators—linear or nonlinear—under the standard assumptions of linearity, exogeneity, and homoskedasticity. This restatement, proposed by Hansen, establishes finite-sample efficiency bounds for estimating linear regression coefficients without distributional assumptions beyond finite variance, showing that the variance of any unbiased estimator is at least as large as that of OLS, specifically \sigma^2 (X^\top X)^{-1}. The implications extend to generalized least squares (GLS) in heteroskedastic cases, affirming its status as the minimum variance unbiased estimator (MVUE) without the linearity qualifier.^[11] Another key development incorporates bounded bias into the framework, extending the theorem to scenarios where estimators may introduce controlled bias to reduce overall mean squared error, particularly relevant in regularized regression. In this formulation, the bias operator B = LX - I is constrained by a Schatten norm \|B\|_p \leq C for p \geq 1, leading to optimal estimators that generalize ridge regression (for p=2) and other shrinkage methods. For instance, under nuclear norm constraints (p=1), the optimal estimator thresholds small eigenvalues of X^\top X, outperforming OLS in simulations with cross-validation. This extension maintains the core Gauss–Markov setup but allows non-zero bias, yielding explicit forms like \hat{G} = U \operatorname{diag}(\max(\sigma, \alpha)) U^\top for the design matrix adjustment.^[28] In high-dimensional settings where the number of parameters p exceeds the sample size n (i.e., p/n \to \gamma > 1), the classical theorem fails due to the singularity of X^\top X, prompting extensions via ridgeless least squares, which solves the minimum \ell_2-norm interpolation problem. These approaches reveal "double descent" phenomena, where prediction risk decreases after the interpolation threshold, with asymptotic bias and variance characterized using random matrix theory and the Marchenko–Pastur law: bias \sim r^2 \lambda^2 s'(-\lambda) and variance \sigma^2 \gamma \int (x/(x+\lambda)^2) d\mu_{MP}(x). Under assumptions of i.i.d. features with covariance \Sigma and sub-Gaussian noise, the min-norm solution achieves consistent estimation in overparameterized regimes, bridging linear models to deep learning interpolation. This has implications for modern applications like neural networks, where overparameterization mitigates the curse of dimensionality.^[29] Further extensions address partial regularization in high dimensions, adapting the theorem for hybrid models that regularize only a subset of parameters to handle bias in average treatment effect estimation or spiked covariance structures. Here, variance estimators like leave-one-out cross-validation provide conservative bounds under the Gauss–Markov model, with simulations showing low bias in geometric or spiked models. These developments, building on ridgeless interpolation, enable inference when full-rank assumptions break, prioritizing mean squared error over unbiasedness.^[30]

References

[1]
Gauss Markov theorem - StatLect
The Gauss Markov theorem says that, under certain conditions, the ordinary least squares (OLS) estimator of the coefficients of a linear regression model is ...
[2]
[PDF] The Gauss-Markov Theorem - STA 211 - Stat@Duke
Mar 7, 2023 · The Gauss-Markov Theorem. The Gauss-Markov Theorem asserts that under some assumptions, the OLS estimator is the “best” (has the lowest ...
[3]
Encyclopedia of Research Design - Gauss–Markov Theorem
In his 1900 textbook on probability, Andrei Markov essentially rediscovered Gauss's theorem. By the 1930s, however, the result was commonly ...Missing: original | Show results with:original
[4]
5.5 The Gauss-Markov Theorem - Introduction to Econometrics with R
The Gauss-Markov theorem states that, in the class of conditionally unbiased linear estimators, the OLS estimator has this property under certain conditions.
[5]
Gauss-Markov Theorem - an overview | ScienceDirect Topics
According to the Gauss–Markov theorem, the estimators α, β found from least squares analysis are the best linear unbiased estimators for the model.
[6]
The Gauss-Markov Theorem and BLUE OLS Coefficient Estimates
The Gauss-Markov theorem states that satisfying the OLS assumptions keeps the sampling distribution as tight as possible for unbiased estimates.
[7]
https://statisticsbyjim.com/regression/gauss-markov-theorem-ols-blue/
[8]
(PDF) Gauss–Markov Theorem in Statistics - ResearchGate
Oct 12, 2017 · Dated at the beginning of the 19th century and controversial about the actual authorship, the least squares method established an optimization ...<|control11|><|separator|>
[9]
The Historical Development of the Gauss Linear Model - jstor
The linear regression model owes so much to Gauss that we believe it should bear his name. Other authors who made substantial contributions are: Cauchy who ...
[10]
[PDF] A Modern Gauss-Markov Theorem - University of Wisconsin–Madison
The Gauss-Markov theorem states that in a linear homoskedastic regression model the minimum variance linear unbiased esti- mator of the regression coefficient ...Missing: sources | Show results with:sources<|control11|><|separator|>
[11]
[PDF] Nicolas Christou Gauss-Markov theorem
Gauss-Markov theorem. Consider the simple regression model Yi = β0 + β1xi + i,i = 1,...,n. The Gauss-Markov conditions hold, i.e. E( i)=0, var( i) = σ2, and ...Missing: statement | Show results with:statement
[12]
[PDF] OLS in Matrix Form
The Gauss-Markov Theorem states that, conditional on assumptions 1-5, there will be no other linear and unbiased estimator of the β coefficients that has a ...
[13]
[PDF] Ch6. Multiple Regression: Estimation 1 The model
Theorem 2.3 If cov(y) = σ2I, the covariance matrix for β is given by σ2(X′X)−1 . PROOF: Exercise. Theorem 2.4 (Gauss-Markov Theorem) If E(y) = Xβ and cov(y) ...
[14]
[PDF] Chapter 4. Gauss-Markov Model - UNM Math
1) The Gauss-Markov assumptions on the errors in a linear model are introduced. They specify that the errors have zero mean, are uncorrelated and have constant ...
[15]
[PDF] The Gauss-Markov Theorem in Multivariate Analysis by Morris L ...
The Gauss-Markov (G.M.) Theorem is formulated in a vector space setting general enough to cover the linear models of multivariate analysis. The connection ...Missing: statement | Show results with:statement
[16]
[PDF] Notes on the Gauss-Markov theorem - UC Berkeley Statistics
Discussion. Statistical Models has the “single-contrast” version of the theorem, which starts with an estimator for the scalar parameter c β.
[17]
[PDF] Lecture 7: Further discussion of bias and variance
The Gauss-Markov Theorem (GMT) for OLS (informally) states that if (A1)-(A3) hold, then: ▶ The irreducible error is σ2. ▶ The bias of OLS is zero.
[18]
[PDF] A Modern Gauss-Markov Theorem
Andre˘ı Andreevich Markov (1912) provided a textbook treatment of the theorem, and clarified the central role of unbiasedness, which Gauss had only assumed ...
[19]
[PDF] From Wooldridge, Chapter 3
Assumptions MLR.1 to MLR.5 are collectively known as the Gauss-Markov assumptions for cross- sectional regression. Now, let's take a look at the sampling ...Missing: theorem | Show results with:theorem
[20]
4.1 Gauss-Markov assumptions - Intro To Econometrics - Fiveable
These assumptions ensure that ordinary least squares estimators are unbiased and efficient, allowing for accurate estimation of economic relationships.
[21]
Key Assumptions of OLS: Econometrics Review - Albert.io
Jul 13, 2021 · If the OLS assumptions 1 to 5 hold, then according to Gauss-Markov Theorem, OLS estimator is Best Linear Unbiased Estimator (BLUE). These ...
[22]
[PDF] Yet Another Proof of the Gauss-Markov Theorem
THEOREM. GAUSS-MARkOV. The OLS estimator is BLUE. The acronym BLUE stands for Best Linear Unbiased Estimator, i.e., the one with the smallest variance.Missing: outline | Show results with:outline
[23]
[PDF] Gauss Markov Theorem
Gauss-Markov Theorem. ▻ The theorem states that b1 has minimum variance ... Proof cont. Now define ci = ki + di where the ki are the constants we ...Missing: history | Show results with:history<|separator|>
[24]
[PDF] Chapter 4 - The Gauss-Markov Theorem
Proof. By the Gauss-Markov theorem bγLSE is the BLUE for γ and l/β = a/γ is a linear function of γ.
[25]
[PDF] Finite-Sample Properties of OLS - Princeton University
The Gauss-Markov Theorem says that the OLS estimator is efficient in the sense that its conditional variance matrix Var(b | X) is smallest among linear unbiased ...
[26]
Generalized least squares (GLS regression) - StatLect
These assumptions are the same made in the Gauss-Markov theorem in order to prove that OLS is BLUE, except for assumption 3. In the Gauss-Markov theorem, we ...
[27]
[PDF] arXiv:2311.11093v1 [cs.LG] 18 Nov 2023
Nov 18, 2023 · We consider the problem of linear estimation, and establish an extension of the Gauss-Markov theorem, in which the bias operator is allowed to ...
[28]
None
Summary of each segment:
[29]
None
### Summary of Extensions to Gauss-Markov Theorem in Regularized OLS