Fact-checked by Grok 2 weeks ago

Gauss–Markov theorem

The Gauss–Markov theorem is a cornerstone result in that asserts, under specified assumptions in a model, the ordinary (OLS) estimator of the regression coefficients is the best linear unbiased estimator (BLUE), possessing the minimum variance among all linear unbiased estimators. This theorem establishes the optimality of OLS within the class of linear estimators that are unbiased, making it a foundational justification for its use in estimating linear models. The theorem derives its name from and , though its origins predate the formal naming. Gauss first proved the key result in his 1821 work Theoria combinationis observationum erroribus minimis obnoxiæ, demonstrating that yields unbiased estimates with minimum variance for linear models of astronomical observations. Markov independently rediscovered and generalized the theorem in his 1900 textbook on , extending it to broader statistical contexts. By , the result was widely recognized in statistical literature, often initially attributed solely to Markov before being credited jointly as the Gauss–Markov theorem. Central to the theorem are four classical assumptions that ensure OLS achieves BLUE status: (1) the model is linear in the parameters, expressed as y = X\beta + \epsilon; (2) the errors \epsilon have zero conditional mean (strict exogeneity); (3) the errors are homoskedastic with constant variance \sigma^2 and uncorrelated (i.e., \text{Var}(\epsilon) = \sigma^2 I); and (4) the design matrix X has full column rank to ensure identifiability. The proof proceeds by showing that the OLS estimator \hat{\beta} = (X^T X)^{-1} X^T y is linear and unbiased, and that its covariance matrix \sigma^2 (X^T X)^{-1} is smaller in the positive semi-definite sense than that of any other linear unbiased estimator. Notably, the theorem does not require normality of the errors for BLUE property, though normality is often assumed for additional inference procedures. The Gauss–Markov theorem underpins much of modern , particularly in , where it validates OLS as an efficient for policy evaluation and forecasting under the classical assumptions. Violations of these assumptions, such as heteroskedasticity or , necessitate alternative estimators like , but the theorem remains a for assessing estimator performance. Its enduring influence highlights the balance between bias, variance, and linearity in statistical estimation.

History

Origins

The origins of the Gauss–Markov theorem trace back to the development of the method of in the early , primarily in the context of astronomical observations and error minimization. first formally published the least squares method in as a practical technique for determining orbits by minimizing the sum of squared residuals when the number of equations exceeds the unknowns. This approach treated errors as deterministic deviations rather than random variables, providing an algorithmic solution without a probabilistic foundation. Carl Friedrich Gauss, who claimed to have conceived the method as early as 1795 at age 18, expanded it into a statistical framework starting in 1809. In his work Theoria Motus Corporum Coelestium, Gauss introduced a probabilistic justification, assuming errors follow a and deriving estimates as maximum likelihood estimators under this assumption. He further refined these ideas in 1821 with the first part of Theoria Combinationis Observationum Erroribus Minimis Obnoxiae, where he demonstrated—using finite-sample arguments based on first and second moments—that the estimator is the best linear unbiased estimator (BLUE) among all linear unbiased estimators, provided the errors have zero mean and constant variance (homoscedasticity), without requiring normality. These contributions by Gauss established the core result of what would later be formalized as the Gauss–Markov theorem, emphasizing , unbiasedness, and minimum variance. The second part of the work appeared in 1823. Andrey Markov independently rediscovered and generalized aspects of the theorem around 1900, focusing explicitly on the role of unbiasedness in linear estimation. In his 1912 textbook Wahrscheinlichkeitsrechnung, Markov provided a clear treatment of the theorem's assumptions and proved that, under linearity and uncorrelated errors with equal variance, the ordinary estimator achieves minimum variance among linear unbiased alternatives. This work clarified and synthesized the unbiasedness condition that Gauss had implicitly assumed, bridging the gap between deterministic and interpretations of . Markov's contributions, though building on earlier ideas, highlighted the theorem's applicability in and helped propagate it in and statistical literature.

Formalization and Naming

The Gauss–Markov theorem was initially formalized by in his seminal 1821 publication, the first part of Theoria combinationis observationum erroribus minimis obnoxiae, where he proved that the of observations minimizes the expected squared error under the assumption of independent errors with zero mean and constant variance, establishing it as the best linear unbiased estimator () in that context. Gauss's derivation in this work used probabilistic arguments based on expectations and variances, without relying on the normal distribution—contrasting his earlier 1809 justification under normality—and provided the first rigorous justification linking the method to variance minimization among linear unbiased estimators. This work built on his earlier unpublished development of around 1795 and its initial application in astronomy. The second part followed in 1823. Andrey Andreyevich Markov independently rediscovered and generalized the result in , presenting a proof that extended beyond by assuming only that errors are uncorrelated with constant (but unknown) variance, thus establishing the property for ordinary in the without distributional assumptions on the errors. Markov's contribution, detailed in a that year and later incorporated into his 1912 Wahrscheinlichkeitsrechnung (a German translation of his Russian work on probability), emphasized the theorem's validity under weaker conditions, focusing on the structure of errors rather than their specific . This generalization aligned the theorem more closely with modern , influencing its application in . The theorem's naming reflects these dual origins, with "Gauss–Markov" becoming the conventional appellation by the mid-20th century to honor both contributors, though Gauss's foundational role predates Markov's by nearly a century. Earlier, in the 1930s, referred to it as the "Markov theorem" in recognition of Markov's general proof, as noted in his 1934 discussion of in sampling theory, and this variant persisted briefly in some econometric literature. The combined name gained prominence following historical reviews, such as Hilary Seal's analysis tracing the theorem's evolution from Gauss through intermediate developments by figures like Friedrich Robert Helmert () to Markov, solidifying its standard usage in statistical texts by the 1950s.

Statement

Scalar Case

In the scalar case, the Gauss–Markov theorem addresses the simple linear regression model, where the response variable Y for each observation i = 1, \dots, n is expressed as
Y_i = \beta_0 + \beta_1 x_i + \epsilon_i,
with \beta_0 and \beta_1 as unknown scalar parameters, x_i as fixed regressors, and \epsilon_i as error terms.
The theorem requires the following assumptions on the errors: the expected value of each error is zero, E(\epsilon_i) = 0; the errors have constant variance, \text{Var}(\epsilon_i) = \sigma^2 > 0; and the errors are uncorrelated across observations, \text{Cov}(\epsilon_i, \epsilon_j) = 0 for all i \neq j. These conditions ensure the model is linear in parameters, the errors are mean-zero and homoskedastic, and there is no serial correlation. Under these assumptions, the ordinary least squares (OLS) estimators of the parameters,
\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(Y_i - \bar{Y})}{\sum_{i=1}^n (x_i - \bar{x})^2},
\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{x},
where \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i and \bar{Y} = \frac{1}{n} \sum_{i=1}^n Y_i, are the best linear unbiased estimators (BLUE). This means \hat{\beta}_0 and \hat{\beta}_1 are unbiased, E(\hat{\beta}_j) = \beta_j for j = 0, 1, and possess the minimum variance among all linear unbiased estimators of the form \sum_{i=1}^n a_i Y_i.
Specifically, the variance of the slope is
\text{Var}(\hat{\beta}_1) = \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2},
and the variance of the intercept is
\text{Var}(\hat{\beta}_0) = \sigma^2 \left( \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n (x_i - \bar{x})^2} \right).
For any other linear unbiased b_1 = \sum_{i=1}^n a_i Y_i of \beta_1, satisfying the unbiasedness conditions \sum_{i=1}^n a_i = 0 and \sum_{i=1}^n a_i x_i = 1, it holds that \text{Var}(b_1) \geq \text{Var}(\hat{\beta}_1), with only if a_i = k_i for the OLS weights k_i. A similar result applies to estimators of \beta_0. This optimality underscores the of OLS in the scalar setting without requiring of the errors.

Matrix Case

The Gauss–Markov theorem in its matrix formulation applies to the classical model, expressed as
\mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon},
where \mathbf{Y} is an n \times 1 of observed responses, \mathbf{X} is an n \times p full-rank with p < n, \boldsymbol{\beta} is a p \times 1 of unknown parameters, and \boldsymbol{\epsilon} is an n \times 1 of random errors. This setup generalizes the scalar case to multiple regressors, allowing for the estimation of a parameter rather than a single coefficient.
The theorem requires the following assumptions: (i) the model is linear in parameters; (ii) the errors have zero mean, \mathbb{E}[\boldsymbol{\epsilon}] = \mathbf{0}; (iii) the errors are uncorrelated with constant variance, \text{Cov}(\boldsymbol{\epsilon}) = \sigma^2 [\mathbf{I}_n](/page/Identity_matrix), where \sigma^2 > 0 is unknown and \mathbf{I}_n is the n \times n ; and (iv) the \mathbf{X} has full column rank to ensure uniqueness. Under these conditions, the ordinary (OLS)
\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y}
is unbiased, \mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}, and possesses the
\text{Cov}(\hat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}.
The Gauss–Markov theorem states that \hat{\boldsymbol{\beta}} is the (BLUE) of \boldsymbol{\beta}, meaning it has the smallest (in the sense of positive semi-definiteness) among all linear unbiased of the form \tilde{\boldsymbol{\beta}} = \mathbf{C} \mathbf{Y} for some p \times n \mathbf{C}. This optimality implies that for any other linear unbiased \tilde{\boldsymbol{\beta}}, \text{Cov}(\hat{\boldsymbol{\beta}}) - \text{Cov}(\tilde{\boldsymbol{\beta}}) is positive semi-definite, ensuring minimal in the linear class. The result, formalized in terms by , underscores the efficiency of OLS in homoscedastic linear models without requiring of errors.

Remarks

The Gauss–Markov theorem specifies that the ordinary least squares (OLS) estimator is the best linear unbiased estimator (BLUE) among all linear unbiased estimators of the regression coefficients, where "best" means possessing the covariance matrix that is minimal in the Loewner order (i.e., the smallest positive semi-definite matrix in the sense that the difference with any other such matrix is positive semi-definite). This optimality is for the entire vector of estimators simultaneously, implying that for any linear combination of the coefficients, the variance is also minimized. A key clarification is that the theorem applies exclusively to unbiased linear estimators and does not claim superiority over biased or nonlinear unbiased estimators, which may achieve lower in some cases. Under the theorem's assumptions, however, recent work has shown that the OLS estimator is actually the (MVUE) without restricting to linearity, as nonlinear unbiased form a narrow class that coincides with linear ones in this setting. The theorem does not require normality of the errors; it holds solely based on the first two moments (zero mean and constant variance with no correlation). Equality in the variance bound occurs if and only if the alternative estimator is a linear transformation of the OLS estimator that preserves unbiasedness. While the scalar case focuses on individual coefficient variances, the matrix formulation extends this to multivariate optimality, with the vector version implying the scalar results and vice versa through consistent variance comparisons. An unsatisfying aspect of the classical statement is its restriction to linear estimators, which motivates extensions like for heteroskedastic errors, though these fall outside the theorem's direct scope. The theorem's finite-sample guarantees provide a foundational justification for OLS in econometric and statistical applications, emphasizing its efficiency under ideal conditions without invoking asymptotic approximations.

Assumptions

Core Assumptions

The Gauss–Markov theorem applies to the classical linear regression model, where the ordinary (OLS) estimator is proven to be the best linear unbiased estimator (BLUE) under a set of core assumptions. These assumptions ensure that the OLS estimator has the minimum variance among all linear unbiased estimators. They focus on the structural form of the model and the properties of the error term, without requiring normality of the errors. The model is specified as linear in the parameters: y = X\beta + \epsilon, where y is an n \times 1 vector of observations on the dependent variable, X is an n \times k matrix of regressors (including a column of ones for the intercept), \beta is a k \times 1 vector of unknown parameters, and \epsilon is an n \times 1 vector of error terms. This linearity assumption allows the parameters to be estimated via a linear combination of the observations. A key assumption is strict exogeneity, or zero conditional mean of the errors: E(\epsilon \mid X) = 0. This implies that the regressors are uncorrelated with the error term, ensuring the unbiasedness of the OLS estimator. Without this, systematic biases could arise from omitted variables or endogeneity. The error terms must also satisfy homoskedasticity and no serial correlation: \text{Var}(\epsilon \mid X) = \sigma^2 I_n, where \sigma^2 > 0 is a constant and I_n is the n \times n identity matrix. This means the errors have constant variance and are uncorrelated across observations, which is crucial for the efficiency property of OLS. Violations, such as heteroskedasticity, would make OLS inefficient but still unbiased. Finally, the regressor matrix X must have full column , ensuring no perfect among the explanatory variables. This guarantees that X^T X is invertible, allowing unique determination of the OLS . Perfect would prevent of individual parameters.

Econometric-Specific Assumptions

In , the Gauss–Markov theorem is invoked within the framework of the classical model, where assumptions are formulated to accommodate random regressors and data structures common in economic datasets, such as cross-sections or panels. These assumptions extend the core statistical conditions by emphasizing conditional moments and , ensuring that the ordinary least squares (OLS) estimator remains the best linear unbiased estimator () even when explanatory variables are . A key econometric-specific assumption is random sampling, which posits that the observations are independently and identically distributed (i.i.d.) draws from the underlying . This ensures that the sample reflects the without systematic biases from non-random selection, a critical consideration in empirical economic studies where data often arise from surveys or administrative records. Without this, the finite-sample properties of OLS may not hold reliably. Another essential assumption is the absence of perfect among the regressors, meaning the explanatory variables X are not linearly dependent, so the has full column rank. In econometric applications, this prevents indeterminate parameter estimates and is particularly relevant when including multiple economic indicators, such as GDP components or variables, that may exhibit high but not perfect correlations. Violation leads to unstable variance-covariance matrices for the OLS . The strict exogeneity assumption, E(u_i \mid X_i) = 0 for each i, is central to econometric formulations, as it accounts for potentially endogenous regressors by conditioning on the full set of explanatory variables. This differs from unconditional exogeneity in fixed-regressor settings and is vital for in , where omitted variables or might otherwise results. It guarantees the unbiasedness of OLS conditionally on the observed X. Finally, homoskedasticity conditional on the regressors, \mathrm{Var}(u_i \mid X_i) = \sigma^2 for all i, along with no serial correlation (\mathrm{Cov}(u_i, u_j \mid X_i, X_j) = 0 for i \neq j), ensures the error covariance matrix is spherical (\sigma^2 I_n). In econometric contexts, this assumption is tested rigorously due to common issues like heteroskedasticity in data or autocorrelation in time series, and its validity underpins the minimum variance property of OLS among linear unbiased estimators.

Proof

Outline of the Proof

The proof of the Gauss–Markov theorem establishes that, under the specified assumptions, the ordinary least squares (OLS) estimator is the best linear unbiased (BLUE) of the regression coefficients in the Y = X\beta + \epsilon, where Y is an n \times 1 of observations, X is an n \times p full-rank , \beta is a p \times 1 , and \epsilon is an n \times 1 with E(\epsilon) = 0 and \mathrm{Cov}(\epsilon) = \sigma^2 I_n. The OLS is given by \hat{\beta} = (X^\top X)^{-1} X^\top Y. To outline the proof, begin by confirming the unbiasedness of the OLS estimator. Substituting the model into the estimator yields E(\hat{\beta}) = (X^\top X)^{-1} X^\top E(Y) = (X^\top X)^{-1} X^\top X \beta = \beta, which holds for all \beta under the linearity and zero-mean error assumptions. Next, consider an arbitrary linear unbiased estimator of \beta, denoted \tilde{\beta} = C Y for some p \times n matrix C. Unbiasedness requires E(\tilde{\beta}) = C X \beta = \beta for all \beta, implying the constraint C X = I_p. The variance-covariance matrix of \tilde{\beta} is then \mathrm{Var}(\tilde{\beta}) = C \cdot \mathrm{Cov}(Y) \cdot C^\top = \sigma^2 C C^\top, since \mathrm{Cov}(Y) = \mathrm{Cov}(\epsilon) = \sigma^2 I_n. Let A = (X^\top X)^{-1} X^\top, so \hat{\beta} = A Y. Decompose C = A + D where D X = 0 (i.e., the rows of D are orthogonal to the column space of X). Substituting gives \mathrm{Var}(\tilde{\beta}) = \sigma^2 (A + D)(A + D)^\top = \sigma^2 A A^\top + \sigma^2 D D^\top + \sigma^2 (A D^\top + D A^\top). The cross terms vanish because X^\top D^\top = (D X)^\top = 0 implies A D^\top = (X^\top X)^{-1} (X^\top D^\top) = 0 and D A^\top = 0 similarly, yielding \mathrm{Var}(\tilde{\beta}) = \sigma^2 A A^\top + \sigma^2 D D^\top. Note that \mathrm{Var}(\hat{\beta}) = \sigma^2 A A^\top = \sigma^2 (X^\top X)^{-1}. The matrix D D^\top is positive semi-definite, so \sigma^2 D D^\top \succeq 0, which implies \mathrm{Var}(\tilde{\beta}) \succeq \mathrm{Var}(\hat{\beta}) in the sense of positive semi-definiteness (i.e., \tilde{\beta}^\top [\mathrm{Var}(\tilde{\beta}) - \mathrm{Var}(\hat{\beta})] \tilde{\beta} \geq 0 for all \tilde{\beta}). Equality holds D = 0, meaning \tilde{\beta} = \hat{\beta}. Thus, no other linear unbiased can have a smaller variance-covariance matrix than the OLS estimator. For the scalar case (), the proof follows analogously by restricting to one-dimensional \beta and using summation constraints on the coefficients, leading to a variance where extra terms are non-negative.

Detailed Derivation

Consider the classical linear regression model given by \mathbf{y} = X \boldsymbol{\beta} + \boldsymbol{\epsilon}, where \mathbf{y} is an n \times 1 vector of observations, X is an n \times p design matrix of full column rank p (with n > p), \boldsymbol{\beta} is a p \times 1 vector of unknown parameters, and \boldsymbol{\epsilon} is an n \times 1 error vector satisfying the assumptions E[\boldsymbol{\epsilon}] = \mathbf{0} and \text{Cov}(\boldsymbol{\epsilon}) = \sigma^2 I_n for some \sigma^2 > 0. The ordinary least squares (OLS) estimator is \hat{\boldsymbol{\beta}} = (X^T X)^{-1} X^T \mathbf{y}. This estimator is linear in \mathbf{y}. To verify unbiasedness, substitute the model into the estimator: E[\hat{\boldsymbol{\beta}}] = (X^T X)^{-1} X^T E[\mathbf{y}] = (X^T X)^{-1} X^T X \boldsymbol{\beta} = \boldsymbol{\beta}, using the zero mean of the errors. The covariance matrix of the OLS estimator is then \text{Cov}(\hat{\boldsymbol{\beta}}) = (X^T X)^{-1} X^T \text{Cov}(\mathbf{y}) X (X^T X)^{-1} = \sigma^2 (X^T X)^{-1}, since \text{Cov}(\mathbf{y}) = \text{Cov}(\boldsymbol{\epsilon}) = \sigma^2 I_n. Now consider a general linear of the form \tilde{\boldsymbol{\beta}} = C \mathbf{y}, where C is a p \times n matrix. For unbiasedness, E[\tilde{\boldsymbol{\beta}}] = C X \boldsymbol{\beta} = \boldsymbol{\beta} must hold for all \boldsymbol{\beta}, implying C X = I_p. The of this is \text{Cov}(\tilde{\boldsymbol{\beta}}) = \sigma^2 C C^T. To show that the OLS estimator is best (minimum variance) among such estimators, compare the covariance matrices. Define D = C - (X^T X)^{-1} X^T, so C = D + (X^T X)^{-1} X^T. The unbiasedness condition C X = I_p yields D X = \mathbf{0}. Then, C C^T = [D + (X^T X)^{-1} X^T] [D + (X^T X)^{-1} X^T]^T = D D^T + (X^T X)^{-1} + (X^T X)^{-1} X D^T + D X^T (X^T X)^{-1}. The cross terms vanish because D X = \mathbf{0} implies X D^T = (D X)^T = \mathbf{0}, so C C^T = D D^T + (X^T X)^{-1}. Thus, \text{Cov}(\tilde{\boldsymbol{\beta}}) - \text{Cov}(\hat{\boldsymbol{\beta}}) = \sigma^2 [C C^T - (X^T X)^{-1}] = \sigma^2 D D^T, which is positive semi-definite (as \sigma^2 > 0 and D D^T has non-negative eigenvalues). Equality holds if and only if D = \mathbf{0}, i.e., \tilde{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}}. This establishes that the OLS estimator has the smallest covariance matrix (in the positive semi-definite order) among all linear unbiased estimators. An alternative derivation uses the projection interpretation. The OLS estimator projects \mathbf{y} onto the column space of X via the hat matrix H = X (X^T X)^{-1} X^T. For estimating a linear combination c^T \boldsymbol{\beta} (with c a p \times 1 vector), the OLS estimate is \hat{\gamma} = c^T \hat{\boldsymbol{\beta}} = k^T \mathbf{y} where k^T = c^T (X^T X)^{-1} X^T. Any other unbiased linear estimator \tilde{\gamma} = d^T \mathbf{y} satisfies d^T X = c^T. The variance of \tilde{\gamma} is \sigma^2 \|d\|^2, while that of \hat{\gamma} is \sigma^2 \|k\|^2. Decompose d = H d + (I - H) d; since H is the onto the column space of X and d^T X = c^T implies the component in the column space matches k, orthogonality gives \|d\|^2 = \|H d\|^2 + \|(I - H) d\|^2 \geq \|H d\|^2 = \|k\|^2, with equality if (I - H) d = 0. Thus, the OLS estimate minimizes the variance for any such .

Remarks on the Proof

The proof of the Gauss–Markov theorem relies on demonstrating that the covariance matrix of the ordinary least squares (OLS) estimator is minimal in the positive semi-definite sense among all linear unbiased estimators. Specifically, for any alternative linear unbiased estimator \tilde{\beta} = C y satisfying C X = I, the difference \operatorname{Var}(\tilde{\beta}) - \operatorname{Var}(\hat{\beta}_{\text{OLS}}) = \sigma^2 (C - (X^T X)^{-1} X^T) (C - (X^T X)^{-1} X^T)^T is positive semi-definite, ensuring that the variances satisfy \operatorname{Var}(\tilde{\beta}_j) \geq \operatorname{Var}(\hat{\beta}_{j,\text{OLS}}) for each component j, with equality only if \tilde{\beta} = \hat{\beta}_{\text{OLS}}. In the scalar case, the proof simplifies to a direct decomposition of coefficients: for an unbiased linear estimator \hat{\beta}_1 = \sum c_i y_i with constraints \sum c_i = 0 and \sum c_i x_i = 1, one sets c_i = k_i + d_i where k_i are the OLS weights and d_i are deviations satisfying \sum d_i = 0 and \sum d_i x_i = 0; the variance then decomposes as \sigma^2 \sum c_i^2 = \sigma^2 \sum k_i^2 + \sigma^2 \sum d_i^2, minimized uniquely when d_i = 0 for all i. This approach highlights the geometric intuition that OLS projects orthogonally onto the column space of X, minimizing the residual variance without bias. A key insight is that the theorem's validity hinges on the spherical error covariance \sigma^2 I, as violations (e.g., heteroskedasticity or ) render OLS inefficient, though still unbiased; in such cases, the estimator recovers BLUE status by pre-whitening the model. The proof does not require error normality, which is only needed for exact finite-sample inference like t-tests. Additionally, full column of X is essential; for rank-deficient designs, the theorem extends to estimable linear combinations l^T \beta via generalized inverses. The proof's elegance lies in its reliance on basic linear algebra—projection matrices and quadratic forms—without invoking probability distributions beyond second moments, underscoring the theorem's robustness in classical linear models.

Generalizations

Generalized Least Squares Estimator

In the generalized linear regression model y = X\beta + \varepsilon, where \mathbb{E}[\varepsilon] = 0 and \mathrm{Var}(\varepsilon) = \sigma^2 \Omega with \Omega a known positive definite matrix, the ordinary least squares estimator is no longer the best linear unbiased estimator (BLUE). Instead, the (GLS) estimator addresses the non-spherical error covariance by weighting observations according to the inverse of \Omega. The GLS estimator is given by \hat{\beta}_{\mathrm{GLS}} = (X^\top \Omega^{-1} X)^{-1} X^\top \Omega^{-1} y, assuming X^\top \Omega^{-1} X is invertible, which holds under the standard full column rank condition on X. This estimator is unbiased, as \mathbb{E}[\hat{\beta}_{\mathrm{GLS}}] = \beta, and its covariance matrix is \mathrm{Var}(\hat{\beta}_{\mathrm{GLS}}) = \sigma^2 (X^\top \Omega^{-1} X)^{-1}, which is the minimal possible among all linear unbiased estimators. Aitken's theorem establishes that \hat{\beta}_{\mathrm{GLS}} is the BLUE in this setting, generalizing the classical Gauss-Markov result to arbitrary known error covariances. To see why GLS achieves this efficiency, consider a linear of the model: let \Omega = V V^\top for some invertible V, and premultiply the equation by V^{-1} to obtain V^{-1} y = V^{-1} X \beta + V^{-1} \varepsilon. The transformed errors satisfy \mathbb{E}[V^{-1} \varepsilon] = 0 and \mathrm{Var}(V^{-1} \varepsilon) = \sigma^2 I, reducing to the standard Gauss-Markov assumptions. Applying ordinary to this transformed model yields \hat{\beta}_{\mathrm{GLS}}, which is BLUE by the classical theorem. This preserves unbiasedness and minimizes variance without requiring of the errors. The GLS estimator minimizes the weighted sum of squared residuals (y - X\beta)^\top \Omega^{-1} (y - X\beta), providing a quadratic form that accounts for the error structure and ensures a unique global minimum due to the positive definiteness of \Omega^{-1}. In practice, if \Omega is unknown, feasible GLS uses a consistent estimate, but the theorem applies directly when \Omega is known.

Modern Extensions

One prominent modern extension of the Gauss–Markov theorem relaxes the classical restriction to linear unbiased estimators, demonstrating that the ordinary least squares (OLS) estimator achieves minimum variance among all unbiased estimators—linear or nonlinear—under the standard assumptions of linearity, exogeneity, and homoskedasticity. This restatement, proposed by Hansen, establishes finite-sample efficiency bounds for estimating linear regression coefficients without distributional assumptions beyond finite variance, showing that the variance of any unbiased estimator is at least as large as that of OLS, specifically \sigma^2 (X^\top X)^{-1}. The implications extend to generalized least squares (GLS) in heteroskedastic cases, affirming its status as the minimum variance unbiased estimator (MVUE) without the linearity qualifier. Another key development incorporates bounded into the , extending the to scenarios where estimators may introduce controlled to reduce overall , particularly relevant in regularized . In this formulation, the B = LX - I is constrained by a \|B\|_p \leq C for p \geq 1, leading to optimal estimators that generalize (for p=2) and other shrinkage methods. For instance, under nuclear norm constraints (p=1), the optimal thresholds small eigenvalues of X^\top X, outperforming OLS in simulations with cross-validation. This extension maintains the core Gauss–Markov setup but allows non-zero , yielding explicit forms like \hat{G} = U \operatorname{diag}(\max(\sigma, \alpha)) U^\top for the adjustment. In high-dimensional settings where the number of parameters p exceeds the sample size n (i.e., p/n \to \gamma > 1), the classical theorem fails due to the singularity of X^\top X, prompting extensions via ridgeless , which solves the minimum \ell_2-norm problem. These approaches reveal "" phenomena, where prediction risk decreases after the threshold, with asymptotic and variance characterized using random matrix theory and the Marchenko–Pastur : \sim r^2 \lambda^2 s'(-\lambda) and variance \sigma^2 \gamma \int (x/(x+\lambda)^2) d\mu_{MP}(x). Under assumptions of i.i.d. features with \Sigma and sub-Gaussian , the min-norm solution achieves consistent estimation in overparameterized regimes, bridging linear models to . This has implications for modern applications like neural networks, where overparameterization mitigates the curse of dimensionality. Further extensions address partial regularization in high dimensions, adapting the theorem for hybrid models that regularize only a of parameters to handle in average treatment effect estimation or spiked structures. Here, variance estimators like leave-one-out cross-validation provide conservative bounds under the Gauss–Markov model, with simulations showing low in geometric or spiked models. These developments, building on ridgeless , enable when full-rank assumptions break, prioritizing over unbiasedness.

References

  1. [1]
    Gauss Markov theorem - StatLect
    The Gauss Markov theorem says that, under certain conditions, the ordinary least squares (OLS) estimator of the coefficients of a linear regression model is ...
  2. [2]
    [PDF] The Gauss-Markov Theorem - STA 211 - Stat@Duke
    Mar 7, 2023 · The Gauss-Markov Theorem. The Gauss-Markov Theorem asserts that under some assumptions, the OLS estimator is the “best” (has the lowest ...
  3. [3]
    Encyclopedia of Research Design - Gauss–Markov Theorem
    In his 1900 textbook on probability, Andrei Markov essentially rediscovered Gauss's theorem. By the 1930s, however, the result was commonly ...Missing: original | Show results with:original
  4. [4]
    5.5 The Gauss-Markov Theorem - Introduction to Econometrics with R
    The Gauss-Markov theorem states that, in the class of conditionally unbiased linear estimators, the OLS estimator has this property under certain conditions.
  5. [5]
    Gauss-Markov Theorem - an overview | ScienceDirect Topics
    According to the Gauss–Markov theorem, the estimators α, β found from least squares analysis are the best linear unbiased estimators for the model.
  6. [6]
    The Gauss-Markov Theorem and BLUE OLS Coefficient Estimates
    The Gauss-Markov theorem states that satisfying the OLS assumptions keeps the sampling distribution as tight as possible for unbiased estimates.
  7. [7]
  8. [8]
    (PDF) Gauss–Markov Theorem in Statistics - ResearchGate
    Oct 12, 2017 · Dated at the beginning of the 19th century and controversial about the actual authorship, the least squares method established an optimization ...<|control11|><|separator|>
  9. [9]
    The Historical Development of the Gauss Linear Model - jstor
    The linear regression model owes so much to Gauss that we believe it should bear his name. Other authors who made substantial contributions are: Cauchy who ...
  10. [10]
    [PDF] A Modern Gauss-Markov Theorem - University of Wisconsin–Madison
    The Gauss-Markov theorem states that in a linear homoskedastic regression model the minimum variance linear unbiased esti- mator of the regression coefficient ...Missing: sources | Show results with:sources<|control11|><|separator|>
  11. [11]
    [PDF] Nicolas Christou Gauss-Markov theorem
    Gauss-Markov theorem. Consider the simple regression model Yi = β0 + β1xi + i,i = 1,...,n. The Gauss-Markov conditions hold, i.e. E( i)=0, var( i) = σ2, and ...Missing: statement | Show results with:statement
  12. [12]
    [PDF] OLS in Matrix Form
    The Gauss-Markov Theorem states that, conditional on assumptions 1-5, there will be no other linear and unbiased estimator of the β coefficients that has a ...
  13. [13]
    [PDF] Ch6. Multiple Regression: Estimation 1 The model
    Theorem 2.3 If cov(y) = σ2I, the covariance matrix for β is given by σ2(X′X)−1 . PROOF: Exercise. Theorem 2.4 (Gauss-Markov Theorem) If E(y) = Xβ and cov(y) ...
  14. [14]
    [PDF] Chapter 4. Gauss-Markov Model - UNM Math
    1) The Gauss-Markov assumptions on the errors in a linear model are introduced. They specify that the errors have zero mean, are uncorrelated and have constant ...
  15. [15]
    [PDF] The Gauss-Markov Theorem in Multivariate Analysis by Morris L ...
    The Gauss-Markov (G.M.) Theorem is formulated in a vector space setting general enough to cover the linear models of multivariate analysis. The connection ...Missing: statement | Show results with:statement
  16. [16]
    [PDF] Notes on the Gauss-Markov theorem - UC Berkeley Statistics
    Discussion. Statistical Models has the “single-contrast” version of the theorem, which starts with an estimator for the scalar parameter c β.
  17. [17]
    [PDF] Lecture 7: Further discussion of bias and variance
    The Gauss-Markov Theorem (GMT) for OLS (informally) states that if (A1)-(A3) hold, then: ▶ The irreducible error is σ2. ▶ The bias of OLS is zero.
  18. [18]
    [PDF] A Modern Gauss-Markov Theorem
    Andre˘ı Andreevich Markov (1912) provided a textbook treatment of the theorem, and clarified the central role of unbiasedness, which Gauss had only assumed ...
  19. [19]
    [PDF] From Wooldridge, Chapter 3
    Assumptions MLR.1 to MLR.5 are collectively known as the Gauss-Markov assumptions for cross- sectional regression. Now, let's take a look at the sampling ...Missing: theorem | Show results with:theorem
  20. [20]
    4.1 Gauss-Markov assumptions - Intro To Econometrics - Fiveable
    These assumptions ensure that ordinary least squares estimators are unbiased and efficient, allowing for accurate estimation of economic relationships.
  21. [21]
    Key Assumptions of OLS: Econometrics Review - Albert.io
    Jul 13, 2021 · If the OLS assumptions 1 to 5 hold, then according to Gauss-Markov Theorem, OLS estimator is Best Linear Unbiased Estimator (BLUE). These ...
  22. [22]
    [PDF] Yet Another Proof of the Gauss-Markov Theorem
    THEOREM. GAUSS-MARkOV. The OLS estimator is BLUE. The acronym BLUE stands for Best Linear Unbiased Estimator, i.e., the one with the smallest variance.Missing: outline | Show results with:outline
  23. [23]
    [PDF] Gauss Markov Theorem
    Gauss-Markov Theorem. ▻ The theorem states that b1 has minimum variance ... Proof cont. Now define ci = ki + di where the ki are the constants we ...Missing: history | Show results with:history<|separator|>
  24. [24]
    [PDF] Chapter 4 - The Gauss-Markov Theorem
    Proof. By the Gauss-Markov theorem bγLSE is the BLUE for γ and l/β = a/γ is a linear function of γ.
  25. [25]
    [PDF] Finite-Sample Properties of OLS - Princeton University
    The Gauss-Markov Theorem says that the OLS estimator is efficient in the sense that its conditional variance matrix Var(b | X) is smallest among linear unbiased ...
  26. [26]
    Generalized least squares (GLS regression) - StatLect
    These assumptions are the same made in the Gauss-Markov theorem in order to prove that OLS is BLUE, except for assumption 3. In the Gauss-Markov theorem, we ...
  27. [27]
    [PDF] arXiv:2311.11093v1 [cs.LG] 18 Nov 2023
    Nov 18, 2023 · We consider the problem of linear estimation, and establish an extension of the Gauss-Markov theorem, in which the bias operator is allowed to ...
  28. [28]
    None
    Summary of each segment:
  29. [29]
    None
    ### Summary of Extensions to Gauss-Markov Theorem in Regularized OLS