Weighted least squares

Weighted least squares (WLS) is a generalization of ordinary least squares (OLS) regression that assigns different weights to observations based on their estimated variances, allowing for the analysis of data with heteroscedastic errors where the variance of residuals is not constant across observations.^[1] This method minimizes the weighted sum of squared residuals, \sum w_i (y_i - \hat{y}_i)^2, where weights w_i are typically set to the inverse of the error variance for each data point, w_i = 1/\sigma_i^2, thereby giving greater influence to more precise observations and reducing the impact of those with higher variability.^[1] In contrast to OLS, which assumes homoscedasticity and treats all observations equally, WLS addresses violations of this assumption by incorporating a diagonal weight matrix W into the estimation process, yielding the parameter estimator \hat{\beta}_{WLS} = (X^T W X)^{-1} X^T W y.^[1] Under the assumptions of the Gauss-Markov theorem—linearity, unbiasedness, and known error covariance structure—the WLS estimator is the best linear unbiased estimator (BLUE), achieving minimum variance among all linear unbiased estimators.^[2] WLS is particularly valuable in applications such as econometrics, where error variances may vary with the scale of predictors, and in biostatistics for analyzing grouped or clustered data with differing sample sizes.^[1] When the true variances are unknown, they can be estimated from residuals or auxiliary models, leading to feasible generalized least squares as an extension for correlated errors.^[3] This flexibility makes WLS a robust tool for improving the efficiency and reliability of regression estimates in real-world datasets exhibiting non-constant error structures.

Fundamentals

Definition and Relation to Ordinary Least Squares

Weighted least squares (WLS) is a generalization of ordinary least squares (OLS) estimation in linear regression, designed to minimize a weighted sum of squared residuals when observations have unequal variances or differing levels of importance.^[1] In this approach, each data point is assigned a positive weight w_i, typically proportional to the inverse of its error variance, to give greater influence to more reliable observations.^[4] The core distinction from OLS lies in the objective function: OLS minimizes the unweighted sum \sum_{i=1}^n (y_i - \hat{y}_i)^2, assuming homoscedastic errors with equal weight 1 for all residuals, whereas WLS minimizes \sum_{i=1}^n w_i (y_i - \hat{y}_i)^2 to address heteroscedasticity or unequal precision.^[1] This weighting ensures that the estimator accounts for varying reliability across data points, leading to more efficient parameter estimates under non-constant variance conditions.^[5] WLS operates within the framework of the linear model y = X\beta + \epsilon, where y is the n \times 1 response vector, X is the n \times p design matrix, \beta is the p \times 1 parameter vector, and \epsilon is the error vector with mean zero and diagonal covariance matrix \Sigma = \diag(\sigma_1^2, \dots, \sigma_n^2), reflecting independent but heteroscedastic errors.^[1] The method has roots in 19th-century astronomy, where Carl Friedrich Gauss introduced weighted least squares in 1821–1823 to refine estimates from observations of varying precision in orbit determination and geodesy.^[6] It was later formalized in statistical contexts during the early 20th century, notably by William G. Cochran in 1937, who addressed weighting for combining results from similar experiments with heterogeneous variances.^[7]

Assumptions and Prerequisites

To apply weighted least squares (WLS) effectively, readers should possess foundational knowledge of linear regression models, including matrix notation such as the design matrix \mathbf{X} (an n \times p matrix of predictors) and the response vector \mathbf{y} (an n \times 1 vector of observations).^[8] This prerequisite enables understanding of how WLS extends ordinary least squares (OLS) by incorporating a weight matrix into the estimation process.^[1] The core assumptions of WLS mirror those of OLS in several respects but adapt the error variance structure to account for heteroscedasticity. Specifically, the model assumes linearity in the parameters, meaning the conditional expectation E(\mathbf{y} | \mathbf{X}) = \mathbf{X} \boldsymbol{\beta}, where \boldsymbol{\beta} is the parameter vector.^[8] It also requires no perfect multicollinearity among the columns of \mathbf{X}, ensuring the weighted design matrix is invertible and estimates are uniquely defined.^[9] The errors are assumed to be independent across observations, with E(\boldsymbol{\epsilon}) = \mathbf{0}, though extensions like generalized least squares can accommodate specified correlation structures.^[1] Crucially, the variance of the errors is heteroscedastic, modeled as \text{Var}(\mathbf{y} | \mathbf{X}) = \boldsymbol{\Sigma}, a diagonal matrix where the i-th diagonal element is \sigma_i^2; the weights w_i are then proportional to the inverse variances, w_i = 1 / \sigma_i^2, reflecting the relative precision of each observation.^[8] These weights must be known or reliably estimated in advance.^[1] In contrast to OLS, which assumes homoscedasticity (\sigma_i^2 = \sigma^2 for all i), WLS relaxes this by allowing varying error variances while retaining unbiasedness under the correct linear model.^[1] However, if homoscedasticity actually holds, applying WLS with non-uniform weights leads to inefficient estimates compared to OLS, though the point estimates remain unbiased.^[10] Misspecifying the weights—such as using incorrect inverse variances—does not introduce bias in the parameter estimates, provided the linearity assumption is met, but it results in a loss of efficiency and invalid standard errors, compromising inference.^[10] Violations of linearity or perfect multicollinearity can produce biased or unstable estimates, underscoring the need to verify these conditions prior to WLS application.^[9]

Motivation

Handling Heteroscedasticity

Heteroscedasticity refers to the situation in linear regression models where the variance of the error terms, denoted as \sigma_i^2, is not constant across observations but instead varies systematically, often depending on the values of the predictors or other factors.^[11] This violates a key assumption of ordinary least squares (OLS), which requires homoscedasticity for optimal properties. Under heteroscedasticity, observations with larger error variances contribute disproportionately to the estimation, leading to inefficient OLS estimates despite their remaining unbiased.^[12] The presence of heteroscedasticity has significant consequences for statistical inference in OLS regression. While the point estimates of the parameters are unbiased, their variances are inflated, making OLS inefficient compared to alternatives that account for varying error variances.^[13] More critically, the conventional standard errors computed under OLS are biased, often underestimated, which invalidates t-tests, F-tests, and confidence intervals, potentially leading to incorrect conclusions about parameter significance.^[12] Weighted least squares (WLS) mitigates this by assigning weights inversely proportional to the error variances, transforming the model to restore homoscedasticity and yielding the best linear unbiased estimator (BLUE) when the weights are correctly specified, as established by Aitken's generalization of the Gauss-Markov theorem.^[14] Detecting heteroscedasticity is essential before applying WLS. One common approach is the Breusch-Pagan test, which involves regressing the squared OLS residuals on the independent variables and testing whether the coefficients are jointly zero using a chi-squared statistic; a significant result indicates variance dependence on predictors.^[11] Another is White's test, which extends this by including squared and cross-product terms of the predictors in the auxiliary regression to capture more general, unspecified forms of heteroscedasticity, also evaluated via a chi-squared test. These tests help confirm the violation without assuming a specific variance structure. A classic real-world illustration of heteroscedasticity appears in regressions modeling household food expenditure as a function of income. For lower-income households, food spending tends to cluster closely around predicted values due to budget constraints, resulting in low error variance. In contrast, higher-income households exhibit greater variability in expenditures, influenced by diverse choices in dining and grocery habits, leading to increasing \sigma_i^2 with income levels.^[15] Applying WLS with weights based on this pattern, such as inversely related to income, can then produce more efficient estimates and reliable inference for such data.

Incorporating Prior Knowledge via Weights

In weighted least squares (WLS), weights can extend beyond correcting for heteroscedasticity to incorporate non-variance-based prior knowledge, such as the relative reliability of observations, effective sample sizes in aggregated data, or subjective assessments from domain experts. For instance, in analyses of grouped data where individual observations are aggregated into categories (e.g., income brackets or age groups), weights are often set proportional to the number of underlying observations in each group, reflecting the greater informational content or stability of larger aggregates. This approach ensures that estimates prioritize groups with more robust empirical support, drawing on the understanding that variance decreases with sample size under standard assumptions. Similarly, weights may be chosen inversely proportional to the cost of obtaining an observation or the estimated error in auxiliary data sources, allowing practitioners to emphasize more economical or accurate inputs in resource-constrained settings.^[16] A prominent application of such non-variance weighting occurs in meta-analysis, where WLS combines effect estimates from multiple studies by assigning weights based on study-specific sample sizes rather than solely on inverse variances. When individual study variances are similar or difficult to estimate precisely, weighting by sample size provides a practical proxy that favors larger, more precise studies, leading to an overall estimate that better reflects the accumulated evidence across the literature. This method has been shown to involve a trade-off compared to inverse-variance weighting, being less biased but less efficient, particularly when sample sizes serve as reliable indicators of precision.^[17] The primary advantage of incorporating prior knowledge through these weights lies in its ability to enhance estimator efficiency by integrating domain-specific insights, even in scenarios of homoscedasticity where ordinary least squares would suffice. By leveraging expert judgment or structural information—such as incomplete priors on parameter relations—WLS can yield more informed predictions than unweighted methods, as demonstrated in mixed estimation frameworks that blend sample data with subjective beliefs. However, a key limitation is that subjective or auxiliary-based weights, if misspecified, can distort the estimator's properties; while unbiasedness is generally preserved under correct model specification, suboptimal weights may inflate variance, reducing efficiency. This underscores the need for careful validation of weight choices to avoid compromising inferential reliability.^[18]

Mathematical Formulation

Model Specification

The weighted least squares (WLS) model is formulated within the framework of linear regression as y = X\beta + [\epsilon](/page/Epsilon), where y is an n \times 1 vector of observed responses, X is an n \times p full-rank design matrix with rows corresponding to covariates for each observation, \beta is a p \times 1 vector of unknown regression coefficients, and [\epsilon](/page/Epsilon) is an n \times 1 error vector satisfying E([\epsilon](/page/Epsilon)) = 0 and \text{Var}([\epsilon](/page/Epsilon)) = \Sigma, with \Sigma being a positive definite n \times n covariance matrix.^[1] This setup generalizes ordinary least squares by allowing for non-constant error variances or correlations, where \Sigma captures heteroscedasticity or dependence in the errors.^[14] The core of WLS estimation involves minimizing the weighted sum of squared residuals, expressed as the objective function

(y - X\beta)^T W (y - X\beta),

where W = \Sigma^{-1} is the n \times n weight matrix that inversely scales the residuals according to their error variances; when errors are uncorrelated, W is diagonal with entries w_i = 1 / \sigma_i^2 for the i-th observation's variance \sigma_i^2.^[19] This weighting ensures that observations with lower variance contribute more to the parameter estimates, aligning with the Gauss-Markov theorem for efficiency under known \Sigma.^[14] Equivalently, the WLS problem can be transformed into an ordinary least squares (OLS) estimation on adjusted variables by defining y^* = W^{1/2} y and X^* = W^{1/2} X, where W^{1/2} is the symmetric positive definite square root of W; the objective then reduces to minimizing \| y^* - X^* \beta \|^2, which preserves the linear structure while accounting for the error covariance.^[20] This transformation highlights how WLS reweights the data to achieve homoscedasticity in the transformed space.^[1]

Weight Selection and Matrix Construction

In weighted least squares (WLS), the ideal weights for observations with uncorrelated errors are chosen as the reciprocals of the error variances, w_i = 1/\sigma_i^2, where \sigma_i^2 represents the variance of the error term for the i-th observation; this assignment ensures that more precise observations (lower variance) receive higher influence in the estimation process.^[1]^[21] The weight matrix W is then constructed as a diagonal matrix with these w_i on the diagonal, W = \operatorname{diag}(w_1, w_2, \dots, w_n), which simplifies computation while assuming independence among errors.^[1]^[19] When errors are correlated, the weight matrix extends to the inverse of the full covariance matrix of the errors, W = \Sigma^{-1}, transforming WLS into generalized least squares (GLS); this full matrix accounts for both heteroscedasticity and serial or spatial correlations, though its estimation is more complex and typically deferred to GLS-specific methods.^[19] For practical implementation, the diagonal form of W is preferred in standard WLS due to its computational efficiency and sufficiency under the uncorrelated errors assumption.^[1]^[21] Since error variances are often unknown in practice, weights must be estimated iteratively or through modeling. A common approach begins with an initial ordinary least squares (OLS) fit to compute residuals e_i = y_i - \hat{y}_i, from which preliminary variance estimates \hat{\sigma}_i^2 are derived, such as by squaring the residuals or regressing them against predictors or fitted values to model the variance function.^[1] These estimates yield initial weights \hat{w}_i = 1/\hat{\sigma}_i^2, which are then used in a WLS fit; the process iterates by updating residuals and weights until convergence, often requiring only one or two iterations for stability.^[1]^[19] Alternatively, model-based estimation assumes a parametric form for the variance, such as \sigma_i^2 = \sigma^2 h(x_i) where h(\cdot) is a known function (e.g., h(x_i) = 1 + 0.5 x_i^2), allowing weights to be constructed directly from the predictors.^[19] Common pitfalls in weight selection include overfitting the variance model from limited data, which can lead to unstable or biased estimates, and failing to transform variables appropriately, resulting in an invalid positive-definite W.^[21]^[19] Small sample sizes exacerbate these issues, as estimated weights become sensitive to outliers or minor variations in residuals, potentially undermining the efficiency gains of WLS.^[21]

Estimation Methods

Closed-Form Solution

The closed-form solution for the weighted least squares (WLS) estimator \hat{\beta} is derived by minimizing the objective function, which is the weighted sum of squared residuals S(\beta) = (y - X\beta)^T W (y - X\beta), where y is the response vector, X is the design matrix, \beta is the parameter vector, and W is the positive definite diagonal weight matrix.^[19]^[1] To obtain the minimizer, differentiate S(\beta) with respect to \beta and set the result to zero:

\frac{\partial S}{\partial \beta} = -2 X^T W (y - X \beta) = 0.

This yields the normal equations:

X^T W X \beta = X^T W y.

Solving for \beta gives the explicit estimator:

\hat{\beta} = (X^T W X)^{-1} X^T W y.

^[19]^[1] This solution exists provided that X^T W X is invertible, which requires the design matrix X to have full column rank and the weight matrix W to be positive definite.^[19]^[1] For numerical computation, direct matrix inversion of X^T W X can be unstable due to potential ill-conditioning; instead, a QR decomposition of the transformed design matrix W^{1/2} X is recommended, as it provides a stable alternative by reducing the problem to an ordinary least squares solution on the transformed data.^[22]

Iterative and Numerical Approaches

In cases where the weight matrix W depends on the unknown parameters \beta, such as in robust regression variants, the closed-form solution for weighted least squares (WLS) becomes infeasible, necessitating iterative methods to achieve convergence.^[23] Iteratively reweighted least squares (IRLS) addresses this by alternating between estimating \beta using ordinary WLS with current weights and updating the weights based on the residuals from that estimate, repeating until the parameter values stabilize within a specified tolerance.^[23] This approach, originally developed for robust estimation to downweight outliers via functions like Huber's, ensures that weights reflect the reliability of each observation iteratively.^[23] Numerical challenges arise in WLS computations, particularly when the matrix X^T W X is ill-conditioned due to multicollinearity in X or extreme variation in the diagonal elements of W, leading to unstable or imprecise parameter estimates.^[24] To mitigate this, regularization techniques such as ridge regression can be incorporated by adding a penalty term to the objective function, or the Moore-Penrose pseudoinverse can be employed to compute a least-squares solution that handles rank deficiency.^[24] These methods stabilize the inversion process without assuming full rank, though they may introduce bias that requires careful tuning.^[25] Software packages facilitate efficient WLS implementation, including iterative variants like IRLS. In R, the lm function supports weights via the weights argument, performing WLS directly or iteratively if needed through extensions like MASS::rlm for robust cases.^[26] Similarly, Python's statsmodels library provides the WLS class in statsmodels.regression.linear_model, which accepts a weights array and handles numerical stability internally via optimized linear algebra routines. IRLS and related numerical approaches are particularly suited for large datasets where direct matrix inversion is computationally prohibitive, scenarios with non-diagonal covariance structures approximated by diagonal weights, or when weights must be refined based on provisional \beta estimates, as in robust or heteroscedasticity-adjusted models.^[23]

Statistical Properties

Estimator Variance and Covariance

The covariance matrix of the weighted least squares (WLS) estimator \hat{\beta} is given by \operatorname{Var}(\hat{\beta}) = \sigma^2 (X^T W X)^{-1}, where \sigma^2 is the error variance, X is the design matrix, and W is the diagonal weight matrix, assuming the weights are correctly specified to account for the error structure.^[5] This formula arises from the linear form of the estimator \hat{\beta} = (X^T W X)^{-1} X^T W y and the assumption that the errors have mean zero and covariance \sigma^2 W^{-1}.^[2] The diagonal elements of \operatorname{Var}(\hat{\beta}) represent the variances of the individual parameter estimates \hat{\beta}_j, quantifying the uncertainty in each coefficient, while the off-diagonal elements capture the covariances between pairs of estimates, indicating how the estimation of one parameter affects others due to shared data structure.^[5] These covariances are particularly useful for understanding multicollinearity or dependencies in the parameter space.^[27] When the weight matrix W = \Sigma^{-1}, where \Sigma is the true error covariance matrix, the WLS estimator achieves the minimum variance among all linear unbiased estimators, as established by the Gauss-Markov theorem, and in the Gaussian error case, it attains the Cramér-Rao lower bound for efficiency.^[2] Under heteroscedasticity, this results in smaller variances compared to ordinary least squares (OLS) estimates, which ignore varying error variances and thus fail to minimize risk in such scenarios.^[2] Additionally, under standard regularity conditions such as fixed design matrix, bounded moments, and correct weight specification, the WLS estimator is asymptotically normal: \sqrt{n} (\hat{\beta} - \beta) \xrightarrow{d} N(0, \lim_{n \to \infty} n \sigma^2 (X^T W X / n)^{-1}).^[28] The error variance \sigma^2 is typically estimated from the weighted residuals as \hat{\sigma}^2 = \frac{(y - X \hat{\beta})^T W (y - X \hat{\beta})}{n - p}, where n is the sample size and p is the number of parameters; this unbiased estimator adjusts the weighted sum of squared residuals by the degrees of freedom to provide a consistent measure of dispersion.^[5] Substituting \hat{\sigma}^2 into the covariance matrix yields the estimated \widehat{\operatorname{Var}}(\hat{\beta}), which is used for inference on the parameters.^[5]

Confidence Intervals for Parameters

In weighted least squares (WLS), inference for the parameters \hat{\beta} relies on the variance-covariance matrix of the estimator, which provides the foundation for standard errors and subsequent confidence intervals.^[1] The standard error for the j-th parameter estimate is given by

\text{SE}(\hat{\beta}_j) = \sqrt{ \left[ (X^T W X)^{-1} \right]_{jj} \hat{\sigma}^2 },

where \hat{\sigma}^2 is the estimated error variance, typically computed as the weighted mean squared error from the residuals.^[5] This measure quantifies the precision of \hat{\beta}_j after accounting for the heteroscedasticity addressed by the weight matrix W.^[5] For finite samples, confidence intervals for individual parameters assume approximate normality of the estimator and use the t-distribution:

\hat{\beta}_j \pm t_{n-p, 1-\alpha/2} \cdot \text{SE}(\hat{\beta}_j),

with n - p degrees of freedom, where n is the number of observations and p is the number of parameters.^[5] This construction ensures that the interval captures the true \beta_j with probability $1 - \alpha, adjusting for the reduced degrees of freedom in the WLS framework.^[5] Hypothesis testing for parameters in WLS follows standard procedures adapted to the weighted model. For a single coefficient, the Wald test statistic \hat{\beta}_j / \text{SE}(\hat{\beta}_j) follows a t-distribution with n - p degrees of freedom under the null hypothesis H_0: \beta_j = 0.^[5] For joint tests on subsets of \beta involving q restrictions, an F-test is used, with the statistic distributed as F_{q, n-p} under the null, comparing restricted and unrestricted weighted residual sums of squares.^[29] When weights are estimated rather than known—such as in feasible generalized least squares—inference based on the conventional standard errors may be invalid due to uncertainty in W. In these cases, robust standard errors via the sandwich estimator provide consistent inference, adjusting the covariance matrix to account for potential misspecification: \widehat{\text{Var}}(\hat{\beta}) = (X^T W X)^{-1} (X^T W \hat{\Omega} W X) (X^T W X)^{-1}, where \hat{\Omega} is a diagonal matrix of squared weighted residuals (often with finite-sample corrections like HC3).^[30] This approach maintains valid confidence intervals and tests even under weight estimation error.^[30]

Residual Diagnostics

In weighted least squares (WLS) regression, residual diagnostics begin with the computation of weighted residuals, defined as r_i = \sqrt{w_i} (y_i - \hat{y}_i), where w_i is the weight for the i-th observation, y_i is the observed response, and \hat{y}_i is the fitted value. These weighted residuals standardize the ordinary residuals by the square root of the weights, accounting for the heteroscedasticity assumed in the model. If the weighting scheme correctly captures the variance structure, the weighted residuals should exhibit homoscedasticity (constant variance) and approximate a normal distribution with mean zero.^[1] Common diagnostic tools include graphical assessments and formal statistical tests applied to these weighted residuals. Plots of weighted residuals against fitted values or predictor variables should display no discernible patterns, such as funnel shapes indicative of unresolved heteroscedasticity, to confirm model adequacy.^[31] For normality, a Q-Q plot of the weighted residuals versus theoretical quantiles can visually check for deviations from a straight line, while the Shapiro-Wilk test provides a formal hypothesis test; under the null hypothesis of normality, the test statistic W approaches 1, with small p-values signaling non-normality.^[32] These diagnostics help verify that the assumptions underlying WLS estimation hold, ensuring reliable inference. Influence measures in WLS adapt standard regression diagnostics to incorporate weights, identifying observations that disproportionately affect parameter estimates. The weighted version of Cook's distance, D_i = \frac{ r_i^{*2} h_{ii} }{ p (1 - h_{ii}) }, where r_i^* is the studentized weighted residual and h_{ii} is the i-th leverage adjusted for weights, quantifies the change in fitted values across all observations when the i-th case is excluded; values exceeding $4/n (with n observations and p parameters) flag influential points.^[31] Similarly, weighted DFFITS, which measures the scaled difference in predicted values with and without the observation, uses the formula \text{DFFITS}_i = r_i^* \sqrt{h_{ii}/(1 - h_{ii})}, with thresholds around $2\sqrt{p/n} indicating influence.^[33] These metrics, computed via weighted leverage matrices, prioritize observations with higher weights in assessing impact. To validate the model, examine whether the weighting has successfully addressed heteroscedasticity by re-plotting weighted residuals against fitted values or predictors; random scatter around zero confirms resolution, while persistent trends suggest misspecified weights.^[1] If diagnostics reveal remaining issues, such as patterns implying correlated errors beyond simple heteroscedasticity, generalized least squares (GLS) may be appropriate as an extension, incorporating a full covariance structure for the errors.^[34]

Applications and Extensions

Practical Examples

One practical application of weighted least squares (WLS) arises in linear regression scenarios with heteroscedastic errors, where the variance of the residuals varies across observations. A classic example is Francis Galton's 1877 study on the inheritance of pea plant characteristics, using data on parent and progeny pea diameters to model the relationship between parental size and average offspring size. In this dataset of seven observations, the standard deviation (SD) of the progeny measurements increases with parental size, indicating heteroscedasticity.^[1] To apply WLS, weights are computed as the inverse of the squared standard deviations (w_i = 1 / \sigma_i^2), where \sigma_i is the SD for each parental group; this downweights observations with higher variability and upweights those with lower variability. The WLS model is then estimated by minimizing the weighted sum of squared residuals, yielding the fitted equation \hat{Y} = 0.12796 + 0.2048 X, where Y is progeny diameter and X is parent diameter. In comparison, ordinary least squares (OLS) produces \hat{Y} = 0.12703 + 0.2100 X, which overemphasizes noisier right-side data points. The WLS fit pulls the regression line closer to low-variance points on the left, reducing the overall estimator variance and improving efficiency without biasing the coefficients.^[1] This adjustment enhances predictive accuracy; for instance, WLS better captures the inheritance pattern by accounting for measurement precision, leading to more reliable inferences about genetic regression toward the mean.^[1] Another common use of WLS is in analyzing survey data to adjust for complex sampling designs, such as oversampling certain subgroups, through post-stratification weights that align sample estimates with known population totals. Consider the 2009-2010 National Health and Nutrition Examination Survey (NHANES), a large-scale U.S. health study with 10,162 respondents, where sampling weights w_i correct for unequal selection probabilities (e.g., oversampling of children and minorities). These weights are normalized so their sum equals the sample size, and for WLS, the diagonal weight matrix uses the positive square roots \sqrt{w_i} to incorporate design effects into the estimation.^[35] The model regresses body weight (kg) on height (cm), with WLS solving the weighted normal equations (X^T W X) \hat{\beta} = X^T W y, where W is diagonal with \sqrt{w_i}. Using \sqrt{w_i}, WLS yields a population-representative mean body weight of 70.86 kg and, in a no-intercept model, a height coefficient of 0.46, whereas unweighted analysis gives a biased mean of 63.29 kg (underestimating due to oversampling lighter subgroups) and a slope of 0.47. WLS thus produces unbiased, design-consistent estimates, changing coefficients to better represent the broader population and improving predictions for public health applications like obesity modeling.^[35] In both cases, WLS refines coefficient estimates and variance properties, yielding more precise predictions tailored to data quality and sampling structure compared to unweighted methods.^[1]^[35]

Relation to Generalized Least Squares

Generalized least squares (GLS) provides a broader framework for linear regression that accommodates both heteroscedasticity and correlation in the error terms, positioning weighted least squares (WLS) as a specific instance within it. The GLS estimator minimizes the weighted sum of squared residuals given by

(y - X\beta)^T \Sigma^{-1} (y - X\beta),

where \Sigma denotes the full n \times n covariance matrix of the error vector, assuming \mathbb{E}(\epsilon) = 0 and \text{Var}(\epsilon) = \Sigma.^[36] This approach yields the BLUE (best linear unbiased estimator) under the Gauss-Markov theorem when \Sigma is known, enhancing efficiency over ordinary least squares by accounting for the error structure.^[2] WLS emerges as a special case of GLS when \Sigma is diagonal, implying uncorrelated errors with potentially varying variances (heteroscedasticity) but no off-diagonal covariances.^[37] In this scenario, the inverse weights are the diagonal elements of \Sigma^{-1}, reducing GLS to the familiar WLS form where each observation is scaled by its inverse variance.^[27] GLS is particularly advantageous over WLS in settings with correlated errors, such as autocorrelation in time series data or intra-cluster correlations in panel or grouped observations, where WLS would fail to capture dependencies and lead to inefficient estimates.^[38] For instance, in autoregressive time series models, GLS transforms the data to whiten the errors, yielding more precise parameter estimates compared to WLS, which assumes independence.^[36] When \Sigma is unknown, feasible GLS (FGLS) addresses this by iteratively estimating the covariance structure—often starting with an initial ordinary least squares fit to obtain residuals, then updating weights via maximum likelihood or method of moments—and re-estimating the model until convergence, akin to the iteratively reweighted least squares procedure.^[27] This approximation performs nearly as well as true GLS under mild conditions and is widely implemented in statistical software for practical analysis.^[2] Historically, WLS preceded the GLS formulation; A. C. Aitken introduced the weighted least squares estimator in 1936 as the minimizer of a weighted sum of squared residuals for linear combinations of observations with known variances.^[14] The generalization to arbitrary covariance structures was advanced by C. R. Rao in 1965, who developed a unified least squares theory for models with stochastic parameters, encompassing GLS as a cornerstone for inference in correlated error settings.^[39]