Fact-checked by Grok 2 weeks ago

Linear least squares

Linear least squares is a fundamental method in statistics and linear algebra for estimating the parameters of a by minimizing the sum of the squared differences between observed data points and the values predicted by the model, providing an optimal fit in the least-squares sense. This technique, also known as ordinary least squares (OLS), addresses overdetermined systems of linear equations where the number of equations exceeds the number of unknowns, seeking the solution \mathbf{x} that minimizes \|\mathbf{Ax} - \mathbf{b}\|_2^2, where \mathbf{A} is the , \mathbf{x} the vector, and \mathbf{b} the observation vector. The solution is obtained by solving the normal equations \mathbf{A}^T \mathbf{A} \mathbf{x} = \mathbf{A}^T \mathbf{b}, which yield the best linear unbiased estimator under certain assumptions, such as normally distributed errors. The method originated in the late 18th and early 19th centuries, with reportedly developing it around 1795 for astronomical applications like , though he published it in 1809. independently introduced the technique in 1805 in his work Nouvelles méthodes pour la détermination des orbites des comètes, formalizing its use for least-squares fitting. Robert Adrain also contributed similar ideas around 1808, but Gauss and Legendre are primarily credited for establishing the modern framework. By the mid-19th century, the method gained prominence in fields like and astronomy, and in the late , applied it to biological data, paving the way for analysis. In practice, linear least squares is widely applied in to model relationships between variables, such as in where it fits a line y = ax + b to by minimizing \sum (y_i - (a x_i + b))^2. It underpins many statistical tools, including multiple , and is essential in for calibration, , and systems, as well as in for tasks like parameter estimation in linear models. Despite its optimality for Gaussian errors, the method is sensitive to outliers and assumes linearity, often requiring robust variants or preprocessing for real-world . Computationally, it benefits from efficient algorithms like or (SVD) to handle ill-conditioned matrices.

Introduction and Formulation

Problem Definition

The linear least squares problem arises in the context of linear algebra, where basic concepts such as vectors, matrices, and norms provide the foundational framework. A vector is an ordered list of numbers, typically represented as a column of elements from the real numbers \mathbb{R}, such as \mathbf{x} \in \mathbb{R}^n, which can be thought of as a point in n-dimensional space. A matrix \mathbf{A} is a rectangular array of numbers arranged in rows and columns, with dimensions m \times n indicating m rows and n columns, used to represent linear transformations or systems of equations. The Euclidean norm, denoted \|\mathbf{v}\|_2 for a vector \mathbf{v}, measures its length as the square root of the sum of the squares of its components, i.e., \|\mathbf{v}\|_2 = \sqrt{\sum_{i=1}^n v_i^2}, providing a metric for the distance from the origin./04%3A_Linear_Algebra/4.03%3A_Inner_Product_and_Euclidean_Norm) In linear systems, an overdetermined system occurs when the number of equations exceeds the number of unknowns, formulated as \mathbf{A}\mathbf{x} = \mathbf{b} where \mathbf{A} is an m \times n matrix with m > n and \mathbf{b} \in \mathbb{R}^m is a column vector of observations. Such systems generally lack an exact solution because the equations are inconsistent, meaning no \mathbf{x} \in \mathbb{R}^n satisfies all equations simultaneously due to measurement errors, noise, or model approximations in real-world data. Instead, the goal is to find an approximate solution \hat{\mathbf{x}} that best fits the data in a quantifiable sense. The linear least squares problem specifically seeks the \hat{\mathbf{x}} that minimizes the squared of the , defined as \|\mathbf{A}\hat{\mathbf{x}} - \mathbf{b}\|_2^2 = \sum_{i=1}^m ((\mathbf{A}\hat{\mathbf{x}})_i - b_i)^2, where the for each represents the deviation between the predicted value \mathbf{A}\hat{\mathbf{x}} and the observed \mathbf{b}. This minimization yields the "best" by reducing the total squared error, making it particularly useful in applications like data fitting and regression where exact matches are infeasible. The solution to this optimization can be derived using the normal equations, as explored in subsequent sections. The method traces its origins to early 19th-century astronomy, where first published the approach in 1805 to determine planetary orbits from observational data, emphasizing its role in handling erroneous measurements. Independently, developed the technique around 1795 at age 18, applying it to predict the position of the asteroid , though he publicized it later in 1809 amid a priority dispute with Legendre. Gauss's contributions included probabilistic justifications assuming normal errors, solidifying the method's foundational impact on statistics and computation.

Objective Function

The objective function in linear least squares is defined as the squared Euclidean norm of the residual vector, given by S(\mathbf{x}) = \|\mathbf{Ax} - \mathbf{b}\|_2^2 = (\mathbf{Ax} - \mathbf{b})^T (\mathbf{Ax} - \mathbf{b}), where \mathbf{A} is an m \times n matrix with m > n, \mathbf{b} \in \mathbb{R}^m is the observation vector, and \mathbf{x} \in \mathbb{R}^n is the parameter vector to be estimated. This formulation seeks to find the \mathbf{x} that minimizes S(\mathbf{x}) for an overdetermined system \mathbf{Ax} \approx \mathbf{b}. Expanding the quadratic form yields S(\mathbf{x}) = \mathbf{x}^T (\mathbf{A}^T \mathbf{A}) \mathbf{x} - 2 (\mathbf{A}^T \mathbf{b})^T \mathbf{x} + \mathbf{b}^T \mathbf{b}, where the constant term \mathbf{b}^T \mathbf{b} does not influence the minimization. The use of squared residuals in this objective minimizes the total squared deviations between the observed data \mathbf{b} and the model predictions \mathbf{Ax}, providing a smooth, convex loss function that is twice differentiable and amenable to gradient-based optimization. This squared error criterion is statistically motivated: assuming the errors \mathbf{b} - \mathbf{Ax} follow a with zero and identity (up to scaling), minimizing S(\mathbf{x}) is equivalent to maximizing the likelihood of the observations. Geometrically, the objective measures the squared from \mathbf{b} to the affine \{\mathbf{Ax} \mid \mathbf{x} \in \mathbb{R}^n\}, with the residuals representing the perpendicular components from \mathbf{b} to the column space of \mathbf{A}.

Mathematical Frameworks

Standard Linear Model

The standard linear model in the context of posits that the observed response \mathbf{y}, an m \times 1 column of m measurements, can be expressed as a of predictor variables plus random errors: \mathbf{y} = X \boldsymbol{\beta} + \boldsymbol{\varepsilon}, where X is the m \times n containing the predictor values (including a column of ones for term), \boldsymbol{\beta} is the n \times 1 of unknown parameters, and \boldsymbol{\varepsilon} is the m \times 1 of error terms. This formulation frames linear least squares as a for estimating \boldsymbol{\beta} from data where the predictors in X are fixed and non-random, distinguishing it from the more general A \mathbf{x} \approx \mathbf{b} by treating \mathbf{y} explicitly as noisy observations of a response and incorporating an intercept to model the mean response at zero predictor values. The ordinary least squares (OLS) for the parameters is defined as the value \hat{\boldsymbol{\beta}} that minimizes the squared of the , \hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \|\mathbf{y} - X \boldsymbol{\beta}\|_2^2 = \arg\min_{\boldsymbol{\beta}} (\mathbf{y} - X \boldsymbol{\beta})^\top (\mathbf{y} - X \boldsymbol{\beta}). This objective function quantifies the total squared deviation between observed and predicted responses, providing a measure of fit under the model's linear structure. Key assumptions underlying this model include linearity, which requires that the conditional expectation of the response given the predictors is a linear function of those predictors, E(\mathbf{y} | X) = X \boldsymbol{\beta}, ensuring the model's functional form accurately captures the relationship without inherent nonlinearity. Additionally, the errors are assumed to be independent across observations, meaning \text{Cov}(\boldsymbol{\varepsilon}_i, \boldsymbol{\varepsilon}_j) = 0 for i \neq j, which supports the validity of treating each data point as a separate realization without systematic dependencies. These assumptions focus on the structural and error properties essential for the model's setup, prior to considerations of inference or estimation properties.

Alternative Formulations

The linear least squares problem admits a geometric as the orthogonal of the response \mathbf{b} \in \mathbb{R}^m onto the column space of the \mathbf{A} \in \mathbb{R}^{m \times n}, where m \geq n and \operatorname{[rank](/page/Rank)}(\mathbf{A}) = n. The solution \hat{\mathbf{x}} yields the \mathbf{p} = \mathbf{A} \hat{\mathbf{x}}, which minimizes the \|\mathbf{b} - \mathbf{p}\|_2, and the \mathbf{r} = \mathbf{b} - \mathbf{A} \hat{\mathbf{x}} lies in the orthogonal complement of the column space of \mathbf{A}. This orthogonality condition is expressed as \mathbf{A}^T \mathbf{r} = \mathbf{0}, ensuring that \mathbf{r} is perpendicular to every column of \mathbf{A}. This formulation highlights the problem's connection to Hilbert space projections and underpins numerical methods like QR decomposition. For cases with linear equality constraints, such as \mathbf{C} \hat{\mathbf{x}} = \mathbf{d} where \mathbf{C} \in \mathbb{R}^{k \times n} and k < n, the problem becomes a constrained minimization of \|\mathbf{Ax} - \mathbf{b}\|_2^2 subject to the constraints. The Lagrange multiplier method introduces auxiliary variables \boldsymbol{\lambda} \in \mathbb{R}^k to form the Lagrangian \mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}) = \|\mathbf{Ax} - \mathbf{b}\|_2^2 + 2 \boldsymbol{\lambda}^T (\mathbf{Cx} - \mathbf{d}). Stationarity requires \nabla_{\mathbf{x}} \mathcal{L} = 0 and \nabla_{\boldsymbol{\lambda}} \mathcal{L} = 0, leading to a coupled system that incorporates the original objective and constraints. This setup is foundational for handling side conditions in applications like and . Total least squares (TLS) provides an alternative when errors are present in both the explanatory variables (columns of \mathbf{A}) and the response \mathbf{b}, contrasting with the standard formulation that assumes errors only in \mathbf{b} and minimizes vertical (or orthogonal in the response direction) distances. In TLS, the objective is to minimize the Frobenius norm of the perturbation matrix [\Delta \mathbf{A}, \Delta \mathbf{b}] such that [\mathbf{A} + \Delta \mathbf{A}] \mathbf{x} = \mathbf{b} + \Delta \mathbf{b}, which geometrically corresponds to minimizing perpendicular distances from data points to the fitted subspace. This makes TLS suitable for orthogonal regression problems, such as in calibration or errors-in-variables models, though it is more sensitive to outliers than ordinary least squares. Iterative reweighted least squares (IRLS) reformulates the problem as a sequence of weighted least squares minimizations, where weights are updated iteratively based on the current parameter estimates to address heteroscedasticity or robustness needs. Starting from initial weights (often uniform), each iteration solves a weighted problem \min_{\mathbf{x}} \sum_{i=1}^m w_i (y_i - \mathbf{a}_i^T \mathbf{x})^2, with weights w_i derived from residuals or variance functions, converging to the solution of the original weighted objective. As a conceptual bridge to nonlinear estimation, IRLS underlies algorithms for by linearizing link functions through weighted updates.

Derivation and Computation

Normal Equations

The normal equations provide the closed-form solution to the linear least squares problem by minimizing the objective function S(\mathbf{x}) = \|\mathbf{b} - A\mathbf{x}\|_2^2, where A is an m \times n matrix with m \geq n and \mathbf{b} \in \mathbb{R}^m. To derive them, consider the squared-error objective and set its gradient with respect to \mathbf{x} to zero. Expanding S(\mathbf{x}) = (\mathbf{b} - A\mathbf{x})^T (\mathbf{b} - A\mathbf{x}) = \mathbf{b}^T \mathbf{b} - 2\mathbf{x}^T A^T \mathbf{b} + \mathbf{x}^T A^T A \mathbf{x}, the gradient is \nabla S(\mathbf{x}) = -2 A^T \mathbf{b} + 2 A^T A \mathbf{x}. Setting this equal to zero yields the normal equations: A^T A \mathbf{x} = A^T \mathbf{b}. This system arises algebraically from the orthogonality condition that the residual \mathbf{b} - A\hat{\mathbf{x}} is perpendicular to the column space of A, ensuring the projection minimizes the error. Assuming A has full column rank (i.e., its columns are linearly independent, which requires \operatorname{rank}(A) = n), the matrix A^T A is invertible because it is positive definite: for any nonzero \mathbf{y} \in \mathbb{R}^n, \mathbf{y}^T A^T A \mathbf{y} = \|A \mathbf{y}\|_2^2 > 0. The unique solution is then \hat{\mathbf{x}} = (A^T A)^{-1} A^T \mathbf{b}. This formula gives the estimate \hat{\mathbf{x}} that minimizes S(\mathbf{x}). In the more general case where A may not have full column rank, the normal equations still hold, but the solution is not unique; any minimizer can be expressed using the Moore-Penrose pseudoinverse A^+, where \hat{\mathbf{x}} = A^+ \mathbf{b} and A^+ = (A^T A)^\dagger A^T with \dagger denoting the pseudoinverse of A^T A. This provides a generalized algebraic solution consistent with the minimum-norm least squares problem. The normal equations were originally formulated by in his 1805 work on comet orbits, marking a foundational contribution to the method of .

Numerical Methods

To solve the linear least squares problem \min_x \|Ax - b\|_2 numerically, methods based on orthogonal factorizations are preferred over the normal equations A^T A x = A^T b due to better stability, as the latter can square the condition number and amplify errors in ill-conditioned cases.

QR Decomposition

The QR decomposition factors the matrix A \in \mathbb{R}^{m \times n} (with m \geq n) as A = QR, where Q is orthogonal (Q^T Q = I) and R is upper triangular. Substituting into the least squares problem yields \min_x \|QRx - b\|_2 = \min_x \|Rx - Q^T b\|_2, since \|Qx\|_2 = \|x\|_2 for orthogonal Q. The solution is then obtained by solving the triangular system Rx = Q^T b via back-substitution, avoiding explicit computation of A^T A or its inverse. This approach is computed using Householder reflections or Givens rotations, with a computational cost of O(mn^2) flops for dense matrices. QR is particularly stable for full-rank problems, as the backward error is small and independent of conditioning.

Singular Value Decomposition (SVD)

For more general cases, including rank-deficient or ill-conditioned matrices, the singular value decomposition A = U \Sigma V^T is used, where U and V are orthogonal, and \Sigma is diagonal with non-negative s \sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_n \geq 0. The pseudoinverse solution is \hat{x} = V \Sigma^+ U^T b, where \Sigma^+ inverts the non-zero singular values (setting \sigma_i^{-1} = 0 for zero or tiny \sigma_i). This handles rank deficiency by projecting onto the row space of A and provides a minimum-norm solution among those minimizing the residual. The SVD costs O(\min(mn^2, m^2 n)) but is backward stable, with relative errors bounded by times the . It is the method of choice for revealing numerical rank via a .

Iterative Methods

For large-scale sparse systems where direct methods are prohibitive, iterative techniques like the applied to the normal equations (CGLS) approximate the solution without forming A^T A explicitly. CGLS minimizes the in the sense over Krylov subspaces generated by matrix-vector products with A and A^T, converging in at most n iterations for exact arithmetic but typically faster for well-conditioned problems. Preconditioning can accelerate , though requires monitoring residuals to avoid stagnation in ill-conditioned cases. These methods scale to millions of variables when sparsity is exploited, with per-iteration cost O(mn) for general sparsity patterns.

Comparison of Methods

QR decomposition offers excellent stability for well-conditioned, full-rank problems, with forward error bounded by \kappa(A) \cdot \epsilon_{\text{mach}}, where \kappa(A) = \sigma_{\max}/\sigma_{\min} is the and \epsilon_{\text{mach}} is machine precision; it is faster than for dense matrices. SVD excels in ill-conditioned or rank-deficient scenarios, providing error estimates via singular values and handling \kappa(A) \gg 1 without squaring the , though at higher cost. Iterative methods like conjugate gradient are essential for sparse, large systems but may lose accuracy if \kappa(A) is extreme without preconditioning, unlike the robust direct factorizations.

Properties and Analysis

Statistical Properties

Under the assumptions of the classical model, where the response vector \mathbf{y} satisfies \mathbf{y} = X \boldsymbol{\beta} + \boldsymbol{\epsilon} with E[\boldsymbol{\epsilon}] = \mathbf{0} and \text{Var}(\boldsymbol{\epsilon}) = \sigma^2 I_m, the ordinary (OLS) estimator \hat{\boldsymbol{\beta}} = (X^T X)^{-1} X^T \mathbf{y} exhibits desirable statistical properties. The Gauss-Markov theorem establishes that \hat{\boldsymbol{\beta}} is the best linear unbiased estimator (BLUE), meaning it has the minimum variance among all linear unbiased estimators of \boldsymbol{\beta}, provided the errors are homoscedastic (constant variance \sigma^2) and uncorrelated (no ). This theorem holds without requiring normality of the errors, relying solely on the linearity, unbiasedness, and the specified error covariance structure. The variance-covariance of the OLS estimator is given by \text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2 (X^T X)^{-1}, which quantifies the precision of the estimates and depends on the error and the information content in the X. Since \sigma^2 is typically unknown, it is estimated using the residual sum of squares: \hat{\sigma}^2 = \frac{1}{m - n} \sum_{i=1}^m \hat{\epsilon}_i^2 = \frac{\text{[RSS](/page/RSS)}}{m - n}, where m is the number of observations, n is the number of parameters, \hat{\epsilon}_i are the residuals, and RSS is the ; this is unbiased under the model assumptions. The standard errors of individual coefficients, \text{se}(\hat{\beta}_j) = \sqrt{ \hat{\sigma}^2 \cdot c_{jj} } where c_{jj} is the j-th diagonal element of (X^T X)^{-1}, then enable inference. For large samples, under additional assumptions such as i.i.d. errors with finite variance and the design matrix satisfying a full rank condition asymptotically, the OLS estimator is consistent, meaning \hat{\boldsymbol{\beta}} \xrightarrow{p} \boldsymbol{\beta} as m \to \infty. Furthermore, it is asymptotically normal: \sqrt{m} (\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}) \xrightarrow{d} \mathcal{N}(\mathbf{0}, \text{AsyVar}(\hat{\boldsymbol{\beta}})), where the asymptotic variance incorporates the error heteroscedasticity if present, facilitating large-sample hypothesis tests and intervals. In finite samples, assuming normally distributed errors, confidence intervals for each coefficient take the form \hat{\beta}_j \pm t_{m-n, 1-\alpha/2} \cdot \text{se}(\hat{\beta}_j), where t_{m-n, 1-\alpha/2} is the critical value from the t-distribution with m - n degrees of freedom; this provides a $100(1-\alpha)\% interval covering the true \beta_j with the stated probability.

Geometric and Algebraic Properties

The in linear least squares states that the vector \mathbf{r} = \mathbf{b} - A \hat{\mathbf{x}} is to the column space of the A, meaning that every column of A is orthogonal to \mathbf{r}. This property arises from the normal equations A^T (b - A \hat{x}) = 0, which ensure that the projection of \mathbf{b} onto the column space of A minimizes the Euclidean distance. Geometrically, this orthogonality implies that the least-squares solution \hat{\mathbf{x}} corresponds to the point in the column space closest to \mathbf{b}, with the forming a right angle to the approximating subspace. The partition of the total sum of squares in linear least squares decomposes the squared norm of the observation vector into explained and residual components, reflecting the geometric decomposition of \mathbf{b} into its projection onto the column space and the orthogonal residual. Specifically, \|\mathbf{b}\|^2 = \|\hat{\mathbf{\mu}}\|^2 + \|\mathbf{r}\|^2, where \hat{\mathbf{\mu}} = A \hat{\mathbf{x}} is the fitted value and the equality follows from the Pythagorean theorem due to the orthogonality of \hat{\mathbf{\mu}} and \mathbf{r}. This decomposition quantifies how much of the total variation in \mathbf{b} is captured by the linear model, with the explained sum of squares \|\hat{\mathbf{\mu}}\|^2 measuring the fit and the residual sum of squares \|\mathbf{r}\|^2 indicating unexplained variation. The uniqueness of the least-squares solution depends on the rank of A and the structure of its null space. If A has full column rank (rank equal to the number of columns n), the null space of A is trivial (\text{Null}(A) = \{\mathbf{0}\}), and the solution \hat{\mathbf{x}} is unique, given by \hat{\mathbf{x}} = (A^T A)^{-1} A^T \mathbf{b}. Conversely, if the rank is deficient (less than n), the null space has dimension greater than zero, leading to infinitely many solutions that differ by vectors in \text{Null}(A), though all yield the same minimum residual norm. This algebraic property ensures that the solution set forms an affine subspace parallel to \text{Null}(A). Sensitivity to perturbations in the data is governed by the condition number of A, which bounds the amplification of errors in the computed solution. The condition number \kappa(A) = \|A\|_2 \|A^+\|_2, where A^+ is the Moore-Penrose pseudoinverse, measures how ill-conditioned the problem is; for perturbations in \mathbf{b}, the relative error satisfies \frac{\|\Delta \hat{\mathbf{x}}\|_2}{\|\hat{\mathbf{x}}\|_2} \leq \kappa(A) \cdot \frac{\|\Delta \mathbf{b}\|_2}{\|\mathbf{b}\|_2} when the residual is small. For perturbations in A, the bound worsens to involve [\kappa(A)]^2 in cases of large residuals, highlighting increased vulnerability near rank deficiency. These error bounds underscore the numerical stability challenges in ill-conditioned least-squares problems.

Limitations and Extensions

Key Limitations

One key limitation of linear least squares estimation arises from its sensitivity to outliers in the data. The method minimizes the sum of squared residuals, which disproportionately amplifies the influence of large errors compared to smaller ones, as the square of a large residual grows quadratically while smaller residuals contribute minimally. This can cause the fitted model to deviate significantly from the underlying pattern, pulling parameters toward atypical observations and yielding unreliable estimates even if the majority of the data fits well. For instance, a single outlier can alter the slope and intercept dramatically, as demonstrated in analyses where stray points strongly influence ordinary least squares (OLS) results. Another inherent weakness is among the predictor variables, occurring when columns of the X exhibit linear dependence. This leads to instability in the inverse of the (X^T X)^{-1}, which forms the basis for parameter estimates, resulting in large variances and highly sensitive coefficient values that fluctuate wildly with minor data changes. High inflates the variance of the estimators, making individual parameter interpretations unreliable despite the overall model fit remaining adequate. The (best linear unbiased ) property of OLS, established by the Gauss-Markov under classical including homoscedasticity and no in , also fails when these are violated. Heteroscedasticity, where variances differ across observations, invalidates variance , leading to inefficient estimates with understated standard errors. Similarly, in residuals, common in time-series data, breaches the , causing biased variance estimates and loss of the status for OLS. In such cases, while OLS remains unbiased, it no longer achieves minimal variance among linear unbiased . Finally, ill-conditioning in the X poses a severe numerical challenge, characterized by a high \kappa(X) that amplifies small in input data or rounding errors into large discrepancies in the solution. For the problem, the effective condition number is approximately \kappa(X)^2, since solutions depend on (X^T X)^{-1}, exacerbating in floating-point computations. Consider a X with singular values spanning several orders of magnitude; a relative perturbation of $10^{-6} in the right-hand side b can yield relative errors in the parameter vector \hat{\beta} exceeding $10^{-2}, losing multiple digits of accuracy even with algorithms. This is intrinsic to the problem formulation and particularly acute in overdetermined systems with near-singular X^T X.

Generalizations and Variants

Weighted least squares (WLS) extends the ordinary least squares method to address heteroscedasticity, where the variance of the errors varies across observations. In this approach, the objective is to minimize the weighted sum of squared residuals, given by (y - X\beta)^T W (y - X\beta), where W is a diagonal positive definite matrix with elements inversely proportional to the error variances for each observation. This weighting gives greater influence to observations with smaller variances, improving the efficiency of the estimator. The concept of weighting observations inversely to their estimated variances was investigated for its efficiency in sampling contexts, demonstrating gains over unweighted methods when variances differ substantially. Generalized least squares (GLS) further generalizes WLS to handle both heteroscedasticity and correlated errors in the residuals, using a full positive definite weight \Omega^{-1} that accounts for the structure of the errors. The minimization problem becomes (y - X\beta)^T \Omega^{-1} (y - X\beta), yielding the \hat{\beta} = (X^T \Omega^{-1} X)^{-1} X^T \Omega^{-1} y. GLS is the best linear unbiased under these conditions, as established in the foundational formulation of the with non-spherical errors. When errors are uncorrelated but heteroscedastic, GLS reduces to WLS with a diagonal \Omega. Ridge regression addresses in the X, where ordinary estimates have high variance due to near-linear dependencies among predictors. It modifies the objective by adding an L_2 penalty term \lambda \|\beta\|_2^2, resulting in the minimization of (y - X\beta)^T (y - X\beta) + \lambda \|\beta\|_2^2, with solution \hat{\beta}_\lambda = (X^T X + \lambda I)^{-1} X^T y for \lambda > 0. This bias-variance trade-off shrinks coefficients toward zero, stabilizing estimates while introducing controlled bias, particularly effective when the number of predictors approaches or exceeds the sample size. The method was introduced as a practical tool for nonorthogonal problems in applied settings. Other variants include (TLS), which accounts for errors in both the response and predictor variables, unlike standard that assumes error-free predictors. TLS minimizes orthogonal perturbations to the to best fit a , solving \min \|E\|_F^2 + \|e\|_2^2 subject to (X + E)\beta = y + e, and is particularly useful in and errors-in-variables models. In contrast, estimates conditional quantiles rather than the , minimizing a weighted absolute deviation \sum \rho_\tau (y_i - x_i^T \beta), where \rho_\tau(u) = u(\tau - I(u < 0)) and \tau is the quantile level; this provides robustness to outliers and heterogeneous error distributions, differing from ' focus on squared errors for regression.

Applications

Regression and Data Fitting

Linear least squares serves as the foundational method for , enabling the estimation of relationships between variables in observational data to support prediction and . In this context, it minimizes the sum of squared differences between observed and predicted values, providing unbiased estimates under assumptions of , , and homoscedasticity in the errors. This approach is particularly valuable for identifying trends, such as sales based on spend or assessing the impact of on income, where the goal is to draw reliable conclusions about population parameters from sample data. (Chapter 3) Beyond prediction, linear least squares facilitates data smoothing and interpolation by fitting linear or polynomial models to noisy measurements, effectively reducing random fluctuations while preserving underlying patterns. For instance, in time-series analysis, it can smooth erratic sensor readings to reveal true signals, or interpolate missing values in datasets by estimating intermediate points along a best-fit line or plane. Techniques like the Savitzky-Golay extend this by applying local least squares polynomials for enhanced smoothing without excessive distortion of peaks and valleys. Implementation of linear least squares for regression is streamlined through widely used software libraries. In Python, NumPy's linalg.lstsq function computes the least-squares solution to overdetermined systems, accepting matrices for design and response variables to yield coefficients and residuals. Similarly, R's lm() function fits linear models via ordinary least squares, supporting formula-based specification for multiple predictors and providing diagnostic outputs like confidence intervals. In , linear least squares underpins supervised linear models, such as those in 's LinearRegression class for tasks in predictive pipelines and RidgeClassifier for , establishing it as a for more complex algorithms, offering interpretability and computational efficiency in high-dimensional settings.

Engineering and Sciences

In , linear least squares methods are employed for tasks such as , where coefficients are determined to forecast future signal values by minimizing errors based on past observations. Smoothing applications use least squares to reduce noise in signals, like electrocardiograms, by balancing fidelity to the original data with smoothness constraints through regularization parameters. leverages these techniques to recover an input signal from a known output and , incorporating regularization to stabilize solutions against noise and ill-conditioned matrices. In control systems, facilitates parameter estimation for by fitting models to observed data, capturing system dynamics under conditions of or correlated . This approach enables the derivation of optimal filters and numerical solutions for linear systems, supporting of model parameters essential for controller design. Geodesy and rely on to process redundant measurements, minimizing squared residuals for a best-fit model of network configurations. In GPS positioning, this method linearizes pseudorange observations from multiple to estimate receiver coordinates and clock bias, using a based on to iteratively refine solutions with at least four satellites for robustness. variants extend this by accounting for errors in both observations and parameters, improving accuracy in coordinate transformations and deformation analysis, such as reducing residuals to centimeters in high-precision surveys. In physics, linear least squares fits experimental data to straight-line models, such as curves relating instrument response to physical quantities like concentration or , by minimizing squared residuals between observed and predicted values. This quantifies uncertainties through deviations about the line, enabling reliable predictions and intervals for parameters like and intercept in spectroscopic or measurement experiments.

Illustrative Examples

Straight Line Fitting

A illustration of linear least squares applied to straight involves a simple consisting of five paired observations: (x_i, y_i) = (0, 1), (1, 3), (2, 2), (3, 5), (4, 7). These points represent measurements where the goal is to find the best-fitting line y = β_0 + β_1 x that minimizes the sum of squared vertical residuals. To compute the parameters, form the X (an n × 2 matrix with n=5) as follows: X = \begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{bmatrix}, and the response vector y = [1, 3, 2, 5, 7]^T. The least squares estimator is given by \hat{β} = (X^T X)^{-1} X^T y, which yields the intercept β_0 = 0.8 and slope β_1 = 1.4. This solution satisfies the normal equations derived from minimizing the residual sum of squares. The fitted line is thus y = 0.8 + 1.4x. The residuals e_i = y_i - \hat{y}_i are: 0.2, 0.8, -1.6, 0, 0.6, with the sum of squared residuals (SSE) equal to 3.6. The coefficient of determination R^2, which measures the proportion of variance in y explained by the model, is calculated as R^2 = 1 - (SSE / SST), where SST is the total sum of squares (23.2); thus, R^2 ≈ 0.845. In a visualization, the five data points would appear scattered around the with a positive trend, and the fitted line would pass closely through them, slightly underpredicting at x=2 and overpredicting minimally elsewhere, demonstrating the method's ability to capture the underlying linear relationship despite .

Curve and Surface Fitting

extends the principles of linear least squares to model nonlinear relationships, such as parabolas, by expressing the as a of basis functions, specifically powers of the independent variable. In the standard , the X incorporates these powers as columns, transforming the problem into a linear system solvable via the least squares estimator \hat{\beta} = (X^T X)^{-1} X^T y. For a quadratic model y \approx \beta_0 + \beta_1 x + \beta_2 x^2, the matrix X has three columns: one for the constant term, one for x, and one for x^2. A numerical example illustrates quadratic fitting using data on live births per 1,000 women by maternal age group. The dataset consists of seven points: (x, y) = (16, 34), (18.5, 86.5), (22, 111.1), (27, 113.9), (32, 84.5), (37, 35.4), (42, 6.8), where x is age and y is the birth rate. Solving the least squares problem yields coefficients \hat{\beta}_0 = -258.81, \hat{\beta}_1 = 27.73, \hat{\beta}_2 = -0.47, so the fitted parabola is y \approx -258.81 + 27.73x - 0.47x^2. This model captures the rise and fall in birth rates with age, with an R^2 value of 0.869 indicating a strong fit. Surface fitting applies similar techniques in higher dimensions, modeling a surface like a z \approx \beta_0 + \beta_1 x + \beta_2 y through a set of points. The X now includes columns for the constant, x, and y, and the solution minimizes the vertical distances from points to the plane. For illustration, consider four points with slight noise: (0,0,0.1), (1,0,1.2), (0,1,1.0), (1,1,2.1). The normal equations yield \hat{\beta}_0 = 0.1, \hat{\beta}_1 = 1.1, \hat{\beta}_2 = 0.9, fitting the plane z \approx 0.1 + 1.1x + 0.9y. This approximates the underlying relation z = x + y despite perturbations. The basis expansion for polynomials relies on the , where each row corresponds to a data point and columns to increasing powers of x_i, enabling efficient computation of coefficients for higher-degree curves while maintaining the of the framework.