Fact-checked by Grok 2 weeks ago

Least squares

The least squares method is a fundamental mathematical technique used to find the best approximate solution to an overdetermined system of linear equations, Ax = b, where the matrix A has more rows than columns, by minimizing the sum of the squares of the residuals, \| Ax - b \|^2.^[1] This approach, also known as ordinary least squares (OLS) in its basic form, determines the vector \hat{x} that produces the closest fit in the Euclidean norm sense, making it ideal for problems where exact solutions do not exist due to noise or inconsistencies in data.^[2] Developed by the German mathematician Carl Friedrich Gauss around 1795–1801, initially for predicting the orbit of the asteroid Ceres, the method was first publicly described by the French mathematician Adrien-Marie Legendre in 1805 as an algebraic procedure for handling observational errors in astronomy.^[3] Gauss later formalized its probabilistic justification in 1809, assuming normally distributed errors, which laid the groundwork for modern statistical inference.^[3] The technique's normal equations, A^T A \hat{x} = A^T b, provide a closed-form solution when A has full column rank, ensuring a unique minimizer.^[1] Least squares has broad applications across disciplines, including linear regression in statistics for modeling relationships between variables, curve fitting in engineering for approximating experimental data, and signal processing for tasks like linear prediction and smoothing.^[4] Extensions such as weighted least squares account for varying error variances, while nonlinear variants handle non-linear models, making it indispensable in fields like econometrics, physics, and machine learning.^[5]

Introduction and Fundamentals

Problem Statement

The least squares problem seeks to find the parameters of a model that minimize the sum of the squared residuals, defined as the differences between observed data points and the corresponding model predictions. This approach is widely used in regression and fitting tasks to determine the best approximate solution when exact matches are not possible.^[6] Geometrically, the least squares solution in the linear setting corresponds to the orthogonal projection of the observation vector onto the column subspace of the design matrix, ensuring that the residual vector is perpendicular to this subspace and thus minimizing the Euclidean distance.^[7] Statistically, when the errors or noise in the observations are assumed to be independent and identically distributed as Gaussian with zero mean and constant variance, the least squares estimator coincides with the maximum likelihood estimator for the model parameters.^[8] In its general linear form, the problem is expressed as

\arg \min_x \| Ax - b \|_2^2,

where A is the m \times n design matrix encoding the model structure, x is the n \times 1 vector of parameters to estimate, and b is the m \times 1 vector of observations.^[2] The nature of the system depends on the relationship between m and n: it is overdetermined if m > n (more observations than parameters, typically yielding no exact solution but an optimal approximation via least squares), underdetermined if m < n (fewer observations than parameters, often requiring additional constraints like minimum norm to select among infinitely many solutions), or square if m = n (potentially solvable exactly if consistent).^[9]

Basic Principles and Assumptions

The least squares method relies on several key assumptions to ensure the validity and reliability of its estimates, particularly in the linear case. The primary assumption is linearity in the parameters, meaning the model expresses the dependent variable as a linear combination of the independent variables plus an error term.^[10] Additionally, the errors must be independent of each other and of the explanatory variables, homoscedastic (with constant variance across all levels of the independent variables), and normally distributed to support statistical inference.^[11]^[12] Violations of these assumptions can compromise the method's performance, though weighted variants exist to address issues like heteroscedasticity.^[13] One major advantage of the least squares approach is its computational efficiency, as it often admits closed-form solutions that avoid iterative procedures, making it suitable for large datasets.^[14] Under the assumptions of the Gauss-Markov theorem—including linearity, independence, and homoscedasticity—the ordinary least squares estimator is the best linear unbiased estimator (BLUE), achieving minimum variance among all linear unbiased estimators.^[15] With the added assumption of normally distributed errors, it coincides with the maximum likelihood estimator, providing statistical optimality for parameter inference.^[16] Furthermore, the residuals—differences between observed and predicted values—offer interpretability, allowing direct assessment of model fit and error patterns.^[17] Despite these strengths, the method has notable limitations. It is highly sensitive to outliers, which can disproportionately influence the estimates due to the squaring of residuals, leading to skewed results.^[18] Violations of core assumptions, such as heteroscedasticity or dependence, can result in biased or inefficient estimators, undermining the BLUE property.^[19] In high-dimensional settings, where the number of parameters approaches or exceeds the number of observations, least squares problems become ill-posed, often requiring regularization to stabilize solutions.^[20] Residuals play a central role in model diagnostics for least squares, enabling the detection of assumption violations and model inadequacies. By plotting residuals against fitted values or predictors, analysts can identify patterns indicating nonlinearity, heteroscedasticity, or non-independence, such as funnel shapes or trends.^[21] Normality tests on residuals, like Q-Q plots or Shapiro-Wilk tests, further validate the distributional assumption, while outlier detection leverages standardized residuals to flag influential points.^[22] These diagnostics guide refinements, ensuring the model's robustness before application.^[23]

Historical Development

Early Contributions

The method of least squares emerged in the early 19th century as a practical algebraic tool for reconciling imprecise observational data, particularly in astronomy. Adrien-Marie Legendre introduced it formally in 1805 through an appendix to his work Nouvelles méthodes pour la détermination des orbites des comètes, where he described the technique for fitting parabolic orbits to comet observations by minimizing the sum of squared residuals between predicted and measured positions.^[24] This approach addressed the challenge of multiple observations containing small errors, offering a deterministic way to derive optimal parameters without assuming underlying error distributions.^[25] Carl Friedrich Gauss independently conceived the principle earlier, applying it privately as early as 1795 and notably to compute the orbit of the asteroid Ceres from fragmentary 1801 sightings, which allowed successful predictions of its position.^[26] Gauss detailed the method in his 1809 publication Theoria motus corporum coelestium in sectionibus conicis solem ambientium, claiming prior development since at least 1801 and framing it as a general principle for combining observations to minimize error effects in celestial mechanics.^[27] His exposition emphasized its utility for orbit determination under measurement uncertainties, though it ignited a lasting priority debate with Legendre, as Gauss asserted his unpublished precedence.^[28] Initial motivations centered on error reduction in fields demanding high precision, such as astronomy for planetary and comet trajectory calculations and geodesy for survey network adjustments.^[25] In astronomy, it enabled robust fits to sparse, noisy data from telescopes, as seen in Legendre's comet work and Gauss's Ceres solution.^[24]^[26] Geodetic applications followed soon after, with the method adapted for reconciling triangulation measurements in large-scale mapping projects.^[29] Pierre-Simon Laplace, a prominent figure in the French Academy of Sciences, endorsed the technique in his contemporaneous analyses of planetary perturbations, recognizing its effectiveness for observational synthesis before integrating probabilistic rationales.^[28]

Formulation and Advancements

In the early 19th century, Carl Friedrich Gauss formalized the least squares method through his derivation of the normal equations, providing a systematic approach to minimizing errors in astronomical observations. In his 1809 work, Theoria Combinationis Observationum Erroribus Minimis Obnoxiae, Gauss assumed errors followed a normal distribution and derived the normal equations by maximizing the likelihood of the observations, leading to the solution that minimizes the sum of squared residuals. This derivation established the least squares estimator as equivalent to the maximum likelihood estimator under Gaussian errors. Later, in 1821, Gauss refined this justification without relying on normality, demonstrating that the least squares solution yields the minimum variance among unbiased linear estimators of the parameters. Building on this foundation in the 1880s, Francis Galton extended least squares to the study of heredity, introducing the concepts of regression and correlation. In his 1885 paper "Regression Towards Mediocrity in Hereditary Stature," Galton analyzed parent-child height data, observing that offspring of extreme parents tended to regress toward the population mean, and fitted lines to grouped data using a precursor to least squares by minimizing squared deviations. He quantified this phenomenon with a regression coefficient, linking it to the correlation between variables, and demonstrated that the slope depended on the ratio of standard deviations and the correlation measure. Galton's work transformed least squares from a purely computational tool into a means for exploring bivariate relationships in natural sciences. In the early 20th century, Karl Pearson advanced least squares to multivariate settings, enabling the fitting of lines and planes to multiple variables. In his 1901 paper "On Lines and Planes of Closest Fit to Systems of Points in Space," Pearson derived methods for multiple regression using the method of least squares to minimize the sum of squared perpendicular distances from points to hyperplanes, extending Galton's bivariate approach to higher dimensions. He formalized the product-moment correlation and regression coefficients for multiple predictors, providing equations for estimating parameters in systems where outcomes depend on several factors, such as ancestral influences in heredity. This work laid the groundwork for modern multivariate analysis, emphasizing the geometric interpretation of least squares in multi-dimensional space.^[30] Ronald Fisher integrated least squares with probability theory in the 1920s, rigorously establishing its optimality properties under statistical assumptions. In his 1922 paper "On the Mathematical Foundations of Theoretical Statistics," Fisher synthesized earlier developments, proving the Gauss-Markov theorem in a modern framework: under linearity, unbiasedness, and homoscedasticity, the ordinary least squares estimator is the best linear unbiased estimator (BLUE), possessing minimum variance among all linear unbiased estimators. This integration elevated least squares from an ad hoc fitting method to a cornerstone of statistical inference, particularly in experimental design and analysis of variance. A key milestone in the mid-20th century was the development of matrix formulations, which streamlined computations for large-scale problems. In 1936, Alexander Craig Aitken introduced a compact matrix notation for generalized least squares in "On Least Squares and Linear Combination of Observations," representing the model as \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} with covariance matrix \mathbf{V}, yielding the solution \hat{\boldsymbol{\beta}} = (\mathbf{X}' \mathbf{V}^{-1} \mathbf{X})^{-1} \mathbf{X}' \mathbf{V}^{-1} \mathbf{y}. This formulation facilitated efficient handling of correlated errors and weighted observations, influencing subsequent computational advancements in statistics and econometrics.^[31]

Linear Least Squares

Mathematical Formulation

In linear least squares, the underlying model relates observed response values \mathbf{y} \in \mathbb{R}^m to a set of predictor variables through a linear function of the parameter vector \boldsymbol{\beta} \in \mathbb{R}^n, expressed as

\mathbf{y} = X \boldsymbol{\beta} + \boldsymbol{\epsilon},

where X \in \mathbb{R}^{m \times n} is the design matrix (with m \geq n), and \boldsymbol{\epsilon} denotes the error term, typically assumed to follow a multivariate normal distribution with mean zero and covariance matrix \sigma^2 I (homoscedasticity and independence).^[2] Key assumptions include linearity in parameters, no perfect multicollinearity among predictors, and errors with constant variance and zero mean.^[32] The objective is to estimate \boldsymbol{\beta} by minimizing the sum of squared residuals,

S(\boldsymbol{\beta}) = \sum_{i=1}^m \left( y_i - \mathbf{x}_i^T \boldsymbol{\beta} \right)^2 = \| \mathbf{y} - X \boldsymbol{\beta} \|^2_2,

where the residual vector is \mathbf{r}(\boldsymbol{\beta}) = \mathbf{y} - X \boldsymbol{\beta}.^[2] This leads to the normal equations

X^T X \boldsymbol{\beta} = X^T \mathbf{y},

which provide a closed-form solution \hat{\boldsymbol{\beta}} = (X^T X)^{-1} X^T \mathbf{y} when X has full column rank.^[33] Unlike nonlinear least squares, the linear problem is convex with a unique global minimum under the full rank assumption, allowing direct analytical solutions without iterative methods.^[2]

Analytical Solutions

Analytical solutions to the linear least squares problem, which seeks to minimize \| y - X \beta \|^2_2 for a response vector y \in \mathbb{R}^m and design matrix X \in \mathbb{R}^{m \times n} with m \geq n, rely on direct matrix factorizations that exploit the structure of the normal equations X^T X \beta = X^T y.^[34] These methods provide exact solutions when X has full column rank and handle ill-conditioned or rank-deficient cases through appropriate decompositions.^[35] One primary direct method is the QR decomposition, which factors X = Q R where Q \in \mathbb{R}^{m \times n} has orthonormal columns and R \in \mathbb{R}^{n \times n} is upper triangular.^[34] Substituting into the normal equations yields R^T R \beta = R^T Q^T y, so \beta = R^{-1} Q^T y after forward substitution to solve the triangular system R \beta = Q^T y.^[34] This approach is numerically stable due to the orthogonality of Q, avoiding the squaring of condition numbers that occurs in the normal equations.^[36] For cases where X^T X is positive definite (i.e., X has full column rank), the Cholesky factorization provides an efficient alternative by decomposing X^T X = L L^T with L lower triangular.^[37] The solution is then obtained by solving L z = X^T y for z via forward substitution, followed by back substitution in L^T \beta = z.^[37] While computationally cheaper than QR—requiring approximately half the operations for dense matrices—this method is less stable for ill-conditioned problems, as errors in forming X^T X are amplified.^[37] In rank-deficient scenarios, where the columns of X are linearly dependent, the singular value decomposition (SVD) offers a robust solution by factoring X = U \Sigma V^T, with U \in \mathbb{R}^{m \times m} and V \in \mathbb{R}^{n \times n} orthogonal, and \Sigma diagonal containing singular values.^[34] The minimum-norm least squares solution is \beta = V \Sigma^+ U^T y, where \Sigma^+ is the pseudoinverse that inverts non-zero singular values and sets reciprocals of zero singular values to zero.^[34] This method automatically handles underdetermined systems and provides the unique solution of minimum Euclidean norm among all minimizers.^[35] Geometrically, the least squares solution \hat{y} = X \beta represents the orthogonal projection of y onto the column space of X, denoted \mathcal{C}(X).^[38] The projection matrix, known as the hat matrix, is H = X (X^T X)^{-1} X^T for full rank X, satisfying H^2 = H and H^T = H, which projects any vector onto \mathcal{C}(X).^[39] Thus, \hat{y} = H y and the residuals are (I - H) y, orthogonal to \mathcal{C}(X).^[39] These direct methods generally incur a computational complexity of O(n^3) for an n \times n system in the dense case, though QR and Cholesky scale as O(m n^2) for m \gg n, with SVD being more expensive at O(m n^2 + n^3).^[40] Stability trade-offs favor QR and SVD over Cholesky for practical implementations, particularly when conditioning is poor.^[36]

Nonlinear Least Squares

Mathematical Formulation

In nonlinear least squares, the underlying model relates observed response values \mathbf{y} \in \mathbb{R}^m to a set of predictor variables \mathbf{x} \in \mathbb{R}^p through a nonlinear function of the parameter vector \boldsymbol{\beta} \in \mathbb{R}^n, expressed as

\mathbf{y} = \mathbf{f}(\mathbf{x}, \boldsymbol{\beta}) + \boldsymbol{\epsilon},

where \mathbf{f}: \mathbb{R}^p \times \mathbb{R}^n \to \mathbb{R}^m is nonlinear in \boldsymbol{\beta}, and \boldsymbol{\epsilon} denotes the error term, typically assumed to follow a multivariate normal distribution with mean zero and positive definite covariance matrix \boldsymbol{\Sigma}.^[41] The objective is to estimate \boldsymbol{\beta} by minimizing the sum of squared residuals,

S(\boldsymbol{\beta}) = \sum_{i=1}^m \left[ y_i - f(x_i, \boldsymbol{\beta}) \right]^2 = \| \mathbf{r}(\boldsymbol{\beta}) \|^2_2,

where the residual vector is \mathbf{r}(\boldsymbol{\beta}) = \mathbf{y} - \mathbf{f}(\mathbf{x}, \boldsymbol{\beta}).^[41] This formulation extends the linear least squares problem, which is recovered when \mathbf{f} is affine in \boldsymbol{\beta}. For numerical optimization, the problem is often locally linearized around a current estimate \boldsymbol{\beta}_k using the Jacobian matrix

\mathbf{J}(\boldsymbol{\beta}_k) = \frac{\partial \mathbf{r}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} \bigg|_{\boldsymbol{\beta} = \boldsymbol{\beta}_k} \in \mathbb{R}^{m \times n},

whose (i,j)-th entry is the partial derivative \frac{\partial r_i}{\partial \beta_j}. Key challenges in this setup stem from the nonlinearity: the objective S(\boldsymbol{\beta}) is generally non-convex, leading to multiple local minima and the absence of closed-form solutions, which complicates convergence and requires robust starting values to avoid poor local optima.^[41]

Iterative Solution Methods

Iterative solution methods for nonlinear least squares problems approximate the solution through successive linearizations of the nonlinear model, leveraging the Jacobian matrix to update parameter estimates. These methods are essential when analytical solutions are unavailable, as they iteratively minimize the sum of squared residuals by solving local linear least squares subproblems. The Jacobian, which captures the first-order sensitivities of the residuals with respect to the parameters, plays a central role in these approximations.^[42] The Gauss-Newton method is a foundational iterative algorithm that linearizes the residual vector around the current parameter estimate \beta_k and solves the resulting linear least squares problem to obtain the next iterate \beta_{k+1}. Specifically, it computes the update as \beta_{k+1} = \beta_k - (J^T J)^{-1} J^T r, where J is the Jacobian matrix evaluated at \beta_k and r is the vector of residuals at \beta_k. This approach approximates the Hessian of the objective function by J^T J, ignoring second-order terms, which simplifies computation but can lead to instability if J^T J is ill-conditioned. Under suitable conditions, such as when the residuals are small near the solution and the Jacobian has full column rank, the method exhibits local quadratic convergence, meaning the error decreases quadratically close to the optimum.^[42] To address the potential non-convergence or slow progress of the Gauss-Newton method, particularly far from the solution or in ill-conditioned cases, the Levenberg-Marquardt algorithm introduces a damping parameter \lambda > 0 to blend Gauss-Newton steps with gradient descent. The update is given by \beta_{k+1} = \beta_k - (J^T J + \lambda I)^{-1} J^T r, where the added \lambda I term acts as a trust region mechanism, ensuring the step is a descent direction and stabilizing the inversion when J^T J is singular or nearly so. Originally proposed by Levenberg in 1944 as an extension for nonlinear problems and refined by Marquardt in 1963 to improve efficiency, this method adjusts \lambda dynamically: decreasing it for Gauss-Newton-like behavior near the solution and increasing it for safer, smaller steps elsewhere. It achieves local quadratic convergence similar to Gauss-Newton when \lambda \to 0 near the optimum, while offering global convergence guarantees under line search or trust region strategies.^[43]^[44]^[42] Convergence of these iterative methods is typically assessed using criteria such as the norm of the gradient of the objective function falling below a tolerance (e.g.,

\|[J^T](/page/J&T) r\| < \epsilon

), the residual norm \|r\| < \delta, or the change in parameters \|\beta_{k+1} - \beta_k\| < \gamma, where \epsilon, \delta, \gamma are user-specified thresholds. These stopping rules ensure the algorithm halts when improvements are negligible, balancing computational cost and accuracy.^[42] The choice of initial guess \beta_0 is critical, as these methods rely on local approximations and may converge to local minima or diverge if starting far from the true solution; techniques like grid search or prior knowledge from linear approximations are often used to select a reasonable \beta_0. Ill-conditioning, arising when the Jacobian columns are nearly linearly dependent, can amplify errors in the update steps, necessitating preconditioning or careful scaling of variables to improve numerical stability.^[42]

Variations and Extensions

Weighted Least Squares

Weighted least squares (WLS) is a generalization of ordinary least squares (OLS) designed to address heteroscedasticity, where the variance of the errors in a linear regression model is not constant across observations. Unlike OLS, which assumes homoscedastic errors with equal variance \sigma^2, WLS assigns higher importance to observations with smaller error variances by incorporating a weight matrix, thereby yielding more efficient parameter estimates under the violated assumption.^[45] In the linear model y = X\beta + \varepsilon, where \varepsilon has mean zero and a diagonal covariance matrix \Sigma = \operatorname{diag}(\sigma_1^2, \dots, \sigma_n^2), the WLS objective is to minimize the weighted sum of squared residuals:

(y - X\beta)^T W (y - X\beta),

with W a diagonal matrix such that W = \Sigma^{-1}, ensuring the errors are effectively standardized to have unit variance. The solution for the parameter vector \beta is obtained by solving the normal equations, resulting in the estimator

\hat{\beta} = (X^T W X)^{-1} X^T W y.

This closed-form expression parallels the OLS solution but incorporates the weights to downweight observations with larger variances.^[46] The choice of weights is critical and typically based on the known or estimated error variances, with the i-th diagonal entry w_{ii} = 1/\sigma_i^2, where \sigma_i^2 reflects the variability specific to the i-th observation. In practice, these variances may be estimated from the data, such as by fitting an auxiliary model to the squared residuals or using domain knowledge about the data collection process. For instance, if observations are averages of multiple measurements, \sigma_i^2 might be inversely proportional to the sample size underlying each average.^[45] WLS finds applications in regression settings where error variances vary systematically, such as when they increase with the level of a predictor or fitted value, improving the precision of estimates compared to unadjusted OLS. In survey sampling, WLS is employed to incorporate sampling weights that account for unequal selection probabilities, which induce heteroscedasticity by making variances differ across strata or clusters, thus providing unbiased and efficient estimates of population parameters.^[45]^[47]

Generalized Least Squares

Generalized least squares (GLS) is a statistical method for estimating the parameters of a linear regression model when the error terms are correlated, characterized by a full covariance matrix rather than assuming independence. Formulated by Alexander Aitken in 1936, GLS addresses scenarios where ordinary least squares (OLS) would be inefficient due to the violation of the homoscedasticity and no-autocorrelation assumptions. The approach transforms the model to account for the known or estimated error covariance structure, yielding the best linear unbiased estimator under Gaussian errors.^[31] Consider the linear model y = X\beta + \varepsilon, where y is an n \times 1 vector of observations, X is an n \times k design matrix, \beta is a k \times 1 parameter vector, and \varepsilon is the error term distributed as \varepsilon \sim N(0, \Sigma), with \Sigma being a positive definite n \times n covariance matrix. The GLS estimator minimizes the weighted quadratic form

(y - X\beta)^T \Sigma^{-1} (y - X\beta),

which generalizes the OLS objective by incorporating the inverse covariance weighting. The closed-form solution is

\hat{\beta}_{GLS} = (X^T \Sigma^{-1} X)^{-1} X^T \Sigma^{-1} y.

This estimator is unbiased and achieves the minimum variance among linear unbiased estimators when \Sigma is known, as established by the Gauss-Markov theorem in its generalized form. In practice, \Sigma is rarely known and must be estimated, leading to feasible GLS (FGLS). The procedure typically begins with an initial OLS estimation to obtain residuals, from which an estimate \hat{\Sigma} is derived—often assuming a parametric form like autoregressive structure for the errors. These residuals inform an iterative weighting scheme: subsequent GLS steps refine \hat{\beta} and \hat{\Sigma} until convergence, approximating the efficiency of true GLS while remaining computationally tractable. This iterative approach ensures consistency under standard regularity conditions, though it may introduce slight bias in small samples.^[48] GLS finds prominent application in time series analysis, where errors exhibit autocorrelation, such as in economic data with serial dependence; the Cochrane-Orcutt procedure implements FGLS for first-order autoregressive errors by estimating the autocorrelation parameter from OLS residuals and transforming the data accordingly. In spatial econometrics, GLS accommodates spatially correlated errors in models of geographic or regional data, enhancing parameter efficiency by modeling dependence through contiguity matrices or distance-based covariances.^[49]

Regularization Methods

Tikhonov Regularization

Tikhonov regularization, also known as ridge regression in statistical contexts, addresses the instability arising in ill-conditioned linear least squares problems by incorporating a penalty term that stabilizes the solution. The method modifies the standard least squares objective to balance fidelity to the data with a constraint on the magnitude of the solution parameters. The regularized objective is to minimize \|X\beta - y\|^2 + \lambda \|L\beta\|^2, where X is the design matrix, y is the response vector, \beta is the parameter vector to estimate, \lambda > 0 is the regularization parameter controlling the penalty strength, and L is a regularization matrix that is often taken as the identity matrix I for standard Tikhonov regularization. This formulation promotes solutions where the parameters \beta remain small in the norm defined by L, thereby mitigating the amplification of noise or errors in the data due to near-singular X^T X. The closed-form solution to this optimization problem is given by \beta = (X^T X + \lambda L^T L)^{-1} X^T y, which can be computed efficiently when L = I and extends the ordinary least squares solution by adding a scaled identity (or more general) matrix to X^T X, ensuring the matrix is well-conditioned. Using the singular value decomposition X = U \Sigma V^T, the solution components are filtered as \beta_j = \frac{\sigma_j u_j^T y}{\sigma_j^2 + \lambda} v_j for L = I, where small singular values \sigma_j are damped to reduce sensitivity to perturbations. The choice of the regularization parameter \lambda is crucial and can be determined through methods such as generalized cross-validation (GCV), which estimates prediction error without requiring knowledge of the noise level, or the L-curve method, which plots the residual norm against the solution norm to identify a corner balancing bias and variance. GCV minimizes an unbiased estimate of the mean squared error of prediction, while the L-curve visually selects \lambda at the point of maximum curvature, often near the optimal trade-off. Tikhonov regularization induces shrinkage in the estimated coefficients \beta, biasing them toward zero (or the null space of L) to achieve a favorable bias-variance trade-off, where the introduced bias reduces variance and improves overall prediction accuracy in the presence of multicollinearity or high dimensionality. This shrinkage effect stabilizes the solution without enforcing exact zeros, leading to more reliable inferences compared to unregularized least squares in noisy or ill-posed settings.

Lasso Regularization

Lasso regularization, also known as the least absolute shrinkage and selection operator, addresses the challenge of overfitting in high-dimensional linear least squares problems by incorporating an L1 penalty that promotes sparse solutions.^[50] The method minimizes the objective function \| \mathbf{X} \boldsymbol{\beta} - \mathbf{y} \|^2_2 + \lambda \| \boldsymbol{\beta} \|_1, where \mathbf{X} is the design matrix, \mathbf{y} is the response vector, \boldsymbol{\beta} is the coefficient vector, \lambda \geq 0 controls the penalty strength, and \| \cdot \|_1 denotes the L1 norm (sum of absolute values).^[50] A key property of Lasso solutions is their sparsity: the L1 penalty drives many coefficients to exactly zero, enabling automatic feature selection, unlike L2 penalties that only shrink coefficients toward zero.^[50] This sparsity arises through a soft-thresholding mechanism in the coordinate-wise updates, where each coefficient \beta_j is adjusted as \beta_j = \text{sign}(z_j) \max(|z_j| - \lambda / 2, 0), with z_j being the partial residual for the j-th feature, effectively thresholding small correlations while preserving larger ones.^[51] The optimization problem is convex, ensuring a unique global minimum for any fixed \lambda and guaranteeing efficient solvability via standard convex optimization techniques.^[50] Efficient algorithms for solving Lasso include least angle regression (LARS), which traces the entire regularization path by sequentially adding features with the highest correlation to the residuals, equating to forward stagewise regression under the L1 constraint.^[52] Another widely used approach is coordinate descent, which iteratively optimizes one coefficient at a time via soft-thresholding, making it scalable for large datasets and implemented in tools like glmnet.^[53] In applications, Lasso excels in feature selection for high-dimensional data, such as genomics, where it identifies key genetic markers from thousands of single nucleotide polymorphisms in genome-wide association studies.^[54] Similarly, in econometrics, it facilitates variable selection in sparse high-dimensional models, improving inference on treatment effects by selecting relevant covariates from large economic datasets.^[55]

Advanced Topics and Connections

Uncertainty Quantification

In the linear least squares model, the uncertainty in the parameter estimates \hat{\beta} is quantified through the variance-covariance matrix, which under the assumptions of homoscedasticity and independence of errors is given by \operatorname{Var}(\hat{\beta}) = \sigma^2 (X^T X)^{-1}, where \sigma^2 is the error variance and X is the design matrix.^[56] This matrix provides the variances of individual parameters along the diagonal and covariances between pairs of parameters off the diagonal, enabling assessment of the precision and potential correlations among estimates.^[56] The error variance \sigma^2 is typically estimated from the residuals of the fitted model. Confidence intervals for the parameters \hat{\beta}_j are constructed using the standard errors derived from the diagonal elements of the variance-covariance matrix, scaled by the t-distribution: \hat{\beta}_j \pm t_{\alpha/2, n-p} \sqrt{\widehat{\operatorname{Var}}(\hat{\beta}_j)}, where t_{\alpha/2, n-p} is the critical value from the t-distribution with n-p residual degrees of freedom (n observations and p parameters), and \widehat{\operatorname{Var}}(\hat{\beta}_j) is the estimated variance of the j-th parameter.^[57] These intervals capture the range within which the true parameters are likely to lie with probability $1 - \alpha, accounting for both estimation variability and the uncertainty in \sigma^2.^[57] For nonlinear least squares or cases where assumptions like normality or homoscedasticity are violated, bootstrap methods provide a robust alternative for uncertainty quantification by resampling the data to approximate the sampling distribution of \hat{\beta}.^[58] In the nonparametric bootstrap, synthetic datasets are generated by resampling with replacement from the original data, refitting the model to each, and using the resulting distribution of \hat{\beta}^* to compute percentile-based confidence intervals or standard errors.^[58] This approach is particularly useful for complex models, as it does not rely on asymptotic normality and can handle dependencies in the data.^[58] Distinguishing between confidence intervals for parameters and prediction intervals for new observations is essential in least squares applications. Confidence intervals address uncertainty in the estimated mean response at a given predictor value, while prediction intervals incorporate additional variability from the error term to quantify uncertainty in individual future observations, resulting in wider intervals.^[59] For example, in linear regression, the prediction interval for a new y at x_0 is \hat{y}_0 \pm t_{\alpha/2, n-p} s \sqrt{1 + x_0^T (X^T X)^{-1} x_0}, where s estimates \sigma, reflecting both parameter estimation error and irreducible noise.^[59]

Statistical Testing and Hypothesis Evaluation

In least squares regression, statistical testing plays a crucial role in evaluating the significance of model parameters and assessing overall model adequacy, enabling researchers to draw inferences about the underlying relationships in the data. These tests rely on the assumptions of the linear model, such as normality of residuals, to construct test statistics that compare observed data to null hypotheses of no effect or poor fit. By examining individual coefficients and global model performance, hypothesis evaluation helps validate whether the least squares estimates provide meaningful insights beyond random variation.^[60] The t-test for individual coefficients assesses whether a specific regression parameter β_j differs significantly from zero, indicating that the corresponding predictor contributes to explaining the response variable. The test statistic is computed as t = β̂_j / SE(β̂_j), where β̂_j is the least squares estimate of the coefficient and SE(β̂_j) is its standard error, derived from the variance-covariance matrix of the estimates. Under the null hypothesis H_0: β_j = 0, this statistic follows a t-distribution with n - p - 1 degrees of freedom, where n is the sample size and p is the number of predictors; rejection of the null at a chosen significance level (e.g., α = 0.05) suggests the coefficient is statistically significant. This approach allows for hypothesis testing on the importance of each variable while controlling for others in the model.^[60] For evaluating the overall fit of the model, the F-test compares the explained variance to the unexplained variance, testing the joint null hypothesis that all slope coefficients are zero (H_0: β_1 = β_2 = ... = β_p = 0). The test statistic is given by

F = \frac{\text{SSR}/p}{\text{SSE}/(n - p - 1)},

where SSR is the sum of squares due to regression, SSE is the sum of squares of the residuals, p is the number of predictors, and n - p - 1 are the residual degrees of freedom. Under the null, F follows an F-distribution with p and n - p - 1 degrees of freedom; a large F-value leads to rejection, implying the model as a whole explains a significant portion of the variability in the response beyond an intercept-only model. This test is particularly useful for confirming whether the inclusion of predictors improves the model over a baseline.^[61] Goodness-of-fit is often quantified using the coefficient of determination, R², which measures the proportion of total variance in the response variable explained by the model: R² = 1 - (SSE / SST), where SST is the total sum of squares. Values range from 0 to 1, with higher values indicating better fit, though R² always increases with added predictors regardless of relevance, potentially leading to overfitting. To address this, the adjusted R² penalizes model complexity by incorporating the number of parameters: adjusted R² = 1 - [(1 - R²)(n - 1)/(n - p - 1)], providing a more reliable metric for comparing models with different numbers of predictors; it can decrease if added variables do not sufficiently improve fit. These metrics guide hypothesis evaluation by highlighting how well the least squares model captures data patterns without excessive parameterization.^[62] Diagnostic tests further support hypothesis evaluation by checking assumptions underlying least squares inference, particularly through residual analysis. The Durbin-Watson test detects first-order autocorrelation in residuals, which violates the independence assumption and can bias standard errors; the test statistic d ranges from 0 to 4, with d ≈ 2 indicating no autocorrelation, d < 2 suggesting positive autocorrelation, and d > 2 indicating negative. Developed for regression residuals, it uses tabulated critical values to test H_0: ρ = 0 (no autocorrelation) against alternatives, with inconclusive zones requiring further bounds testing. Similarly, the Breusch-Pagan test examines heteroscedasticity, where residual variance changes with fitted values or predictors, undermining homoscedasticity; it involves regressing squared residuals on the independent variables and testing the significance of that auxiliary regression via a chi-squared statistic (LM = n R²_aux ~ χ²_p under H_0: homoscedasticity). If violated, these diagnostics prompt model adjustments, such as weighted least squares, to restore valid inference.^[63]^[64]

Relation to Principal Component Analysis

Principal component regression (PCR) integrates least squares estimation with principal component analysis (PCA) to address challenges in multiple linear regression, particularly when predictors exhibit multicollinearity. In PCR, PCA is first applied to the predictor matrix to derive orthogonal principal components that capture the maximum variance in the data. These components, or their scores, then serve as the new predictors in an ordinary least squares (OLS) regression model to estimate the response variable. This approach decorrelates the predictors, stabilizing coefficient estimates and reducing variance inflation caused by high correlations among original variables.^[65] A key equivalence exists between OLS applied to principal components and certain forms of regularized least squares. Specifically, performing OLS on a subset of the principal components effectively imposes a form of regularization by truncating lower-variance components, which is mathematically akin to ridge regression in the limit as the regularization parameter approaches specific values derived from the eigenvalues of the predictor covariance matrix. This truncation mitigates overfitting in high-dimensional settings while preserving the least squares objective on the reduced space. Full use of all components recovers the standard OLS solution, but selective inclusion provides a data-driven regularization strategy.^[66]^[67] Geometrically, both least squares and PCA rely on orthogonal projections, though they optimize different criteria. In least squares regression, the solution minimizes the squared Euclidean distance by projecting the response vector onto the column space of the predictor matrix, yielding residuals orthogonal to that space. PCA, in contrast, identifies an orthogonal basis of directions that successively maximize the projected variance of the data cloud, with projections onto the principal subspace minimizing reconstruction error in the least squares sense for that subspace. This shared projection framework underscores their complementarity in dimensionality reduction, where PCA preprocesses data for subsequent least squares fitting.^[68]^[69] PCR is particularly advantageous in scenarios involving high-dimensional data with correlated predictors, such as genomics or econometrics, where multicollinearity can render standard OLS unstable or uninterpretable. By regressing on a few leading principal components, PCR reduces model complexity, enhances predictive accuracy, and facilitates interpretation through variance-explained metrics, making it a practical tool for such applications without requiring explicit penalty terms.^[65]

Links to Measure Theory

In the framework of measure theory, the least squares method can be interpreted as the minimization of the L^2 norm in a Hilbert space equipped with an inner product defined by integration with respect to a probability measure. Specifically, for a random variable Y and a model subspace \mathcal{H} of square-integrable functions, the least squares estimator minimizes \mathbb{E}[(Y - f)^2] over f \in \mathcal{H}, which corresponds to the L^2(\Omega, \mathcal{F}, P) norm where P is the underlying probability measure.^[70] This abstraction generalizes the finite-dimensional case to infinite-dimensional settings, such as function spaces over continuous domains.^[71] The solution to this minimization problem is provided by the projection theorem in Hilbert spaces, which states that for a closed subspace \mathcal{M} \subset \mathcal{H}, every element y \in \mathcal{H} has a unique orthogonal projection P_{\mathcal{M}} y \in \mathcal{M} such that \|y - P_{\mathcal{M}} y\| = \inf_{m \in \mathcal{M}} \|y - m\|, and the residual y - P_{\mathcal{M}} y is orthogonal to \mathcal{M}. In the least squares context, the minimizer is thus the orthogonal projection of the observed data onto the model subspace, ensuring that the error is uncorrelated with the predictors in the L^2 sense.^[72] This theorem underpins the geometric interpretation of least squares as a projection, extending it to abstract measure-theoretic spaces.^[70] This Hilbert space formulation naturally generalizes to stochastic processes, where the least squares problem involves estimating parameters in models driven by processes with covariance structures defined via expectations over the probability space. In functional analysis, the method applies to reproducing kernel Hilbert spaces (RKHS) associated with stochastic processes, allowing for nonparametric regression where the minimizer is the representer of evaluation functionals in the RKHS.^[72] For Gaussian processes, the least squares solution coincides with the conditional expectation, bridging estimation with Bayesian inference in infinite-dimensional settings.^[71] The ergodic theorem further connects least squares to measure theory in the analysis of time series, where ergodicity ensures that time averages converge almost surely to ensemble expectations, justifying the consistency of least squares estimators for stationary ergodic processes. Under ergodicity of the underlying measure-preserving transformation, the sample covariance matrices used in least squares converge to their theoretical counterparts, enabling asymptotic analysis of prediction errors in non-stochastic time series.^[73] This implication is crucial for long-run forecasting, as it validates the use of empirical least squares for processes where realizations behave like typical samples from the invariant measure.^[74]

Practical Examples

Simple Linear Regression Example

To illustrate the application of least squares in simple linear regression, consider a dataset consisting of heights (in inches) and corresponding weights (in pounds) for 10 students, as used in statistical education examples. This bivariate data allows fitting a straight-line model to predict weight based on height, demonstrating how least squares minimizes the sum of squared residuals to find the best-fitting line. The dataset is presented in the following table:

Student	Height (inches)	Weight (pounds)
1	63	127
2	64	121
3	66	142
4	69	157
5	69	162
6	71	156
7	71	169
8	72	165
9	73	181
10	75	208

The simple linear regression model is given by

\text{weight} = \beta_0 + \beta_1 \cdot \text{height} + \epsilon,

where \epsilon represents the random error term. Applying the least squares method yields the fitted equation

\hat{\text{weight}} = -266.53 + 6.1376 \cdot \text{height}.

^[75] The residuals are the vertical distances between the observed weights and the predicted weights from this line; by construction of least squares, their sum is zero, and the method ensures these differences are as small as possible in squared sum. The coefficient of determination, R^2 = 0.897, indicates that approximately 89.7% of the variation in weights is explained by the linear relationship with heights.^[75] The slope estimate of 6.1376 implies that, on average, weight increases by about 6.14 pounds for each additional inch of height in this sample. A t-test for the slope yields a p-value of less than 0.001, providing strong evidence against the null hypothesis of no linear relationship between height and weight.^[75] In a visualization, a scatter plot of the 10 data points forms a positively trending cloud, with the fitted regression line passing through the center, closely aligning with most points while showing minor deviations that represent the residuals. This graphical representation highlights how least squares balances the fit across all observations.^[75]

Nonlinear Fitting Example

A common application of nonlinear least squares arises in fitting exponential decay models to observational data, such as counts from radioactive decay experiments measured at discrete time points.^[76] In such scenarios, the underlying physical process follows an exponential form, where the number of counts decreases proportionally to the current amount, leading to a nonlinear parameter estimation problem.^[77] Consider the following representative dataset of measured counts y_i at times t_i (in arbitrary units):

Time t_i	Counts y_i
0	100
1	99
2	98
3	97
4	96
5	95

The model is

y = a e^{-b t} + \epsilon,

where a > 0 is the initial amplitude, b > 0 is the decay rate, and \epsilon denotes additive noise, typically assumed Gaussian.^[76] The goal is to estimate the parameter vector \beta = (a, b) by minimizing the sum of squared residuals \sum_{i=1}^n (y_i - a e^{-b t_i})^2. An initial guess for the parameters is required due to the nonlinearity. The Gauss-Newton method addresses this by iteratively linearizing the model around the current parameter estimate and solving a sequence of linear least squares subproblems for the update direction.^[76] The algorithm updates the estimates until convergence, yielding parameters that minimize the residual sum of squares. For this type of model and reasonable initial guess, convergence is typically rapid. To evaluate the fit, a residual plot is constructed by plotting the residuals r_i = y_i - \hat{a} e^{-\hat{b} t_i} against t_i or the fitted values. In a well-fitted model, these residuals should appear randomly distributed around zero with no discernible trends or patterns, indicating that the exponential form adequately captures the decay without systematic bias. Goodness-of-fit is further quantified using metrics such as the coefficient of determination R^2, which measures the proportion of variance explained by the model, or the reduced chi-squared statistic, which assesses consistency with the assumed error variance; for this example, these confirm a strong fit suitable for the radioactive decay context.^[77]

References

[1]
https://math.libretexts.org/Bookshelves/Linear_Algebra/Interactive_Linear_Algebra_(Margalit_and_Rabinoff)/06:_Orthogonality/6.5:_The_Method_of_Least_Squares
[2]
The Method of Least Squares
A least-squares solution solves the equation Ax = b as closely as possible, in the sense that the sum of the squares of the difference b − Ax is minimized.
[3]
Gauss and the Invention of Least Squares - Project Euclid
It is argued (though not conclusively) that Gauss probably possessed the method well before Legendre, but that he was unsuccessful in communicating it to his ...
[4]
[PDF] The Method of Least Squares - Williams College
Abstract. The Method of Least Squares is a procedure to determine the best fit line to data; the proof uses calculus and linear algebra.
[5]
[PDF] Least Squares Approximation - Utah State University
Least squares can be used in a wide variety of categorical applications, in- cluding: curve fitting of data, parameter identification, and system model ...
[6]
Least Squares Fitting -- from Wolfram MathWorld
Least squares fitting finds the best-fitting curve by minimizing the sum of the squares of the vertical offsets of points from the curve.Missing: interpretation | Show results with:interpretation
[7]
2.3. Geometry of least squares: the orthogonal projection
The Orthogonal Projection Theorem implies that any v ∈ R n can be decomposed into its orthogonal projection onto U and a vector orthogonal to it. DEFINITION ( ...
[8]
[PDF] 1 Least squares basics - Cornell: Computer Science
Sep 20, 2022 · The least squares problem is also the maxi- mum likelihood estimator under these same hypotheses. ... Of course, Gaussian noise is not the only ...
[9]
Least squares | overdetermined & underdetermined systems
May 6, 2018 · If m > n, i.e. we have more equations than unknowns, the system is overdetermined. The equations will be inconsistent in general. This is ...Missing: distinction | Show results with:distinction
[10]
7 Classical Assumptions of OLS Linear Regression
Ordinary Least Squares (OLS) produces the best possible coefficient estimates when your model satisfies the OLS assumptions for linear regression.
[11]
Regression Model Assumptions | Introduction to Statistics - JMP
Regression Model Assumptions · The true relationship is linear · Errors are normally distributed · Homoscedasticity of errors (or, equal variance around the line).
[12]
Statistical notes for clinical researchers: simple linear regression 3
Feb 12, 2019 · Four basic assumptions of linear regression are linearity, independence, normality, and equality of variance [2]. Only under the condition ...Fitted Regression Line And... · 1. Linearity · Appendix 1<|control11|><|separator|>
[13]
6 Assumptions of Linear Regression - Analytics Vidhya
Apr 4, 2025 · Homoscedasticity: The variance of the errors is constant across all levels of the independent variables. Normality: The errors follow a normal ...
[14]
Regression Models - Least Squares Method - Arthur Graus
Efficiency: Least Squares Method has a closed-form solution, allowing for efficient computation of the regression coefficients. It does not require iterative ...Description · Use Cases · Algorithm Behind The ScenesMissing: advantages optimality
[15]
The Gauss-Markov Theorem and BLUE OLS Coefficient Estimates
The Gauss-Markov theorem states that OLS can produce the best coefficient estimates. Learn more about this theorem and its implications for the estimates.Missing: optimality | Show results with:optimality
[16]
Gauss Markov theorem - StatLect
The Gauss Markov theorem says that, under certain conditions, the ordinary least squares (OLS) estimator of the coefficients of a linear regression model is the ...Missing: optimality | Show results with:optimality
[17]
Everything to Know About Residual Analysis - SixSigma.us
Jul 10, 2024 · Residual analysis is a crucial diagnostic technique to evaluate the validity of model assumptions & assess the adequacy of the fitted model.
[18]
Linear Regression: Assumptions and Limitations - QuantInsti Blog
Apr 18, 2022 · In this blog, we take a critical look at the assumptions of a linear regression model, how to detect and fix them, and how much water they hold in the real ...Missing: ill- posed
[19]
4.4 The Least Squares Assumptions
The OLS assumptions are: error term has conditional mean zero, data is i.i.d., and large outliers are unlikely.
[20]
Least Squares Methods for Ill-Posed Problems with a Prescribed ...
Tikhonov regularization is a standard method for obtaining smooth solutions to discrete ill-posed problems. A more recent method, based on the singular value ...Missing: assumption violations
[21]
6.10: Residual diagnostics – setting the stage for inference
Dec 15, 2022 · If the model missed a curve in the relationship, the residuals often will highlight that missed pattern and a curve will show up in this plot.
[22]
[PDF] Regression Diagnostics | Stat@Duke
Regression diagnostics are used after fitting to check if a fitted mean function and assumptions are consistent with observed data, using residuals.
[23]
19 Residual-diagnostics Plots - Explanatory Model Analysis
Residual diagnostics use residuals to examine model performance, assess goodness-of-fit, and identify issues with model fit or prediction.
[24]
Adrien-Marie Legendre Publishes the Method of Least Squares
In 1805 French mathematician Adrien-Marie Legendre Offsite Link published Nouvelles méthodes pour la détermination des orbites des comètes Offsite Link .
[25]
The Discovery of Statistical Regression - Priceonomics
Nov 6, 2015 · In his 1805 paper, “New Methods for Determination of the Orbits of Comets,” Legendre supplied the original articulation and example of the use ...
[26]
Gauss, Least Squares, and the Missing Planet - Actuaries Institute
Mar 30, 2021 · The early history of statistics can be traced back to 1795 when Carl Fredrich Gauss, at 18 years of age, invented the method of least squares ...
[27]
Gauss on least-squares and maximum-likelihood estimation
Apr 2, 2022 · Gauss' 1809 discussion of least squares, which can be viewed as the beginning of mathematical statistics, is reviewed.Abstract · Notes · References · Author information
[28]
Science on the Farther Shore | American Scientist
Both Gauss and Legendre introduced the method of least squares in works on astronomy. Legendre was first to publish; he presented the technique (and also ...
[29]
Least Squares: The Great Arbitrater - xyHt
Jan 10, 2015 · Least squares grew out of the field of astronomy, and it is widely accepted that Carl Gauss first proposed it more than two centuries ago. The ...Missing: applications | Show results with:applications
[30]
LIII. On lines and planes of closest fit to systems of points in space
On lines and planes of closest fit to systems of points in space. Karl Pearson FRS. University College, London. Pages 559-572 | Published online: 08 Jun 2010.
[31]
IV.—On Least Squares and Linear Combination of Observations
Sep 15, 2014 · In a series of papers WF Sheppard (1912, 1914) has considered the approximate representation of equidistant, equally weighted, and uncorrelated observations.Missing: formulation 1930s
[32]
4.1.4.2. Nonlinear Least Squares Regression
Nonlinear least squares regression extends linear least squares regression for use with a much larger and more general class of functions. Almost any function ...Missing: authoritative reference
[33]
[PDF] Linear Least Squares Problems
The linear least squares problem has several explicit solutions that we now discuss: 1. normal equations,. 2. QR decomposition,. 3. SVD,. 4. transformation to a ...
[34]
[PDF] CS 450 – Numerical Analysis Chapter 3: Linear Least Squares
Jan 28, 2019 · ▻ Total least squares solution can be computed from SVD of [A, b] ... ▻ Solving least squares problem using Householder QR factorization.
[35]
[PDF] Matrix decompositions for regression analysis
Sep 7, 2010 · Generally the QR decomposition is preferred to the Cholesky decomposition for least squares problems because there is a certain loss of ...
[36]
[PDF] Lecture 18: October 28 18.1 Basic Operations
We observe that solving least squares using Cholesky decomposition is cheaper than using QR, but it is also less stable. Two methods have different ...
[37]
Lecture 16: Projection matrices and least squares | Linear Algebra
Linear regression is commonly used to fit a line to a collection of data. The method of least squares can be viewed as finding the projection of a vector.
[38]
[PDF] 3.0 Linear Regression with Matrices - Stat@Duke
The orthogonal projection of the hat matrix minimizes the sum of the squared vertical distances onto the subspace. Recall that in multiple linear regression ...
[39]
[PDF] Lecture 9: Numerical Linear Algebra Primer (February 11st)
... Computational cost of Cholesky and QR on least square. Here we perform an analysis on the computational cost of both sides. The case studied here is least ...
[40]
[PDF] Numerical Optimization - UCI Mathematics
... problem min x∈IR6 f (x) # r2. 1 (x) + r2. 2 (x) +···+ r2 m(x). (2.3). This is a nonlinear least-squares problem, a special case of unconstrained optimization ...
[41]
[PDF] a method for the solution of certain non-linear
The purpose of this article is to show how the problem may be solved by an extension of the standard method which insures improvement of the initial solution.
[42]
[PDF] An Algorithm for Least-Squares Estimation of Nonlinear Parameters
Mar 15, 2004 · Introduction. Most algorithms for the least-squares estimation of non- linear parameters have centered about either of two approaches. On the ...
[43]
13.1 - Weighted Least Squares | STAT 501
The method of weighted least squares can be used when the ordinary least squares assumption of constant variance in the errors is violated (which is called ...
[44]
[PDF] Weighted Least Squares - McMaster University
Here w1,...,wn are known positive constants. ∗ Weighted least squares is an estimation technique which weights the observations proportional to the ...
[45]
[PDF] On The Use of Survey Sample Weights in the Linear Model
We again consider two estimators : a simple weighted average of the least-squares estimators and an approximate. Aitken estimator. We make the following ...<|control11|><|separator|>
[46]
[PDF] lecture 11: generalized least squares (gls)
Our standard formula gives s2 = /(N - K) which is the unbiased estimator for σ2. Now we add the assumption of normality: y ~ N(Xβ, σ2Ω). Consider the log ...
[47]
[PDF] Anselin & Bera. (1998): Spatial Dependence in Linear Regression ...
The approach taken in spatial econometrics is to impose structure on the problem through the specifica- tion of a model, coupled with extensive specification ...
[48]
Regression Shrinkage and Selection Via the Lasso - Oxford Academic
SUMMARY. We propose a new method for estimation in linear models. The 'lasso' minimizes the residual sum of squares subject to the sum of the absolute valu.
[49]
Pathwise coordinate optimization - Project Euclid
It turns out that coordinate-wise descent does not work in the “fused lasso,” however, so we derive a generalized algorithm that yields the solution in much ...
[50]
Least angle regression - Project Euclid
Least Angle Regression (LARS), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods.
[51]
Regularization Paths for Generalized Linear Models via Coordinate ...
Feb 2, 2010 · The algorithms use cyclical coordinate descent, computed along a regularization path. The methods can handle large problems and can also deal ...
[52]
Genome-wide association analysis by lasso penalized logistic ...
The present article evaluates the performance of lasso penalized logistic regression in case–control disease gene mapping with a large number of SNPs.Missing: econometrics | Show results with:econometrics
[53]
Inference on Treatment Effects after Selection among High ...
These analyses build on the rate analysis of infeasible Lasso by Bickel et al. (2009) and on sparsity analysis and rate analysis of Post-Lasso by Belloni and ...
[54]
[PDF] OLS in Matrix Form
We can derive the variance-covariance matrix of the OLS estimator, ˆβ. E[( ˆβ − β)( ˆβ − β)0] = E[((X0X)−1X0²)((X0X)−1X0²)0].
[55]
7.5 - Confidence Intervals for Regression Parameters | STAT 415
Before we can derive confidence intervals for α and β , we first need to derive the probability distributions of a , b and σ ^ 2 . In the process of doing ...
[56]
[PDF] Bootstrap Methods for Standard Errors, Confidence Intervals, and ...
Apr 24, 2001 · We illustrate an application of the bootstrap to standard linear least squares regression as well as to a nonparametric regression technique.
[57]
Confidence vs prediction intervals for regression
The confidence interval for the conditional mean measures our degree of uncertainty in our estimate of the conditional mean; but the prediction interval must ...
[58]
Lesson 5: Multiple Linear Regression (MLR) Model & Evaluation
With a minor generalization of the degrees of freedom, we use t-tests and t-intervals for the regression slope coefficients to assess whether a predictor is ...
[59]
6.2 - The General Linear F-Test | STAT 501
The general linear F-test involves defining a full and reduced model, then using an F-statistic to decide whether to reject the reduced model.
[60]
R squared of a linear regression | Definition and interpretation
The R squared is a decreasing function of the sample variance of the residuals: the higher the sample variance of the residuals is, the smaller the R squared ...
[61]
Testing for Serial Correlation in Regression
J. DURBIN, G. S. WATSON; TESTING FOR SERIAL CORRELATION IN LEAST SQUARES REGRESSION. I, Biometrika, Volume 37, Issue 3-4, 1 December 1950, Pages 409–428, h.
[62]
A Simple Test for Heteroscedasticity and Random Coefficient Variation
BY T. S. BREUSCH AND A. R. PAGAN. A simple test for heteroscedastic disturbances in a linear regression model is developed using the framework of the ...
[63]
[PDF] Principal Component Analysis to Address Multicollinearity
May 13, 2017 · We will begin by reviewing simple linear regression, multiple linear regression and matrix repre- sentations of each model.
[64]
[PDF] some equivalence relationships of regularized regressions
This equivalence can be used to solve weighted principal component regression efficiently. 1. Introduction. Regression models are the most popular predictive ...<|control11|><|separator|>
[65]
(PDF) Some equivalence relationships of regularized regressions
Jul 3, 2018 · In this paper, we construct two useful equivalence relationships for regularized regression: 1. An equivalence between regularized functional ...
[66]
[PDF] Principal Component Analysis and Least Squares - Penn Math
Because hZ, V iV is just the orthogonal projection of Z into L, this formula is also a geometrically obvious consequence of the Pythagorean Theorem. Fit ...
[67]
[PDF] Chapter 11 Least Squares, Pseudo-Inverses, PCA & SVD
orthogonal projection p of b onto the subspace U, which is equivalent to pb = b − Ax being orthogonal to U. First of all, if U⊥ is the vector space orthogonal ...
[68]
Optimization by Vector Space Methods - David G. Luenberger
Jan 23, 1997 · This book shows engineers how to use optimization theory to solve complex problems. Unifies the large field of optimization with a few geometric ...
[69]
[PDF] Optimal Rates for Spectral Algorithms with Least-Squares ... - arXiv
Jul 15, 2022 · In this paper, we study regression problems over a separable Hilbert space with the square loss, covering non-parametric regression over a ...
[70]
On Conditional Least Squares Estimation for Stochastic Processes
An estimation procedure for stochastic processes based on the minimization of a sum of squared deviations about conditional expectations is developed.
[71]
Linear Least Squares Prediction in Non-Stochastic Time Series - jstor
series except through an appeal to ergodic theory. In this paper we take another look at these problems, and show how one can develop a theory for individual.
[72]
The Asymptotic Theory of Linear Time-Series Models - jstor
A linear time-series model is where a stationary, non-deterministic series has the best linear predictor equal to the best predictor. The paper constructs a ...
[73]
1.2 - What is the "Best Fitting Line"? | STAT 501
The first data point in the list indicates that student 1 is 63 inches tall and weighs 127 pounds.
[74]
[PDF] Nonlinear least squares problems
The model M(x,t) is nonlinear if at least one of the parameters in x appear nonlinearly. For example, in the exponential decay model. M(x1,x2,t) = x1e. −x2t.
[75]
12.5 - Nonlinear Regression | STAT 462
There are some nonlinear models which are actually called intrinsically linear because they can be made linear in the parameters by a simple transformation.Missing: differences | Show results with:differences<|control11|><|separator|>
[76]
[PDF] The Levenberg-Marquardt algorithm for nonlinear least squares ...
May 5, 2024 · The Levenberg-Marquardt algorithm was developed in the early 1960's to solve nonlinear least squares problems.

Student	Height (inches)	Weight (pounds)
1	63	127
2	64	121
3	66	142
4	69	157
5	69	162
6	71	156
7	71	169
8	72	165
9	73	181
10	75	208

Student	Height (inches)	Weight (pounds)
1	63	127
2	64	121
3	66	142
4	69	157
5	69	162
6	71	156
7	71	169
8	72	165
9	73	181
10	75	208

Student	Height (inches)	Weight (pounds)
1	63	127
2	64	121
3	66	142
4	69	157
5	69	162
6	71	156
7	71	169
8	72	165
9	73	181
10	75	208