Fact-checked by Grok 2 weeks ago

Ordinary least squares

Ordinary least squares (OLS) is a statistical method for estimating the parameters of a model by minimizing the sum of the squared residuals between observed and predicted values. Developed independently by in 1805 and (who claimed prior use from 1795 and published in 1809), OLS forms the foundation of analysis and is widely applied in fields such as , , and social sciences for modeling relationships between variables. Under the Gauss-Markov theorem, OLS produces the best linear unbiased estimator () when certain assumptions hold, including linearity in parameters, independence of errors, homoscedasticity (constant variance of errors), no perfect among predictors, and zero mean errors. The method assumes the model is linear in its parameters—meaning the response variable is a of explanatory variables plus an error term—though the relationship need not be a straight line if transformations are applied. For inference, such as constructing confidence intervals, an additional assumption of normally distributed errors is often invoked, though it is not strictly necessary for large samples due to the . OLS is computationally straightforward, typically solved via normal equations derived from setting partial derivatives of the to , yielding closed-form solutions like the b = \frac{\sum (Y_i - \bar{Y})(X_i - \bar{X})}{\sum (X_i - \bar{X})^2} for . It is sensitive to outliers and violations of assumptions like heteroscedasticity, which can lead to inefficient or biased estimates, prompting alternatives such as or in such cases. Despite these limitations, OLS remains a of statistical modeling due to its interpretability, efficiency with small datasets, and ability to provide prediction intervals when assumptions are met. The technique's historical roots in astronomy and underscore its enduring role in handling observational errors.

Model Formulation

Scalar Form

The ordinary least squares (OLS) method begins with the formulation of a linear regression model in scalar notation, which expresses the relationship between a dependent and one or more independent s for each observation. In the simplest case of univariate , the model for a single observation is given by Y = \beta_0 + \beta_1 X + \varepsilon, where Y is the response , X is the predictor , \beta_0 is the intercept representing the of Y when X=0, \beta_1 is the slope indicating the change in Y for a one-unit increase in X, and \varepsilon is the random error term capturing unexplained variation. For the more general multivariate case with n observations and k predictors, the model is specified as Y_i = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_k X_{ik} + \varepsilon_i, \quad i = 1, \dots, n, where Y_i is the i-th observed response, X_{ij} is the value of the j-th predictor for the i-th , \beta_j (for j=1,\dots,k) are the parameters measuring the effect of each predictor on Y_i holding others , and \beta_0 is . The error terms \varepsilon_i are assumed to have zero, E(\varepsilon_i) = 0, and variance, \mathrm{Var}(\varepsilon_i) = \sigma^2, ensuring the model's in parameters and homoscedasticity. This scalar form provides an intuitive, element-wise representation of the , originating from early applications in by in 1805 and in 1809, who developed the approach to fit orbits to astronomical data amid measurement errors. The notation can be extended compactly to and forms for multivariate derivations.

Vector and Matrix Form

The vector and matrix formulation of ordinary least squares (OLS) extends the scalar representation of the model to handle multiple and predictors simultaneously, leveraging linear algebra for compact notation and computational efficiency. In this framework, the dependent variable is expressed as an n \times 1 column \mathbf{Y}, where n denotes the number of , compiling all response values y_i for i = 1, \dots, n. The is a (k+1) \times 1 column \boldsymbol{\beta}, encompassing the intercept \beta_0 and k coefficients \beta_1, \dots, \beta_k. The error term becomes an n \times 1 \boldsymbol{\varepsilon}, capturing the deviations for each . The core model equation in matrix form is \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}, where \mathbf{X} is the n \times (k+1) that organizes the predictor . This equation generalizes the scalar form y_i = \beta_0 + \sum_{j=1}^k \beta_j x_{ij} + \varepsilon_i across all i, enabling operations for analysis. The \mathbf{X} plays a pivotal role by structuring the input features, with its first column consisting of ones to accommodate the intercept term, followed by n rows of the k predictor variables x_{ij}. For and interpretability, the columns of \mathbf{X} (excluding the intercept) are often centered by subtracting their means or scaled by dividing by their standard deviations, which does not alter the fitted model but mitigates issues like or ill-conditioning in computations. Partitioning \mathbf{X} further distinguishes the intercept column from the predictor columns, such as \mathbf{X} = [\mathbf{1} \mid \mathbf{Z}], where \mathbf{1} is the n \times 1 of ones and \mathbf{Z} is the n \times k of centered or scaled predictors; this separation facilitates modular analysis of model components.

Estimation Methods

Least Squares Objective

The ordinary least squares (OLS) objective seeks to estimate the parameters of a model by minimizing the sum of squared residuals, which measures the discrepancy between observed and predicted values. In scalar form, for a model Y_i = \beta_0 + \sum_{j=1}^p \beta_j X_{ij} + \epsilon_i with i = 1, \dots, n, the objective function is S(\boldsymbol{\beta}) = \sum_{i=1}^n \left( Y_i - \beta_0 - \sum_{j=1}^p \beta_j X_{ij} \right)^2, where \boldsymbol{\beta} = (\beta_0, \beta_1, \dots, \beta_p)^\top. Equivalently, in matrix notation, let \mathbf{Y} be the n \times 1 vector of responses, \mathbf{X} the n \times (p+1) design matrix (including a column of ones for the intercept), and \boldsymbol{\beta} the (p+1) \times 1 parameter vector; the objective then becomes minimizing \| \mathbf{Y} - \mathbf{X} \boldsymbol{\beta} \|^2_2, the squared Euclidean norm of the residual vector. Residuals are defined as the differences e_i = Y_i - \hat{Y}_i, where \hat{Y}_i = \beta_0 + \sum_{j=1}^p \beta_j X_{ij} represents the fitted value for the i-th under the estimated parameters. Squaring these residuals in the objective function penalizes larger deviations more severely than smaller ones, which emphasizes fitting accuracy for outliers, while also preventing positive and negative errors from canceling each other out in the sum. This further ensures a , amenable to optimization techniques. Under the classical linear model assumptions (linearity, strict exogeneity, homoskedasticity, and no perfect ), minimizing this objective yields the best linear unbiased (BLUE) of \boldsymbol{\beta}, meaning it has the minimum variance among all linear unbiased estimators, as established by the Gauss-Markov theorem. Geometrically, the objective corresponds to finding the point in the column space of \mathbf{X} closest to \mathbf{Y} in , equivalent to the squared length of the from \mathbf{Y} to that .

Closed-Form Estimator

The closed-form for the coefficients in ordinary (OLS) regression provides an explicit algebraic solution to the least squares objective, applicable when the satisfies certain conditions. In the and formulation, the normal equations that define the OLS are \mathbf{X}^\top \mathbf{X} \boldsymbol{\beta} = \mathbf{X}^\top \mathbf{y}, where \mathbf{X} is the n \times (p+1) (including the intercept column of ), \boldsymbol{\beta} is the (p+1) \times 1 of coefficients, and \mathbf{y} is the n \times 1 response ; these equations hold assuming \mathbf{X}^\top \mathbf{X} is invertible, which requires \mathbf{X} to have full column rank. Solving the normal equations yields the closed-form \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}. For the case with a single predictor, the estimator simplifies to \hat{\beta}_1 = \frac{\text{Cov}(X, Y)}{\text{Var}(X)}, and the intercept to \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}, where \bar{X} and \bar{Y} denote sample means. In practice, direct computation of (\mathbf{X}^\top \mathbf{X})^{-1} can suffer from numerical instability if \mathbf{X}^\top \mathbf{X} is ill-conditioned due to or scaling issues; instead, of \mathbf{X} = \mathbf{QR} allows solving \mathbf{R} \hat{\boldsymbol{\beta}} = \mathbf{Q}^\top \mathbf{y} for improved stability, while (SVD) of \mathbf{X} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^\top handles rank-deficient cases by effectively computing a pseudoinverse.

Derivations of Estimator

Geometric Projection

The (OLS) admits a natural derivation through the lens of vector geometry in . Consider the response vector \mathbf{Y} \in \mathbb{R}^n, where n denotes the number of observations, and the \mathbf{X} \in \mathbb{R}^{n \times p} with p predictors (p \leq n). The columns of \mathbf{X} span a p-dimensional subspace \mathcal{C}(\mathbf{X}) \subseteq \mathbb{R}^n. The OLS estimate \hat{\beta} selects the vector in this subspace closest to \mathbf{Y} in the Euclidean norm, such that the fitted values \hat{\mathbf{Y}} = \mathbf{X} \hat{\beta} form the orthogonal of \mathbf{Y} onto \mathcal{C}(\mathbf{X}). This projection minimizes the squared distance \| \mathbf{Y} - \mathbf{X} \beta \|_2^2 over all \beta \in \mathbb{R}^p. The of the implies that the vector \mathbf{e} = \mathbf{Y} - \hat{\mathbf{Y}} is to \mathcal{C}(\mathbf{X}), satisfying \mathbf{X}^T \mathbf{e} = \mathbf{0}. Substituting the expression yields the normal equations \mathbf{X}^T (\mathbf{Y} - \mathbf{X} \hat{\beta}) = \mathbf{0}, which uniquely determine \hat{\beta} when \mathbf{X}^T \mathbf{X} is invertible (i.e., \mathbf{X} has full column ). This condition ensures the residuals lie in the of the column space, geometrically partitioning \mathbf{Y} into its onto \mathcal{C}(\mathbf{X}) and the . The is formalized by the hat matrix \mathbf{H} = \mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T, a symmetric and (\mathbf{H}^T = \mathbf{H} and \mathbf{H}^2 = \mathbf{H}) that maps any in \mathbb{R}^n onto \mathcal{C}(\mathbf{X}). Thus, the fitted values are \hat{\mathbf{Y}} = \mathbf{H} \mathbf{Y}, and the residuals are \mathbf{e} = (\mathbf{I} - \mathbf{H}) \mathbf{Y}, where \mathbf{I} is the . This matrix representation underscores how OLS geometrically "hats" the observed \mathbf{Y} by projecting it onto the defined by the predictors. To illustrate in the context of simple linear regression (p = 1), visualize the data as points (x_i, y_i) in a 2D scatterplot. The column space \mathcal{C}(\mathbf{X}) is spanned by the vector of ones and the predictor vector \mathbf{x} = (x_1, \dots, x_n)^T. The OLS fitted line \hat{y} = \hat{\alpha} + \hat{\beta} x minimizes the sum of squared vertical distances from the points to the line, corresponding to the orthogonal projection of \mathbf{Y} onto this 2D subspace in \mathbb{R}^n. Geometrically, the residuals appear as vertical segments in the plot, but their orthogonality condition ensures \sum e_i = 0 and \sum x_i e_i = 0, aligning the line through the data centroid while perpendicular to the subspace in the full observation space.

Maximum Likelihood Approach

Under the assumption that the errors in the model are independent and identically distributed as random variables, the ordinary least squares (OLS) estimator can be derived as the maximum likelihood estimator (MLE) of the model parameters. Consider the classical Y = X\beta + \epsilon, where Y is an n \times 1 of observations, X is an n \times p , \beta is a p \times 1 of unknown parameters, and \epsilon is an n \times 1 error with \epsilon_i \sim N(0, \sigma^2) i.i.d. for i = 1, \dots, n. The likelihood function for the parameters \beta and \sigma^2 given the data Y and X is L(\beta, \sigma^2 \mid Y, X) = (2\pi \sigma^2)^{-n/2} \exp\left( -\frac{1}{2\sigma^2} \|Y - X\beta\|^2 \right), which is proportional to (\sigma^2)^{-n/2} \exp\left( -\frac{1}{2\sigma^2} \|Y - X\beta\|^2 \right). To obtain the MLE, maximize the log-likelihood \ell(\beta, \sigma^2 \mid Y, X) = -\frac{n}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} \|Y - X\beta\|^2. $$ Maximizing this with respect to $ \beta $ for fixed $ \sigma^2 $ (or jointly) leads to the normal equations $ X^T X \beta = X^T Y $, whose solution is the OLS estimator $ \hat{\beta} = (X^T X)^{-1} X^T Y $.[](https://data.princeton.edu/wws509/notes/c2s2.html) This equivalence holds because the least squares objective directly corresponds to the exponential term in the likelihood under the normality assumption.[](https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/06/lecture-06.pdf) The derivation requires the errors to be i.i.d. normal with mean zero and constant variance $ \sigma^2 $, which justifies the probabilistic interpretation of OLS as MLE.[](http://web.stanford.edu/~rjohari/teaching/notes/226_lecture11_inference.pdf) In this setting, the OLS estimator achieves full efficiency, meaning it has the smallest possible variance among all unbiased estimators, as per the Cramér-Rao lower bound.[](https://math.arizona.edu/~jwatkins/o-mle.pdf) Without normality, the Gauss-Markov theorem still establishes OLS as the best linear unbiased estimator (BLUE) with minimal variance among linear unbiased estimators, but the MLE framework provides broader efficiency guarantees only when normality holds.[](http://assets.press.princeton.edu/chapters/s6946.pdf) ### Method of Moments The method of moments derives the ordinary least squares (OLS) estimator by equating [population](/page/Population) moments implied by the [linear regression](/page/Linear_regression) model to their sample counterparts. In the [population](/page/Population), the model posits that the [conditional expectation](/page/Conditional_expectation) of the dependent variable $ Y $ given the regressors $ X $ is $ \mathbb{E}(Y \mid X) = X \beta $, where $ \beta $ is the [vector](/page/Vector) of unknown parameters. This implies that the error term $ u = Y - X \beta $ satisfies $ \mathbb{E}(u \mid X) = 0 $, or equivalently, the unconditional moments $ \mathbb{E}(u) = 0 $ and $ \mathbb{E}(X' u) = 0 $. These moment conditions form the foundation for [estimation](/page/Estimation), as they express the [orthogonality](/page/Orthogonality) between the regressors and the errors.[](https://warwick.ac.uk/fac/soc/economics/staff/swarulampalam/panel/gmm.pdf) To obtain the sample estimator, the method of moments replaces the population expectations with sample averages. For a dataset of $ n $ observations, the sample analog of $ \mathbb{E}(X' u) = 0 $ is the condition \frac{1}{n} X' (Y - X \hat{\beta}) = 0, where $ X $ and $ Y $ are the $ n \times k $ [design matrix](/page/Design_matrix) and $ n \times 1 $ response vector, respectively, and $ \hat{\beta} $ is the [estimator](/page/Estimator). Solving this equation yields the normal equations $ X' X \hat{\beta} = X' Y $, and assuming $ X' X $ is invertible, the OLS estimator follows as \hat{\beta} = (X' X)^{-1} X' Y. This derivation highlights how OLS emerges directly from moment matching without invoking minimization of a quadratic form.[](https://www.reed.edu/economics/parker/s10/312/notes/Notes2.pdf) Within the broader method of moments framework, originally proposed by Pearson in [1894](/page/1894) and extended to the [generalized method of moments](/page/Generalized_method_of_moments) (GMM) by Hansen in 1982, OLS represents a special case where the moment conditions are linear in the parameters. The general setup involves a vector of moment functions $ g(\theta) = \mathbb{E}[m(Z, \theta)] = 0 $, with the sample analog $ \frac{1}{n} \sum_{i=1}^n m(z_i, \theta) = 0 $ solved for $ \theta $; for OLS, $ m(z_i, \beta) = x_i (y_i - x_i' \beta) $, making it exactly identified with as many moments as parameters. This linear structure simplifies computation and ensures the estimator coincides with the least squares solution. Compared to [maximum likelihood estimation](/page/Maximum_likelihood_estimation), which typically assumes a full [probability distribution](/page/Probability_distribution) for the errors (such as [normality](/page/Normality)), the method of moments approach for OLS requires only the validity of these unconditional moment conditions, rendering it robust to some distributional misspecifications as long as the moments exist and the [orthogonality](/page/Orthogonality) holds.[](https://warwick.ac.uk/fac/soc/economics/staff/swarulampalam/panel/gmm.pdf) ## Assumptions and Violations ### Classical Assumptions The classical assumptions underlying ordinary least squares (OLS) regression, often referred to as the Gauss–Markov assumptions, provide the foundational conditions for the OLS estimator to achieve optimal properties in linear models. These assumptions ensure that the model is appropriately specified and that the errors behave in a manner that supports unbiased and efficient [estimation](/page/Estimation). Formally introduced by [Carl Friedrich Gauss](/page/Carl_Friedrich_Gauss) in his 1821 work *Theoria Combinationis Observationum Erroribus Minimis Obnoxiae*, the theorem was later generalized by [Andrey Markov](/page/Andrey_Markov) in 1912 to encompass broader linear models with correlated observations. Under these assumptions, the [Gauss–Markov theorem](/page/Gauss–Markov_theorem) establishes that the OLS estimator is the best linear unbiased estimator (BLUE), possessing the smallest variance among all linear unbiased estimators of the parameters.[](https://www.statlect.com/fundamentals-of-statistics/Gauss-Markov-theorem) The first assumption is **linearity in parameters**, which posits that the [conditional expectation](/page/Conditional_expectation) of the dependent variable given the regressors is a [linear function](/page/Linear_function) of those regressors. This is expressed as E(\mathbf{Y} \mid \mathbf{X}) = \mathbf{X} \boldsymbol{\beta}, where $\mathbf{Y}$ is the $n \times 1$ vector of observations, $\mathbf{X}$ is the $n \times (k+1)$ design matrix (including an intercept column), and $\boldsymbol{\beta}$ is the $(k+1) \times 1$ vector of parameters. This assumption requires the model to be correctly specified in its [linear form](/page/Linear_form), allowing the systematic component to capture the true relationship without omitted nonlinearities or misspecifications. The second key assumption is **strict exogeneity**, stating that the error term is uncorrelated with the regressors, such that E(\boldsymbol{\epsilon} \mid \mathbf{X}) = \mathbf{0}. This implies that the regressors are not systematically related to the unobserved factors captured by the errors, ensuring no [endogeneity](/page/Endogeneity) or [omitted variable bias](/page/Omitted-variable_bias) that would lead to inconsistent estimates. Strict exogeneity is crucial for the unbiasedness of the OLS estimator, as violations could introduce correlation between $\mathbf{X}$ and $\boldsymbol{\epsilon}$, biasing parameter estimates.[](https://statisticsbyjim.com/regression/gauss-markov-theorem-ols-blue/) Homoskedasticity forms the third assumption, requiring that the conditional variance of each error term is constant and equal to $\sigma^2$ across all observations, given the regressors: \text{Var}(\epsilon_i \mid \mathbf{X}) = \sigma^2, \quad i = 1, \dots, n. This equal variance condition, often combined with the spherical errors assumption (no autocorrelation, so $\text{Cov}(\epsilon_i, \epsilon_j \mid \mathbf{X}) = 0$ for $i \neq j$), ensures that the errors have a homoskedastic and uncorrelated structure. Homoskedasticity is essential for the efficiency of OLS, as it guarantees the minimum variance property within the class of linear unbiased estimators.[](https://www.statlect.com/fundamentals-of-statistics/Gauss-Markov-theorem) Finally, the assumption of **no perfect multicollinearity** requires that the design matrix $\mathbf{X}$ has full column rank, specifically $\text{rank}(\mathbf{X}) = k+1$, where $k$ is the number of explanatory variables. This prevents the regressors from being perfectly linearly dependent, which would make the parameters non-identifiable and the OLS estimator undefined. Without perfect multicollinearity, the inverse $\mathbf{X}^\top \mathbf{X}$ exists, allowing computation of the closed-form OLS solution. Collectively, these assumptions underpin the Gauss–Markov theorem's conclusion that OLS yields [BLUE](/page/Blue) estimates, a result that holds without requiring [normality](/page/Normality) of the errors, though [normality](/page/Normality) is often added for [inference](/page/Inference) purposes in finite samples.[](https://statisticsbyjim.com/regression/gauss-markov-theorem-ols-blue/) ### Heteroskedasticity and [Autocorrelation](/page/Autocorrelation) In ordinary least squares (OLS) regression, heteroskedasticity occurs when the conditional variance of the error term varies across observations, such that $\operatorname{Var}(\varepsilon_i \mid \mathbf{X}_i) = \sigma_i^2$, where $\sigma_i^2$ depends on the regressors $\mathbf{X}_i$ or other factors.[](https://www.ucl.ac.uk/~uctp41a/b203/lecture9.pdf) This violation of the classical assumption of homoskedasticity implies that the errors have unequal spreads, often increasing with the level of the predictors, as seen in [cross-sectional data](/page/Cross-sectional_data) like income regressions where variance rises with income levels.[](https://www.reed.edu/economics/parker/s12/312/notes/Notes8.pdf) Under heteroskedasticity, the OLS estimator $\hat{\beta}$ remains unbiased and consistent, meaning it converges in probability to the true $\beta$ as the sample size grows.[](https://stats.stackexchange.com/questions/378851/why-use-ols-when-it-is-assumed-there-is-heteroscedasticity) However, $\hat{\beta}$ becomes inefficient, exhibiting larger variance than the best linear unbiased [estimator](/page/Estimator), and the conventional standard errors are biased, typically underestimated, which invalidates t-tests, F-tests, and confidence intervals by overstating [statistical significance](/page/Statistical_significance).[](https://spureconomics.com/heteroscedasticity-causes-and-consequences/) To detect heteroskedasticity, the Breusch-Pagan test regresses the squared OLS residuals on the regressors and applies a [Lagrange multiplier](/page/Lagrange_multiplier) statistic that follows a [chi-squared distribution](/page/Chi-squared_distribution) under the null of homoskedasticity.[](https://edu.hansung.ac.kr/~jecon/art/BreuschPagan_1979.pdf) This test, proposed by Breusch and Pagan, is widely used for its simplicity and power against common forms of heteroskedasticity.[](https://www.semanticscholar.org/paper/A-simple-test-for-heteroscedasticity-and-random-vol-Breusch-Pagan/a05a732eaa9462ba7df9195dad17d78218533efd) Autocorrelation, or serial correlation, arises when the errors are not [independent](/page/Independent), with $\operatorname{Cov}(\varepsilon_i, \varepsilon_j) \neq 0$ for $i \neq j$, frequently in [time series](/page/Time_series) data due to omitted variables, [inertia](/page/Inertia), or measurement errors.[](https://online.stat.psu.edu/stat501/lesson/t/t.2/t.2.3-testing-and-remedial-measures-autocorrelation) Positive autocorrelation, where high errors follow high errors, is common in economic [time series](/page/Time_series) like GDP growth.[](https://www.investopedia.com/terms/d/durbin-watson-statistic.asp) Like heteroskedasticity, autocorrelation leaves the OLS $\hat{\beta}$ unbiased and consistent under standard conditions, but renders it inefficient with inflated variance; moreover, the usual standard errors are biased—often underestimated in cases of positive autocorrelation—leading to overstated t-statistics and unreliable inference.[](https://analystprep.com/study-notes/cfa-level-2/quantitative-method/explain-serial-correlation-and-how-it-affects-statistical-inference/) Both issues compound in panel or time series models, where ignoring them can mislead policy analysis by suggesting spurious precision.[](https://www3.nd.edu/~wevans1/econ30331/autocorrelation.pdf) The Durbin-Watson test detects first-order [autocorrelation](/page/Autocorrelation) by computing a statistic $d = \sum_{t=2}^n \frac{(\hat{e}_t - \hat{e}_{t-1})^2}{\sum_{t=1}^n \hat{e}_t^2}$, where $\hat{e}_t$ are OLS residuals, and comparing it to critical bounds; values near 2 indicate no autocorrelation, below 2 suggest positive, and above 2 negative.[](https://academic.oup.com/biomet/article-abstract/37/3-4/409/176531) Developed by Durbin and Watson, this test is approximate but effective for models without lagged dependents. Remedies for these violations prioritize correcting [inference](/page/Inference) without altering point estimates. [Heteroskedasticity-consistent standard errors](/page/Heteroskedasticity-consistent_standard_errors), introduced by [White](/page/White), estimate the [covariance matrix](/page/Covariance_matrix) as $\hat{V}(\hat{\beta}) = (X'X)^{-1} \left( \sum \hat{e}_i^2 x_i x_i' \right) (X'X)^{-1}$, providing valid t-tests even under unknown heteroskedasticity forms.[](https://crooker.faculty.unlv.edu/econ441/econ_papers/White-Heteroskedasticity-Correction-1980.pdf) For both heteroskedasticity and [autocorrelation](/page/Autocorrelation), [generalized least squares](/page/Generalized_least_squares) (GLS) transforms the model to $\tilde{Y} = ( \Sigma^{-1/2} X ) \beta + \tilde{\varepsilon}$ with $\operatorname{Var}(\tilde{\varepsilon}) = I$, yielding efficient estimates if the error [covariance](/page/Covariance) $\Sigma$ is known; feasible GLS (FGLS) estimates $\Sigma$ from OLS residuals for practical use.[](http://web.vu.lt/mif/a.buteikis/wp-content/uploads/2019/11/MultivariableRegression_4.pdf) These approaches restore efficiency and valid [inference](/page/Inference), with robust errors sufficing for large samples.[](https://www.bauer.uh.edu/rsusmel/phd/ec1-11.pdf) ### Non-Normality and Outliers In ordinary least squares (OLS) regression, the assumption of normally distributed errors is not required for the estimator to remain unbiased or consistent, provided the other Gauss-Markov conditions hold, such as zero conditional mean and homoskedasticity.[](https://pmc.ncbi.nlm.nih.gov/articles/PMC8613103/) However, non-normality can distort finite-sample inference procedures, including t-tests and F-tests for coefficient significance, as these rely on the exact normality of the error distribution for their validity; in such cases, asymptotic approximations or robust standard errors may be necessary to ensure reliable p-values and confidence intervals.[](https://www.carlislerainey.com/papers/heavy-tails.pdf) To detect non-normality in the residuals, the Jarque-Bera test is commonly applied, which assesses skewness and kurtosis against normal distribution expectations using the statistic $ JB = n \left( \frac{S^2}{6} + \frac{(K-3)^2}{24} \right) $, where $ n $ is the sample size, $ S $ is the sample skewness, and $ K $ is the sample kurtosis; under the null hypothesis of normality, this follows a chi-squared distribution with 2 degrees of freedom.[](https://www.jstor.org/stable/1403192) Outliers in OLS can manifest as points with large residuals, indicating poor model fit for that observation, or as leverage points, where the predictor values are distant from the bulk of the data in the design space, potentially pulling the fitted line toward them despite fitting well.[](https://www.tandfonline.com/doi/abs/10.1080/00401706.1977.10489493) Leverage is quantified by the diagonal elements $ h_{ii} $ of the hat matrix $ H = X(X^T X)^{-1} X^T $, with values exceeding $ \frac{2(k+1)}{n} $ (where $ k $ is the number of predictors and $ n $ the sample size) signaling high influence from the predictors alone.[](https://www.tandfonline.com/doi/abs/10.1080/00401706.1977.10489493) Cook's distance provides a unified measure of an observation's joint influence from both residual and leverage, defined as D_i = \frac{e_i^2}{(k+1) s^2} \cdot \frac{h_{ii}}{(1 - h_{ii})^2}, where $ e_i $ is the i-th residual, $ s^2 $ is the mean squared error, and $ h_{ii} $ is the leverage; values of $ D_i > \frac{4}{n-k-1} $ typically indicate substantial influence, as they reflect how much the predicted values change if the i-th observation is removed.[](https://www.tandfonline.com/doi/abs/10.1080/00401706.1977.10489493) Influential observations are those whose deletion causes a notable shift in the OLS coefficient estimates $ \hat{\beta} $, often quantified by the difference $ \hat{\beta}_{(i)} - \hat{\beta} $, where $ \hat{\beta}_{(i)} $ excludes the i-th point; such points can bias the overall fit if they disproportionately affect the parameter vector.[](https://www.tandfonline.com/doi/abs/10.1080/00401706.1977.10489493) For instance, a single influential point might inflate or deflate specific coefficients, leading to misleading interpretations of relationships.[](https://www.tandfonline.com/doi/abs/10.1080/00401706.1977.10489493) To mitigate the effects of outliers and non-normality, robust regression methods downweight extreme residuals using M-estimators, such as the [Huber](/page/Huber) estimator, which minimizes a [loss function](/page/Loss_function) that is [quadratic](/page/Quadratic) for small errors but linear for large ones, thereby reducing sensitivity to contamination while preserving efficiency under approximate [normality](/page/Normality).[](https://projecteuclid.org/journals/annals-of-statistics/volume-1/issue-5/Robust-Regression-Asymptotics-Conjectures-and-Monte-Carlo/10.1214/aos/1176342503.full) Alternatively, trimming involves excluding suspected outliers based on diagnostic measures before applying OLS, though this risks information loss if the points are not truly aberrant.[](https://projecteuclid.org/journals/annals-of-statistics/volume-1/issue-5/Robust-Regression-Asymptotics-Conjectures-and-Monte-Carlo/10.1214/aos/1176342503.full) Numerical issues, such as rounding errors in computed [data](/page/Data), can also behave like outliers by amplifying discrepancies in ill-conditioned designs, where small perturbations in one [observation](/page/Observation) propagate to distort the entire coefficient vector.[](https://www.jstor.org/stable/2348515) ## Finite-Sample Properties ### Unbiasedness and Efficiency Under the assumptions of [linearity](/page/Linearity) in parameters, strict exogeneity (E[ε | X] = 0), and no perfect [multicollinearity](/page/Multicollinearity), the ordinary least squares (OLS) estimator β̂ is unbiased in finite samples, meaning its [expected value](/page/Expected_value) equals the true [parameter](/page/Parameter) vector: E[β̂] = β.[](https://www.stat.berkeley.edu/~census/GaussMr2.pdf) This unbiasedness follows from the [linearity](/page/Linearity) of the OLS [estimator](/page/Estimator) in the observed [data](/page/Data). The [closed-form expression](/page/Closed-form_expression) for β̂ takes the [linear form](/page/Linear_form) β̂ = (X'X)^{-1} X' y, where y = Xβ + ε. Substituting yields β̂ = β + (X'X)^{-1} X' ε. Taking expectations gives E[β̂] = β + (X'X)^{-1} X' E[ε] = β, since E[ε] = 0 under the exogeneity assumption.[](http://www.unm.edu/~jikaczmarski/working_papers/gm_proof.pdf) Under the additional Gauss-Markov assumptions of homoskedasticity (Var(ε | X) = σ² I), the OLS [estimator](/page/Estimator) β̂ is the best linear unbiased [estimator](/page/Estimator) (BLUE), possessing the minimum variance among all linear unbiased estimators of β.[](https://www2.stat.duke.edu/courses/Spring23/sta211.01/slides/lec-8.pdf) Specifically, the [covariance matrix](/page/Covariance_matrix) of β̂ is Var(β̂) = σ² (X'X)^{-1}, which is the smallest possible in the positive semi-definite sense for any competing linear unbiased estimator.[](http://www.stat.ucla.edu/~nchristo/statistics100C/gauss_markov_theorem.pdf) The proof of the [BLUE](/page/Blue) property uses the [decomposition](/page/Decomposition) for any other linear unbiased [estimator](/page/Estimator) γ̂ = C y, where E[γ̂] = β implies C X = I. Let B = C - (X'X)^{-1} X', so B X = 0. Then Var(γ̂) = σ² C C' = σ² [((X'X)^{-1} X' + B) ((X'X)^{-1} X' + B)'] = σ² [(X'X)^{-1} + B B'], since the cross terms vanish (X' B' = (B X)' = 0). Thus, Var(γ̂) - Var(β̂) = σ² B B', which is positive semi-definite. This efficiency holds without requiring normality of the errors, relying solely on the classical assumptions for [linearity](/page/Linearity) and unbiasedness.[](https://www.stat.berkeley.edu/~census/GaussMr2.pdf)[](https://web.ics.purdue.edu/~jltobias/671/lecture_notes/regression4.pdf) ### Variance of Estimators Under the Gauss-Markov assumptions of the classical linear regression model—where the errors are uncorrelated, have constant variance σ², and zero mean—the ordinary least squares (OLS) estimator $\hat{\beta}$ has a finite-sample covariance matrix that can be derived directly from its expression as a linear function of the errors. Specifically, the model is $Y = X\beta + \epsilon$, with $\mathbb{E}(\epsilon) = 0$ and $\operatorname{Var}(\epsilon) = \sigma^2 I_n$. Substituting yields $\hat{\beta} = (X^\top X)^{-1} X^\top Y = \beta + (X^\top X)^{-1} X^\top \epsilon$. The covariance matrix then follows as $\operatorname{Var}(\hat{\beta}) = (X^\top X)^{-1} X^\top (\sigma^2 I_n) X (X^\top X)^{-1} = \sigma^2 (X^\top X)^{-1}$, assuming $X$ is non-stochastic or conditioned upon.[](https://www.stat.berkeley.edu/~census/general.pdf) Since σ² is unknown in practice, it is estimated using the residuals from the fitted model. The [residual](/page/Residual) vector is $e = Y - X\hat{\beta} = (I_n - X(X^\top X)^{-1}X^\top) \epsilon = (I_n - P_X) \epsilon$, where $P_X$ is the [projection matrix](/page/Projection_matrix) onto the column space of $X$. The sum of squared residuals is $e^\top e = \|Y - X\hat{\beta}\|^2$, and the unbiased [estimator](/page/Estimator) of σ² is $s^2 = \frac{\|Y - X\hat{\beta}\|^2}{n - k - 1}$, where $n$ is the sample size and $k$ is the number of regressors (excluding [the intercept](/page/The_Intercept)). This estimator is unbiased under the Gauss-Markov assumptions. Under the additional assumption of normally distributed errors, $(n - k - 1)s^2 / \sigma^2 \sim \chi^2_{n-k-1}$.[](https://users.stat.umn.edu/~helwig/notes/mvlr-Notes.pdf) The estimated covariance matrix of $\hat{\beta}$ is then $s^2 (X^\top X)^{-1}$, and the standard errors of the individual coefficients are the square roots of its diagonal elements, i.e., $\operatorname{se}(\hat{\beta}_j) = \sqrt{s^2 \cdot [(X^\top X)^{-1}]_{jj}}$ for the $j$-th coefficient. These standard errors quantify the sampling variability of each $\hat{\beta}_j$ and form the basis for inference in finite samples.[](https://www.stat.berkeley.edu/~census/general.pdf) In partitioned regression, where the regressors are split as $X = [X_1 \, X_2]$ with corresponding coefficients $\beta = [\beta_1^\top \, \beta_2^\top]^\top$, the Frisch-Waugh-Lovell theorem provides a focused expression for the variance of the subset estimator $\hat{\beta}_1$. The theorem states that $\hat{\beta}_1$ equals the OLS coefficient from regressing the residuals of $Y$ on $X_2$ (denoted $y^{(2)}$) onto the residuals of $X_1$ on $X_2$ (denoted $X_1^{(2)}$), yielding $\hat{\beta}_1 = (X_1^{(2)\top} X_1^{(2)})^{-1} X_1^{(2)\top} y^{(2)}$. The conditional variance, given the estimation of $\beta_2$, is $\operatorname{Var}(\hat{\beta}_1 \mid \hat{\beta}_2) = \sigma^2 (X_1^{(2)\top} X_1^{(2)})^{-1}$, which isolates the variability attributable to the variables in $X_1$ after accounting for those in $X_2$. This result, originally demonstrated for partial regressions, facilitates computational efficiency and interpretation in models with grouped regressors.[](https://web.stanford.edu/~mrosenfe/soc_meth_proj3/matrix_OLS_NYU_notes.pdf)[](http://repec.wesleyan.edu/pdf/mlovell/2005012_lovell.pdf) ### Influential Data Points In ordinary least squares (OLS) [estimation](/page/Estimation), certain data points can disproportionately affect the [parameter](/page/Parameter) estimates in finite samples due to their position in the design space or the near-linear dependencies among predictors. These influential observations arise because the OLS [estimator](/page/Estimator) weights points based on their [leverage](/page/Leverage), potentially leading to biased or unstable inferences if not identified. The impact is particularly pronounced in small samples, where the inverse moment [matrix](/page/Matrix) $(X'X)^{-1}$ amplifies the contribution of atypical points. Leverage quantifies the potential influence of the $i$-th observation on the fitted values solely through its predictor values, independent of the response. It is given by the diagonal element of the hat matrix $H = X(X'X)^{-1}X'$, specifically $h_{ii} = x_i'(X'X)^{-1}x_i$, where $x_i$ is the $i$-th row of the design matrix $X$. Values of $h_{ii}$ range from $1/n$ to 1, with an average of $p/n$ across all observations, where $n$ is the sample size and $p$ is the number of parameters; points with $h_{ii} > 2p/n$ are typically flagged as high-leverage. High-leverage points, often located far from the centroid of the predictors, can dominate the estimation even if their residuals are small, as they pull the fit toward their position in the $X$-space.[](https://dspace.mit.edu/bitstream/handle/1721.1/48325/linearregression00wels.pdf?sequence=1&isAllowed=y) To assess the actual change in individual [coefficient](/page/Coefficient) estimates, the DFBETAS [statistic](/page/Statistic) measures the standardized difference in the $j$-th [parameter](/page/Parameter) when the $i$-th [observation](/page/Observation) is deleted: \text{DFBETAS}{j(i)} = \frac{\hat{\beta}j - \hat{\beta}{j(i)}}{s{(i)} \sqrt{c_{jj}}}, where $\hat{\beta}_{j(i)}$ is the estimate without the $i$-th point, $s_{(i)}$ is the [residual](/page/Residual) standard error from the deleted fit, and $c_{jj}$ is the $j$-th diagonal element of $(X'X)^{-1}$. A point is considered influential if $|\text{DFBETAS}_{j(i)}| > 2/\sqrt{n}$, indicating a notable shift in $\hat{\beta}_j$. This metric combines [leverage](/page/Leverage) with the [residual](/page/Residual) discrepancy, highlighting observations that alter specific parameters.[](https://dspace.mit.edu/bitstream/handle/1721.1/48325/linearregression00wels.pdf?sequence=1&isAllowed=y) Multicollinearity exacerbates the influence of data points by inflating the variance of the estimators through large elements in $(X'X)^{-1}$. When predictors are highly correlated, the inverse matrix develops elevated diagonal and off-diagonal entries, particularly in directions of near-linear dependence, making coefficients sensitive to individual observations aligned with those dependencies. This variance inflation can render even moderate-leverage points highly influential, as small changes in $X$ or $y$ propagate disproportionately to $\hat{\beta}$. Diagnostics for collinearity, such as condition indices derived from the [singular value decomposition](/page/Singular_value_decomposition) of $X$, help identify these unstable configurations. In small samples, a single point can dominate the OLS fit; for instance, consider a simple linear regression with $n=5$ and one predictor, where four points cluster near the origin (e.g., $x = 1,2,3,4$; $y \approx x$) and the fifth lies far out ($x=10$, $y=15$). The leverage of the outlier exceeds 0.5, while others are below 0.1, causing the slope to tilt sharply toward it and overriding the trend from the cluster, as the $(X'X)^{-1}$ amplifies its weight. Removing this point reverses the slope sign, illustrating how finite-sample scarcity allows one observation to control the estimates. ## Asymptotic Properties ### Consistency In the context of ordinary least squares (OLS) estimation, consistency refers to the property that the estimator $\hat{\beta}$ converges in probability to the true parameter vector $\beta$ as the sample size $n$ approaches [infinity](/page/Infinity), denoted as $\operatorname{plim}_{n \to \infty} \hat{\beta} = \beta$. This large-sample convergence holds under relatively weak conditions, including a fixed number of parameters $k$, strict exogeneity $\mathbb{E}[u \mid X] = 0$, and a rank condition ensuring that the [design matrix](/page/Design_matrix) $X$ has full column rank asymptotically. Additionally, the data-generating process must satisfy [ergodicity](/page/Ergodicity) or independence and identical distribution (i.i.d.) assumptions to invoke the [law of large numbers](/page/Law_of_large_numbers) (LLN), along with finite second moments for the regressors and errors to ensure the necessary probability limits exist.[](https://cameron.econ.ucdavis.edu/e240a/asymptotic.pdf) The proof of consistency relies on rewriting the OLS estimator in probability limit form and applying Slutsky's theorem. Specifically, express $\hat{\beta} = \left( \frac{X'X}{n} \right)^{-1} \frac{X'y}{n}$, where $y = X\beta + u$. Under the stated assumptions, the LLN implies $\operatorname{plim}_{n \to \infty} \frac{X'X}{n} = Q$, a positive definite matrix representing the second moment of the regressors, and $\operatorname{plim}_{n \to \infty} \frac{X'y}{n} = Q\beta$, due to exogeneity ensuring $\mathbb{E}[Xu] = 0$. By continuous mapping and [Slutsky's theorem](/page/Slutsky's_theorem), $\operatorname{plim}_{n \to \infty} \hat{\beta} = Q^{-1} (Q\beta) = \beta$. Notably, this result does not require normality of the errors or homoskedasticity, as those assumptions are pertinent to finite-sample efficiency or asymptotic normality rather than point convergence.[](https://cameron.econ.ucdavis.edu/e240a/asymptotic.pdf)[](https://bstewart.scholar.princeton.edu/document/205)[](https://www.bauer.uh.edu/rsusmel/phd/ec1-7.pdf) Under slightly stronger moment conditions, such as finite fourth moments for the errors and regressors, the OLS estimator is $\sqrt{n}$-consistent, meaning $\sqrt{n} (\hat{\beta} - \beta)$ remains stochastically bounded and converges in distribution to a normal limit (though the distributional aspect is addressed elsewhere). This rate underscores the practical utility of OLS in large samples, where estimation error diminishes proportionally to $1/\sqrt{n}$.[](https://cameron.econ.ucdavis.edu/e240a/asymptotic.pdf) ### Asymptotic Normality Under suitable regularity conditions, the ordinary least squares (OLS) estimator $\hat{\beta}$ exhibits asymptotic normality, meaning that as the sample size $n$ approaches infinity, the scaled difference $\sqrt{n}(\hat{\beta} - \beta)$ converges in distribution to a [multivariate normal distribution](/page/Multivariate_normal_distribution). This property, which builds on the consistency of $\hat{\beta}$, is fundamental for large-sample inference in [linear regression](/page/Linear_regression) models.[](https://doi.org/10.1016/S1573-4412(05)80005-4) The asymptotic normality arises from applying the [central limit theorem](/page/Central_limit_theorem) (CLT) to the term $\frac{1}{\sqrt{n}} X' \epsilon$, where $X$ is the [design matrix](/page/Design_matrix) and $\epsilon$ is the error vector. Assuming independent and identically distributed (i.i.d.) errors with mean zero and finite variance $\sigma^2$, and that the regressors satisfy $\operatorname{plim}_{n \to \infty} \frac{1}{n} X'X = Q$ where $Q$ is positive definite, the CLT implies \frac{1}{\sqrt{n}} X' \epsilon \xrightarrow{d} N(0, \sigma^2 Q). Combined with the [consistency](/page/Consistency) of $\hat{\beta}$ and a [continuous mapping theorem](/page/Continuous_mapping_theorem) such as Slutsky's, this yields the key distributional result: \sqrt{n} (\hat{\beta} - \beta) \xrightarrow{d} N(0, \sigma^2 Q^{-1}), where $\sigma^2 Q^{-1}$ is the asymptotic [covariance matrix](/page/Covariance_matrix).[](https://doi.org/10.1016/S1573-4412(05)80005-4)[](https://cameron.econ.ucdavis.edu/e240a/asymptotic.pdf) In the presence of heteroskedasticity, where the error variances $\operatorname{Var}(\epsilon_i | X_i) = \sigma_i^2$ may differ across observations but remain finite, the homoskedastic form no longer holds. Instead, the asymptotic covariance takes the "sandwich" form $\Omega = Q^{-1} \left( \operatorname{plim}_{n \to \infty} \frac{1}{n} X' \operatorname{diag}(\epsilon^2) X \right) Q^{-1}$, ensuring valid inference without assuming constant variance. This robust estimator, which adjusts for heteroskedasticity-consistent standard errors, was formalized in the seminal work on covariance matrix estimation.[](https://www.jstor.org/stable/1912934) ### Inference Procedures In large samples, inference procedures for the ordinary least squares (OLS) estimator $\hat{\beta}$ rely on its asymptotic [normality](/page/Normality) to construct tests and [confidence](/page/Confidence) intervals for hypotheses about the true parameters $\beta$. These methods approximate the finite-sample distributions with [normal](/page/Normal) or chi-squared distributions, providing robust inference even when classical assumptions like homoskedasticity or [normality](/page/Normality) of errors are violated, as long as [consistency](/page/Consistency) and the [central limit theorem](/page/Central_limit_theorem) hold.[](https://eml.berkeley.edu/~powell/e240b_sp10/alsnotes.pdf)[](https://www.schmidheiny.name/teaching/ols.pdf) The Wald test is a general framework for testing linear or nonlinear restrictions on $\beta$. For a hypothesis $H_0: g(\beta) = 0$, where $g$ is a function of dimension $r$, the test statistic is $W_N = N \cdot g(\hat{\beta})' \left[ \hat{G} \cdot \widehat{\text{Avar}}(\hat{\beta}) \cdot \hat{G}' \right]^{-1} g(\hat{\beta})$, which converges in distribution to $\chi^2_r$ under the null, with $\hat{G}$ as the Jacobian of $g$ evaluated at $\hat{\beta}$ and $\widehat{\text{Avar}}(\hat{\beta})$ as a consistent estimator of the asymptotic variance, often using the robust sandwich form $\hat{D}^{-1} \hat{C} \hat{D}^{-1}$, where $\hat{D} = N^{-1} \sum x_i x_i'$ and $\hat{C} = N^{-1} \sum \hat{\epsilon}_i^2 x_i x_i'$. For linear restrictions $H_0: R\beta = q$ with $R$ of dimension $J \times K$, this simplifies to $W = (R\hat{\beta} - q)' \left[ R \widehat{\text{Avar}}(\hat{\beta}) R' \right]^{-1} (R\hat{\beta} - q) \stackrel{a}{\sim} \chi^2_J$. The null is rejected if $W$ exceeds the critical value from the chi-squared distribution at the desired significance level.[](https://eml.berkeley.edu/~powell/e240b_sp10/alsnotes.pdf)[](https://www.bauer.uh.edu/rsusmel/phd/ec1-7.pdf)[](https://www.schmidheiny.name/teaching/ols.pdf) Asymptotic t-tests, often called z-tests in large samples, assess individual coefficients or simple linear combinations. For $H_0: \beta_j = \beta_{j0}$, the statistic is $z_j = \frac{\hat{\beta}_j - \beta_{j0}}{\text{se}(\hat{\beta}_j)}$, where $\text{se}(\hat{\beta}_j)$ is the square root of the $j$-th diagonal element of $\widehat{\text{Avar}}(\hat{\beta})$; under the null, $z_j \stackrel{a}{\sim} N(0,1)$. The test rejects if $|z_j|$ exceeds the critical value from the standard normal distribution, such as 1.96 for a 5% two-sided test. This procedure extends to any linear hypothesis $H_0: r' \beta = q$ via $z = \frac{r' \hat{\beta} - q}{\sqrt{r' \widehat{\text{Avar}}(\hat{\beta}) r}}$.[](https://eml.berkeley.edu/~powell/e240b_sp10/alsnotes.pdf)[](https://www.schmidheiny.name/teaching/ols.pdf)[](https://www.bauer.uh.edu/rsusmel/phd/ec1-7.pdf) Confidence intervals for $\beta$ or linear combinations follow directly from the asymptotic normality. A $(1 - \alpha) \times 100\%$ interval for $\beta_j$ is $\hat{\beta}_j \pm z_{\alpha/2} \cdot \text{se}(\hat{\beta}_j)$, where $z_{\alpha/2}$ is the $(1 - \alpha/2)$-quantile of the standard normal distribution; for example, $\pm 1.96 \cdot \text{se}(\hat{\beta}_j)$ at the 95% level. For a vector $\beta$, the interval is $\hat{\beta} \pm z_{\alpha/2} \cdot \sqrt{\widehat{\text{Avar}}(\hat{\beta})}$, applied elementwise. These intervals capture the true $\beta$ with probability approaching $1 - \alpha$ as the sample size grows.[](https://www.schmidheiny.name/teaching/ols.pdf)[](https://eml.berkeley.edu/~powell/e240b_sp10/alsnotes.pdf) The F-test for subsets of coefficients, such as testing joint significance of a group of regressors, is asymptotically equivalent to the Wald test under large $n$. For $H_0: R\beta = 0$ where $R$ selects $J$ coefficients, the statistic $F = \frac{1}{J} (R\hat{\beta})' \left[ R \widehat{\text{Avar}}(\hat{\beta}) R' \right]^{-1} (R\hat{\beta}) \stackrel{a}{\sim} \chi^2_J / J$, but is often scaled to match the Wald form for chi-squared approximation directly. This provides a test for restricted models, rejecting the null if the statistic exceeds the critical chi-squared value, and is particularly useful for comparing nested models in large samples.[](https://www.bauer.uh.edu/rsusmel/phd/ec1-7.pdf)[](https://www.schmidheiny.name/teaching/ols.pdf) ## Prediction and Diagnostics ### Fitted Values and Residuals In ordinary least squares (OLS) regression, the fitted values represent the predicted response values based on the estimated model parameters. For a dataset with response vector $ \mathbf{y} $ (of length $ n $) and design matrix $ \mathbf{X} $ (of dimension $ n \times (k+1) $, including an intercept column), the fitted values are denoted $ \hat{\mathbf{y}} $ and computed as $ \hat{\mathbf{y}} = \mathbf{X} \hat{\boldsymbol{\beta}} $, where $ \hat{\boldsymbol{\beta}} $ is the OLS coefficient vector.[](https://web.stanford.edu/~mrosenfe/soc_meth_proj3/matrix_OLS_NYU_notes.pdf) Equivalently, in matrix notation, $ \hat{\mathbf{y}} = \mathbf{H} \mathbf{y} $, where $ \mathbf{H} = \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top $ is known as the [hat matrix](/page/Hat).[](https://web.stanford.edu/~mrosenfe/soc_meth_proj3/matrix_OLS_NYU_notes.pdf) The [hat matrix](/page/Hat) $ \mathbf{H} $ is symmetric ($ \mathbf{H}^\top = \mathbf{H} $) and idempotent ($ \mathbf{H}^2 = \mathbf{H} $), properties that reflect its role as an orthogonal projection onto the column space of $ \mathbf{X} $.[](http://users.stat.umn.edu/~helwig/notes/mlr-Notes.pdf) Additionally, the trace of $ \mathbf{H} $, denoted $ \operatorname{tr}(\mathbf{H}) $, equals the number of parameters in the model, $ k+1 $, which quantifies the effective dimensionality of the projection.[](http://www.mysmu.edu/faculty/anthonytay/MFE/OLS_using_Matrix_Algebra.pdf) The residuals, denoted $ \mathbf{e} $, measure the discrepancies between the observed responses and the fitted values, defined as $ \mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I} - \mathbf{H}) \mathbf{y} $, where $ \mathbf{I} $ is the $ n \times n $ [identity matrix](/page/Identity_matrix).[](https://www.math.wm.edu/~leemis/otext3.pdf) By construction of the OLS estimator, the residuals satisfy two key orthogonality conditions: their sum is zero ($ \sum_{i=1}^n e_i = \mathbf{1}^\top \mathbf{e} = 0 $, where $ \mathbf{1} $ is a vector of [ones](/page/The_Ones)) and they are orthogonal to each column of $ \mathbf{X} $ ($ \mathbf{X}^\top \mathbf{e} = \mathbf{0} $). These properties ensure that the residuals capture the unexplained variation after accounting for the linear effects in $ \mathbf{X} $, and they hold under the standard OLS assumptions of full column [rank](/page/Rank) in $ \mathbf{X} $. A primary summary statistic derived from the fitted values and residuals is the [coefficient of determination](/page/Coefficient_of_determination), $ R^2 $, which quantifies the proportion of total variation in $ \mathbf{y} $ explained by the model. It is given by R^2 = 1 - \frac{| \mathbf{e} |^2}{| \mathbf{y} - \bar{y} \mathbf{1} |^2} = 1 - \frac{\sum_{i=1}^n e_i^2}{\sum_{i=1}^n (y_i - \bar{y})^2}, where $ \| \cdot \|^2 $ denotes the squared [Euclidean](/page/Euclidean) [norm](/page/Norm), $ \bar{y} $ is the [sample mean](/page/Mean) of $ \mathbf{y} $, and the denominator is the [total sum of squares](/page/Total_sum_of_squares) (SST).[](https://users.wfu.edu/cottrell/ecn215/regress_print.pdf) Equivalently, $ R^2 = \frac{\| \hat{\mathbf{y}} - \bar{y} \mathbf{1} \|^2}{\| \mathbf{y} - \bar{y} \mathbf{1} \|^2} $, the ratio of the [explained sum of squares](/page/Explained_sum_of_squares) (SSR) to SST.[](http://users.stat.umn.edu/~helwig/notes/slr-Notes.pdf) To account for model complexity and sample size, the adjusted $ R^2 $ penalizes the inclusion of additional parameters: R^2_{\text{adj}} = 1 - \frac{| \mathbf{e} |^2 / (n - k - 1)}{| \mathbf{y} - \bar{y} \mathbf{1} |^2 / (n - 1)}, which divides the mean squared residual by its unbiased degrees-of-freedom estimate and compares it to the unbiased total variance.[](https://stats.oarc.ucla.edu/spss/output/regression-analysis/) Both $ R^2 $ and $ R^2_{\text{adj}} $ range from 0 to 1, with higher values indicating better fit, though $ R^2_{\text{adj}} $ is preferred for model comparison across different $ k $.[](https://users.wfu.edu/cottrell/ecn215/regress_print.pdf) ### Confidence Intervals for Predictions In ordinary least squares (OLS) [regression](/page/Regression), confidence intervals quantify the uncertainty around the estimated [mean](/page/Mean) response at a given predictor [vector](/page/Vector) $x_0$, while prediction intervals account for the additional variability in a new individual observation at the same $x_0$. The point estimate $\hat{Y}_0 = x_0^T \hat{\beta}$ centers both types of intervals, where $\hat{\beta}$ is the OLS coefficient [vector](/page/Vector).[](https://www.stat.purdue.edu/~fmliang/STAT512/lect3.pdf) The $(1 - \alpha) \times 100\%$ [confidence interval](/page/Confidence_interval) for the mean response $E(y \mid x_0)$ is constructed as \hat{Y}0 \pm t{n-k-1, 1-\alpha/2} , s , \sqrt{x_0^T (X^T X)^{-1} x_0}, where $t_{n-k-1, 1-\alpha/2}$ is the critical value from the [Student's t-distribution](/page/Student's_t-distribution) with $n - k - 1$ [degrees of freedom](/page/Degrees_of_freedom) ($n$ is the sample size and $k$ is the number of predictors), $s = \sqrt{\text{MSE}}$ is the residual [standard error](/page/Standard_error) with $\text{MSE} = \text{SSE}/(n - k - 1)$, and $X$ is the [design matrix](/page/Design_matrix) including an intercept column. This interval relies on the assumptions of [linearity](/page/Linearity), [independence](/page/Independence), homoscedasticity, and [normality](/page/Normality) of errors in the OLS model.[](https://www.stat.purdue.edu/~fmliang/STAT512/lect3.pdf) For predicting a new response $y_0$ at $x_0$, the corresponding $(1 - \alpha) \times 100\%$ [prediction interval](/page/Prediction_interval) is \hat{Y}0 \pm t{n-k-1, 1-\alpha/2} , s , \sqrt{1 + x_0^T (X^T X)^{-1} x_0}. The added 1 under the [square root](/page/Square_root) incorporates the variance of the new [error](/page/Error) term $\sigma^2$, estimated by $s^2$, which explains why prediction intervals are always wider than [confidence](/page/Confidence) intervals for the mean response.[](https://www.stat.purdue.edu/~fmliang/STAT512/lect3.pdf) For large sample sizes, the finite-sample t-based intervals approximate asymptotic normal intervals by replacing $t_{n-k-1, 1-\alpha/2}$ with the standard [normal](/page/Normal) critical value $z_{1-\alpha/2}$ (approximately 1.96 for $\alpha = 0.05$) and using the [consistent estimator](/page/Consistent_estimator) $s$ for $\sigma$. The asymptotic variance of $\hat{Y}_0$ is $\sigma^2 x_0^T (X^T X / n)^{-1} x_0$, leveraging the normality of the OLS estimator under standard regularity conditions.[](https://cameron.econ.ucdavis.edu/e240a/asymptotic.pdf) ### Model Diagnostics Model diagnostics in ordinary least squares (OLS) regression are post-estimation procedures used to validate key assumptions, including [linearity](/page/Linearity), homoskedasticity, [normality](/page/Normality) of errors, absence of [multicollinearity](/page/Multicollinearity) among predictors, and lack of [autocorrelation](/page/Autocorrelation) in residuals. These diagnostics rely primarily on the residuals, defined as the differences between observed and predicted values, to identify potential model misspecifications that could [bias](/page/Bias) estimates or invalidate [inference](/page/Inference).[](https://online.stat.psu.edu/stat462/node/116/) Residual plots provide a visual [assessment](/page/Assessment) of several assumptions. The residuals versus fitted values plot evaluates [linearity](/page/Linearity) and homoskedasticity; under the assumptions, residuals should exhibit a random scatter around the horizontal line at zero, with no discernible patterns such as curves (indicating nonlinearity) or funnel shapes (indicating heteroskedasticity).[](https://library.virginia.edu/data/articles/diagnostic-plots) Deviations in this plot suggest the need for model adjustments, like [polynomial](/page/Polynomial) terms or variance-stabilizing transformations.[](https://www.itl.nist.gov/div898/[handbook](/page/Handbook)/pri/section2/pri24.htm) Similarly, the normal Q-Q (quantile-quantile) plot checks the [normality](/page/Normality) assumption by comparing ordered residuals to theoretical quantiles of the [normal](/page/The_Normal) distribution; residuals aligning closely with the reference line support [normality](/page/Normality), while systematic deviations, such as heavy tails, indicate non-[normality](/page/Normality).[](https://library.virginia.edu/data/articles/diagnostic-plots) The Ramsey RESET (regression equation specification error test) addresses functional form misspecification by testing whether higher-order terms of the fitted values significantly improve the model. Introduced by Ramsey in [1969](/page/1969), the test involves augmenting the original OLS model with powers (typically squares and cubes) of the fitted values and performing an [F-test](/page/F-test) on the coefficients of these added terms; rejection of the [null hypothesis](/page/null_hypothesis) (all added coefficients zero) signals omitted variables or incorrect functional form.[](https://www.jstor.org/stable/2984219) Multicollinearity among predictors is quantified using variance inflation factors (VIF), which measure how much the variance of an OLS [coefficient](/page/Coefficient) is inflated due to correlations with other predictors. For the j-th predictor, the VIF is calculated as \text{VIF}_j = \frac{1}{1 - R_j^2}, where $R_j^2$ is the [coefficient of determination](/page/Coefficient_of_determination) from an auxiliary OLS [regression](/page/Regression) of $X_j$ on all other predictors; values exceeding 5 or 10 typically indicate high [multicollinearity](/page/Multicollinearity), potentially leading to unstable estimates, though the choice of threshold depends on context.[](https://www.tandfonline.com/doi/abs/10.1080/00401706.1970.10488699)[](https://online.stat.psu.edu/stat462/node/180/) Autocorrelation in residuals, common in time-series data, is tested using the Durbin-Watson statistic, which examines [first-order](/page/First-order) serial correlation. Developed by Durbin and Watson in 1950, the test statistic is d = \frac{\sum_{t=2}^n (e_t - e_{t-1})^2}{\sum_{t=1}^n e_t^2}, where $e_t$ are the residuals; $d$ ranges from 0 to 4, with values near 2 supporting the [null hypothesis](/page/Null_hypothesis) of no [autocorrelation](/page/Autocorrelation), while low ($d < 1.5$) or high ($d > 2.5$) values suggest positive or negative [autocorrelation](/page/Autocorrelation), respectively, often requiring adjustments like including lagged variables. Critical values for significance depend on sample size and number of regressors, available in Durbin-Watson tables.[](https://www.jstor.org/stable/2332391) ## Applications and Examples ### Simple Linear Regression Case In [simple linear regression](/page/Simple_linear_regression), ordinary least squares (OLS) estimates the linear relationship between a dependent [variable](/page/Variable) $Y$ and a single independent [variable](/page/Variable) $X$ by minimizing the [sum](/page/Sum) of squared residuals. The model is expressed as $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$, where $\beta_0$ is the intercept, $\beta_1$ is the slope, and $\epsilon_i$ are the errors. The OLS estimators for these parameters are given by \hat{\beta}1 = \frac{\sum{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2}, \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}, where $\bar{X}$ and $\bar{Y}$ are the sample means of $X$ and $Y$, respectively.[](https://www.amherst.edu/system/files/media/1287/SLR_Leastsquares.pdf) These estimators provide the best linear unbiased estimates under the Gauss-Markov assumptions. To assess the significance of the slope $\beta_1$, a t-test is conducted under the null hypothesis $H_0: \beta_1 = 0$. The test statistic is t = \frac{\hat{\beta}_1}{\text{se}(\hat{\beta}_1)}, where the standard error is \text{se}(\hat{\beta}1) = \frac{s}{\sqrt{\sum{i=1}^n (X_i - \bar{X})^2}}, and $s = \sqrt{\frac{\sum_{i=1}^n (Y_i - \hat{Y}_i)^2}{n-2}}$ is the [residual](/page/Residual) standard error.[](https://www.stat.cmu.edu/~hseltman/309/Book/chapter9.pdf) The [t-statistic](/page/T-statistic) follows a [t-distribution](/page/T-distribution) with $n-2$ [degrees of freedom](/page/Degrees_of_freedom) under the [null hypothesis](/page/Null_hypothesis), allowing for [p-value](/page/P-value) computation to determine if the predictor significantly explains variation in the response. The [coefficient of determination](/page/Coefficient_of_determination), $R^2$, measures the [goodness of fit](/page/Goodness_of_fit) and is interpreted as the proportion of the total variance in $Y$ explained by the model: R^2 = 1 - \frac{\sum_{i=1}^n (Y_i - \hat{Y}i)^2}{\sum{i=1}^n (Y_i - \bar{Y})^2} = \frac{\sum_{i=1}^n (\hat{Y}i - \bar{Y})^2}{\sum{i=1}^n (Y_i - \bar{Y})^2}. Values of $R^2$ closer to 1 indicate a stronger linear relationship.[](https://online.stat.psu.edu/stat462/node/95/) ### Example Computation Consider a dataset on the effects of LSD concentration ($X$) on math performance scores ($Y$), with $n=7$ observations: (1.17, 78.93), (2.97, 58.20), (3.26, 67.47), (4.69, 37.47), (5.83, 45.65), (6.00, 32.92), (6.41, 29.97). The sample means are $\bar{X} \approx 4.333$ and $\bar{Y} \approx 50.087$. Using the OLS formulas, the slope is $\hat{\beta}_1 \approx -9.009$ and the intercept is $\hat{\beta}_0 \approx 89.123$, yielding the fitted line $\hat{Y} = 89.123 - 9.009X$. The residual standard error is $s \approx 7.129$, and $\text{se}(\hat{\beta}_1) \approx 1.500$. The t-statistic for testing $\beta_1 = 0$ is $t \approx -6.01$ (df = 5, p < 0.001), indicating a significant negative relationship. Finally, $R^2 \approx 0.878$, meaning the model explains about 87.8% of the variance in scores.[](https://users.stat.ufl.edu/~winner/sta6208/notes1.pdf) ### Multiple Regression with Real Data To illustrate the application of ordinary least squares (OLS) in multiple regression, consider Francis Galton's classic 1885 dataset on family heights, which records the heights of 934 adult children from 205 English families, including separate measurements for fathers, mothers, and children (with heights in inches). This dataset allows modeling child height as a function of both parental heights, accounting for potential gender differences in inheritance patterns. A common approach separates analyses by child gender to capture distinct effects, using OLS to estimate the parameters. The [design matrix](/page/Design_matrix) $ \mathbf{X} $ is constructed as an $ n \times (p+1) $ matrix, where $ n $ is the number of observations (e.g., 481 for sons), the first column is a [vector](/page/Vector) of [ones](/page/The_Ones) for the [intercept](/page/The_Intercept), and subsequent columns contain the predictors: father's height and mother's height. The response [vector](/page/Vector) $ \mathbf{y} $ contains child heights. The OLS estimator $ \hat{\beta} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} $ is computed, typically via statistical software like [R](/page/R) or Python's statsmodels library for [numerical stability](/page/Numerical_stability), as inverting $ \mathbf{X}^T \mathbf{X} $ directly can be ill-conditioned for larger datasets. For sons, the fitted model is $ \hat{y} = \beta_0 + \beta_1 \cdot \text{[father height](/page/Height)} + \beta_2 \cdot \text{[mother height](/page/Height)} $, with least-squares estimates $ \hat{\beta}_1 = 0.36 $ and $ \hat{\beta}_2 = 0.25 $, indicating that a one-inch increase in [father's height](/page/Height) predicts a 0.36-inch increase in son's height, holding [mother's height](/page/Height) constant, while the mother's effect is smaller but positive. For daughters, the coefficients are $ \hat{\beta}_1 \approx 0.31 $ for [father](/page/Father) and $ \hat{\beta}_2 \approx 0.28 $ for [mother](/page/Mother), showing comparable parental influences.[](https://arxiv.org/pdf/1508.02942) The multiple $ R^2 $ is approximately 0.45 for sons and 0.42 for daughters, explaining 42-45% of the variance in child [height](/page/Height). These results highlight how OLS isolates partial effects in multivariate settings, revealing nuanced [heritability](/page/Heritability) patterns originally noted by Galton. Model diagnostics include plotting residuals (observed minus fitted values) against fitted values to assess linearity and homoscedasticity; in Galton's data, residual plots show no strong patterns, supporting the assumptions, though some heteroscedasticity appears at height extremes. The $ R^2 = 0.66 $ in extended models incorporating gender as a dummy variable (e.g., adding a binary indicator for male children) improves fit by accounting for average sex-based height differences of about 5 inches. For a modern multivariate example, the [Boston](/page/Boston) Housing dataset (506 observations from 1970 U.S. [Census](/page/Census) tracts) models [median](/page/Median) housing value (MV, in &#36;1,000s) against 13 predictors, including structural factors like average rooms per dwelling (RM), socioeconomic indicators like lower-status population proportion (LSTAT), and environmental variables like nitrogen oxide concentration (NOX). The [design matrix](/page/Design_matrix) $ \mathbf{X} $ includes an intercept column plus these predictors, and $ \hat{\beta} $ is again estimated via OLS software. A semilog specification, $ \log(\text{MV}) = \beta_0 + \sum \beta_k x_k $, yields an $ R^2 = 0.81 $, indicating strong explanatory power.[](https://www.journals.elsevier.com/journal-of-environmental-economics-and-management) Key coefficients from the hedonic model include $ \hat{\beta}_{\text{RM}} \approx 0.11 $ (a one-room increase raises [log](/page/Log) [value](/page/Value) by 0.11, or about 12% at [mean](/page/Mean) MV), $ \hat{\beta}_{\text{LSTAT}} = -0.015 $ (1% higher low-status [population](/page/Population) lowers [log](/page/Log) [value](/page/Value) by 1.5%), and $ \hat{\beta}_{\text{NOX}^2} = -0.0064 $ ([quadratic](/page/Quadratic) [term](/page/Term) capturing nonlinear [air pollution](/page/Air_pollution) effects, reducing [value](/page/Value) by roughly &#36;1,613 per pphm NOX increase at means). These interpretations reveal trade-offs, such as how [accessibility](/page/Accessibility) (e.g., distance to [employment](/page/Employment), DIS) positively affects values while [crime](/page/Crime) (CRIM) and taxes (TAX) negatively do. [Residual](/page/Residual) plots versus fitted values confirm approximate linearity, with $ R^2 = 0.81 $ underscoring OLS's utility in policy-relevant hedonic pricing, though multicollinearity among predictors like industrial proportion (INDUS) and NOX warrants caution in [inference](/page/Inference).

References

  1. [1]
    4.1.4.1. Linear Least Squares Regression
    ### Summary of Linear Least Squares Regression (OLS)
  2. [2]
    [PDF] The Method of Least Squares - The University of Texas at Dallas
    It exists with several variations: Its simpler version is called ordinary least squares (OLS), a more sophisticated version is called weighted least squares ( ...
  3. [3]
    The Origins of Ordinary Least Squares Assumptions – Feature Column
    Mar 1, 2022 · A technique called ordinary least squares (OLS), aka linear regression, is a principled way to pick the “best” line.
  4. [4]
    [PDF] Simple Linear Regression - School of Statistics
    Jan 4, 2017 · Ordinary Least Squares: Scalar Form. The ordinary least squares (OLS) problem is min b0,b1∈R n. X i=1. (yi − b0 − b1xi). 2 and the OLS ...<|separator|>
  5. [5]
    [PDF] Multiple Linear Regression - School of Statistics
    Jan 4, 2017 · Overview of MLR Model. Scalar Model Form. MLR Model: Form. The multiple linear regression model has the form yi = b0 + p. X j=1 bjxij + ei for i ...
  6. [6]
    [PDF] Topic 3 Chapter 5: Linear Regression in Matrix Form
    The SLR Model in Scalar Form. Yi = β0 + β1Xi + i where i ∼iid N(0,σ. 2. ) ... The Multiple Regression Model. Yi = β0 + β1Xi,1 + β2Xi,2 + ... + βp−1Xi,p−1 ...
  7. [7]
    Gauss and the Invention of Least Squares - Project Euclid
    The most famous priority dispute in the history of statistics is that between Gauss and Legendre, over the discovery of the method of least squares.
  8. [8]
    [PDF] OLS in Matrix Form
    The assumption of no autocorrelation (uncorrelated errors) means that cov(²i,²j|X) = 0 ∀ i 6= j i.e. knowing something about the disturbance term for one.
  9. [9]
    [PDF] Matrix Algebra for OLS Estimator
    and U as an n × 1 vector of error terms. The linear multiple regression model in matrix form is. Y = Xβ + U. • Read Appendix D of the textbook.
  10. [10]
    5.4 - A Matrix Formulation of the Multiple Regression Model
    Here, we review basic matrix algebra, as well as learn some of the more important multiple regression formulas in matrix form.
  11. [11]
    The Method of Least Squares
    A least-squares solution solves the equation Ax = b as closely as possible, in the sense that the sum of the squares of the difference b − Ax is minimized.Missing: history | Show results with:history
  12. [12]
    [PDF] Lecture 1 Least Squares
    CLM – OLS. • Example: One explanatory variable model. (A1') DGP: y = β1+ β2 x + ε. Objective function: S x;𝜽 = ∑ ε = ∑ y – β1 – β2 x. F.o.c. (2 ...
  13. [13]
    Regression Analysis: Method of Least Squares - MIT
    Apr 15, 1998 · The method of least squares is a very common technique used for this purpose. ... squared errors over all the observations is minimized.
  14. [14]
    [PDF] Chapter 2: simple regression model
    i is called residual sum squares (RSS). Squaring is intended to penalize big error and avoid cancelation of positive and negative errors. Also a squared ...Missing: large | Show results with:large
  15. [15]
    Introductory Econometrics Chapter 14: The Gauss-Markov Theorem
    This last statement is often stated in shorthand as “OLS is BLUE” (best linear unbiased estimator) and is known as the Gauss–Markov theorem.
  16. [16]
  17. [17]
    [PDF] OLS: Estimation and Standard Errors - MIT OpenCourseWare
    The OLS procedure is nothing more than finding the orthogonal projection of y on the subspace spanned by the regressors, because then the vector of residuals ...
  18. [18]
    [PDF] Simple Linear Regression - Kosuke Imai
    Consider simple linear regression without an intercept: ˆ β = Pn i=1 ... Orthogonal projection matrix or “Hat” matrix: bY = Xˆ β = X(X>X). −1X>. |. {z. }.
  19. [19]
    [PDF] Lecture 11 Weighted Least Squares and Review
    Oct 7, 2015 · From a geometric interpretation of the least squares estimator, we introduce an important matrix PX called the projection matrix. P = X(XtX).
  20. [20]
    [PDF] Data Analysis in Atmospheric and Oceanic Sciences
    We started with a geometric interpretation, examining a regression line on a 2D x-y diagram. Following that, we considered the algebraic interpretation of ˆα ...
  21. [21]
    [PDF] Linear regression
    Geometric interpretation. ▷ Any vector XT β is in the span of the rows of X. ▷ The OLS estimate is the closest vector to y that can be represented in this ...
  22. [22]
    [PDF] Lecture 11: Maximum likelihood - MS&E 226: “Small” Data
    Now, we have shown that in addition if we assume the εi are i.i.d. normal random variables, then OLS is the maximum likelihood estimate.
  23. [23]
    [PDF] Lecture 6: The Method of Maximum Likelihood for Simple Linear ...
    Sep 19, 2015 · As you will recall, the estimators for the slope and the intercept exactly match the least squares estimators. This is a special property of ...
  24. [24]
    [PDF] Topic 15: Maximum Likelihood Estimation - Arizona Math
    In particular, ordinary (unweighted) least square estimators are unbiased. In computing the optimal values using introductory differential calculus, the ...
  25. [25]
    2.2. Estimation of the Parameters - Statistics and Population
    This gives an explicit formula for the ordinary least squares (OLS) or maximum likelihood estimator of the linear parameters:.
  26. [26]
    [PDF] Finite-Sample Properties of OLS - Princeton University
    Are the OLS Assumptions Satisfied? To justify the use of least squares, we need to make sure that Assumptions 1.1–. 1.4 are satisfied for the equation (1.7.4) ...
  27. [27]
    [PDF] GMM 1. OLS as a Method of Moment Estimator Consider a simple ...
    We use instrumental variable estimation using say z as instruments. Assume number of instruments=L and L ≥ K. The population moment conditions are: E(zi'ui)=0.
  28. [28]
    [PDF] Section 2 Simple Regression
    OLS is an estimator selected by the method of least squares and method of moments regardless of the underlying model (as long as the relevant moments exist). • ...
  29. [29]
    Gauss Markov theorem - StatLect
    Assumptions. OLS is linear and unbiased. What it means to be best. The covariance matrix of the OLS estimator. OLS is BLUE. Assumptions. The regression model is ...Assumptions · OLS is linear and unbiased · What it means to be best
  30. [30]
    The Gauss-Markov Theorem and BLUE OLS Coefficient Estimates
    When your model satisfies the assumptions, the Gauss-Markov theorem states that the OLS procedure produces unbiased estimates that have the minimum variance.
  31. [31]
    [PDF] Heteroskedasticity
    Heteroskedasticity means that the variance of the errors is not constant across observations. • In particular the variance of the errors may be a function of.
  32. [32]
    [PDF] Section 8 Heteroskedasticity
    The Breusch-Pagan test is a formal way to test whether the error variance depends on anything observable. o Suppose that. ( ). ( ) (. ) 2.
  33. [33]
    Why use OLS when it is assumed there is heteroscedasticity?
    Nov 26, 2018 · Under heteroscedasticity, OLS remains unbiased and consistent, but you lose efficiency. So unless you're certain of the form of ...What are the consequences of having non-constant variance in the ...What are the implications on prediction if a regression model with ...More results from stats.stackexchange.com
  34. [34]
    Heteroscedasticity: Causes and Consequences - SPUR ECONOMICS
    Feb 8, 2023 · The coefficients end up having larger standard errors and lower precision in the presence of heteroscedasticity. Hence, OLS estimators become ...
  35. [35]
    [PDF] a simple test for heteroscedasticity and random
    Econometrica, Vol. 47, No. 5 (September, 1979). A SIMPLE TEST FOR HETEROSCEDASTICITY AND RANDOM. COEFFICIENT VARIATION. BY T. S. BREUSCH AND A. R. PAGAN¹. A ...
  36. [36]
    A simple test for heteroscedasticity and random coefficient variation ...
    Sep 1, 1979 · A simple test for heteroscedastic disturbances in a linear regression model is developed using the framework of the Lagrangian multiplier ...
  37. [37]
    T.2.3 - Testing and Remedial Measures for Autocorrelation | STAT 501
    Here we present some formal tests and remedial measures for dealing with error autocorrelation. Durbin-Watson Test. We usually assume that the error terms ...
  38. [38]
    Durbin Watson Test Explained: Autocorrelation in Regression Analysis
    Positive autocorrelation in a stock means that if the price fell yesterday, it's likely to fall today as well. Negative autocorrelation implies that if a ...
  39. [39]
    Explain Serial Correlation and How It Affects Statistical Inference
    Dec 21, 2022 · The positive serial correlation makes the OLS standard errors for the regression coefficients underestimate the true standard errors. Moreover, ...
  40. [40]
    [PDF] Impact of Autocorrelation on OLS Estimates
    Even in the presence of autocorrelation, OLS estimates are unbiased. Page 4. 4. The variance of β. ˆ. 1 in the presence of autocorrelation: By definition,. (7).
  41. [41]
    TESTING FOR SERIAL CORRELATION IN LEAST SQUARES ...
    J. DURBIN, G. S. WATSON; TESTING FOR SERIAL CORRELATION IN LEAST SQUARES REGRESSION. I, Biometrika, Volume 37, Issue 3-4, 1 December 1950, Pages 409–428, h.
  42. [42]
    [PDF] A Heteroskedasticity-Consistent Covariance Matrix Estimator and a ...
    E C O N O M E T R I C A. VOLUME 48 MAY, 1980 NUMBER 4. A HETEROSKEDASTICITY-CONSISTENT COVARIANCE. MATRIX ESTIMATOR AND A DIRECT TEST. FOR HETEROSKEDASTICITY.
  43. [43]
    [PDF] Generalized Least Squares, Heteroskedastic and Autocorrelated ...
    Note: HAC not only corrects for autocorrelation, but also for heteroskedasticity. Do not be alarmed if you see slightly different HAC standard errors in ...
  44. [44]
    [PDF] Lecture 11 GLS
    (easy for heteroscedasticity, complicated for autocorrelation.) – Wald tests and F-tests with usual asymptotic χ2 distributions. Generalized Least Squares (GLS).
  45. [45]
    Violating the normality assumption may be the lesser of two evils
    When data are not normally distributed, researchers are often uncertain whether it is legitimate to use tests that assume Gaussian errors, or whether one ...
  46. [46]
    [PDF] When BLUE is not best: non-normal errors and the linear model
    Least squares (LS) is the best estimator (BLUE) when errors are normal. If errors are not normal, other non-linear estimators may be more efficient.Missing: impact | Show results with:impact
  47. [47]
    A Test for Normality of Observations and Regression Residuals - jstor
    Bera, A.K. & Jarque, C. M. (1982). Model specification tests: A simultaneous approach. J. Econometrics 20,. 59-82. Bowman, K.O. & Shenton ...
  48. [48]
    Detection of Influential Observation in Linear Regression
    A new measure based on confidence ellipsoids is developed for judging the contribution of each data point to the determination of the least squares estimate.
  49. [49]
    Robust Regression: Asymptotics, Conjectures and Monte Carlo
    September, 1973 Robust Regression: Asymptotics, Conjectures and Monte Carlo. Peter J. Huber · DOWNLOAD PDF + SAVE TO MY LIBRARY. Ann. Statist. 1(5): 799-821 ...
  50. [50]
    Bounds on Rounding Errors in Linear Regression Models - jstor
    Abstract. Rounding errors in regression analysis occur either systematically over all observations in a design matrix X or in just one observation.Missing: numerical | Show results with:numerical<|separator|>
  51. [51]
    [PDF] Yet Another Proof of the Gauss-Markov Theorem
    THEOREM. GAUSS-MARkOV. The OLS estimator is BLUE. The acronym BLUE stands for Best Linear Unbiased Estimator, i.e., the one with the smallest variance.
  52. [52]
    [PDF] ORDINARY LEAST SQUARES AND B.L.U.E 1
    This document aims to provide a concise and clear proof that the ordinary least squares model is BLUE. BLUE stands for Best, Linear, Unbiased, Estimator. In ...Missing: theorem | Show results with:theorem
  53. [53]
    [PDF] The Gauss-Markov Theorem - STA 211 - Stat@Duke
    Mar 7, 2023 · The Gauss-Markov Theorem asserts that under some assumptions, the OLS estimator is the “best” (has the lowest variance) among all estimators in ...
  54. [54]
    [PDF] Nicolas Christou Gauss-Markov theorem
    The Gauss-Markov theorem states that these OLS estimates have the smallest variance among all the linear unbiased estimators. We say that the OLS estimates ...
  55. [55]
    [PDF] Regression #4: Properties of OLS Estimator (Part 2)
    We now move on to discuss an important result, related to the efficiency of the OLS estimator, known as the Gauss-Markov Theorem. This theorem states: Justin L.Missing: ordinary | Show results with:ordinary
  56. [56]
    [PDF] General formulas for bias and variance in OLS
    The OLS estimator is ˆβ = (X X)−1X Y. At the moment, no assumptions are imposed on . Lemma 1. ˆβ = β + (X X)−1X . Proof.
  57. [57]
  58. [58]
    [PDF] A Simple Proof of the FWL (Frisch-Waugh-Lovell) Theorem
    Dec 28, 2005 · Waugh (1933) demonstrated a remarkable property of the method of least squares in a paper published in the very first volume of Econometrica.
  59. [59]
    [PDF] Linear regression diagnostics - DSpace@MIT
    LINEAR REGRESSION DIAGNOSTICS*. Soy E. Welsch and Edwin Kuh. Massachusetts Institute of Technology and. NBER Computer Research Center. WP 923-77. April 1977.
  60. [60]
    [PDF] Asymptotic Theory for OLS - Colin Cameron
    Examples include: (1) bN is an estimator, say 7θ; (2) bN is a component of an estimator, such as N-1 Σi xiui; (3) bN is a test statistic.
  61. [61]
    [PDF] Regression #3: Properties of OLS Estimator - Purdue University
    In this lecture, we establish some desirable properties associated with the OLS estimator. These include proofs of unbiasedness and consistency for both ˆβ and.
  62. [62]
    [PDF] Week 5: Simple Linear Regression - Brandon Stewart
    Gauss-Markov Theorem. All estimators unbiased linear. OLS is efficient in the class of unbiased, linear estimators. OLS is BLUE--best linear unbiased estimator.<|control11|><|separator|>
  63. [63]
    [PDF] Lecture 7 Asymptotics of OLS
    Theorem: Convergence for sample moments. Under certain assumptions (for example, i.i.d. with finite mean), sample moments converge in probability to their ...
  64. [64]
  65. [65]
    A Heteroskedasticity-Consistent Covariance Matrix Estimator and a ...
    Together, Assumptions 1-3 allow the multivariate Liapounov central limit theorem given by White [23] to be applied. The asymptotic normality result is as ...
  66. [66]
    [PDF] Asymptotics for Least Squares - University of California, Berkeley
    Asymptotics for least squares uses weaker assumptions, estimates the best linear predictor, and its consistency and asymptotic normality are shown using the ...
  67. [67]
    [PDF] The Multiple Linear Regression Model - Kurt Schmidheiny
    Sep 17, 2025 · The multiple linear regression model and its estimation using ordinary least squares (OLS) is doubtless the most widely used tool in ...<|control11|><|separator|>
  68. [68]
    [PDF] OLS using Matrix Algebra - my.SMU
    Exercise: Show that the matrix M is symmetric and idempotent, with trace equal to N −K. Exercise: Consider another “M” matrix which we'll call “M0”: M0 = I(N×N) ...
  69. [69]
    [PDF] Chapter 3: Topics in Regression
    Show that the residuals ei = Yi −. ˆ. Yi for i = 1, 2,..., n, can be written in terms of the hat matrix H as e = (I−H)Y. 3.24. For the simple linear ...
  70. [70]
    [PDF] Regression Analysis: Basic Concepts
    R2 = (1 − SSR/SST) is 1 minus the proportion of the variation in yi that is unexplained. It shows the proportion of the variation in yi that is accounted for ...
  71. [71]
    Regression Analysis | SPSS Annotated Output - OARC Stats - UCLA
    The value of R-square was .489, while the value of Adjusted R-square was .479 Adjusted R-squared is computed using the formula 1 – ((1 – Rsq)(N – 1 )/ (N – k – ...Missing: SSR/ SST
  72. [72]
    [PDF] Chapter 3: Multiple Regression - Purdue Department of Statistics
    F-statistic we just computed. Page 37. 5 Confidence interval: estimation of the mean response. We may construct a confidence interval on the mean response at ...
  73. [73]
    4.1 - Residuals | STAT 462
    The basic idea of residual analysis, therefore, is to investigate the observed residuals to see if they behave “properly.”
  74. [74]
    Understanding Diagnostic Plots for Linear Regression Analysis | UVA Library
    ### Summary of Diagnostic Plots for Linear Regression
  75. [75]
    5.2.4. Are the model residuals well-behaved?
    The overall pattern of the residuals should be similar to the bell-shaped pattern observed when plotting a histogram of normally distributed data. We emphasize ...Missing: OLS | Show results with:OLS
  76. [76]
    Tests for Specification Errors in Classical Linear Least-Squares ...
    THE objectives of this paper are two. The first is to derive the distributions of the classical linear least-squares residuals under a variety of ...
  77. [77]
    Generalized Inverses, Ridge Regression, Biased Linear Estimation ...
    Apr 9, 2012 · The paper exhibits theoretical properties shared by generalized inverse estimators, ridge estimators, and corresponding nonlinear estimation procedures.Missing: URL | Show results with:URL
  78. [78]
    Detecting Multicollinearity Using Variance Inflation Factors | STAT 462
    Many regression analysts often rely on what are called variance inflation factors (VIF) to help detect multicollinearity.
  79. [79]
    Testing for Serial Correlation in Least Squares Regression: I - jstor
    A great deal of use has undoubtedly been made of least squares regression methods in circumstances in which they are known to be inapplicable.
  80. [80]
    [PDF] Simple Linear Regression Least Squares Estimates of β0 and β1
    ˆ. Y = µY |X = β0 + β1X. This document derives the least squares estimates of β0 and β1. It is simply for your own information. You will not be held responsible ...Missing: β_j source
  81. [81]
    [PDF] Chapter 9 Simple Linear Regression - Statistics & Data Science
    In simple regression the p-value for the null hypothesis H0 : β1 = 0 comes from the t-test for b1. If applicable, a similar test is made for β0. SPSS also gives ...Missing: OLS | Show results with:OLS
  82. [82]
    2.5 - The Coefficient of Determination, r-squared | STAT 462
    The coefficient of determination or r-squared value, denoted r 2 , is the regression sum of squares divided by the total sum of squares.
  83. [83]
    [PDF] 1 Simple Linear Regression - Statistics
    ... Yi = β0 + β1Xi1 + β2Xi2 + β3Xi3 + β4Xi4 + εi. H0 : β1 = β2. (k = 1). Yi = β0+β1Xi1+β1Xi2+β3Xi3+β4Xi4+εi == β0+β1(Xi1+Xi2)+β3Xi3+β4Xi4+εi = β0+β1X∗.