Fact-checked by Grok 2 weeks ago

Linear regression

Linear regression is a statistical method used to model the relationship between a scalar response (or dependent ) and one or more explanatory variables (or variables) by fitting a to observed , where the best-fitting line is determined by minimizing the sum of the squares of the differences between observed and predicted values, known as the least squares method. This approach assumes a linear relationship, often expressed for simple linear regression as Y = a + bX + \epsilon, where Y is the dependent , X is the , a is the y-intercept, b is the slope, and \epsilon represents the error term. The method extends to multiple linear regression, incorporating several predictors as Y = a + b_1X_1 + b_2X_2 + \dots + b_nX_n + \epsilon. Key assumptions underlying linear regression include in the parameters, of errors, homoscedasticity ( variance of errors), and often of the error distribution for inference purposes, though the estimation itself does not require normality. These assumptions can be checked using diagnostic plots such as scatterplots for and residual plots for homoscedasticity and . Violations may necessitate data transformations or alternative models, but the technique remains robust for many applications due to the approximating normality in large samples. Linear regression serves multiple purposes, including describing relationships between , estimating unknown values of the dependent , and predicting future outcomes or prognostication, such as identifying factors in studies (e.g., predicting from and ). It is widely applied across fields like , , , and social sciences for tasks ranging from sales based on spend to analyzing the impact of environmental factors on crop yields. The model's simplicity and interpretability—where coefficients directly indicate the change in the dependent per change in an independent —make it a of statistical . The origins of linear regression trace back to the development of the least squares method, first published by Adrien-Marie Legendre in 1805 for astronomical calculations, though Carl Friedrich Gauss claimed prior invention around 1795 and formalized its probabilistic justification in 1809. The term "regression" was coined by Francis Galton in the 1880s during his studies on heredity using pea plant data, observing how offspring traits "regressed" toward the population mean, with Karl Pearson later refining the mathematical framework in the 1890s through correlation and multiple regression extensions. This evolution transformed least squares from a computational tool into a cornerstone of modern inferential statistics.

Fundamentals

Definition and Model Formulation

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a to observed data. The model assumes that the of the response variable, given the predictors, can be expressed as a linear combination of the predictors. This approach is foundational in statistics and is widely applied in fields such as economics, biology, and engineering for predictive modeling and inference. In its general matrix formulation for multiple linear regression, the model is expressed as \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, where \mathbf{Y} is an n \times 1 of observed responses, \mathbf{X} is an n \times p containing the predictor variables (with the first column typically being a of ones to account for the intercept), \boldsymbol{\beta} is a p \times 1 of unknown regression coefficients, and \boldsymbol{\varepsilon} is an n \times 1 of random terms. The terms satisfy E(\boldsymbol{\varepsilon}) = \mathbf{0} and \mathrm{Var}(\boldsymbol{\varepsilon}) = \sigma^2 \mathbf{I}_n, where \sigma^2 is the variance and \mathbf{I}_n is the n \times n , implying that the errors have mean zero and are uncorrelated with constant variance. The linearity in the model refers specifically to the parameters \boldsymbol{\beta}, meaning the response is a linear function of these coefficients, though the predictors in \mathbf{X} may involve nonlinear transformations of the original variables (e.g., polynomials or interactions). The intercept term \beta_0 is included as the first element of \boldsymbol{\beta}, corresponding to the constant column in \mathbf{X}, and represents the expected value of \mathbf{Y} when all predictors are zero. This formulation allows for the conditional expectation E(\mathbf{Y} \mid \mathbf{X}) = \mathbf{X}\boldsymbol{\beta} to serve as the mean response surface, derived from the zero mean of the errors: E(\mathbf{Y} \mid \mathbf{X}) = \mathbf{X}\boldsymbol{\beta} + E(\boldsymbol{\varepsilon} \mid \mathbf{X}) = \mathbf{X}\boldsymbol{\beta}, assuming the errors are independent of the predictors. For the simple case with a single predictor, the model simplifies to y_i = \beta_0 + \beta_1 x_i + \varepsilon_i for i = 1, \dots, n, where y_i is the i-th response, x_i is the predictor value, \beta_0 is the intercept, \beta_1 is the slope , and \varepsilon_i is the error term with E(\varepsilon_i) = 0 and \mathrm{Var}(\varepsilon_i) = \sigma^2. This setup captures the essence of the linear relationship, where the expected response E(y_i \mid x_i) = \beta_0 + \beta_1 x_i increases or decreases linearly with x_i, depending on the sign of \beta_1.

Simple Linear Regression Example

To illustrate simple linear regression, consider a hypothetical dataset consisting of heights (in inches, denoted as X) and weights (in pounds, denoted as Y) for 10 adult males, drawn from a larger body measurements study. The data are presented in the following table:
IndividualHeight (X)Weight (Y)
167.75154.25
272.25173.25
366.25154.00
472.25184.75
571.25184.25
674.75210.25
769.75181.00
872.50176.00
974.00191.00
1073.50198.25
A scatter plot of these points shows a clear positive linear trend, with weight increasing as height increases, though some variability exists around the trend line. The ordinary least squares method fits the model by estimating the slope \beta_1 and intercept \beta_0. The slope is given by \beta_1 = \frac{\text{Cov}(X, Y)}{\text{Var}(X)}, where \text{Cov}(X, Y) is the sample covariance and \text{Var}(X) is the sample variance of X. Here, the sample means are \bar{x} = 71.425 inches and \bar{y} = 180.7 pounds, yielding \text{Cov}(X, Y) \approx 43.07 and \text{Var}(X) \approx 7.51, so \beta_1 \approx 5.73 pounds per inch. The intercept is then \beta_0 = \bar{y} - \beta_1 \bar{x} \approx 180.7 - (5.73)(71.425) \approx -228.6 pounds. Thus, the fitted equation is \hat{y} = -228.6 + 5.73x. Using this equation, the predicted weight for a new individual with height x = 70 inches is \hat{y} = -228.6 + 5.73(70) \approx 172.5 pounds. To assess the fit, residuals are computed as e_i = y_i - \hat{y}_i for each observation. For example, for the first individual (height 67.75 inches, weight 154.25 pounds), \hat{y}_1 \approx -228.6 + 5.73(67.75) \approx 159.6 pounds, so e_1 \approx 154.25 - 159.6 = -5.35 pounds. Residuals quantify the model's prediction errors and play a key role in evaluating whether the data align with the assumptions of linearity and independence of errors.

Notation and Basic Interpretation

In linear regression, the dependent variable is typically denoted as y, representing the outcome or response of interest, while the independent variables are denoted as x_1, x_2, \dots, x_p, where p is the number of predictors. The model parameters include the coefficients \beta_0, \beta_1, \dots, \beta_p, where \beta_0 is the intercept and \beta_j (for j = 1, \dots, p) are the slope coefficients associated with each predictor. The error term's variance is denoted as \sigma^2, capturing the variability not explained by the predictors, and the sample is n, the number of observations. The coefficient \beta_j is interpreted as the expected change in the dependent variable y for a one-unit increase in the independent variable x_j, holding all other predictors constant. This partial highlights the unique contribution of each predictor to the response in a multivariate setting. The , R^2, measures the proportion of the total variance in y that is explained by the model, ranging from 0 (no explanatory power) to 1 (perfect fit). The intercept \beta_0 represents the expected value of y when all independent variables x_j = 0 for j = 1, \dots, p. This baseline value provides a reference point for the response under the condition of zero predictor values, though its practical relevance depends on whether such a scenario is meaningful in the data context. The magnitude and interpretation of coefficients are sensitive to the units of measurement for both y and the x_j; for instance, changing the scale of a predictor from meters to centimeters scales the corresponding \beta_j by a factor of 100, altering its numerical value while preserving the underlying relationship. of variables can mitigate such scaling effects, yielding coefficients that reflect relative importance in standard deviation units.

Assumptions and Limitations

Core Assumptions

The assumptions underlying the linear regression model that the (OLS) is unbiased, consistent, and efficient for and . These assumptions, collectively known as the Gauss-Markov assumptions when excluding , form the foundation for the model's theoretical properties, including the BLUE (best linear unbiased ) of OLS. While some are required for and others primarily for , violations can compromise the reliability of estimates and tests. The linearity assumption requires that the conditional expectation of the dependent variable Y given the predictors X is a linear function of the parameters: \mathbb{E}(Y \mid X) = X \beta, where \beta is the vector of coefficients and X includes an intercept column. This implies that the model correctly captures the systematic relationship between predictors and the expected response, with deviations from this line attributed solely to random error. In the standard formulation Y = X \beta + \varepsilon, linearity ensures that the error term \varepsilon has a conditional mean of zero, \mathbb{E}(\varepsilon \mid X) = 0, preventing systematic bias in predictions. Independence of the errors assumes that the error terms \varepsilon_i for different observations i are , meaning \text{Cov}(\varepsilon_i, \varepsilon_j \mid X) = 0 for all i \neq j. This , stronger than mere uncorrelatedness in the Gauss-Markov framework, is essential in cross-sectional or experimental data to that observations do not influence one another through the errors, supporting the validity of calculations and confidence intervals. Homoscedasticity requires that the of the errors is across all levels of the predictors: \text{Var}(\varepsilon_i \mid X) = \sigma^2 for all i, where \sigma^2 is a positive . This equal spread of errors around the regression line ensures that the variance-covariance of the errors is spherical (\sigma^2 I), which is crucial for the of OLS estimates and the reliability of t-tests and F-tests. Without it, the of estimates would vary systematically with predictor values, leading to inefficient . Normality of the errors is assumed for valid statistical inference: \varepsilon_i \mid X \sim N(0, \sigma^2) independently for each i. This Gaussian distribution facilitates exact finite-sample inference, such as the t-distribution for coefficient significance and the F-distribution for overall model fit, particularly in small samples. Although not required for the consistency or unbiasedness of OLS point estimates, it is optional for large-sample asymptotics where the central limit theorem can approximate normality. Finally, the absence of perfect multicollinearity assumes that the predictor matrix X is of full column rank, ensuring (X^T X) is invertible and that the coefficients \beta can be uniquely estimated. This prevents linear dependence among the predictors, which would otherwise make individual effects indistinguishable and render OLS undefined. High but imperfect multicollinearity may inflate variances but does not violate this core requirement.

Consequences of Violations

Violations of assumptions in linear regression can lead to biased estimates, inefficient predictions, and invalid statistical inferences, undermining the reliability of the model for both explanatory and predictive purposes. Specifically, these breaches affect the ordinary (OLS) estimator's properties, such as unbiasedness, , and the validity of errors, t-tests, and intervals. While OLS remains unbiased under many violations, the consequences often manifest in overstated precision or unreliable hypothesis testing, particularly in finite samples. Nonlinearity in the relationship between predictors and the response variable results in systematically biased predictions, as the attempts to approximate a curved or nonadditive with a straight line, leading to over- or under-predictions across parts of the . This violation also tends to underestimate the true variance of the errors, inflating measures of model fit like R² and producing overly narrow intervals that fail to capture the actual . Heteroscedasticity, where error variance changes with the level of predictors, renders OLS estimates inefficient, meaning they have larger variance than the best linear unbiased estimator, though unbiasedness is preserved. More critically, it invalidates the usual standard errors, leading to unreliable t-tests and F-tests that may incorrectly reject or fail to reject hypotheses; for instance, standard errors can be either underestimated or overestimated depending on the form of heteroscedasticity. To address this for inference, heteroscedasticity-consistent standard errors, such as White's estimator, can provide robust alternatives without altering the point estimates. Autocorrelation in error terms, common in time series data, causes the OLS estimates to remain unbiased and consistent but inefficient, with underestimated standard errors that inflate the significance of coefficients and lead to overly optimistic t-statistics and p-values. This serial correlation also artificially inflates the R² statistic, suggesting stronger explanatory power than actually exists, and renders confidence intervals too narrow, increasing the risk of Type I errors in hypothesis testing. Non-normality of errors does not bias OLS estimates or affect their , but it compromises exact in small samples, where t- and F-distributions no longer hold, leading to inaccurate p-values and intervals. However, for large samples, the ensures asymptotic normality of the estimators, making robust; in small samples, severe non-normality can distort significance tests, though the impact diminishes with sample size exceeding 30–50 observations. In multiple linear regression, multicollinearity among predictors inflates the variance of the coefficient estimates, making them highly sensitive to small changes in data and leading to unstable predictions with wide intervals, even if the overall model fit remains adequate. This high variance does not bias the estimates but reduces their precision, often resulting in insignificant individual t-tests despite a significant overall ; the (VIF), calculated as \text{VIF}_j = \frac{1}{1 - R_j^2} where R_j^2 is the R² from regressing predictor j on others, quantifies this, with VIF > 10 indicating problematic .

Diagnostic Techniques

Diagnostic techniques in linear regression involve graphical and statistical methods to evaluate the validity of model assumptions after estimation, such as , homoscedasticity, , , and lack of . These tools primarily rely on residuals, defined as the differences between observed and fitted values, to identify potential violations that could lead to unreliable inferences. By examining residuals, analysts can detect patterns indicative of model misspecification or data issues, enabling informed decisions about model refinement. Graphical methods are foundational for assessing several assumptions. A scatterplot of residuals against fitted values helps diagnose and homoscedasticity; under ideal conditions, points should scatter randomly around zero without systematic trends or funneling patterns that suggest non-constant variance. Similarly, a quantile-quantile (Q-Q) plot compares the distribution of residuals to a theoretical normal distribution, with points aligning closely to the reference line indicating approximate normality; deviations in the tails may signal outliers or non-normality. For detecting autocorrelation in residuals, particularly in time-series data, the Durbin-Watson test provides a statistical measure. The test statistic is calculated as DW = \frac{\sum_{i=1}^{n-1} (e_{i+1} - e_i)^2}{\sum_{i=1}^n e_i^2}, where e_i are the residuals and n is the sample size; values near 2 indicate no first-order autocorrelation, while values below 1.5 or above 2.5 suggest positive or negative autocorrelation, respectively, with critical values depending on the number of predictors and sample size. Multicollinearity among predictors is assessed using the (VIF) for each regressor x_j, defined as VIF_j = \frac{[1](/page/1)}{[1](/page/1) - R_j^2}, where R_j^2 is the from regressing x_j on all other predictors; VIF values exceeding 5 or 10 typically indicate problematic , inflating coefficient variances and standard errors. Influence and leverage diagnostics identify observations that disproportionately affect the fitted model. measures, derived from the hat H = X(X^T X)^{-1} X^T, quantify how much each observation pulls the fit toward itself, with diagonal elements h_{ii} ranging from 0 to ; high leverage (h_{ii} > 2p/n, where p is the number of parameters) warrants . Cook's distance further combines leverage and size to measure overall influence: D_i = \frac{e_i^2}{p \cdot MSE} \cdot \frac{h_{ii}}{([1](/page/1) - h_{ii})^2}, where e_i is the , MSE is the , and values exceeding $4/p or F_{p,n-p}(50,[1](/page/1)) suggest influential points that may distort estimates. Outliers, which can results, are detected using studentized residuals, defined as t_i = e_i / \sqrt{MSE (1 - h_{ii})}, adjusting the residual for its estimated . These follow a t-distribution under the model, allowing formal tests; values greater than 3 or 2.5 (depending on level) potential outliers, as they deviate markedly from expected .

Estimation Methods

Least Squares

least squares (OLS) is the for estimating the parameters of a linear regression model by minimizing the sum of squared residuals between observed and predicted values. Introduced by Adrien-Marie Legendre in 1805 for determining comet orbits and independently justified probabilistically by Carl Friedrich Gauss in 1809, OLS selects the parameter vector \beta that best fits the data in a least-squares sense. The residuals are the differences e_i = y_i - \hat{y}_i, and the objective function to minimize is the residual sum of squares: S(\beta) = (Y - X\beta)'(Y - X\beta), where Y is the n \times 1 vector of observations, X is the n \times (k+1) design matrix (including an intercept column of ones), and \beta is the (k+1) \times 1 parameter vector. To derive the OLS estimator, differentiate S(\beta) with respect to \beta and set the result to zero, yielding the normal equations: X'X \beta = X'Y. This system of k+1 equations in k+1 unknowns arises from the first-order conditions for minimization. Assuming X'X is invertible (which requires the columns of X to be linearly independent and full rank), the closed-form solution is: \hat{\beta} = (X'X)^{-1} X'Y. This explicit formula allows direct computation of the estimates without iterative methods, provided the matrix inversion is feasible. Under the core assumptions of the linear regression model—linearity in parameters, strict exogeneity of errors, no perfect , and homoskedasticity of errors—the OLS possesses desirable statistical properties. It is unbiased, meaning E(\hat{\beta}) = \beta, so on average, the estimates equal the true parameters. Furthermore, by the Gauss-Markov , \hat{\beta} is the best linear unbiased (BLUE), with the smallest variance among all linear unbiased estimators of \beta. The variance-covariance of \hat{\beta} is: \text{Var}(\hat{\beta}) = \sigma^2 (X'X)^{-1}, where \sigma^2 is the variance of the error terms; this quantifies the precision of the estimates, with diagonal elements giving the variances of individual \hat{\beta}_j. Once \hat{\beta} is obtained, the model generates fitted values \hat{Y} = X \hat{\beta}, which serve as predictions for the response variable. The (MSE) of these in-sample predictions, also known as the error variance estimate, is computed from the residuals as \widehat{\text{MSE}} = \frac{(Y - \hat{Y})'(Y - \hat{Y})}{n - k - 1}, providing a measure of prediction accuracy and an unbiased estimate of \sigma^2.

Maximum Likelihood Estimation

In the linear regression model, (MLE) provides a probabilistic framework for parameter estimation by assuming that the errors follow a , i.e., \epsilon_i \sim \mathcal{N}(0, \sigma^2) independently for i = 1, \dots, n. Under this assumption, the for the parameters \beta (the of coefficients) and \sigma^2 (the error variance) is L(\beta, \sigma^2) = (2\pi \sigma^2)^{-n/2} \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - x_i^T \beta)^2 \right), where y_i is the observed response, x_i is the vector of predictors (including the intercept), and the sum of squared residuals measures the discrepancy between observed and predicted values. This formulation treats the observations as independent and identically distributed (i.i.d.) from \mathcal{N}(X\beta, \sigma^2 I_n), where X is the n \times (p+1) design matrix. To find the MLE, it is computationally convenient to maximize the log-likelihood instead: \ell(\beta, \sigma^2) = -\frac{n}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} \| y - X\beta \|^2, where \| \cdot \|^2 denotes the squared norm. Differentiating \ell with respect to \beta yields the score equations \frac{\partial \ell}{\partial \beta} = \frac{1}{\sigma^2} X^T (y - X\beta) = 0, which simplify to the normal equations X^T X \beta = X^T y under the assumption that X^T X is invertible. Solving these gives the MLE for \beta, \hat{\beta}_{\text{MLE}} = (X^T X)^{-1} X^T y, which coincides exactly with the least squares (OLS) . For \sigma^2, substituting \hat{\beta}_{\text{MLE}} and differentiating \ell with respect to \sigma^2 produces \hat{\sigma}^2_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^n \hat{e}_i^2 = \frac{\| y - X\hat{\beta}_{\text{MLE}} \|^2}{n}, where \hat{e}_i = y_i - x_i^T \hat{\beta}_{\text{MLE}} are the residuals; this is biased downward, in contrast to the unbiased OLS variance s^2 = \frac{\| y - X\hat{\beta}_{\text{OLS}} \|^2}{n - p - 1}. Under standard regularity conditions (including normality of errors and fixed design matrix with full rank), the MLE \hat{\beta}_{\text{MLE}} is consistent and asymptotically normal as n \to \infty: \sqrt{n} (\hat{\beta}_{\text{MLE}} - \beta) \xrightarrow{d} \mathcal{N}(0, \sigma^2 \text{plim}(n^{-1} X^T X)^{-1}), or more precisely, \hat{\beta}_{\text{MLE}} \sim \mathcal{N}(\beta, \sigma^2 (X^T X)^{-1}) in finite samples under exact normality. These properties enable asymptotic inference on \beta. The Wald test assesses hypotheses of the form H_0: R\beta = r (where R is a restriction matrix and r a vector) by forming the quadratic form W = n (R\hat{\beta}_{\text{MLE}} - r)^T [R \hat{V} R^T]^{-1} (R\hat{\beta}_{\text{MLE}} - r), where \hat{V} = \hat{\sigma}^2 (X^T X)^{-1} is the estimated asymptotic covariance; under H_0, W \xrightarrow{d} \chi^2_q as n \to \infty, with q = \text{rank}(R). Alternatively, the likelihood ratio test compares the log-likelihoods under the full and restricted models: \text{LR} = 2 [\ell(\hat{\beta}_{\text{MLE}}, \hat{\sigma}^2_{\text{MLE}}) - \ell(\hat{\beta}_R, \hat{\sigma}^2_R)], which also follows \chi^2_q asymptotically under H_0. Both tests leverage the information matrix equality, equating the expected outer product of scores to the negative expected Hessian of the log-likelihood, ensuring their validity for large samples.

Alternative Estimators

Least absolute deviation (LAD) estimation provides a robust alternative to ordinary least squares (OLS) by minimizing the sum of absolute residuals rather than squared residuals, making it less sensitive to outliers. The LAD estimator is defined as \hat{\beta}_{\text{LAD}} = \arg\min_{\beta} \sum_{i=1}^n |y_i - \mathbf{x}_i' \beta|, where y_i is the observed response, \mathbf{x}_i are the predictors, and \beta are the coefficients. This objective function corresponds to the conditional median of the response given the predictors, which is inherently robust to extreme values since the median is less affected by outliers than the mean. Computationally, the LAD problem can be reformulated as a linear programming task, allowing efficient solution via standard optimization algorithms like the simplex method. Quantile regression extends the LAD approach to estimate conditional quantiles of the response distribution at any level \tau \in (0,1), offering a more complete picture of the relationship between predictors and the full range of response outcomes. The is given by \hat{\beta}(\tau) = \arg\min_{\beta} \sum_{i=1}^n \rho_{\tau}(y_i - \mathbf{x}_i' \beta), where the check function \rho_{\tau}(u) = u (\tau - I(u < 0)) weights positive and negative residuals asymmetrically based on \tau, with I(\cdot) as the indicator function. When \tau = 0.5, this reduces to the median regression, recovering the LAD . Like LAD, quantile regression can be solved using linear programming, and it is particularly useful in heteroscedastic settings or when interest lies in tail behaviors of the response. Ridge regression addresses multicollinearity in the predictors by introducing a penalty term that shrinks the coefficients toward zero, stabilizing estimates when OLS variances become large. The ridge estimator is obtained by minimizing \hat{\beta}_{\text{ridge}} = \arg\min_{\beta} \left( \| \mathbf{y} - \mathbf{X} \beta \|^2 + \lambda \| \beta \|^2 \right), where \lambda > 0 is a tuning parameter controlling the degree of shrinkage, \mathbf{y} is the response vector, and \mathbf{X} is the design matrix. This can be derived from the normal equations of OLS, \mathbf{X}' \mathbf{X} \beta = \mathbf{X}' \mathbf{y}, by augmenting the design matrix with a ridge \lambda \mathbf{I} term: (\mathbf{X}' \mathbf{X} + \lambda \mathbf{I}) \beta = \mathbf{X}' \mathbf{y}, yielding the closed-form solution \hat{\beta}_{\text{ridge}} = (\mathbf{X}' \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}' \mathbf{y}. The bias introduced by shrinkage is traded off against reduced variance, often improving mean squared error in correlated predictor scenarios. Bayesian linear regression incorporates beliefs about the coefficients, yielding estimators as posterior means under a , which provides a probabilistic framework for beyond point estimates. Assuming a likelihood for the errors and a \beta \sim \mathcal{N}(\mu_0, \Sigma_0), the posterior distribution is also , with mean \hat{\beta}_{\text{Bayes}} = \left( \mathbf{X}' \mathbf{X} + \Sigma_0^{-1} \right)^{-1} \left( \mathbf{X}' \mathbf{y} + \Sigma_0^{-1} \mu_0 \right), resembling a shrunk OLS estimator where the prior acts as regularization. This approach naturally handles multicollinearity through the prior covariance and allows for model comparison via posterior predictive checks.

Extensions and Variants

Multiple Linear Regression

Multiple linear regression extends the simple linear regression framework to incorporate multiple predictor variables, allowing for the modeling of more complex relationships between the response variable and a set of explanatory factors. The model is expressed as Y_i = \beta_0 + \sum_{j=1}^p \beta_j X_{ij} + \epsilon_i for i = 1, \dots, n, where Y_i is the response, X_{ij} are the predictors, \beta_j are the coefficients, and \epsilon_i are independent errors with mean zero and constant variance. In matrix notation, this becomes \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}, where \mathbf{Y} is an n \times 1 vector of responses, \mathbf{X} is an n \times (p+1) design matrix with a column of ones for the intercept, \boldsymbol{\beta} is a (p+1) \times 1 vector of coefficients, and \boldsymbol{\epsilon} is an n \times 1 error vector. This formulation facilitates computational efficiency and theoretical analysis, particularly for estimation via ordinary least squares, where the coefficient vector is \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y}, assuming \mathbf{X}^\top \mathbf{X} is invertible. The coefficients \beta_j in multiple linear regression represent partial effects, capturing the change in the expected value of Y associated with a one-unit increase in X_j, while holding all other predictors constant at their means or specific values. Unlike marginal effects in simple regression (where p=1), these partial coefficients adjust for confounding among predictors, isolating the unique contribution of each variable to the response. This interpretation is crucial for in observational data but assumes the model is correctly specified and free of that could inflate standard errors. To assess overall model fit and significance, the R^2 measures the proportion of variance explained by the predictors, but it increases with added variables regardless of relevance. The adjusted R^2 addresses this by penalizing model : \bar{R}^2 = 1 - (1 - R^2) \frac{n-1}{n - p - 1}, where n is the sample size and p is the number of predictors; higher values indicate better fit relative to baseline models. For overall significance, the evaluates whether the model explains more variance than an intercept-only model: F = \frac{R^2 / p}{(1 - R^2) / (n - p - 1)}, which follows an with p and n - p - 1 under the that all \beta_j = 0 for j \geq 1. A significant F-statistic (low ) supports retaining the full model. Variable selection in multiple linear regression aims to identify a of predictors that balances fit and , often using stepwise methods guided by criteria. Forward stepwise selection begins with an intercept-only model and iteratively adds the predictor that most improves fit (e.g., via largest in or F-statistic), stopping when no addition yields significant . Backward stepwise selection starts with all predictors and removes the least contributory one step-by-step until removals degrade fit unacceptably. These procedures commonly employ the Akaike criterion (AIC) for selection, defined as \text{AIC} = -2 \log L + 2k, where L is the likelihood and k = p + 1 is the number of parameters; lower AIC values favor models with good predictive accuracy penalized for . While computationally efficient, stepwise methods risk overfitting and are sensitive to collinearity, prompting caution in interpretation.

Generalized Linear Models

Generalized linear models (GLMs) extend the of linear regression to accommodate response variables that follow distributions other than , such as or , by incorporating a that connects the of the response to a linear predictor. The model consists of three main components: a random component specifying the of the response variable Y, typically from the exponential family (e.g., for count data or for binary/proportion data); a linear predictor \eta = X\beta, where X is the design matrix and \beta are the coefficients; and a g(\mu) = \eta, where \mu = E(Y) is the expected value of the response. This structure allows GLMs to model non-constant variance and non-linear relationships between predictors and the response while maintaining the linear form in the predictors. A key feature of GLMs is the use of canonical link functions, which simplify estimation and interpretation by aligning the link with the natural parameter of the . For the Gaussian distribution, the is the g(\mu) = \mu, which recovers the standard linear regression model. In the case of the , the g(\mu) = \log(\mu / (1 - \mu)) is , suitable for modeling probabilities. For the Poisson distribution, the g(\mu) = \log(\mu) serves as the canonical form, enabling the analysis of count data with multiplicative effects. Parameter estimation in GLMs is typically performed using (IRLS), an algorithm that iteratively solves problems to maximize the likelihood. In each iteration, a working response is constructed as z_i = \eta_i + (y_i - \mu_i) g'(\mu_i), where \eta_i is the current linear predictor, y_i is the observed response, \mu_i is the current fitted mean, and g'(\mu_i) is the of the link ; the coefficients \beta are then updated by weighted ordinary with weights w_i = 1 / [V(\mu_i) (g'(\mu_i))^2], where V(\mu) is the variance of the distribution. This process converges to the maximum likelihood estimates under the specified distribution. Goodness-of-fit in GLMs is assessed using the deviance, defined as D = 2 [l(\text{saturated}) - l(\text{fitted})], where l denotes the log-likelihood, analogous to -2 \log L in likelihood ratio tests. The deviance measures the discrepancy between the fitted model and a that perfectly the , with smaller values indicating better fit; under the of adequate fit, it approximately follows a with equal to the number of observations minus the number of parameters. To handle , where the observed variance exceeds that implied by the model (e.g., \phi > 1 for a parameter in approaches), GLMs can incorporate a \phi such that \text{Var}(Y) = \phi V(\mu), allowing robust without altering the mean structure. This extension, introduced to address situations like clustered or heterogeneous data, scales the standard errors by \sqrt{\phi} while preserving the IRLS .

Robust and Regularized Variants

Robust and regularized variants of linear regression extend the classical model to address violations of core assumptions such as homoscedasticity, independence in clustered data, measurement errors in predictors, endogeneity due to correlation between regressors and errors, and challenges in high-dimensional settings where the number of predictors exceeds observations. These methods improve estimation efficiency, reduce bias, and promote sparsity or interpretability while maintaining the linear structure. They are particularly valuable in applied fields like econometrics, social sciences, and machine learning, where real-world data often deviate from ideal conditions. Heteroscedastic models arise when the variance of the errors is not constant across observations, violating the homoscedasticity assumption and leading to inefficient estimates. (WLS) addresses this by assigning weights inversely proportional to the error variances, yielding more efficient estimators under known or estimated heteroscedasticity. The WLS estimator is given by \hat{\beta} = (X^T W X)^{-1} X^T W Y, where W is a diagonal matrix with entries w_i = 1 / \Var(y_i), and the variances may be estimated iteratively from residuals of an initial fit. This approach, originally formulated as a generalization of for weighted observations, enhances precision in settings like cross-sectional economic data with varying scales. Hierarchical or multilevel models, also known as random effects models, accommodate data with nested structures, such as students within schools or repeated measures within individuals, where observations are not due to group-level variation. These models the into fixed effects across groups and random effects specific group, allowing intercepts or slopes to vary hierarchically. A basic two-level model for grouped data is y_{ij} = X_{ij} \beta + Z_{ij} u_j + \epsilon_{ij}, where i indexes observations within group j, u_j \sim N(0, \Sigma_u) captures random effects at the group level, and \epsilon_{ij} \sim N(0, \sigma^2) is the residual error, assuming independence within groups. Estimation typically uses maximum likelihood or , accounting for the structure induced by the random effects. This framework, developed for educational and , properly handles and provides more accurate inference for group-varying parameters. Errors-in-variables models occur when predictors X contain errors, causing to biased and inconsistent estimates of \beta to attenuation bias. Total (TLS), a robust alternative, minimizes the perpendicular distances from data points to the fitted , perturbing both X and Y to account for errors in all variables. The TLS involves the of the [X \ Y], where \hat{\beta} is derived from the right singular vector corresponding to the smallest of the perturbed . This method, analyzed through , yields consistent estimators under classical error assumptions and is widely applied in problems and approximate solutions to overdetermined systems. For cases where errors in X correlate with the true regressors, instrumental variables provide an adjustment by using exogenous instruments Z uncorrelated with the errors but correlated with X. Lasso regression introduces regularization to linear models, particularly useful in high-dimensional settings where p > n (number of predictors exceeds sample size), by shrinking coefficients toward and performing automatic variable selection through sparsity. The lasso estimator solves the penalized \hat{\beta}_{\text{lasso}} = \arg\min_{\beta} \|Y - X\beta\|_2^2 + \lambda \|\beta\|_1, where \lambda \geq 0 controls the strength of the L1 penalty on the absolute values of the coefficients, driving irrelevant predictors exactly to . This promotes parsimonious models with improved accuracy and interpretability, outperforming in scenarios with or irrelevant features, as demonstrated in simulation studies and real datasets like . The combines the benefits of selection and while ensuring computational tractability via . Instrumental variables (IV) regression mitigates endogeneity, where regressors correlate with the error term due to omitted variables, simultaneity, or measurement error, leading to biased ordinary estimates. IV uses external instruments Z that are correlated with the endogenous regressors but uncorrelated with the errors, enabling identification of causal effects. The simple IV estimator for a single endogenous regressor is \hat{\beta}_{\text{IV}} = (Z^T X)^{-1} Z^T Y, which can be viewed as a two-stage procedure: first regress X on Z to obtain fitted values \hat{X}, then regress Y on \hat{X}. This approach, rooted in early econometric work on supply-demand systems, provides consistent estimates under valid instrument conditions and is foundational for in observational data, though it requires careful testing for instrument strength and validity.

Applications

Trend Analysis and Prediction

Linear regression serves as a fundamental tool for by fitting a straight line to time-series data, capturing the underlying linear pattern over time. The model is typically expressed as y_t = \beta_0 + \beta_1 t + \varepsilon_t, where y_t is the observed value at time t, \beta_0 is the intercept, \beta_1 is the slope representing the , and \varepsilon_t is the assumed to be normally distributed with and variance. This approach enables detrending, where the linear component is subtracted from the data to isolate random fluctuations or cyclical patterns for further analysis. For prediction, linear regression provides point forecasts by extrapolating the fitted trend line beyond the observed data range. However, to quantify uncertainty, prediction intervals are constructed around these forecasts, given by \hat{y}_{\text{new}} \pm t_{n-p-1, 1-\alpha/2} \, s \sqrt{1 + \mathbf{x}_{\text{new}}' (\mathbf{X}'\mathbf{X})^{-1} \mathbf{x}_{\text{new}}}, where \hat{y}_{\text{new}} is the predicted value, t_{n-p-1, 1-\alpha/2} is the critical value from the t-distribution with n-p-1 degrees of freedom, s is the residual standard error, p is the number of predictors, and \mathbf{x}_{\text{new}} is the vector for the new observation. These intervals widen as extrapolation extends further from the data, reflecting increased uncertainty due to potential violations of model assumptions outside the fitted range. Confidence bands, in contrast, provide intervals for the mean trend function rather than individual predictions and are narrower, following a similar form but omitting the "+1" term inside the square root: \hat{y}_{\text{new}} \pm t_{n-p-1, 1-\alpha/2} \, s \sqrt{\mathbf{x}_{\text{new}}' (\mathbf{X}'\mathbf{X})^{-1} \mathbf{x}_{\text{new}}}. These bands parallel the trend line and are used to assess the reliability of the estimated mean trajectory, such as in visualizing long-term growth patterns. In economic forecasting, linear regression models are commonly applied to predict indicators like GDP growth, often after detrending to focus on deviations from steady-state paths. For instance, quarterly GDP data can be modeled with a linear trend in logged values to estimate growth rates, but prior checks for stationarity—using tests like the Augmented Dickey-Fuller—are essential to avoid spurious regressions from non-stationary series. Non-stationarity, indicated by unit roots, implies that shocks have persistent effects, necessitating differencing or cointegration analysis before applying the model. Long-term predictions using linear regression face significant limitations when underlying assumptions drift, such as shifts in due to structural economic changes or evolving relationships between variables. Violations of or stationarity can lead to biased forecasts and unreliable , as the model fails to capture nonlinear or persistent trends that emerge over extended horizons. In such cases, the widening prediction intervals underscore the model's reduced accuracy for distant forecasts, emphasizing the need for periodic re-estimation.

Use in Specific Disciplines

In , linear regression is commonly employed to model dose-response relationships, such as regressing disease rates or health outcomes on exposure levels while adjusting for confounders like age, , or environmental factors. For instance, studies have used linear and log-linear regression models to examine the impact of blood lead levels on children's IQ, controlling for variables including maternal IQ, , and , revealing that a doubling of blood lead concentration is associated with a 2.6-point IQ decrement. Similarly, linear regression has been applied to assess lead exposure's effect on , where log-linear models show a 1.0 mm Hg increase in systolic per doubling of blood lead levels after confounder adjustment. These models enable public health policymakers to quantify risks and evaluate interventions, such as lead reduction programs that have yielded substantial economic benefits. In finance, linear regression underpins the (CAPM), where the measures an asset's relative to the . The is calculated as \beta = \frac{\mathrm{Cov}(R_i, R_m)}{\mathrm{Var}(R_m)}, with R_i as the asset's and R_m as the , derived from regressing excess asset returns on excess returns. This approach allows investors to estimate expected returns and assess , forming a for and since its in . Economists utilize linear regression in regression discontinuity designs to infer causal effects from policy interventions at known cutoff thresholds, comparing outcomes immediately above and below the cutoff to isolate treatment impacts. For example, analyses of vote share cutoffs near zero have estimated incumbency advantages, showing discontinuities of 5-10% in vote shares and 37-57% in reelection probabilities. Other applications include evaluating class size reductions at enrollment thresholds, where regression models reveal effects on student test scores, and financial aid eligibility based on test scores, demonstrating boosts in college enrollment. These designs provide quasi-experimental rigor for assessing policy efficacy, such as in education and electoral systems, under the assumption of continuity in potential outcomes absent the cutoff. In environmental science, linear regression models relate pollutant concentrations to covariates like land use, traffic density, and meteorological factors, including temperature, to map spatial and temporal variations. Land use regression, a form of multiple linear regression, has been used globally to predict nitrogen dioxide (NO₂) levels from variables such as proximity to major roads and satellite-derived emissions, explaining up to 54% of variation in annual concentrations across diverse regions. Temperature serves as a key covariate in such models, as higher values can exacerbate ozone formation or alter particulate matter (PM₂.₅) components like nitrate and sulfate through photochemical reactions and atmospheric stability changes. These applications support air quality forecasting and regulatory assessments, such as identifying emission hotspots influenced by seasonal temperature shifts. Building science leverages linear regression to predict based on insulation properties and related variables, aiding in the design of efficient structures. Models regress energy use on factors like and U-values (measures of effectiveness), infiltration rates, and window-to-wall ratios, validated against real-world data with errors under 10%. This approach integrates with to optimize thermal performance and support green certification processes.

Integration with Machine Learning

In pipelines, linear regression often begins with to ensure model stability and interpretability. A key step is , where each feature x_j is transformed to x_j' = \frac{x_j - \mu_j}{\sigma_j}, centering the at zero and unit variance. This preprocessing makes coefficients comparable across features with different scales, preventing those with larger variances from disproportionately influencing the fit, and is particularly beneficial for gradient-based optimization in linear models. Popular machine learning libraries integrate linear regression as a core component, facilitating seamless use within broader workflows. For instance, scikit-learn's LinearRegression class implements ordinary least squares fitting and supports pipeline integration for tasks like prediction and evaluation. Regularized variants, such as ridge and lasso regression, incorporate cross-validation to tune the regularization parameter \lambda; RidgeCV and LassoCV automate this by evaluating multiple \lambda values via k-fold cross-validation, selecting the one minimizing validation error to balance bias and variance. Lasso, in particular, promotes sparsity by driving some coefficients to zero, aiding feature selection in high-dimensional data. To capture nonlinearity while retaining the simplicity of linear regression, basis expansions transform the input space. Polynomial features extend the model to forms like y = \beta_0 + \beta_1 x + \beta_2 x^2 + \varepsilon, where higher-degree terms are generated via preprocessing (e.g., scikit-learn's PolynomialFeatures), allowing the linear framework to approximate curved relationships without altering the core . Splines provide a flexible , using polynomials joined smoothly at knots to avoid the high-degree oscillations of global polynomials, as detailed in foundational statistical learning texts. Linear regression also serves as a base learner in ensemble methods, enhancing predictive performance through iterative refinement. In gradient boosting, weak linear models are sequentially fitted to residuals of prior fits, building an additive ensemble that corrects errors additively; this approach, while often using trees, leverages linear bases for scenarios requiring interpretability over complexity. A primary advantage of linear regression in machine learning is its inherent interpretability compared to black-box models like deep neural networks. The coefficients \beta directly quantify feature impacts on the target, but for deeper insights, SHAP values decompose predictions into additive feature attributions, aligning with game-theoretic fairness axioms and extending interpretability to ensemble or regularized linear models.

Historical Development

Early Origins

The foundations of linear regression trace back to the early 19th century, when astronomers sought precise methods to fit observational data to theoretical models, particularly for predicting celestial orbits. In 1805, French mathematician introduced the method of least squares in the appendix of his work Nouvelles méthodes pour la détermination des orbites des comètes, marking the first formal publication of this technique. Legendre presented it as a practical tool for minimizing the sum of squared residuals between observed and predicted positions of comets, enabling more accurate orbital determinations amid noisy astronomical measurements. This approach, though algorithmic, laid the groundwork for regression by emphasizing error minimization in linear relationships. Shortly thereafter, in 1809, published his own formulation of the in the second volume of Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium, applying it to astronomical such as planetary and orbits. claimed to have developed the technique as early as 1795, using it successfully to predict the position of the based on limited observations from in 1801. His work extended Legendre's deterministic by incorporating a probabilistic framework, assuming that observational errors followed a normal distribution with mean zero, which justified as the maximum-likelihood estimator for parameter recovery. This error model, derived from the idea that errors arise from numerous small, independent causes, provided a theoretical basis for the normality assumption in regression analysis. The term "regression" itself emerged later in the 19th century through studies of biological inheritance. In 1886, British polymath coined the phrase in his paper "Regression Towards Mediocrity in Hereditary Stature," published in the Journal of the Anthropological Institute of and . Analyzing data on the heights of 930 adult children and their 205 mid-parentages, Galton observed a tendency for of exceptionally tall or short parents to have heights closer to the population average, a phenomenon he termed "regression towards mediocrity." This empirical insight, visualized through scatter plots and fitted lines, highlighted mean reversion in linear relationships and connected estimation to the study of correlated traits, influencing the statistical interpretation of regression.

Key Milestones and Contributors

Building on Galton's empirical observations, refined the mathematical framework of in the 1890s. In , he developed the to quantify the strength of linear relationships between , and by the early , he extended these ideas to multiple , providing equations for predicting a dependent from several and establishing a rigorous basis for biometrical . In the 1920s, laid foundational groundwork for modern linear regression analysis by developing the analysis of variance (ANOVA) framework, which decomposes total variance into components attributable to different sources, and introducing the for assessing the significance of regression coefficients. These concepts were detailed in his seminal book Statistical Methods for Research Workers (1925), enabling researchers to evaluate the fit and of linear models in experimental data. During the 1930s, Jerzy Neyman and Egon Pearson advanced the integration of hypothesis testing into linear regression by formulating the Neyman-Pearson lemma, which identifies the most powerful tests for simple hypotheses based on likelihood ratios, thereby providing a rigorous basis for inference on model parameters such as slopes and intercepts. Their collaborative work, including the 1933 paper "On the Problem of the Most Efficient Tests of Statistical Hypotheses," established a decision-theoretic approach that complemented Fisher's methods and became essential for validating linear regression assumptions. In the 1940s, Trygve Haavelmo propelled linear regression's application in by addressing biases in systems of simultaneous equations, where endogenous variables violate classical assumptions; his analysis demonstrated the need for strategies, paving the way for variables (IV) methods to obtain consistent estimates. This insight was articulated in his 1943 paper "The Statistical Implications of a System of Simultaneous Equations," which shifted econometric modeling toward probabilistic frameworks and influenced techniques. Computational advancements in the 1960s enhanced the stability of linear regression estimation amid growing data volumes; Gene Golub's 1965 paper "Numerical Methods for Solving Linear Least Squares Problems" introduced the using Householder transformations, offering a numerically stable alternative to direct normal equations by orthogonalizing the and avoiding ill-conditioning issues. Software developments in the 1970s democratized linear regression by embedding it in user-friendly tools; the (SAS), originating from a 1966 project at North Carolina State University and incorporated in 1976, provided comprehensive procedures for regression modeling, including diagnostics and extensions, making advanced analysis accessible beyond specialists. By the 1990s, the open-source R programming language, initiated in 1993 by Ross Ihaka and Robert Gentleman at the University of Auckland as an implementation of the S language, further popularized linear regression through its lm() function and extensible packages, fostering widespread adoption in statistical computing and reproducible research.

References

  1. [1]
    Linear Regression Analysis: Part 14 of a Series on Evaluation ... - NIH
    The linear regression model describes the dependent variable with a straight line that is defined by the equation Y = a + b × X, where a is the y-intersect of ...Missing: history | Show results with:history
  2. [2]
    Introduction to linear regression - Duke People
    It is the study of linear (ie, straight-line) relationships between variables, usually under an assumption of normally distributed errors.
  3. [3]
    [PDF] Derivations of the LSE for Four Regression Models
    The least squares method goes back to 1795, when Carl Friedrich Gauss, the great German mathematician, discovered it when he was eighteen years old. It arose ...
  4. [4]
    [PDF] Galton, Pearson, and the Peas: A Brief History of Linear Regression ...
    Dec 1, 2017 · A simple linear regression analysis on these data using parent seed size to predict filial seed size produced almost exactly the same slope ...Missing: definition | Show results with:definition
  5. [5]
    [PDF] Mathematicians of Gaussian Elimination - CIS UPenn
    tions finally arose when Adrien-Marie Legendre. [1805] and Gauss [1809] (Figure 5) invented what. Legendre named the method of least squares. (“méthode des ...<|separator|>
  6. [6]
    Simple Linear Regression — STATS191
    ### Mathematical Formulation of Simple Linear Regression
  7. [7]
    [PDF] STAT 714 LINEAR STATISTICAL MODELS
    Linear models are linear in their parameters, with the general form Y = Xβ + , where Y is observed responses, X is a design matrix, β is unknown parameters, ...
  8. [8]
    5.4 - A Matrix Formulation of the Multiple Regression Model
    The matrix formulation of multiple regression uses matrices to define the model, with the form Y=Xβ+ε, where X is an n x 2 matrix, Y is an n x 1 vector, β is a ...
  9. [9]
    [PDF] Ch5. Simple Linear Regression 1 The model
    The designation simple indicates that there is only one predictor x, and linear means that the model 1 is linear in parameters β0 and β1. For the model, we ...
  10. [10]
    None
    ### Summary of BodyMeasurementsCorrected.csv
  11. [11]
    Lesson 22: Simple Linear Regression - byuistats.github.io
    Find the equation of the linear regression line used to predict the weight of a man given his height. Interpret the slope and intercept of the regression line, ...
  12. [12]
    Formulas for Estimating the Slope
    The best estimate of the slope is the covariance between X and Y divided by the variance of X. The slope also equals the correlation multiplied by the ratio of ...
  13. [13]
    Linear Least Squares Regression - Information Technology Laboratory
    Linear least squares regression has earned its place as the primary tool for process modeling because of its effectiveness and completeness. Though there are ...Missing: notation | Show results with:notation
  14. [14]
    Interpreting Regression Output - Data and Statistical Services
    In regression with a single independent variable, the coefficient tells you how much the dependent variable is expected to increase (if the coefficient is ...Introduction · P, t and standard error · Coefficients
  15. [15]
    1.5 - The Coefficient of Determination, \(R^2\) | STAT 501
    ### Definition and Interpretation of R-squared in Simple Linear Regression
  16. [16]
    Linear regression with standardized variables - StatLect
    The estimated coefficients of a linear regression model with standardized variables are called standardized coefficients. They are sometimes deemed easier to ...
  17. [17]
    [PDF] Assumptions for OLS Regression and the Gauss-Markov Theorem
    Sep 8, 2021 · These properties include: • The least squares estimators are linear estimators; they are linear functions of the observations. (This property.
  18. [18]
    Testing the assumptions of linear regression - Duke People
    There are four principal assumptions which justify the use of linear regression models for purposes of inference or prediction: (i) linearity and additivity ...<|control11|><|separator|>
  19. [19]
    4.1 - Background | STAT 501 - STAT ONLINE
    On the other hand, if you want to use your model to predict a future response , then you are likely to get inaccurate results if the error terms are not ...
  20. [20]
    Introductory Econometrics Chapter 19: Heteroskedasticity
    Heteroskedasticity has serious consequences for the OLS estimator. Although the OLS estimator remains unbiased, the estimated SE is wrong. Because of this ...
  21. [21]
    Introduction to Regression with SPSS Lesson 2 - OARC Stats
    Testing Nonlinearity​​ If this assumption is violated, the linear regression will try to fit a straight line to data that do not follow a straight line. The ...
  22. [22]
    [PDF] Section 8 Heteroskedasticity
    OLS is inefficient with heteroskedasticity. Page 3. ~ 84 ~ o We don't prove this, but the Gauss-Markov Theorem requires homoskedasticity, so the OLS estimator ...
  23. [23]
    [PDF] Using Heteroscedasticity Consistent Standard Errors in the Linear ...
    In the presence of heteroscedasticity, OLS estimates are unbiased, but the usual tests of significance are generally inappropriate and their use.
  24. [24]
    Topic 2: Time Series & Autocorrelation | STAT 501
    Error terms correlated over time are said to be autocorrelated or serially correlated. When error terms are autocorrelated, some issues arise when using ...Missing: impact | Show results with:impact<|separator|>
  25. [25]
    [PDF] Lecture 12 Assumption Violation: Autocorrelation
    Describe the Impacts on the estimated parameters of autocorrelation. • Describe several methods for detecting autocorrelation? • Describe how to use the Durbin- ...
  26. [26]
    Should I Always Transform My Variables to Make Them Normal?
    Sep 14, 2015 · In fact, linear regression analysis works well, even with non-normal errors. But, the problem is with p-values for hypothesis testing. After ...
  27. [27]
    10.4 - Multicollinearity | STAT 462
    Multicollinearity exists when two or more of the predictors in a regression model are moderately or highly correlated with one another.Missing: consequences | Show results with:consequences
  28. [28]
    [PDF] Diagnosing Multicollinearity - Cornell Statistical Consulting Unit
    Multicollinearity can have two effects on the regression analysis. First, the regression parameters will be unstable from sample to sample because the standard ...Missing: consequences | Show results with:consequences
  29. [29]
    Applied Linear Regression | Wiley Series in Probability and Statistics
    Master linear regression techniques with a new edition of a classic text. Reviews of the Second Edition: "I found it enjoyable reading and ...
  30. [30]
    TESTING FOR SERIAL CORRELATION IN LEAST SQUARES ...
    J. DURBIN, G. S. WATSON; TESTING FOR SERIAL CORRELATION IN LEAST SQUARES REGRESSION. I, Biometrika, Volume 37, Issue 3-4, 1 December 1950, Pages 409–428, h.
  31. [31]
    [PDF] Legendre On Least Squares - University of York
    Gauss says in his work on the Theory of Mo- tions of the Heavenly Bodies (1809) that he had made use of this principle since 1795 but that it was first ...
  32. [32]
    Gauss and the Invention of Least Squares - Project Euclid
    The most famous priority dispute in the history of statistics is that between Gauss and Legendre, over the discovery of the method of least squares.
  33. [33]
    Normal equations in simple and multiple linear regression models
    In linear regression analysis, the normal equations are a system of equations whose solution is the Ordinary Least Squares (OLS) estimator of the regression ...
  34. [34]
    Properties of the OLS estimator | Consistency, asymptotic normality
    In this lecture we discuss under which assumptions the OLS (Ordinary Least Squares) estimator has desirable statistical properties such as consistency and ...
  35. [35]
    [PDF] 1 Ordinary Least Squares
    (18). We call ˆσ2 the MSE of the regression and ˆσ the root MSE of the regression. 4.6 The least squares predictor. 4.4.6 The least squares predictor. • Predict.
  36. [36]
    [PDF] Lecture 6: The Method of Maximum Likelihood for Simple Linear ...
    Sep 19, 2015 · The method of maximum likelihood does not always work; there are models where it gives poor or even pathological estimates. For Gaussian-noise ...
  37. [37]
    [PDF] Linear Regression via Maximization of the Likelihood - cs.Princeton
    Sep 17, 2018 · Unsurprisingly, the maximum likelihood estimate in this model (regardless of σ2) is the sample average of the data. MLE Regression with Gaussian ...
  38. [38]
    [PDF] Topic 15: Maximum Likelihood Estimation - Arizona Math
    ... maximum likelihood is equivalent to the least squares criterion for ordinary linear regression. The maximum likelihood estimators α and β give the regression ...
  39. [39]
    1.2 - Maximum Likelihood Estimation | STAT 415
    So, the "trick" is to take the derivative of ln ⁡ L ( p ) (with respect to p ) rather than taking the derivative of L ( p ) . Again, doing so often makes the ...
  40. [40]
    [PDF] Score, Wald, and Likelihood Ratio - MyWeb
    We've now covered the most important theoretical properties of the MLE: it is consistent, asymptotically normal, and efficient.
  41. [41]
    [PDF] Lecture 4 Testing in the Classical Linear Model
    - Wald Tests - Based on the asymptotic normality of the MLE. - Score Tests - Based on the asymptotic normality of the log- likelihood. General Methods. 37.
  42. [42]
    Optimal Estimation of Executive Compensation by Linear ...
    Least Absolute Deviation. 14 July 2023. Introduction. 1 March 2023 ... Charnes, W. W. Cooper, R. O. Ferguson, (1955) Optimal Estimation of Executive ...
  43. [43]
    Regression Quantiles - jstor
    1 (January, 1978). REGRESSION QUANTILES'. BY ROGER KOENKER AND GILBERT BASSETT, JR. A simple minimization problem yielding the ordinary sample quantiles in ...
  44. [44]
    Ridge Regression: Biased Estimation for Nonorthogonal Problems
    Ridge Regression: Biased Estimation for Nonorthogonal Problems. Arthur E. Hoerl University of Delaware and E. 1. du Pont de Nemours & Co. &. Robert W. Kennard ...
  45. [45]
    Lesson 5: Multiple Linear Regression (MLR) Model & Evaluation
    Be able to interpret the coefficients of a multiple regression model. Understand what the scope of the model is in the multiple regression model. Understand ...
  46. [46]
    [PDF] Multiple Linear Regression
    The multiple regression model equation is. Y = β0. + β1 x1. + β2 x2. + ... + βp ... The multiple regression model can be written in matrix form. Page 5. 5.
  47. [47]
    5.3 - The Multiple Linear Regression Model | STAT 462
    SSTO. Coefficient of Determination, R-squared, and Adjusted R-squared. As in simple linear regression, R^2=\frac{SSR}{SSTO}=1-\frac{SSE}{SSTO}, and represents ...
  48. [48]
    4.1 - Variable Selection for the Linear Model | STAT 897D
    Stepwise Selection · First we approximate the response variable y with a constant (i.e., an intercept-only regression model). · Then we gradually add one more ...Missing: multiple | Show results with:multiple
  49. [49]
    [PDF] Chapter 10: Variable Selection - Purdue Department of Statistics
    Small value of AIC are preferred. An alternative to AIC is the Bayes Information criterion, or BIC, which is given by. BICp = nlog(SSRes ...<|control11|><|separator|>
  50. [50]
    Generalized Linear Models - jstor
    Nelder (1966) gives examples of inverse polynomials calculated using the first approximation of the method in this paper. More generally, as shown in Nelder ( ...
  51. [51]
    Quasi-Likelihood Functions, Generalized Linear Models, and ... - jstor
    Wedderburn. The approach described in this paper sheds new light on some existing data-analytic techniques, and also suggests new ones. An example is given ...
  52. [52]
    IV.—On Least Squares and Linear Combination of Observations
    Sep 15, 2014 · In a series of papers WF Sheppard (1912, 1914) has considered the approximate representation of equidistant, equally weighted, and uncorrelated observations.
  53. [53]
    Application of hierarchical linear models to assessing change.
    This two-stage conceptualization, illustrated with data on Head Start children, allows investigators to model individual change, predict future development.Missing: paper | Show results with:paper
  54. [54]
    An Analysis of the Total Least Squares Problem
    Golub and C. F. Van Loan, SIAM J. Numer. Anal., 17 (1980), pp. 883–893) introduced this method into the field of numerical analysis and developed an algorithm ...
  55. [55]
    Regression Shrinkage and Selection Via the Lasso - Oxford Academic
    SUMMARY. We propose a new method for estimation in linear models. The 'lasso' minimizes the residual sum of squares subject to the sum of the absolute valu.
  56. [56]
    [PDF] Retrospectives Who Invented Instrumental Variable Regression?
    In Sewall Wright's (1921,1923) initial expositions, the method of path coefficients is equivalent to multiple regression using ordinary least squares.
  57. [57]
    5.1 The linear model | Forecasting: Principles and Practice (2nd ed)
    In the simplest case, the regression model allows for a linear relationship between the forecast variable y y and a single predictor variable x x : yt=β0+β1xt+ε ...
  58. [58]
    [PDF] Chapter 3: Regression Methods for Trends
    ▶ We commonly express such time series models using the form Yt = µt + Xt, where µt is a trend and Xt is a random process with mean zero for all t.Missing: ε_t source
  59. [59]
    3.3 - Prediction Interval for a New Response | STAT 501
    In this section, we are concerned with the prediction interval for a new response, y n e w , when the predictor's value is x h .
  60. [60]
    Confidence and prediction bands (linear regression) - GraphPad
    Confidence bands define the interval of the best-fit line, while prediction bands show where 95% of data points are expected to fall. Use confidence bands to ...
  61. [61]
    14 Introduction to Time Series Regression and Forecasting
    In this context we also discuss the concept of stationarity, an important property which has far-reaching consequences. Most empirical applications in this ...
  62. [62]
    [PDF] Measuring Uncertainty about Long-Run Predictions
    Most of the existing literature on long-horizon forecasting stresses the diffi culty of constructing good long-term forecasts under uncertainty about the.
  63. [63]
    Testing the Dose–Response Specification in Epidemiology - NIH
    A linear model suggests that equal reduction in population BPb is accompanied by equal reduction in health consequence from any starting level of lead. Under a ...Missing: seminal papers
  64. [64]
    How to control confounding effects by statistical analysis - PMC - NIH
    There are various ways to exclude or control confounding variables including Randomization, Restriction and Matching.
  65. [65]
    CAPITAL ASSET PRICES: A THEORY OF MARKET EQUILIBRIUM ...
    CAPITAL ASSET PRICES: A THEORY OF MARKET EQUILIBRIUM UNDER CONDITIONS OF RISK* - Sharpe - 1964 - The Journal of Finance - Wiley Online Library.Introduction · II. Optimal Investment Policy... · III. Equilibrium in the Capital...
  66. [66]
    [PDF] Regression Discontinuity Designs in Economics - Princeton University
    Regression Discontinuity (RD) designs were first introduced by Donald L. Thistlethwaite and Donald T. Campbell. (1960) as a way of estimating treatment.Missing: seminal | Show results with:seminal
  67. [67]
    Global Land Use Regression Model for Nitrogen Dioxide Air Pollution
    May 18, 2017 · Nitrogen dioxide is a common air pollutant with growing evidence of health impacts independent of other common pollutants such as ozone and ...
  68. [68]
    Climate impact on ambient PM2.5 elemental concentration in the ...
    We evaluated the impacts of weather changes on seven major components of ambient PM2.5, including elemental carbon (EC), organic carbon (OC), nitrate, sulfate, ...
  69. [69]
    Using Regression Model to Develop Green Building Energy ... - MDPI
    For the rest of the independent building design variables, linear regression models are used to analyse the relationship between it and energy consumption.
  70. [70]
    7.3. Preprocessing data — scikit-learn 1.7.2 documentation
    In general, many learning algorithms such as linear models benefit from standardization of the data set (see Importance of Feature Scaling).Importance of Feature Scaling · 7.4. Imputation of missing values · MaxAbsScaler
  71. [71]
    LinearRegression
    ### Overview of LinearRegression in scikit-learn
  72. [72]
    RidgeCV — scikit-learn 1.7.2 documentation
    Ridge regression with built-in cross-validation. See glossary entry for cross-validation estimator. By default, it performs efficient Leave-One-Out Cross- ...
  73. [73]
    PolynomialFeatures
    **Summary of Polynomial Features for Basis Expansion in Linear Regression:**
  74. [74]
    Greedy function approximation: A gradient boosting machine.
    Gradient boosting of regression trees produces competitive, highly robust, interpretable procedures for both regression and classification.
  75. [75]
    Adrien-Marie Legendre Publishes the Method of Least Squares
    In 1805 French mathematician Adrien-Marie Legendre Offsite Link published Nouvelles méthodes pour la détermination des orbites des comètes Offsite Link .
  76. [76]
    Nouvelles méthodes pour la détermination des orbites des comètes
    Jun 18, 2008 · Publication date: 1805 ; Publisher: F. Didot ; Collection: americana ; Book from the collections of: New York Public Library ; Language: French.Missing: squares | Show results with:squares
  77. [77]
    Carl Friedrich Gauss & Adrien-Marie Legendre Discover the Method ...
    Carl Friedrich Gauss Offsite Link is credited with developing the fundamentals of the basis for least-squares analysis in 1795 at the age of eighteen.
  78. [78]
  79. [79]
    Gauss on least-squares and maximum-likelihood estimation
    Apr 2, 2022 · Gauss' 1809 discussion of least squares, which can be viewed as the beginning of mathematical statistics, is reviewed.
  80. [80]
    [PDF] Gauss on least-squares and maximum-likelihood estimation1
    Dec 18, 2021 · Abstract: Gauss' 1809 discussion of least squares, which can be viewed as the beginning of mathematical statistics, is reviewed.
  81. [81]
    Regression Towards Mediocrity in Hereditary Stature - jstor
    ANTHROPOLOGICAL MISCELLANEA. REGRESSION towvards MEDIOCRITY in IIEREDITARY STATURE. By FRANCIS GALTON, F.R.S., &c. [WITH PLATES IX AND X.]
  82. [82]
    [PDF] F-TEST and Analysis of Variance (ANOVA) - Lucknow University
    ANOVA was developed by statistician and eugenicist Ronald Fisher. Though many statisticians including Fisher worked on the development of ANOVA model but it ...
  83. [83]
    IX. On the problem of the most efficient tests of statistical hypotheses
    On the problem of the most efficient tests of statistical hypotheses. Jerzy Neyman.
  84. [84]
    The Statistical Implications of a System of Simultaneous Equations
    VOLUME 11 JANUARY, 1943 NUMBER 1. THE STATISTICAL IMPLICATIONS OF A SYSTEM. OF SIMULTANEOUS EQUATIONS. By TRYGVE HAAVELMO. 1. INTRODUCTION. Measurement of ...
  85. [85]
    Numerical methods for solving linear least squares problems
    Numerical methods for solving linear least squares problems. G. Golub. Numerische Mathematik volume 7, pages 206–216 (1965)Cite ...Abstract · References
  86. [86]
    SAS History
    In the late 1960s, eight Southern universities came together to develop a general purpose statistical software package to analyze agricultural data.
  87. [87]
    [PDF] The R Project: A Brief History and Thoughts About the Future
    Sep 16, 1997 · Ross Ihaka joins the Department of Statistics at the. University of Auckland. • Robert Gentleman spends sabbatical from the University of.