Fact-checked by Grok 2 weeks ago

Linear regression

Linear regression is a fundamental statistical method used to model the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables) by fitting a linear equation to observed data, where the best-fitting line is determined by minimizing the sum of the squares of the differences between observed and predicted values, known as the least squares method.^[1] This approach assumes a linear relationship, often expressed for simple linear regression as Y = a + bX + \epsilon, where Y is the dependent variable, X is the independent variable, a is the y-intercept, b is the slope, and \epsilon represents the error term.^[2] The method extends to multiple linear regression, incorporating several predictors as Y = a + b_1X_1 + b_2X_2 + \dots + b_nX_n + \epsilon.^[1] Key assumptions underlying linear regression include linearity in the parameters, independence of errors, homoscedasticity (constant variance of errors), and often normality of the error distribution for inference purposes, though the least squares estimation itself does not require normality.^[3] These assumptions can be checked using diagnostic plots such as scatterplots for linearity and residual plots for homoscedasticity and normality.^[1] Violations may necessitate data transformations or alternative models, but the technique remains robust for many applications due to the central limit theorem approximating normality in large samples.^[2] Linear regression serves multiple purposes, including describing relationships between variables, estimating unknown values of the dependent variable, and predicting future outcomes or prognostication, such as identifying risk factors in medical studies (e.g., predicting blood pressure from age and weight).^[1] It is widely applied across fields like economics, biology, engineering, and social sciences for tasks ranging from forecasting sales based on advertising spend to analyzing the impact of environmental factors on crop yields.^[2] The model's simplicity and interpretability—where coefficients directly indicate the change in the dependent variable per unit change in an independent variable—make it a cornerstone of statistical analysis.^[1] The origins of linear regression trace back to the development of the least squares method, first published by Adrien-Marie Legendre in 1805 for astronomical calculations, though Carl Friedrich Gauss claimed prior invention around 1795 and formalized its probabilistic justification in 1809.^[4] The term "regression" was coined by Francis Galton in the 1880s during his studies on heredity using pea plant data, observing how offspring traits "regressed" toward the population mean, with Karl Pearson later refining the mathematical framework in the 1890s through correlation and multiple regression extensions.^[5] This evolution transformed least squares from a computational tool into a cornerstone of modern inferential statistics.^[4]

Fundamentals

Definition and Model Formulation

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. The model assumes that the conditional expectation of the response variable, given the predictors, can be expressed as a linear combination of the predictors. This approach is foundational in statistics and is widely applied in fields such as economics, biology, and engineering for predictive modeling and inference.^[6] In its general matrix formulation for multiple linear regression, the model is expressed as

\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon},

where \mathbf{Y} is an n \times 1 vector of observed responses, \mathbf{X} is an n \times p design matrix containing the predictor variables (with the first column typically being a vector of ones to account for the intercept), \boldsymbol{\beta} is a p \times 1 vector of unknown regression coefficients, and \boldsymbol{\varepsilon} is an n \times 1 vector of random error terms. The error terms satisfy E(\boldsymbol{\varepsilon}) = \mathbf{0} and \mathrm{Var}(\boldsymbol{\varepsilon}) = \sigma^2 \mathbf{I}_n, where \sigma^2 is the error variance and \mathbf{I}_n is the n \times n identity matrix, implying that the errors have mean zero and are uncorrelated with constant variance.^[7]^[8]^[7] The linearity in the model refers specifically to the parameters \boldsymbol{\beta}, meaning the response is a linear function of these coefficients, though the predictors in \mathbf{X} may involve nonlinear transformations of the original variables (e.g., polynomials or interactions). The intercept term \beta_0 is included as the first element of \boldsymbol{\beta}, corresponding to the constant column in \mathbf{X}, and represents the expected value of \mathbf{Y} when all predictors are zero. This formulation allows for the conditional expectation E(\mathbf{Y} \mid \mathbf{X}) = \mathbf{X}\boldsymbol{\beta} to serve as the mean response surface, derived from the zero mean of the errors: E(\mathbf{Y} \mid \mathbf{X}) = \mathbf{X}\boldsymbol{\beta} + E(\boldsymbol{\varepsilon} \mid \mathbf{X}) = \mathbf{X}\boldsymbol{\beta}, assuming the errors are independent of the predictors.^[9]^[8]^[6] For the simple case with a single predictor, the model simplifies to y_i = \beta_0 + \beta_1 x_i + \varepsilon_i for i = 1, \dots, n, where y_i is the i-th response, x_i is the predictor value, \beta_0 is the intercept, \beta_1 is the slope coefficient, and \varepsilon_i is the error term with E(\varepsilon_i) = 0 and \mathrm{Var}(\varepsilon_i) = \sigma^2. This setup captures the essence of the linear relationship, where the expected response E(y_i \mid x_i) = \beta_0 + \beta_1 x_i increases or decreases linearly with x_i, depending on the sign of \beta_1.^[9]^[6]

Simple Linear Regression Example

To illustrate simple linear regression, consider a hypothetical dataset consisting of heights (in inches, denoted as X) and weights (in pounds, denoted as Y) for 10 adult males, drawn from a larger body measurements study.^[10] The data are presented in the following table:

Individual	Height (X)	Weight (Y)
1	67.75	154.25
2	72.25	173.25
3	66.25	154.00
4	72.25	184.75
5	71.25	184.25
6	74.75	210.25
7	69.75	181.00
8	72.50	176.00
9	74.00	191.00
10	73.50	198.25

A scatter plot of these points shows a clear positive linear trend, with weight increasing as height increases, though some variability exists around the trend line.^[11] The ordinary least squares method fits the model by estimating the slope \beta_1 and intercept \beta_0. The slope is given by

\beta_1 = \frac{\text{Cov}(X, Y)}{\text{Var}(X)},

where \text{Cov}(X, Y) is the sample covariance and \text{Var}(X) is the sample variance of X.^[12] Here, the sample means are \bar{x} = 71.425 inches and \bar{y} = 180.7 pounds, yielding \text{Cov}(X, Y) \approx 43.07 and \text{Var}(X) \approx 7.51, so \beta_1 \approx 5.73 pounds per inch. The intercept is then

\beta_0 = \bar{y} - \beta_1 \bar{x} \approx 180.7 - (5.73)(71.425) \approx -228.6

pounds. Thus, the fitted equation is \hat{y} = -228.6 + 5.73x.^[12] Using this equation, the predicted weight for a new individual with height x = 70 inches is \hat{y} = -228.6 + 5.73(70) \approx 172.5 pounds.^[11] To assess the fit, residuals are computed as e_i = y_i - \hat{y}_i for each observation. For example, for the first individual (height 67.75 inches, weight 154.25 pounds), \hat{y}_1 \approx -228.6 + 5.73(67.75) \approx 159.6 pounds, so e_1 \approx 154.25 - 159.6 = -5.35 pounds. Residuals quantify the model's prediction errors and play a key role in evaluating whether the data align with the assumptions of linearity and independence of errors.^[12]

Notation and Basic Interpretation

In linear regression, the dependent variable is typically denoted as y, representing the outcome or response of interest, while the independent variables are denoted as x_1, x_2, \dots, x_p, where p is the number of predictors.^[13] The model parameters include the coefficients \beta_0, \beta_1, \dots, \beta_p, where \beta_0 is the intercept and \beta_j (for j = 1, \dots, p) are the slope coefficients associated with each predictor.^[13] The error term's variance is denoted as \sigma^2, capturing the variability not explained by the predictors, and the sample size is n, the number of observations. The coefficient \beta_j is interpreted as the expected change in the dependent variable y for a one-unit increase in the independent variable x_j, holding all other predictors constant.^[14] This partial effect highlights the unique contribution of each predictor to the response in a multivariate setting.^[14] The coefficient of determination, R^2, measures the proportion of the total variance in y that is explained by the model, ranging from 0 (no explanatory power) to 1 (perfect fit).^[15] The intercept \beta_0 represents the expected value of y when all independent variables x_j = 0 for j = 1, \dots, p.^[14] This baseline value provides a reference point for the response under the condition of zero predictor values, though its practical relevance depends on whether such a scenario is meaningful in the data context.^[14] The magnitude and interpretation of coefficients are sensitive to the units of measurement for both y and the x_j; for instance, changing the scale of a predictor from meters to centimeters scales the corresponding \beta_j by a factor of 100, altering its numerical value while preserving the underlying relationship.^[14] Standardization of variables can mitigate such scaling effects, yielding coefficients that reflect relative importance in standard deviation units.^[16]

Assumptions and Limitations

Core Assumptions

The core assumptions underlying the linear regression model ensure that the ordinary least squares (OLS) estimator is unbiased, consistent, and efficient for inference and prediction. These assumptions, collectively known as the Gauss-Markov assumptions when excluding normality, form the foundation for the model's theoretical properties, including the BLUE (best linear unbiased estimator) status of OLS.^[17] While some are required for estimation and others primarily for statistical inference, violations can compromise the reliability of coefficient estimates and hypothesis tests.^[3] The linearity assumption requires that the conditional expectation of the dependent variable Y given the predictors X is a linear function of the parameters:

\mathbb{E}(Y \mid X) = X \beta,

where \beta is the vector of coefficients and X includes an intercept column. This implies that the model correctly captures the systematic relationship between predictors and the expected response, with deviations from this line attributed solely to random error.^[17] In the standard formulation Y = X \beta + \varepsilon, linearity ensures that the error term \varepsilon has a conditional mean of zero, \mathbb{E}(\varepsilon \mid X) = 0, preventing systematic bias in predictions.^[3] Independence of the errors assumes that the error terms \varepsilon_i for different observations i are independent, meaning \text{Cov}(\varepsilon_i, \varepsilon_j \mid X) = 0 for all i \neq j. This condition, stronger than mere uncorrelatedness in the Gauss-Markov framework, is essential in cross-sectional or experimental data to ensure that observations do not influence one another through the errors, supporting the validity of standard error calculations and confidence intervals.^[17] Homoscedasticity requires that the conditional variance of the errors is constant across all levels of the predictors:

\text{Var}(\varepsilon_i \mid X) = \sigma^2

for all i, where \sigma^2 is a positive constant. This equal spread of errors around the regression line ensures that the variance-covariance matrix of the errors is spherical (\sigma^2 I), which is crucial for the efficiency of OLS estimates and the reliability of t-tests and F-tests.^[17] Without it, the precision of estimates would vary systematically with predictor values, leading to inefficient inference.^[3] Normality of the errors is assumed for valid statistical inference: \varepsilon_i \mid X \sim N(0, \sigma^2) independently for each i. This Gaussian distribution facilitates exact finite-sample inference, such as the t-distribution for coefficient significance and the F-distribution for overall model fit, particularly in small samples.^[17] Although not required for the consistency or unbiasedness of OLS point estimates, it is optional for large-sample asymptotics where the central limit theorem can approximate normality.^[3] Finally, the absence of perfect multicollinearity assumes that the predictor matrix X is of full column rank, ensuring (X^T X) is invertible and that the coefficients \beta can be uniquely estimated. This prevents linear dependence among the predictors, which would otherwise make individual effects indistinguishable and render OLS undefined.^[17] High but imperfect multicollinearity may inflate variances but does not violate this core requirement.^[3]

Consequences of Violations

Violations of the core assumptions in linear regression can lead to biased estimates, inefficient predictions, and invalid statistical inferences, undermining the reliability of the model for both explanatory and predictive purposes.^[3] Specifically, these breaches affect the ordinary least squares (OLS) estimator's properties, such as unbiasedness, efficiency, and the validity of standard errors, t-tests, and confidence intervals.^[18] While OLS remains unbiased under many violations, the consequences often manifest in overstated precision or unreliable hypothesis testing, particularly in finite samples.^[19] Nonlinearity in the relationship between predictors and the response variable results in systematically biased predictions, as the linear model attempts to approximate a curved or nonadditive pattern with a straight line, leading to over- or under-predictions across parts of the data range.^[20] This violation also tends to underestimate the true variance of the errors, inflating measures of model fit like R² and producing overly narrow confidence intervals that fail to capture the actual uncertainty.^[3] Heteroscedasticity, where error variance changes with the level of predictors, renders OLS estimates inefficient, meaning they have larger variance than the best linear unbiased estimator, though unbiasedness is preserved.^[21] More critically, it invalidates the usual standard errors, leading to unreliable t-tests and F-tests that may incorrectly reject or fail to reject hypotheses; for instance, standard errors can be either underestimated or overestimated depending on the form of heteroscedasticity.^[19] To address this for inference, heteroscedasticity-consistent standard errors, such as White's estimator, can provide robust alternatives without altering the point estimates.^[22] Autocorrelation in error terms, common in time series data, causes the OLS estimates to remain unbiased and consistent but inefficient, with underestimated standard errors that inflate the significance of coefficients and lead to overly optimistic t-statistics and p-values.^[23] This serial correlation also artificially inflates the R² statistic, suggesting stronger explanatory power than actually exists, and renders confidence intervals too narrow, increasing the risk of Type I errors in hypothesis testing.^[24] Non-normality of errors does not bias OLS estimates or affect their consistency, but it compromises exact inference in small samples, where t- and F-distributions no longer hold, leading to inaccurate p-values and confidence intervals.^[18] However, for large samples, the central limit theorem ensures asymptotic normality of the estimators, making inference robust; in small samples, severe non-normality can distort significance tests, though the impact diminishes with sample size exceeding 30–50 observations.^[25] In multiple linear regression, multicollinearity among predictors inflates the variance of the coefficient estimates, making them highly sensitive to small changes in data and leading to unstable predictions with wide confidence intervals, even if the overall model fit remains adequate.^[26] This high variance does not bias the estimates but reduces their precision, often resulting in insignificant individual t-tests despite a significant overall F-test; the variance inflation factor (VIF), calculated as \text{VIF}_j = \frac{1}{1 - R_j^2} where R_j^2 is the R² from regressing predictor j on others, quantifies this, with VIF > 10 indicating problematic multicollinearity.^[27]

Diagnostic Techniques

Diagnostic techniques in linear regression involve graphical and statistical methods to evaluate the validity of model assumptions after estimation, such as linearity, homoscedasticity, normality, independence, and lack of multicollinearity. These tools primarily rely on residuals, defined as the differences between observed and fitted values, to identify potential violations that could lead to unreliable inferences. By examining residuals, analysts can detect patterns indicative of model misspecification or data issues, enabling informed decisions about model refinement. Graphical methods are foundational for assessing several assumptions. A scatterplot of residuals against fitted values helps diagnose linearity and homoscedasticity; under ideal conditions, points should scatter randomly around zero without systematic trends or funneling patterns that suggest non-constant variance.^[28] Similarly, a quantile-quantile (Q-Q) plot compares the distribution of residuals to a theoretical normal distribution, with points aligning closely to the reference line indicating approximate normality; deviations in the tails may signal outliers or non-normality.^[28] For detecting autocorrelation in residuals, particularly in time-series data, the Durbin-Watson test provides a statistical measure. The test statistic is calculated as

DW = \frac{\sum_{i=1}^{n-1} (e_{i+1} - e_i)^2}{\sum_{i=1}^n e_i^2},

where e_i are the residuals and n is the sample size; values near 2 indicate no first-order autocorrelation, while values below 1.5 or above 2.5 suggest positive or negative autocorrelation, respectively, with critical values depending on the number of predictors and sample size.^[29] Multicollinearity among predictors is assessed using the variance inflation factor (VIF) for each regressor x_j, defined as

VIF_j = \frac{[1](/page/1)}{[1](/page/1) - R_j^2},

where R_j^2 is the coefficient of determination from regressing x_j on all other predictors; VIF values exceeding 5 or 10 typically indicate problematic multicollinearity, inflating coefficient variances and standard errors. Influence and leverage diagnostics identify observations that disproportionately affect the fitted model. Leverage measures, derived from the hat matrix H = X(X^T X)^{-1} X^T, quantify how much each observation pulls the fit toward itself, with diagonal elements h_{ii} ranging from 0 to 1; high leverage (h_{ii} > 2p/n, where p is the number of parameters) warrants scrutiny. Cook's distance further combines leverage and residual size to measure overall influence:

D_i = \frac{e_i^2}{p \cdot MSE} \cdot \frac{h_{ii}}{([1](/page/1) - h_{ii})^2},

where e_i is the residual, MSE is the mean squared error, and values exceeding $4/p or F_{p,n-p}(50,[1](/page/1)) suggest influential points that may distort parameter estimates. Outliers, which can bias results, are detected using studentized residuals, defined as t_i = e_i / \sqrt{MSE (1 - h_{ii})}, adjusting the ordinary residual for its estimated standard error. These follow a t-distribution under the model, allowing formal tests; absolute values greater than 3 or 2.5 (depending on significance level) flag potential outliers, as they deviate markedly from expected behavior.

Estimation Methods

Ordinary Least Squares

Ordinary least squares (OLS) is the canonical method for estimating the parameters of a linear regression model by minimizing the sum of squared residuals between observed and predicted values. Introduced by Adrien-Marie Legendre in 1805 for determining comet orbits and independently justified probabilistically by Carl Friedrich Gauss in 1809, OLS selects the parameter vector \beta that best fits the data in a least-squares sense.^[30]^[31] The residuals are the differences e_i = y_i - \hat{y}_i, and the objective function to minimize is the residual sum of squares:

S(\beta) = (Y - X\beta)'(Y - X\beta),

where Y is the n \times 1 vector of observations, X is the n \times (k+1) design matrix (including an intercept column of ones), and \beta is the (k+1) \times 1 parameter vector.^[32] To derive the OLS estimator, differentiate S(\beta) with respect to \beta and set the result to zero, yielding the normal equations:

X'X \beta = X'Y.

This system of k+1 equations in k+1 unknowns arises from the first-order conditions for minimization. Assuming X'X is invertible (which requires the columns of X to be linearly independent and full rank), the closed-form solution is:

\hat{\beta} = (X'X)^{-1} X'Y.

This explicit formula allows direct computation of the estimates without iterative methods, provided the matrix inversion is feasible.^[32] Under the core assumptions of the linear regression model—linearity in parameters, strict exogeneity of errors, no perfect multicollinearity, and homoskedasticity of errors—the OLS estimator possesses desirable statistical properties. It is unbiased, meaning E(\hat{\beta}) = \beta, so on average, the estimates equal the true parameters.^[33] Furthermore, by the Gauss-Markov theorem, \hat{\beta} is the best linear unbiased estimator (BLUE), with the smallest variance among all linear unbiased estimators of \beta. The variance-covariance matrix of \hat{\beta} is:

\text{Var}(\hat{\beta}) = \sigma^2 (X'X)^{-1},

where \sigma^2 is the variance of the error terms; this matrix quantifies the precision of the estimates, with diagonal elements giving the variances of individual \hat{\beta}_j.^[33] Once \hat{\beta} is obtained, the model generates fitted values \hat{Y} = X \hat{\beta}, which serve as predictions for the response variable. The mean squared error (MSE) of these in-sample predictions, also known as the error variance estimate, is computed from the residuals as \widehat{\text{MSE}} = \frac{(Y - \hat{Y})'(Y - \hat{Y})}{n - k - 1}, providing a measure of prediction accuracy and an unbiased estimate of \sigma^2.^[34]

Maximum Likelihood Estimation

In the linear regression model, maximum likelihood estimation (MLE) provides a probabilistic framework for parameter estimation by assuming that the errors follow a normal distribution, i.e., \epsilon_i \sim \mathcal{N}(0, \sigma^2) independently for i = 1, \dots, n.^[35] Under this assumption, the likelihood function for the parameters \beta (the vector of coefficients) and \sigma^2 (the error variance) is

L(\beta, \sigma^2) = (2\pi \sigma^2)^{-n/2} \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - x_i^T \beta)^2 \right),

where y_i is the observed response, x_i is the vector of predictors (including the intercept), and the sum of squared residuals measures the discrepancy between observed and predicted values.^[36] This formulation treats the observations as independent and identically distributed (i.i.d.) from \mathcal{N}(X\beta, \sigma^2 I_n), where X is the n \times (p+1) design matrix.^[37] To find the MLE, it is computationally convenient to maximize the log-likelihood instead:

\ell(\beta, \sigma^2) = -\frac{n}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} \| y - X\beta \|^2,

where \| \cdot \|^2 denotes the squared Euclidean norm.^[38] Differentiating \ell with respect to \beta yields the score equations \frac{\partial \ell}{\partial \beta} = \frac{1}{\sigma^2} X^T (y - X\beta) = 0, which simplify to the normal equations X^T X \beta = X^T y under the assumption that X^T X is invertible.^[35] Solving these gives the MLE for \beta, \hat{\beta}_{\text{MLE}} = (X^T X)^{-1} X^T y, which coincides exactly with the ordinary least squares (OLS) estimator.^[37] For \sigma^2, substituting \hat{\beta}_{\text{MLE}} and differentiating \ell with respect to \sigma^2 produces \hat{\sigma}^2_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^n \hat{e}_i^2 = \frac{\| y - X\hat{\beta}_{\text{MLE}} \|^2}{n}, where \hat{e}_i = y_i - x_i^T \hat{\beta}_{\text{MLE}} are the residuals; this estimator is biased downward, in contrast to the unbiased OLS variance estimator s^2 = \frac{\| y - X\hat{\beta}_{\text{OLS}} \|^2}{n - p - 1}.^[36] Under standard regularity conditions (including normality of errors and fixed design matrix with full rank), the MLE \hat{\beta}_{\text{MLE}} is consistent and asymptotically normal as n \to \infty: \sqrt{n} (\hat{\beta}_{\text{MLE}} - \beta) \xrightarrow{d} \mathcal{N}(0, \sigma^2 \text{plim}(n^{-1} X^T X)^{-1}), or more precisely, \hat{\beta}_{\text{MLE}} \sim \mathcal{N}(\beta, \sigma^2 (X^T X)^{-1}) in finite samples under exact normality.^[39] These properties enable asymptotic inference on \beta. The Wald test assesses hypotheses of the form H_0: R\beta = r (where R is a restriction matrix and r a vector) by forming the quadratic form W = n (R\hat{\beta}_{\text{MLE}} - r)^T [R \hat{V} R^T]^{-1} (R\hat{\beta}_{\text{MLE}} - r), where \hat{V} = \hat{\sigma}^2 (X^T X)^{-1} is the estimated asymptotic covariance; under H_0, W \xrightarrow{d} \chi^2_q as n \to \infty, with q = \text{rank}(R).^[40] Alternatively, the likelihood ratio test compares the log-likelihoods under the full and restricted models: \text{LR} = 2 [\ell(\hat{\beta}_{\text{MLE}}, \hat{\sigma}^2_{\text{MLE}}) - \ell(\hat{\beta}_R, \hat{\sigma}^2_R)], which also follows \chi^2_q asymptotically under H_0.^[40] Both tests leverage the information matrix equality, equating the expected outer product of scores to the negative expected Hessian of the log-likelihood, ensuring their validity for large samples.^[39]

Alternative Estimators

Least absolute deviation (LAD) estimation provides a robust alternative to ordinary least squares (OLS) by minimizing the sum of absolute residuals rather than squared residuals, making it less sensitive to outliers.^[41] The LAD estimator is defined as

\hat{\beta}_{\text{LAD}} = \arg\min_{\beta} \sum_{i=1}^n |y_i - \mathbf{x}_i' \beta|,

where y_i is the observed response, \mathbf{x}_i are the predictors, and \beta are the coefficients. This objective function corresponds to the conditional median of the response given the predictors, which is inherently robust to extreme values since the median is less affected by outliers than the mean.^[41] Computationally, the LAD problem can be reformulated as a linear programming task, allowing efficient solution via standard optimization algorithms like the simplex method.^[41] Quantile regression extends the LAD approach to estimate conditional quantiles of the response distribution at any level \tau \in (0,1), offering a more complete picture of the relationship between predictors and the full range of response outcomes.^[42] The estimator is given by

\hat{\beta}(\tau) = \arg\min_{\beta} \sum_{i=1}^n \rho_{\tau}(y_i - \mathbf{x}_i' \beta),

where the check function \rho_{\tau}(u) = u (\tau - I(u < 0)) weights positive and negative residuals asymmetrically based on \tau, with I(\cdot) as the indicator function.^[42] When \tau = 0.5, this reduces to the median regression, recovering the LAD estimator.^[42] Like LAD, quantile regression can be solved using linear programming, and it is particularly useful in heteroscedastic settings or when interest lies in tail behaviors of the response.^[42] Ridge regression addresses multicollinearity in the predictors by introducing a penalty term that shrinks the coefficients toward zero, stabilizing estimates when OLS variances become large.^[43] The ridge estimator is obtained by minimizing

\hat{\beta}_{\text{ridge}} = \arg\min_{\beta} \left( \| \mathbf{y} - \mathbf{X} \beta \|^2 + \lambda \| \beta \|^2 \right),

where \lambda > 0 is a tuning parameter controlling the degree of shrinkage, \mathbf{y} is the response vector, and \mathbf{X} is the design matrix.^[43] This can be derived from the normal equations of OLS, \mathbf{X}' \mathbf{X} \beta = \mathbf{X}' \mathbf{y}, by augmenting the design matrix with a ridge \lambda \mathbf{I} term: (\mathbf{X}' \mathbf{X} + \lambda \mathbf{I}) \beta = \mathbf{X}' \mathbf{y}, yielding the closed-form solution \hat{\beta}_{\text{ridge}} = (\mathbf{X}' \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}' \mathbf{y}.^[43] The bias introduced by shrinkage is traded off against reduced variance, often improving mean squared error in correlated predictor scenarios.^[43] Bayesian linear regression incorporates prior beliefs about the coefficients, yielding estimators as posterior means under a normal prior, which provides a probabilistic framework for uncertainty quantification beyond point estimates. Assuming a normal likelihood for the errors and a normal prior \beta \sim \mathcal{N}(\mu_0, \Sigma_0), the posterior distribution is also normal, with mean

\hat{\beta}_{\text{Bayes}} = \left( \mathbf{X}' \mathbf{X} + \Sigma_0^{-1} \right)^{-1} \left( \mathbf{X}' \mathbf{y} + \Sigma_0^{-1} \mu_0 \right),

resembling a shrunk OLS estimator where the prior acts as regularization. This approach naturally handles multicollinearity through the prior covariance and allows for model comparison via posterior predictive checks.

Extensions and Variants

Multiple Linear Regression

Multiple linear regression extends the simple linear regression framework to incorporate multiple predictor variables, allowing for the modeling of more complex relationships between the response variable and a set of explanatory factors. The model is expressed as Y_i = \beta_0 + \sum_{j=1}^p \beta_j X_{ij} + \epsilon_i for i = 1, \dots, n, where Y_i is the response, X_{ij} are the predictors, \beta_j are the coefficients, and \epsilon_i are independent errors with mean zero and constant variance.^[8] In matrix notation, this becomes \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}, where \mathbf{Y} is an n \times 1 vector of responses, \mathbf{X} is an n \times (p+1) design matrix with a column of ones for the intercept, \boldsymbol{\beta} is a (p+1) \times 1 vector of coefficients, and \boldsymbol{\epsilon} is an n \times 1 error vector.^[8] This formulation facilitates computational efficiency and theoretical analysis, particularly for estimation via ordinary least squares, where the coefficient vector is \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y}, assuming \mathbf{X}^\top \mathbf{X} is invertible.^[8] The coefficients \beta_j in multiple linear regression represent partial effects, capturing the change in the expected value of Y associated with a one-unit increase in X_j, while holding all other predictors constant at their means or specific values.^[44] Unlike marginal effects in simple regression (where p=1), these partial coefficients adjust for confounding among predictors, isolating the unique contribution of each variable to the response.^[45] This interpretation is crucial for causal inference in observational data but assumes the model is correctly specified and free of multicollinearity that could inflate standard errors.^[44] To assess overall model fit and significance, the coefficient of determination R^2 measures the proportion of variance explained by the predictors, but it increases with added variables regardless of relevance.^[46] The adjusted R^2 addresses this by penalizing model complexity:

\bar{R}^2 = 1 - (1 - R^2) \frac{n-1}{n - p - 1},

where n is the sample size and p is the number of predictors; higher values indicate better fit relative to baseline models.^[46] For overall significance, the F-test evaluates whether the model explains more variance than an intercept-only model:

F = \frac{R^2 / p}{(1 - R^2) / (n - p - 1)},

which follows an F-distribution with p and n - p - 1 degrees of freedom under the null hypothesis that all \beta_j = 0 for j \geq 1.^[45] A significant F-statistic (low p-value) supports retaining the full model. Variable selection in multiple linear regression aims to identify a subset of predictors that balances fit and parsimony, often using stepwise methods guided by information criteria. Forward stepwise selection begins with an intercept-only model and iteratively adds the predictor that most improves fit (e.g., via largest reduction in residual sum of squares or F-statistic), stopping when no addition yields significant gain.^[47] Backward stepwise selection starts with all predictors and removes the least contributory one step-by-step until removals degrade fit unacceptably.^[47] These procedures commonly employ the Akaike information criterion (AIC) for selection, defined as \text{AIC} = -2 \log L + 2k, where L is the likelihood and k = p + 1 is the number of parameters; lower AIC values favor models with good predictive accuracy penalized for complexity.^[48] While computationally efficient, stepwise methods risk overfitting and are sensitive to collinearity, prompting caution in interpretation.^[47]

Generalized Linear Models

Generalized linear models (GLMs) extend the framework of linear regression to accommodate response variables that follow distributions other than the normal distribution, such as binomial or Poisson, by incorporating a link function that connects the mean of the response to a linear predictor.^[49] The model consists of three main components: a random component specifying the distribution of the response variable Y, typically from the exponential family (e.g., Poisson for count data or binomial for binary/proportion data); a linear predictor \eta = X\beta, where X is the design matrix and \beta are the coefficients; and a link function g(\mu) = \eta, where \mu = E(Y) is the expected value of the response.^[49] This structure allows GLMs to model non-constant variance and non-linear relationships between predictors and the mean response while maintaining the linear form in the predictors.^[49] A key feature of GLMs is the use of canonical link functions, which simplify estimation and interpretation by aligning the link with the natural parameter of the exponential family distribution.^[49] For the Gaussian distribution, the canonical link is the identity function g(\mu) = \mu, which recovers the standard linear regression model.^[49] In the case of the binomial distribution, the logit link g(\mu) = \log(\mu / (1 - \mu)) is canonical, suitable for modeling probabilities.^[49] For the Poisson distribution, the log link g(\mu) = \log(\mu) serves as the canonical form, enabling the analysis of count data with multiplicative effects.^[49] Parameter estimation in GLMs is typically performed using iteratively reweighted least squares (IRLS), an algorithm that iteratively solves weighted least squares problems to maximize the likelihood.^[49] In each iteration, a working response is constructed as z_i = \eta_i + (y_i - \mu_i) g'(\mu_i), where \eta_i is the current linear predictor, y_i is the observed response, \mu_i is the current fitted mean, and g'(\mu_i) is the derivative of the link function; the coefficients \beta are then updated by weighted ordinary least squares with weights w_i = 1 / [V(\mu_i) (g'(\mu_i))^2], where V(\mu) is the variance function of the distribution.^[49] This process converges to the maximum likelihood estimates under the specified distribution.^[49] Goodness-of-fit in GLMs is assessed using the deviance, defined as D = 2 [l(\text{saturated}) - l(\text{fitted})], where l denotes the log-likelihood, analogous to -2 \log L in likelihood ratio tests.^[49] The deviance measures the discrepancy between the fitted model and a saturated model that perfectly fits the data, with smaller values indicating better fit; under the null hypothesis of adequate fit, it approximately follows a chi-squared distribution with degrees of freedom equal to the number of observations minus the number of parameters.^[49] To handle overdispersion, where the observed variance exceeds that implied by the model (e.g., \phi > 1 for a dispersion parameter in quasi-likelihood approaches), GLMs can incorporate a scale parameter \phi such that \text{Var}(Y) = \phi V(\mu), allowing robust estimation without altering the mean structure.^[50] This quasi-likelihood extension, introduced to address situations like clustered or heterogeneous data, scales the standard errors by \sqrt{\phi} while preserving the IRLS estimation procedure.^[50]

Robust and Regularized Variants

Robust and regularized variants of linear regression extend the classical model to address violations of core assumptions such as homoscedasticity, independence in clustered data, measurement errors in predictors, endogeneity due to correlation between regressors and errors, and challenges in high-dimensional settings where the number of predictors exceeds observations. These methods improve estimation efficiency, reduce bias, and promote sparsity or interpretability while maintaining the linear structure. They are particularly valuable in applied fields like econometrics, social sciences, and machine learning, where real-world data often deviate from ideal conditions. Heteroscedastic models arise when the variance of the errors is not constant across observations, violating the homoscedasticity assumption and leading to inefficient ordinary least squares estimates. Weighted least squares (WLS) addresses this by assigning weights inversely proportional to the error variances, yielding more efficient estimators under known or estimated heteroscedasticity. The WLS estimator is given by

\hat{\beta} = (X^T W X)^{-1} X^T W Y,

where W is a diagonal matrix with entries w_i = 1 / \Var(y_i), and the variances may be estimated iteratively from residuals of an initial ordinary least squares fit. This approach, originally formulated as a generalization of least squares for weighted observations, enhances precision in settings like cross-sectional economic data with varying scales.^[51] Hierarchical or multilevel models, also known as random effects models, accommodate data with nested structures, such as students within schools or repeated measures within individuals, where observations are not independent due to group-level variation. These models partition the error term into fixed effects common across groups and random effects specific to each group, allowing intercepts or slopes to vary hierarchically. A basic two-level model for grouped data is

y_{ij} = X_{ij} \beta + Z_{ij} u_j + \epsilon_{ij},

where i indexes observations within group j, u_j \sim N(0, \Sigma_u) captures random effects at the group level, and \epsilon_{ij} \sim N(0, \sigma^2) is the residual error, assuming independence within groups. Estimation typically uses maximum likelihood or restricted maximum likelihood, accounting for the covariance structure induced by the random effects. This framework, developed for educational and social research, properly handles intraclass correlation and provides more accurate inference for group-varying parameters.^[52] Errors-in-variables models occur when predictors X contain measurement errors, causing ordinary least squares to produce biased and inconsistent estimates of \beta due to attenuation bias. Total least squares (TLS), a robust alternative, minimizes the perpendicular distances from data points to the fitted hyperplane, perturbing both X and Y to account for errors in all variables. The TLS solution involves the singular value decomposition of the augmented matrix [X \ Y], where \hat{\beta} is derived from the right singular vector corresponding to the smallest singular value of the perturbed matrix. This method, analyzed through numerical linear algebra, yields consistent estimators under classical error assumptions and is widely applied in calibration problems and approximate solutions to overdetermined systems. For cases where measurement errors in X correlate with the true regressors, instrumental variables provide an adjustment by using exogenous instruments Z uncorrelated with the errors but correlated with X.^[53] Lasso regression introduces regularization to linear models, particularly useful in high-dimensional settings where p > n (number of predictors exceeds sample size), by shrinking coefficients toward zero and performing automatic variable selection through sparsity. The lasso estimator solves the penalized optimization problem

\hat{\beta}_{\text{lasso}} = \arg\min_{\beta} \|Y - X\beta\|_2^2 + \lambda \|\beta\|_1,

where \lambda \geq 0 controls the strength of the L1 penalty on the absolute values of the coefficients, driving irrelevant predictors exactly to zero. This promotes parsimonious models with improved prediction accuracy and interpretability, outperforming ordinary least squares in scenarios with multicollinearity or irrelevant features, as demonstrated in simulation studies and real datasets like gene expression analysis. The method combines the benefits of subset selection and ridge regression while ensuring computational tractability via convex optimization.^[54] Instrumental variables (IV) regression mitigates endogeneity, where regressors correlate with the error term due to omitted variables, simultaneity, or measurement error, leading to biased ordinary least squares estimates. IV uses external instruments Z that are correlated with the endogenous regressors but uncorrelated with the errors, enabling identification of causal effects. The simple IV estimator for a single endogenous regressor is

\hat{\beta}_{\text{IV}} = (Z^T X)^{-1} Z^T Y,

which can be viewed as a two-stage least squares procedure: first regress X on Z to obtain fitted values \hat{X}, then regress Y on \hat{X}. This approach, rooted in early econometric work on supply-demand systems, provides consistent estimates under valid instrument conditions and is foundational for causal inference in observational data, though it requires careful testing for instrument strength and validity.^[55]

Applications

Trend Analysis and Prediction

Linear regression serves as a fundamental tool for trend analysis by fitting a straight line to time-series data, capturing the underlying linear pattern over time. The model is typically expressed as y_t = \beta_0 + \beta_1 t + \varepsilon_t, where y_t is the observed value at time t, \beta_0 is the intercept, \beta_1 is the slope representing the trend rate, and \varepsilon_t is the error term assumed to be normally distributed with mean zero and constant variance.^[56] This approach enables detrending, where the linear component is subtracted from the data to isolate random fluctuations or cyclical patterns for further analysis.^[57] For prediction, linear regression provides point forecasts by extrapolating the fitted trend line beyond the observed data range. However, to quantify uncertainty, prediction intervals are constructed around these forecasts, given by

\hat{y}_{\text{new}} \pm t_{n-p-1, 1-\alpha/2} \, s \sqrt{1 + \mathbf{x}_{\text{new}}' (\mathbf{X}'\mathbf{X})^{-1} \mathbf{x}_{\text{new}}},

where \hat{y}_{\text{new}} is the predicted value, t_{n-p-1, 1-\alpha/2} is the critical value from the t-distribution with n-p-1 degrees of freedom, s is the residual standard error, p is the number of predictors, and \mathbf{x}_{\text{new}} is the vector for the new observation.^[58] These intervals widen as extrapolation extends further from the data, reflecting increased uncertainty due to potential violations of model assumptions outside the fitted range.^[58] Confidence bands, in contrast, provide intervals for the mean trend function rather than individual predictions and are narrower, following a similar form but omitting the "+1" term inside the square root:

\hat{y}_{\text{new}} \pm t_{n-p-1, 1-\alpha/2} \, s \sqrt{\mathbf{x}_{\text{new}}' (\mathbf{X}'\mathbf{X})^{-1} \mathbf{x}_{\text{new}}}.

These bands parallel the trend line and are used to assess the reliability of the estimated mean trajectory, such as in visualizing long-term growth patterns.^[59] In economic forecasting, linear regression models are commonly applied to predict indicators like GDP growth, often after detrending to focus on deviations from steady-state paths. For instance, quarterly GDP data can be modeled with a linear trend in logged values to estimate growth rates, but prior checks for stationarity—using tests like the Augmented Dickey-Fuller—are essential to avoid spurious regressions from non-stationary series.^[60] Non-stationarity, indicated by unit roots, implies that shocks have persistent effects, necessitating differencing or cointegration analysis before applying the model.^[60] Long-term predictions using linear regression face significant limitations when underlying assumptions drift, such as shifts in linearity due to structural economic changes or evolving relationships between variables. Violations of linearity or stationarity can lead to biased forecasts and unreliable extrapolation, as the model fails to capture nonlinear dynamics or persistent trends that emerge over extended horizons.^[3] In such cases, the widening prediction intervals underscore the model's reduced accuracy for distant forecasts, emphasizing the need for periodic re-estimation.^[61]

Use in Specific Disciplines

In epidemiology, linear regression is commonly employed to model dose-response relationships, such as regressing disease rates or health outcomes on exposure levels while adjusting for confounders like age, socioeconomic status, or environmental factors. For instance, studies have used linear and log-linear regression models to examine the impact of blood lead levels on children's IQ, controlling for variables including maternal IQ, education, and birth weight, revealing that a doubling of blood lead concentration is associated with a 2.6-point IQ decrement. Similarly, linear regression has been applied to assess lead exposure's effect on blood pressure, where log-linear models show a 1.0 mm Hg increase in systolic blood pressure per doubling of blood lead levels after confounder adjustment. These models enable public health policymakers to quantify risks and evaluate interventions, such as lead reduction programs that have yielded substantial economic benefits.^[62]^[62]^[62]^[63] In finance, linear regression underpins the Capital Asset Pricing Model (CAPM), where the beta coefficient measures an asset's systematic risk relative to the market. The beta is calculated as \beta = \frac{\mathrm{Cov}(R_i, R_m)}{\mathrm{Var}(R_m)}, with R_i as the asset's return and R_m as the market return, derived from regressing excess asset returns on excess market returns. This approach allows investors to estimate expected returns and assess portfolio risk, forming a cornerstone for asset pricing and cost of capital determination since its formulation in equilibrium theory.^[64]^[64] Economists utilize linear regression in regression discontinuity designs to infer causal effects from policy interventions at known cutoff thresholds, comparing outcomes immediately above and below the cutoff to isolate treatment impacts. For example, analyses of vote share cutoffs near zero have estimated incumbency advantages, showing discontinuities of 5-10% in vote shares and 37-57% in reelection probabilities. Other applications include evaluating class size reductions at enrollment thresholds, where regression models reveal effects on student test scores, and financial aid eligibility based on test scores, demonstrating boosts in college enrollment. These designs provide quasi-experimental rigor for assessing policy efficacy, such as in education and electoral systems, under the assumption of continuity in potential outcomes absent the cutoff.^[65]^[65]^[65] In environmental science, linear regression models relate pollutant concentrations to covariates like land use, traffic density, and meteorological factors, including temperature, to map spatial and temporal variations. Land use regression, a form of multiple linear regression, has been used globally to predict nitrogen dioxide (NO₂) levels from variables such as proximity to major roads and satellite-derived emissions, explaining up to 54% of variation in annual concentrations across diverse regions. Temperature serves as a key covariate in such models, as higher values can exacerbate ozone formation or alter particulate matter (PM₂.₅) components like nitrate and sulfate through photochemical reactions and atmospheric stability changes. These applications support air quality forecasting and regulatory assessments, such as identifying emission hotspots influenced by seasonal temperature shifts.^[66]^[66]^[67] Building science leverages linear regression to predict energy consumption based on insulation properties and related variables, aiding in the design of efficient structures. Models regress energy use intensity on factors like wall and roof U-values (measures of insulation effectiveness), infiltration rates, and window-to-wall ratios, validated against real-world data with errors under 10%. This approach integrates with building information modeling to optimize thermal performance and support green certification processes.^[68]^[68]^[68]

Integration with Machine Learning

In machine learning pipelines, linear regression often begins with feature engineering to ensure model stability and interpretability. A key step is standardization, where each feature x_j is transformed to x_j' = \frac{x_j - \mu_j}{\sigma_j}, centering the data at zero mean and unit variance. This preprocessing makes coefficients comparable across features with different scales, preventing those with larger variances from disproportionately influencing the fit, and is particularly beneficial for gradient-based optimization in linear models.^[69] Popular machine learning libraries integrate linear regression as a core component, facilitating seamless use within broader workflows. For instance, scikit-learn's LinearRegression class implements ordinary least squares fitting and supports pipeline integration for tasks like prediction and evaluation. Regularized variants, such as ridge and lasso regression, incorporate cross-validation to tune the regularization parameter \lambda; RidgeCV and LassoCV automate this by evaluating multiple \lambda values via k-fold cross-validation, selecting the one minimizing validation error to balance bias and variance. Lasso, in particular, promotes sparsity by driving some coefficients to zero, aiding feature selection in high-dimensional data.^[70]^[71] To capture nonlinearity while retaining the simplicity of linear regression, basis expansions transform the input space. Polynomial features extend the model to forms like y = \beta_0 + \beta_1 x + \beta_2 x^2 + \varepsilon, where higher-degree terms are generated via preprocessing (e.g., scikit-learn's PolynomialFeatures), allowing the linear framework to approximate curved relationships without altering the core algorithm. Splines provide a flexible alternative, using piecewise polynomials joined smoothly at knots to avoid the high-degree oscillations of global polynomials, as detailed in foundational statistical learning texts.^[72] Linear regression also serves as a base learner in ensemble methods, enhancing predictive performance through iterative refinement. In gradient boosting, weak linear models are sequentially fitted to residuals of prior fits, building an additive ensemble that corrects errors additively; this approach, while often using trees, leverages linear bases for scenarios requiring interpretability over complexity.^[73] A primary advantage of linear regression in machine learning is its inherent interpretability compared to black-box models like deep neural networks. The coefficients \beta directly quantify feature impacts on the target, but for deeper insights, SHAP values decompose predictions into additive feature attributions, aligning with game-theoretic fairness axioms and extending interpretability to ensemble or regularized linear models.

Historical Development

Early Origins

The foundations of linear regression trace back to the early 19th century, when astronomers sought precise methods to fit observational data to theoretical models, particularly for predicting celestial orbits. In 1805, French mathematician Adrien-Marie Legendre introduced the method of least squares in the appendix of his work Nouvelles méthodes pour la détermination des orbites des comètes, marking the first formal publication of this technique.^[74]^[75] Legendre presented it as a practical tool for minimizing the sum of squared residuals between observed and predicted positions of comets, enabling more accurate orbital determinations amid noisy astronomical measurements.^[74] This approach, though algorithmic, laid the groundwork for regression by emphasizing error minimization in linear relationships.^[74] Shortly thereafter, in 1809, Carl Friedrich Gauss published his own formulation of the least squares method in the second volume of Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium, applying it to astronomical data such as planetary and asteroid orbits.^[76]^[77] Gauss claimed to have developed the technique as early as 1795, using it successfully to predict the position of the asteroid Ceres based on limited observations from Giuseppe Piazzi in 1801.^[76] His work extended Legendre's deterministic method by incorporating a probabilistic framework, assuming that observational errors followed a normal distribution with mean zero, which justified least squares as the maximum-likelihood estimator for parameter recovery.^[78]^[79] This error model, derived from the idea that errors arise from numerous small, independent causes, provided a theoretical basis for the normality assumption in regression analysis.^[78] The term "regression" itself emerged later in the 19th century through studies of biological inheritance. In 1886, British polymath Francis Galton coined the phrase in his paper "Regression Towards Mediocrity in Hereditary Stature," published in the Journal of the Anthropological Institute of Great Britain and Ireland.^[80] Analyzing data on the heights of 930 adult children and their 205 mid-parentages, Galton observed a tendency for offspring of exceptionally tall or short parents to have heights closer to the population average, a phenomenon he termed "regression towards mediocrity."^[80] This empirical insight, visualized through scatter plots and fitted lines, highlighted mean reversion in linear relationships and connected least squares estimation to the study of correlated traits, influencing the statistical interpretation of regression.^[80]

Key Milestones and Contributors

Building on Galton's empirical observations, Karl Pearson refined the mathematical framework of regression in the 1890s. In 1895, he developed the product-moment correlation coefficient to quantify the strength of linear relationships between variables, and by the early 1900s, he extended these ideas to multiple regression, providing equations for predicting a dependent variable from several independent ones and establishing a rigorous basis for biometrical analysis.^[5] In the 1920s, Ronald Fisher laid foundational groundwork for modern linear regression analysis by developing the analysis of variance (ANOVA) framework, which decomposes total variance into components attributable to different sources, and introducing the F-test for assessing the significance of regression coefficients. These concepts were detailed in his seminal book Statistical Methods for Research Workers (1925), enabling researchers to evaluate the fit and explanatory power of linear models in experimental data.^[81] During the 1930s, Jerzy Neyman and Egon Pearson advanced the integration of hypothesis testing into linear regression by formulating the Neyman-Pearson lemma, which identifies the most powerful tests for simple hypotheses based on likelihood ratios, thereby providing a rigorous basis for inference on model parameters such as slopes and intercepts. Their collaborative work, including the 1933 paper "On the Problem of the Most Efficient Tests of Statistical Hypotheses," established a decision-theoretic approach that complemented Fisher's methods and became essential for validating linear regression assumptions.^[82] In the 1940s, Trygve Haavelmo propelled linear regression's application in econometrics by addressing biases in systems of simultaneous equations, where endogenous variables violate classical assumptions; his analysis demonstrated the need for identification strategies, paving the way for instrumental variables (IV) methods to obtain consistent estimates. This insight was articulated in his 1943 paper "The Statistical Implications of a System of Simultaneous Equations," which shifted econometric modeling toward probabilistic frameworks and influenced robust regression techniques.^[83] Computational advancements in the 1960s enhanced the stability of linear regression estimation amid growing data volumes; Gene Golub's 1965 paper "Numerical Methods for Solving Linear Least Squares Problems" introduced the QR decomposition using Householder transformations, offering a numerically stable alternative to direct normal equations by orthogonalizing the design matrix and avoiding ill-conditioning issues.^[84] Software developments in the 1970s democratized linear regression by embedding it in user-friendly tools; the Statistical Analysis System (SAS), originating from a 1966 project at North Carolina State University and incorporated in 1976, provided comprehensive procedures for regression modeling, including diagnostics and extensions, making advanced analysis accessible beyond specialists.^[85] By the 1990s, the open-source R programming language, initiated in 1993 by Ross Ihaka and Robert Gentleman at the University of Auckland as an implementation of the S language, further popularized linear regression through its lm() function and extensible packages, fostering widespread adoption in statistical computing and reproducible research.^[86]

References

[1]
Linear Regression Analysis: Part 14 of a Series on Evaluation ... - NIH
The linear regression model describes the dependent variable with a straight line that is defined by the equation Y = a + b × X, where a is the y-intersect of ...Missing: history | Show results with:history
[2]
Introduction to linear regression - Duke People
It is the study of linear (ie, straight-line) relationships between variables, usually under an assumption of normally distributed errors.
[3]
[PDF] Derivations of the LSE for Four Regression Models
The least squares method goes back to 1795, when Carl Friedrich Gauss, the great German mathematician, discovered it when he was eighteen years old. It arose ...
[4]
[PDF] Galton, Pearson, and the Peas: A Brief History of Linear Regression ...
Dec 1, 2017 · A simple linear regression analysis on these data using parent seed size to predict filial seed size produced almost exactly the same slope ...Missing: definition | Show results with:definition
[5]
[PDF] Mathematicians of Gaussian Elimination - CIS UPenn
tions finally arose when Adrien-Marie Legendre. [1805] and Gauss [1809] (Figure 5) invented what. Legendre named the method of least squares. (“méthode des ...<|separator|>
[6]
Simple Linear Regression — STATS191
### Mathematical Formulation of Simple Linear Regression
[7]
[PDF] STAT 714 LINEAR STATISTICAL MODELS
Linear models are linear in their parameters, with the general form Y = Xβ + , where Y is observed responses, X is a design matrix, β is unknown parameters, ...
[8]
5.4 - A Matrix Formulation of the Multiple Regression Model
The matrix formulation of multiple regression uses matrices to define the model, with the form Y=Xβ+ε, where X is an n x 2 matrix, Y is an n x 1 vector, β is a ...
[9]
[PDF] Ch5. Simple Linear Regression 1 The model
The designation simple indicates that there is only one predictor x, and linear means that the model 1 is linear in parameters β0 and β1. For the model, we ...
[10]
None
### Summary of BodyMeasurementsCorrected.csv
[11]
Lesson 22: Simple Linear Regression - byuistats.github.io
Find the equation of the linear regression line used to predict the weight of a man given his height. Interpret the slope and intercept of the regression line, ...
[12]
Formulas for Estimating the Slope
The best estimate of the slope is the covariance between X and Y divided by the variance of X. The slope also equals the correlation multiplied by the ratio of ...
[13]
Linear Least Squares Regression - Information Technology Laboratory
Linear least squares regression has earned its place as the primary tool for process modeling because of its effectiveness and completeness. Though there are ...Missing: notation | Show results with:notation
[14]
Interpreting Regression Output - Data and Statistical Services
In regression with a single independent variable, the coefficient tells you how much the dependent variable is expected to increase (if the coefficient is ...Introduction · P, t and standard error · Coefficients
[15]
1.5 - The Coefficient of Determination, $R^2$ | STAT 501
### Definition and Interpretation of R-squared in Simple Linear Regression
[16]
Linear regression with standardized variables - StatLect
The estimated coefficients of a linear regression model with standardized variables are called standardized coefficients. They are sometimes deemed easier to ...
[17]
[PDF] Assumptions for OLS Regression and the Gauss-Markov Theorem
Sep 8, 2021 · These properties include: • The least squares estimators are linear estimators; they are linear functions of the observations. (This property.
[18]
Testing the assumptions of linear regression - Duke People
There are four principal assumptions which justify the use of linear regression models for purposes of inference or prediction: (i) linearity and additivity ...<|control11|><|separator|>
[19]
4.1 - Background | STAT 501 - STAT ONLINE
On the other hand, if you want to use your model to predict a future response , then you are likely to get inaccurate results if the error terms are not ...
[20]
Introductory Econometrics Chapter 19: Heteroskedasticity
Heteroskedasticity has serious consequences for the OLS estimator. Although the OLS estimator remains unbiased, the estimated SE is wrong. Because of this ...
[21]
Introduction to Regression with SPSS Lesson 2 - OARC Stats
Testing Nonlinearity If this assumption is violated, the linear regression will try to fit a straight line to data that do not follow a straight line. The ...
[22]
[PDF] Section 8 Heteroskedasticity
OLS is inefficient with heteroskedasticity. Page 3. ~ 84 ~ o We don't prove this, but the Gauss-Markov Theorem requires homoskedasticity, so the OLS estimator ...
[23]
[PDF] Using Heteroscedasticity Consistent Standard Errors in the Linear ...
In the presence of heteroscedasticity, OLS estimates are unbiased, but the usual tests of significance are generally inappropriate and their use.
[24]
Topic 2: Time Series & Autocorrelation | STAT 501
Error terms correlated over time are said to be autocorrelated or serially correlated. When error terms are autocorrelated, some issues arise when using ...Missing: impact | Show results with:impact<|separator|>
[25]
[PDF] Lecture 12 Assumption Violation: Autocorrelation
Describe the Impacts on the estimated parameters of autocorrelation. • Describe several methods for detecting autocorrelation? • Describe how to use the Durbin- ...
[26]
Should I Always Transform My Variables to Make Them Normal?
Sep 14, 2015 · In fact, linear regression analysis works well, even with non-normal errors. But, the problem is with p-values for hypothesis testing. After ...
[27]
10.4 - Multicollinearity | STAT 462
Multicollinearity exists when two or more of the predictors in a regression model are moderately or highly correlated with one another.Missing: consequences | Show results with:consequences
[28]
[PDF] Diagnosing Multicollinearity - Cornell Statistical Consulting Unit
Multicollinearity can have two effects on the regression analysis. First, the regression parameters will be unstable from sample to sample because the standard ...Missing: consequences | Show results with:consequences
[29]
Applied Linear Regression | Wiley Series in Probability and Statistics
Master linear regression techniques with a new edition of a classic text. Reviews of the Second Edition: "I found it enjoyable reading and ...
[30]
TESTING FOR SERIAL CORRELATION IN LEAST SQUARES ...
J. DURBIN, G. S. WATSON; TESTING FOR SERIAL CORRELATION IN LEAST SQUARES REGRESSION. I, Biometrika, Volume 37, Issue 3-4, 1 December 1950, Pages 409–428, h.
[31]
[PDF] Legendre On Least Squares - University of York
Gauss says in his work on the Theory of Mo- tions of the Heavenly Bodies (1809) that he had made use of this principle since 1795 but that it was first ...
[32]
Gauss and the Invention of Least Squares - Project Euclid
The most famous priority dispute in the history of statistics is that between Gauss and Legendre, over the discovery of the method of least squares.
[33]
Normal equations in simple and multiple linear regression models
In linear regression analysis, the normal equations are a system of equations whose solution is the Ordinary Least Squares (OLS) estimator of the regression ...
[34]
Properties of the OLS estimator | Consistency, asymptotic normality
In this lecture we discuss under which assumptions the OLS (Ordinary Least Squares) estimator has desirable statistical properties such as consistency and ...
[35]
[PDF] 1 Ordinary Least Squares
(18). We call ˆσ2 the MSE of the regression and ˆσ the root MSE of the regression. 4.6 The least squares predictor. 4.4.6 The least squares predictor. • Predict.
[36]
[PDF] Lecture 6: The Method of Maximum Likelihood for Simple Linear ...
Sep 19, 2015 · The method of maximum likelihood does not always work; there are models where it gives poor or even pathological estimates. For Gaussian-noise ...
[37]
[PDF] Linear Regression via Maximization of the Likelihood - cs.Princeton
Sep 17, 2018 · Unsurprisingly, the maximum likelihood estimate in this model (regardless of σ2) is the sample average of the data. MLE Regression with Gaussian ...
[38]
[PDF] Topic 15: Maximum Likelihood Estimation - Arizona Math
... maximum likelihood is equivalent to the least squares criterion for ordinary linear regression. The maximum likelihood estimators α and β give the regression ...
[39]
1.2 - Maximum Likelihood Estimation | STAT 415
So, the "trick" is to take the derivative of ln ⁡ L ( p ) (with respect to p ) rather than taking the derivative of L ( p ) . Again, doing so often makes the ...
[40]
[PDF] Score, Wald, and Likelihood Ratio - MyWeb
We've now covered the most important theoretical properties of the MLE: it is consistent, asymptotically normal, and efficient.
[41]
[PDF] Lecture 4 Testing in the Classical Linear Model
- Wald Tests - Based on the asymptotic normality of the MLE. - Score Tests - Based on the asymptotic normality of the log- likelihood. General Methods. 37.
[42]
Optimal Estimation of Executive Compensation by Linear ...
Least Absolute Deviation. 14 July 2023. Introduction. 1 March 2023 ... Charnes, W. W. Cooper, R. O. Ferguson, (1955) Optimal Estimation of Executive ...
[43]
Regression Quantiles - jstor
1 (January, 1978). REGRESSION QUANTILES'. BY ROGER KOENKER AND GILBERT BASSETT, JR. A simple minimization problem yielding the ordinary sample quantiles in ...
[44]
Ridge Regression: Biased Estimation for Nonorthogonal Problems
Ridge Regression: Biased Estimation for Nonorthogonal Problems. Arthur E. Hoerl University of Delaware and E. 1. du Pont de Nemours & Co. &. Robert W. Kennard ...
[45]
Lesson 5: Multiple Linear Regression (MLR) Model & Evaluation
Be able to interpret the coefficients of a multiple regression model. Understand what the scope of the model is in the multiple regression model. Understand ...
[46]
[PDF] Multiple Linear Regression
The multiple regression model equation is. Y = β0. + β1 x1. + β2 x2. + ... + βp ... The multiple regression model can be written in matrix form. Page 5. 5.
[47]
5.3 - The Multiple Linear Regression Model | STAT 462
SSTO. Coefficient of Determination, R-squared, and Adjusted R-squared. As in simple linear regression, R^2=\frac{SSR}{SSTO}=1-\frac{SSE}{SSTO}, and represents ...
[48]
4.1 - Variable Selection for the Linear Model | STAT 897D
Stepwise Selection · First we approximate the response variable y with a constant (i.e., an intercept-only regression model). · Then we gradually add one more ...Missing: multiple | Show results with:multiple
[49]
[PDF] Chapter 10: Variable Selection - Purdue Department of Statistics
Small value of AIC are preferred. An alternative to AIC is the Bayes Information criterion, or BIC, which is given by. BICp = nlog(SSRes ...<|control11|><|separator|>
[50]
Generalized Linear Models - jstor
Nelder (1966) gives examples of inverse polynomials calculated using the first approximation of the method in this paper. More generally, as shown in Nelder ( ...
[51]
Quasi-Likelihood Functions, Generalized Linear Models, and ... - jstor
Wedderburn. The approach described in this paper sheds new light on some existing data-analytic techniques, and also suggests new ones. An example is given ...
[52]
IV.—On Least Squares and Linear Combination of Observations
Sep 15, 2014 · In a series of papers WF Sheppard (1912, 1914) has considered the approximate representation of equidistant, equally weighted, and uncorrelated observations.
[53]
Application of hierarchical linear models to assessing change.
This two-stage conceptualization, illustrated with data on Head Start children, allows investigators to model individual change, predict future development.Missing: paper | Show results with:paper
[54]
An Analysis of the Total Least Squares Problem
Golub and C. F. Van Loan, SIAM J. Numer. Anal., 17 (1980), pp. 883–893) introduced this method into the field of numerical analysis and developed an algorithm ...
[55]
Regression Shrinkage and Selection Via the Lasso - Oxford Academic
SUMMARY. We propose a new method for estimation in linear models. The 'lasso' minimizes the residual sum of squares subject to the sum of the absolute valu.
[56]
[PDF] Retrospectives Who Invented Instrumental Variable Regression?
In Sewall Wright's (1921,1923) initial expositions, the method of path coefficients is equivalent to multiple regression using ordinary least squares.
[57]
5.1 The linear model | Forecasting: Principles and Practice (2nd ed)
In the simplest case, the regression model allows for a linear relationship between the forecast variable y y and a single predictor variable x x : yt=β0+β1xt+ε ...
[58]
[PDF] Chapter 3: Regression Methods for Trends
▶ We commonly express such time series models using the form Yt = µt + Xt, where µt is a trend and Xt is a random process with mean zero for all t.Missing: ε_t source
[59]
3.3 - Prediction Interval for a New Response | STAT 501
In this section, we are concerned with the prediction interval for a new response, y n e w , when the predictor's value is x h .
[60]
Confidence and prediction bands (linear regression) - GraphPad
Confidence bands define the interval of the best-fit line, while prediction bands show where 95% of data points are expected to fall. Use confidence bands to ...
[61]
14 Introduction to Time Series Regression and Forecasting
In this context we also discuss the concept of stationarity, an important property which has far-reaching consequences. Most empirical applications in this ...
[62]
[PDF] Measuring Uncertainty about Long-Run Predictions
Most of the existing literature on long-horizon forecasting stresses the diffi culty of constructing good long-term forecasts under uncertainty about the.
[63]
Testing the Dose–Response Specification in Epidemiology - NIH
A linear model suggests that equal reduction in population BPb is accompanied by equal reduction in health consequence from any starting level of lead. Under a ...Missing: seminal papers
[64]
How to control confounding effects by statistical analysis - PMC - NIH
There are various ways to exclude or control confounding variables including Randomization, Restriction and Matching.
[65]
CAPITAL ASSET PRICES: A THEORY OF MARKET EQUILIBRIUM ...
CAPITAL ASSET PRICES: A THEORY OF MARKET EQUILIBRIUM UNDER CONDITIONS OF RISK* - Sharpe - 1964 - The Journal of Finance - Wiley Online Library.Introduction · II. Optimal Investment Policy... · III. Equilibrium in the Capital...
[66]
[PDF] Regression Discontinuity Designs in Economics - Princeton University
Regression Discontinuity (RD) designs were first introduced by Donald L. Thistlethwaite and Donald T. Campbell. (1960) as a way of estimating treatment.Missing: seminal | Show results with:seminal
[67]
Global Land Use Regression Model for Nitrogen Dioxide Air Pollution
May 18, 2017 · Nitrogen dioxide is a common air pollutant with growing evidence of health impacts independent of other common pollutants such as ozone and ...
[68]
Climate impact on ambient PM2.5 elemental concentration in the ...
We evaluated the impacts of weather changes on seven major components of ambient PM2.5, including elemental carbon (EC), organic carbon (OC), nitrate, sulfate, ...
[69]
Using Regression Model to Develop Green Building Energy ... - MDPI
For the rest of the independent building design variables, linear regression models are used to analyse the relationship between it and energy consumption.
[70]
7.3. Preprocessing data — scikit-learn 1.7.2 documentation
In general, many learning algorithms such as linear models benefit from standardization of the data set (see Importance of Feature Scaling).Importance of Feature Scaling · 7.4. Imputation of missing values · MaxAbsScaler
[71]
LinearRegression
### Overview of LinearRegression in scikit-learn
[72]
RidgeCV — scikit-learn 1.7.2 documentation
Ridge regression with built-in cross-validation. See glossary entry for cross-validation estimator. By default, it performs efficient Leave-One-Out Cross- ...
[73]
PolynomialFeatures
**Summary of Polynomial Features for Basis Expansion in Linear Regression:**
[74]
Greedy function approximation: A gradient boosting machine.
Gradient boosting of regression trees produces competitive, highly robust, interpretable procedures for both regression and classification.
[75]
Adrien-Marie Legendre Publishes the Method of Least Squares
In 1805 French mathematician Adrien-Marie Legendre Offsite Link published Nouvelles méthodes pour la détermination des orbites des comètes Offsite Link .
[76]
Nouvelles méthodes pour la détermination des orbites des comètes
Jun 18, 2008 · Publication date: 1805 ; Publisher: F. Didot ; Collection: americana ; Book from the collections of: New York Public Library ; Language: French.Missing: squares | Show results with:squares
[77]
Carl Friedrich Gauss & Adrien-Marie Legendre Discover the Method ...
Carl Friedrich Gauss Offsite Link is credited with developing the fundamentals of the basis for least-squares analysis in 1795 at the age of eighteen.
[78]
https://link.springer.com/article/10.1007/s00407-022-00291-w
[79]
Gauss on least-squares and maximum-likelihood estimation
Apr 2, 2022 · Gauss' 1809 discussion of least squares, which can be viewed as the beginning of mathematical statistics, is reviewed.
[80]
[PDF] Gauss on least-squares and maximum-likelihood estimation1
Dec 18, 2021 · Abstract: Gauss' 1809 discussion of least squares, which can be viewed as the beginning of mathematical statistics, is reviewed.
[81]
Regression Towards Mediocrity in Hereditary Stature - jstor
ANTHROPOLOGICAL MISCELLANEA. REGRESSION towvards MEDIOCRITY in IIEREDITARY STATURE. By FRANCIS GALTON, F.R.S., &c. [WITH PLATES IX AND X.]
[82]
[PDF] F-TEST and Analysis of Variance (ANOVA) - Lucknow University
ANOVA was developed by statistician and eugenicist Ronald Fisher. Though many statisticians including Fisher worked on the development of ANOVA model but it ...
[83]
IX. On the problem of the most efficient tests of statistical hypotheses
On the problem of the most efficient tests of statistical hypotheses. Jerzy Neyman.
[84]
The Statistical Implications of a System of Simultaneous Equations
VOLUME 11 JANUARY, 1943 NUMBER 1. THE STATISTICAL IMPLICATIONS OF A SYSTEM. OF SIMULTANEOUS EQUATIONS. By TRYGVE HAAVELMO. 1. INTRODUCTION. Measurement of ...
[85]
Numerical methods for solving linear least squares problems
Numerical methods for solving linear least squares problems. G. Golub. Numerische Mathematik volume 7, pages 206–216 (1965)Cite ...Abstract · References
[86]
SAS History
In the late 1960s, eight Southern universities came together to develop a general purpose statistical software package to analyze agricultural data.
[87]
[PDF] The R Project: A Brief History and Thoughts About the Future
Sep 16, 1997 · Ross Ihaka joins the Department of Statistics at the. University of Auckland. • Robert Gentleman spends sabbatical from the University of.