Linear model
A linear model in statistics is a framework for modeling the relationship between a response variable and one or more predictor variables, assuming that the expected value of the response is a linear function of the predictors, typically expressed in matrix form as Y = X\beta + \epsilon, where Y is the n \times 1 vector of observed responses, X is the n \times p design matrix incorporating the predictors, \beta is the p \times 1 vector of unknown parameters, and \epsilon is the n \times 1 vector of random errors with mean zero.[1][2] The origins of linear models trace back to the late 19th century, when Sir Francis Galton developed the concept of regression while studying hereditary traits in sweet peas, introducing the idea of a linear relationship tending toward the mean, which Karl Pearson later formalized in the early 20th century through the development of the product-moment correlation and multiple regression techniques.[3][4] Linear models encompass a wide range of applications, including simple and multiple linear regression for prediction and inference, analysis of variance (ANOVA) for comparing group means, and analysis of covariance (ANCOVA) for adjusting means across covariates.[2][1] Under the Gauss-Markov assumptions—where errors have zero mean, constant variance \sigma^2, and are uncorrelated—the ordinary least squares estimator of \beta is the best linear unbiased estimator (BLUE), providing efficient parameter estimates via the solution to the normal equations X'X\hat{\beta} = X'Y.[2] Extensions include generalized least squares for heteroscedastic or correlated errors, as in the Aitken model where \text{cov}(\epsilon) = \sigma^2 V with known V, and further generalizations to linear mixed models incorporating random effects for clustered or hierarchical data.[2][5] These models are foundational in fields like economics, biology, and social sciences, enabling hypothesis testing via F-statistics and confidence intervals under normality assumptions.[1][2]Basic Concepts
Definition and Scope
In statistics, a linear model describes the relationship between a dependent variable and one or more independent variables as a linear function of the parameters, meaning the expected value of the dependent variable is a linear combination of the independent variables weighted by unknown coefficients.[2] This linearity pertains specifically to the parameters rather than the variables themselves, allowing transformations of the variables (such as logarithms or polynomials) to maintain the linear structure in the coefficients.[6] A canonical example is the simple linear regression model, where the dependent variable Y is modeled as Y = \beta_0 + \beta_1 X + \epsilon, with \beta_0 and \beta_1 as the intercept and slope parameters, X as the independent variable, and \epsilon as a random error term representing unexplained variation.[7] This formulation emphasizes the additivity of effects, where the influence of each independent variable contributes independently to the outcome, aligning with the superposition principle that permits combining solutions through scaling and addition.[8] The scope of linear models encompasses a broad range of applications in statistics, including regression analysis, analysis of variance, and experimental design, primarily for predicting outcomes and drawing inferences about relationships in fields such as economics, social sciences, and engineering.[9] Unlike nonlinear models, where conditional and marginal effects may diverge, linear models ensure these effects coincide, simplifying interpretation and enabling properties like homogeneity and additivity that support scalable solutions.[10] A key advantage is that linearity in parameters facilitates closed-form solutions for parameter estimation, making them computationally efficient and analytically tractable compared to nonlinear alternatives requiring iterative methods.[11]Historical Background
The development of linear models traces its roots to the early 19th century, when astronomers and mathematicians sought methods to fit observational data amid measurement errors. In 1805, Adrien-Marie Legendre introduced the method of least squares in his work Nouvelles méthodes pour la détermination des orbites des comètes, applying it to minimize the sum of squared residuals for predicting comet orbits based on astronomical observations. This deterministic approach marked a foundational step in handling overdetermined systems. Four years later, in 1809, Carl Friedrich Gauss published Theoria motus corporum coelestium in sectionibus conicis solem ambientum, where he claimed prior use of least squares since 1795 and provided a probabilistic justification by linking it to the normal distribution of errors, establishing it as a maximum-likelihood estimator under Gaussian assumptions. The concept of regression emerged in the late 19th century through studies of inheritance patterns. In 1886, Francis Galton coined the term "regression" in his paper "Regression Towards Mediocrity in Hereditary Stature," published in the Journal of the Anthropological Institute, while analyzing height data from parents and children.[12] Galton observed that extreme parental heights tended to produce offspring heights closer to the population average, introducing the idea of linear relationships between variables and laying the groundwork for bivariate regression as a tool in biometrics. This work shifted focus from mere curve fitting to modeling dependencies, influencing subsequent statistical applications in natural sciences.[13] Building on Galton's ideas, Karl Pearson formalized key aspects of linear regression in the late 19th and early 20th centuries. In 1895, Pearson developed the product-moment correlation coefficient to quantify the strength of linear relationships between variables. He further extended this to multiple regression techniques around 1900–1910, enabling the modeling of a dependent variable against several predictors, which provided a mathematical foundation for broader applications in biometrics and beyond.[14] In the 1920s, Ronald A. Fisher advanced linear models into a unified framework for experimental design and analysis. Working at the Rothamsted Experimental Station, Fisher developed analysis of variance (ANOVA) in his 1925 book Statistical Methods for Research Workers, extending least squares to partition variance in designed experiments, such as agricultural trials. By the early 1930s, in works like The Design of Experiments (1935), Fisher synthesized regression, ANOVA, and covariance analysis into the general linear model, incorporating probabilistic error terms to enable inference from sample data.[15] This evolution transformed linear models from deterministic tools to probabilistic frameworks essential for hypothesis testing in experimental sciences. The 1930s saw Jerzy Neyman and Egon Pearson formalize inference procedures for linear models through their Neyman-Pearson lemma, introduced in the 1933 paper "On the Problem of the Most Efficient Tests of Statistical Hypotheses" in Philosophical Transactions of the Royal Society.[16] Their framework emphasized controlling error rates (Type I and Type II) and power in hypothesis testing, providing a rigorous basis for applying linear models to decision-making under uncertainty. Post-World War II computational advances, including early electronic computers like the ENIAC (1945) and statistical software developments in the 1950s, facilitated the widespread adoption of these methods by enabling efficient matrix computations for large datasets.[17] This period marked the transition of linear models from theoretical constructs to practical tools in fields like economics and social sciences.Mathematical Formulation
General Linear Model Equation
The general linear model expresses the relationship between a response variable and one or more predictor variables as a linear combination of parameters plus an error term. In its scalar form, for each observation i = [1](/page/1), \dots, n, the model is given by Y_i = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_{p-1} X_{i,p-1} + \varepsilon_i, where Y_i is the observed response, \beta_0 is the intercept, \beta_j (for j = [1](/page/1), \dots, p-1) are the slope coefficients associated with the predictors X_{ij}, and \varepsilon_i represents the random error for the ith observation.[18][19] This formulation can be compactly represented in vector-matrix notation as \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}, where \mathbf{Y} is an n \times 1 vector of responses, \mathbf{X} is an n \times p design matrix with rows corresponding to observations, \boldsymbol{\beta} is a p \times 1 vector of parameters (including the intercept), and \boldsymbol{\varepsilon} is an n \times 1 vector of errors.[19][20] The errors \varepsilon_i are typically assumed to be independently and identically distributed as normal with mean zero and constant variance \sigma^2, though full details on these assumptions appear in the relevant section.[19] In this model, each coefficient \beta_j (for j \geq 1) represents the partial effect of the jth predictor on the response, interpreted as the expected change in Y for a one-unit increase in X_j, holding all other predictors constant.[21][22] The design matrix \mathbf{X} structures the predictors for estimation; its first column consists entirely of 1s to accommodate the intercept term \beta_0.[23] Categorical predictors are incorporated by creating dummy variables, where each category (except one reference category) is represented by a binary column in \mathbf{X} to avoid multicollinearity.[24]Matrix Representation
The design matrix X serves as the foundational structure in the matrix representation of linear models, typically an n \times p matrix where n denotes the number of observations and p the number of parameters (including the intercept). Its rows represent individual observations, while columns correspond to the predictor variables. For the model to be identifiable and to ensure unique parameter estimates, X must have full column rank, meaning its rank equals p, which implies that the columns are linearly independent and there is no perfect multicollinearity. This full column rank condition guarantees the invertibility of the matrix X^\top X, a key property that enables the explicit solution for model parameters. The parameter vector \beta, a p \times 1 column vector containing the coefficients, is estimated in matrix form via the ordinary least squares (OLS) estimator \hat{\beta} = (X^\top X)^{-1} X^\top Y, where Y is the n \times 1 response vector; this form previews the efficient algebraic solution without requiring iterative methods, with full details on its derivation provided in the section on ordinary least squares estimation.[25] A central element in this representation is the projection matrix H = X (X^\top X)^{-1} X^\top, often termed the hat matrix, which orthogonally projects the response vector Y onto the column space of X to yield the fitted values \hat{Y} = H Y. This matrix H is symmetric (H^\top = H) and idempotent (H^2 = H), properties that reflect its role as an orthogonal projection operator and facilitate analytical manipulations such as variance computations. Complementing H is the residual maker matrix M = I_n - H, where I_n is the n \times n identity matrix, which produces the residuals e = M Y by projecting Y onto the orthogonal complement of the column space of X. The matrix M annihilates X such that M X = 0, ensuring residuals are uncorrelated with the predictors in the column space, and it is also symmetric (M^\top = M) and idempotent (M^2 = M). These properties underscore M's utility in isolating deviations unexplained by the model.[25] The matrix formulation offers substantial computational advantages, particularly for large datasets, as it leverages optimized linear algebra algorithms in statistical software to perform operations like matrix inversion and multiplication efficiently, scaling to high-dimensional problems where scalar-based approaches would be prohibitive.[26]Assumptions and Diagnostics
Core Assumptions
The validity of the linear model, particularly in the context of ordinary least squares (OLS) estimation and statistical inference, hinges on several foundational assumptions that ensure the model's parameters are unbiased, consistent, and efficient. These assumptions pertain to the relationship between the response variable Y and predictors X, as well as the properties of the error terms \epsilon. Violations can lead to biased estimates or invalid inference, though some robustness holds under large samples via the central limit theorem (CLT). The core assumptions are linearity, independence, homoscedasticity, normality, no perfect multicollinearity, and exogeneity.[27][28][29] Linearity requires that the conditional expectation of the response variable is a linear function of the predictors, expressed as E(Y \mid X) = X\beta, where \beta is the vector of parameters. This assumption implies that the effects of the predictors on the mean response are additive and that the relationship is straight-line in the parameters, holding the other variables fixed. It does not necessitate a linear relationship in the raw data but rather in the population model; nonlinearities can be addressed through transformations or additional terms, but the core model form must satisfy this for OLS to yield unbiased estimates.[27][28][29] Independence assumes that the error terms \epsilon_i for different observations are statistically independent, meaning no correlation between residuals across observations, such as in time series where autocorrelation might occur. This ensures that the variance-covariance matrix of the errors is diagonal, supporting the unbiasedness of OLS estimators under the Gauss-Markov theorem. Independence is crucial for the standard errors and hypothesis tests to be valid, as dependence can inflate or deflate them.[27][28] Homoscedasticity stipulates that the variance of the errors is constant across all levels of the predictors, i.e., \text{Var}(\epsilon_i \mid X) = \sigma^2 for some constant \sigma^2. This equal spread of residuals prevents heteroscedasticity, where variance changes with X, which could lead to inefficient OLS estimates and unreliable standard errors. Under this assumption, combined with exogeneity, OLS achieves the best linear unbiased estimator (BLUE) property.[27][28][29] Normality posits that the errors follow a normal distribution, \epsilon_i \sim N(0, \sigma^2), which is necessary for exact finite-sample inference, such as t-tests and F-tests on coefficients. However, this assumption is not required for consistency or unbiasedness of OLS; for large samples, the CLT ensures asymptotic normality of the estimators, making inference approximately valid even without it. Normality primarily affects the distribution of test statistics in small samples.[27][28] No perfect multicollinearity requires that the predictors are not linearly dependent, meaning the design matrix X has full column rank, so no predictor is an exact linear combination of others. This ensures the parameter matrix (\mathbf{X}^T\mathbf{X})^{-1} exists and is unique, preventing infinite or undefined OLS estimates. While mild multicollinearity is tolerable, perfect collinearity renders coefficients non-identifiable.[29] Exogeneity, or the zero conditional mean assumption, states that the errors are uncorrelated with the predictors, E(\epsilon_i \mid X) = 0, implying no omitted variables or endogeneity biasing the estimates. This strict exogeneity ensures OLS estimators are unbiased and consistent, as any correlation would violate the orthogonality condition essential for projection-based estimation.[29]Violation Detection and Remedies
Detecting violations of the linear model's assumptions is essential for ensuring the validity of inferences drawn from the analysis. Post-estimation diagnostics primarily focus on the residuals, which represent the differences between observed and predicted values. These tools help identify departures from linearity, independence, homoscedasticity, and normality, allowing practitioners to assess model adequacy before proceeding to remedies.[30] To check for linearity, residual plots of residuals against fitted values are commonly used; a random scatter around zero indicates the assumption holds, while patterns such as curves or funnels suggest nonlinearity. For independence, particularly in time series contexts, the Durbin-Watson test detects first-order autocorrelation by computing a statistic that compares adjacent residuals; values near 2 indicate no autocorrelation, while deviations toward 0 or 4 signal positive or negative autocorrelation, respectively.[31] Heteroscedasticity is assessed via the Breusch-Pagan test, which regresses squared residuals on the predictors and tests the significance of the resulting coefficients under a chi-squared distribution; a significant result rejects constant variance.[32] Normality of residuals is evaluated using quantile-quantile (Q-Q) plots, where points aligning closely with a straight line support the assumption, and deviations indicate skewness or heavy tails.[33] Multicollinearity among predictors can inflate variance estimates and destabilize coefficients, even if other assumptions hold. The variance inflation factor (VIF) measures this by quantifying how much the variance of a coefficient is increased due to correlation with other predictors; for each predictor, VIF is computed as 1 over (1 - R²) from regressing it on the others, with values exceeding 10 signaling severe multicollinearity requiring attention.[34] Outliers and influential points can disproportionately affect model fit and parameter estimates. Cook's distance identifies influential observations by measuring the change in fitted values when a single data point is removed; values greater than 4/n (where n is the sample size) or exceeding an F-threshold flag potential issues, combining leverage and residual magnitude.[35] Leverage plots, based on hat values from the projection matrix, highlight high-leverage points that lie far from the mean of predictors.[35] Once violations are detected, targeted remedies can restore assumption validity without discarding the linear framework. For heteroscedasticity, logarithmic or Box-Cox transformations stabilize variance by adjusting the scale of the response or predictors; the Box-Cox family, parameterized by λ, applies y^λ for λ ≠ 0 or log(y) for λ = 0, with maximum likelihood estimating the optimal λ to achieve homoscedasticity.[36] Robust standard errors, such as those proposed by White, adjust inference by estimating the covariance matrix accounting for heteroscedasticity, providing consistent standard errors without altering coefficients.[37] Autocorrelation, often in temporal data, can be addressed by including lagged dependent variables as predictors, effectively modeling the serial dependence and reducing residual correlation.[38] For multicollinearity, ridge regression introduces a penalty term (λ times the sum of squared coefficients) to the least squares objective, shrinking estimates toward zero and stabilizing them in correlated predictor spaces; λ is tuned via cross-validation or ridge traces.[34] Outliers may be handled by removing influential points identified via Cook's distance if they are verifiable errors, or by robust regression methods that downweight them, though sensitivity analyses are recommended to confirm robustness.[35] Transformations like those in the Box-Cox framework can also mitigate multiple violations simultaneously by improving overall distributional properties.[36]Estimation and Inference
Ordinary Least Squares Estimation
The ordinary least squares (OLS) estimation method seeks to find the parameter vector \hat{\beta} that minimizes the sum of squared residuals, given by \sum_{i=1}^n (y_i - \mathbf{x}_i' \hat{\beta})^2, where y_i is the observed response and \mathbf{x}_i' \hat{\beta} is the predicted value.[39] This objective function measures the total deviation between observed and fitted values, weighted equally by the square of each residual to penalize larger errors more heavily.[40] To derive the OLS estimator, differentiate the sum of squared residuals with respect to \beta and set the result to zero, yielding the normal equations X'X \beta = X'y, where X is the design matrix and y is the response vector.[41] Assuming X'X is invertible (which requires no perfect multicollinearity), the solution is \hat{\beta} = (X'X)^{-1} X'y.[39] In matrix representation, as detailed in the Mathematical Formulation section, this projection of y onto the column space of X ensures the residuals are orthogonal to the predictors.[40] Under the assumptions of linearity in parameters, strict exogeneity of regressors, and no perfect multicollinearity, the OLS estimator is unbiased, satisfying E[\hat{\beta}] = \beta.[42] Furthermore, by the Gauss-Markov theorem, under the additional assumptions of homoscedasticity of errors and uncorrelated errors, \hat{\beta} is the best linear unbiased estimator (BLUE), possessing the minimum variance among all linear unbiased estimators of \beta. The variance-covariance matrix of the estimator is \operatorname{Var}(\hat{\beta}) = \sigma^2 (X'X)^{-1}, where \sigma^2 is the error variance, highlighting how the precision of \hat{\beta} improves with more informative data in X.[43] OLS estimation is widely implemented in statistical software for practical application. In R, thelm() function from the base stats package fits linear models using OLS by default, accepting a formula interface for specifying predictors and responses. In Python, the statsmodels library provides the OLS class in statsmodels.regression.linear_model, which computes \hat{\beta} and related statistics via methods like fit().