Simple linear regression
Simple linear regression is a statistical method that models the relationship between two continuous variables: one independent variable (predictor) and one dependent variable (response), assuming a straight-line relationship between them.[1] The model is expressed as y = \beta_0 + \beta_1 x + \epsilon, where y is the dependent variable, x is the independent variable, \beta_0 is the y-intercept, \beta_1 is the slope, and \epsilon represents the random error term.[2] This technique enables the estimation of the dependent variable based on the independent variable and is foundational in fields such as economics, biology, and engineering for analyzing linear associations.[3] The origins of simple linear regression trace back to the late 19th century, when Sir Francis Galton developed the concept while studying heredity and the phenomenon of regression toward the mean in biological traits, such as the heights of parents and children.[4] Building on earlier work in least squares estimation by Adrien-Marie Legendre in 1805 and Carl Friedrich Gauss, Galton introduced the term "regression" to describe the tendency of extreme values to move toward the average in subsequent generations.[5] Karl Pearson later formalized the mathematical framework in the early 20th century, extending it through correlation analysis, which solidified linear regression as a core tool in statistical inference.[6] To ensure valid inferences, simple linear regression relies on several key assumptions: linearity (the true relationship is linear in parameters), independence of observations, homoscedasticity (constant variance of residuals across levels of the independent variable), and normality of the error terms.[7] Violations of these assumptions, such as nonlinearity or heteroscedasticity, can lead to biased estimates or invalid predictions, necessitating diagnostic checks like residual plots.[8] Parameter estimates are typically obtained via ordinary least squares (OLS), which minimizes the sum of squared residuals between observed and predicted values, providing unbiased and efficient estimators under the model assumptions.[9] In practice, simple linear regression is widely applied for prediction, hypothesis testing on the slope (to assess significance of the relationship), and understanding causal or associative patterns in data, though it cannot establish causation without additional experimental design.[10] Extensions include multiple linear regression for more predictors and robust methods for handling assumption violations, but the simple form remains essential for introductory statistical modeling due to its interpretability and computational simplicity.[]Model and Assumptions
Definition and Model Equation
Simple linear regression is a fundamental statistical technique used to model and analyze the linear relationship between a single predictor variable, denoted as X, and a response variable, denoted as Y. It posits that the expected value of the response variable can be expressed as a straight-line function of the predictor, enabling predictions and inferences about how changes in X affect Y. This method is widely applied in fields such as economics, biology, and engineering to quantify associations in bivariate data.[1] The population model for simple linear regression is given by the equation Y_i = \beta_0 + \beta_1 X_i + \epsilon_i, \quad i = 1, 2, \dots, n, where Y_i is the i-th observation of the response variable, X_i is the corresponding predictor value, \beta_0 represents the y-intercept (the expected value of Y when X = 0), \beta_1 denotes the slope (the expected change in Y for a one-unit increase in X), and \epsilon_i is the random error term for the i-th observation, assumed to be independent with mean zero and constant variance \sigma^2. The error terms \epsilon_i capture the unexplained variation in Y after accounting for the linear effect of X.[11][10] In practice, with a sample of n data points drawn from the population, the parameters \beta_0 and \beta_1 are unknown and must be estimated, typically using Roman letters such as b_0 and b_1 to distinguish sample estimates from the true population values represented by Greek letters. This distinction underscores the inferential nature of regression analysis, where sample-based estimates inform broader population characteristics.[2]Key Assumptions
The simple linear regression model is built upon a set of classical assumptions that underpin the validity of parameter estimation and statistical inference. These assumptions ensure that the model's predictions align with the underlying data-generating process and that the ordinary least squares (OLS) estimators possess desirable properties, such as unbiasedness and minimum variance under the Gauss-Markov theorem.[12] While the core assumptions apply to both simple and multiple regression, in the simple case, they simplify due to the presence of only one predictor variable. Linearity: The primary assumption is that the conditional expected value of the response variable Y given the predictor X is a linear function of X, expressed as E(Y \mid X) = \beta_0 + \beta_1 X. This posits that the mean response changes linearly with the predictor, allowing the model to capture a straight-line relationship without curvature or higher-order terms.[7] Violation of linearity, such as when the true relationship is quadratic, can lead to biased estimates, though graphical diagnostics like scatterplots can help detect this.[13] Independence: The errors \varepsilon_i across observations must be independent, meaning that the value of one error does not influence another. This assumption arises from the requirement that the data constitute a random sample, ensuring no serial correlation or dependence structure, such as in time-series data.[7] In the simple linear regression context, independence implies that observations are drawn without clustering or autocorrelation, which is crucial for the validity of standard errors.[14] Homoscedasticity: The variance of the errors is constant across all levels of the predictor, so \text{Var}(\varepsilon_i) = \sigma^2 for all i, regardless of X. This equal spread of residuals around the regression line prevents heteroscedasticity, where variance increases or decreases with X, which could otherwise inflate standard errors for certain predictions.[7] The assumption is part of the Gauss-Markov conditions that make OLS the best linear unbiased estimator (BLUE).[14] Normality: For exact finite-sample inference, such as t-tests and F-tests, the errors are assumed to be normally distributed, \varepsilon_i \sim N(0, \sigma^2). This Gaussian assumption facilitates the derivation of the sampling distribution of the OLS estimators.[7] However, it is not required for consistency or unbiasedness; in large samples, the central limit theorem ensures asymptotic normality of the estimators even under non-normal errors. No perfect multicollinearity: In simple linear regression, this reduces to the predictor X not being constant across all observations, ensuring variation in X to allow estimation of \beta_1. Without this, the model parameters cannot be uniquely identified.[12] Violations of these assumptions can compromise the model's reliability, but simple linear regression is robust in several ways, particularly with large sample sizes. For instance, breaches in homoscedasticity or normality often do not severely affect point estimates, though they may impact inference; asymptotic theory supports valid hypothesis tests as the number of observations grows.[15] Linearity and independence violations, however, tend to have more pronounced effects, potentially requiring model respecification or alternative methods.[15]Estimation Methods
Ordinary Least Squares
Ordinary least squares (OLS) is the primary method for estimating the parameters of the simple linear regression model, introduced by Adrien-Marie Legendre in 1805 as a technique to fit lines to observational data by minimizing the sum of squared errors.[16] The core principle of OLS involves selecting the intercept b_0 and slope b_1 that minimize the sum of squared residuals (SSR), defined as \text{SSR} = \sum_{i=1}^n (Y_i - \hat{Y}_i)^2, where \hat{Y}_i = b_0 + b_1 X_i represents the predicted value for the i-th observation.[17] This minimization criterion assumes that errors are measured vertically from the response variable Y to the line, emphasizing the model's predictive accuracy for Y given X.[18] Geometrically, the OLS regression line can be interpreted as the straight line passing through the centroid (\bar{X}, \bar{Y}) of the data cloud that minimizes the sum of the squared vertical distances from each data point to the line.[18] This property ensures the line balances the data around the mean, providing an intuitive visual representation of the best linear fit in the plane spanned by the observations.[17] To derive the OLS estimates, the SSR is treated as a function of b_0 and b_1, and its partial derivatives are set to zero, yielding a system of linear equations known as the normal equations: \sum_{i=1}^n Y_i = n b_0 + b_1 \sum_{i=1}^n X_i, \sum_{i=1}^n X_i Y_i = b_0 \sum_{i=1}^n X_i + b_1 \sum_{i=1}^n X_i^2. These equations arise directly from the calculus-based optimization and form the foundation for solving the parameter estimates.[17][18] Under the assumptions of linearity in parameters and strict exogeneity (E[ε | X] = 0), the OLS estimators are unbiased, meaning their expected values equal the true population parameters.[19] Furthermore, the Gauss-Markov theorem establishes that OLS produces the best linear unbiased estimators (BLUE), with the smallest variance among all linear unbiased estimators, provided the additional assumption of homoscedasticity holds.[20][21] OLS is also computationally straightforward, relying solely on sums and products of the data, which facilitates its implementation even with limited resources.[18]Coefficient Formulas
The ordinary least squares (OLS) estimators for the coefficients in simple linear regression are obtained by solving the normal equations, which minimize the sum of squared residuals.[2] The slope estimator b_1 is given by the sample covariance of X and Y divided by the sample variance of X: b_1 = \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2} = \frac{\text{Cov}(X, Y)}{\text{Var}(X)}, where \bar{X} and \bar{Y} are the sample means of the predictor and response variables, respectively.[2] This formulation, rooted in the method of least squares introduced by Adrien-Marie Legendre in 1805, expresses the slope as a measure of linear association scaled by the variability in X.[22] An alternative computational form for the slope, useful for direct calculation from raw data, is b_1 = \frac{n \sum_{i=1}^n X_i Y_i - \left( \sum_{i=1}^n X_i \right) \left( \sum_{i=1}^n Y_i \right)}{n \sum_{i=1}^n X_i^2 - \left( \sum_{i=1}^n X_i \right)^2}. This expression avoids explicit computation of means and is equivalent to the covariance form.[2] The intercept estimator b_0 is then b_0 = \bar{Y} - b_1 \bar{X}, ensuring the regression line passes through the point of means (\bar{X}, \bar{Y}).[2] The fitted values for the response variable are predicted by the estimated model: \hat{Y}_i = b_0 + b_1 X_i for each observation i = 1, \dots, n.[2] The residuals, which represent the deviations between observed and fitted values, are defined as e_i = Y_i - \hat{Y}_i. [2]Interpretation
Slope Meaning
In simple linear regression, the slope coefficient, denoted b_1, represents the estimated change in the expected value of the response variable Y for each one-unit increase in the predictor variable X.[23] This interpretation holds under the model's assumptions, where no other factors are involved, providing a measure of the average linear association between X and Y.[24] The sign of b_1 indicates the direction of this association: a positive value suggests a direct relationship, where increases in X are associated with increases in Y, while a negative value implies an inverse relationship, with increases in X linked to decreases in Y.[23] For instance, in a model relating height to weight, a positive b_1 would mean that taller individuals tend to weigh more, with the magnitude specifying the average weight gain per additional unit of height.[23] The units of b_1 are determined by the scales of Y and X, specifically units of Y per unit of X, ensuring the coefficient's interpretability remains tied to the data's measurement context.[23] A slope of zero indicates no linear association between X and Y, implying that changes in X do not systematically predict changes in Y under the model.[25] Furthermore, b_1 is directly related to the covariance between X and Y, scaled by the inverse of the variance of X, which quantifies how the joint variability of the variables contributes to the estimated linear effect.[26] This connection underscores the slope's role in capturing the strength and direction of the linear dependence relative to the predictor's spread.[27]Intercept Meaning
In simple linear regression, the intercept parameter, denoted as b_0, represents the expected value of the response variable Y when the predictor variable X is equal to zero, or equivalently, the predicted value \hat{Y} at X = 0.[28][29] This interpretation follows directly from the model equation E(Y \mid X = x) = \beta_0 + \beta_1 x, where \beta_0 is the true population intercept.[10] However, the practical relevance of the intercept can be limited if X = 0 falls outside the observed range of the data or represents an impossible scenario in the context of the variables.[28] For instance, in a regression model predicting weight from height, an intercept implying a negative weight at zero height lacks physical meaning, as heights are positive.[28] In such cases, the intercept serves more as a mathematical adjustment rather than a substantive prediction.[10] The ordinary least squares estimate of the intercept ensures that the fitted regression line passes through the point of means (\bar{X}, \bar{Y}), which centers the model around the data.[30] This property is reflected in the formula b_0 = \bar{Y} - b_1 \bar{X}, guaranteeing that the predicted value at the average predictor equals the average response.[30] Omitting the intercept by setting b_0 = 0 fundamentally alters the model, forcing the line through the origin and potentially biasing estimates unless theoretically justified.Correlation Coefficient
The Pearson correlation coefficient, denoted as r, is a standardized measure of the strength and direction of the linear relationship between two variables, X and Y. It is defined as the covariance between X and Y divided by the product of their standard deviations: r = \frac{\text{Cov}(X, Y)}{s_X s_Y}, where s_X and s_Y are the sample standard deviations of X and Y, respectively. Equivalently, it can be computed as r = \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^n (X_i - \bar{X})^2 \sum_{i=1}^n (Y_i - \bar{Y})^2}}, which normalizes the data by centering at the means \bar{X} and \bar{Y}. The value of r ranges from -1 to 1, where r = 1 indicates a perfect positive linear relationship, r = -1 a perfect negative linear relationship, and r = 0 no linear association.[31][32] In the context of simple linear regression, the correlation coefficient is closely related to the slope estimate b_1. Specifically, r = b_1 \frac{s_X}{s_Y}, or equivalently, b_1 = r \frac{s_Y}{s_X}, showing that the sign of r always matches the sign of the slope, while its magnitude reflects the slope scaled by the ratio of standard deviations. This relationship highlights how r standardizes the association to be unitless, unlike the slope which depends on the units of X and Y.[33] The absolute value |r| indicates the strength of the linear association: values near 1 suggest a strong linear relationship, while values near 0 indicate weak or no linear association. The sign of r conveys the direction—positive for direct association and negative for inverse. Additionally, the square of the correlation coefficient, R^2 = r^2, is the coefficient of determination, representing the proportion of the variance in Y that is explained by the linear variation in X under the regression model. For example, if r = 0.8, then R^2 = 0.64, meaning 64% of the variability in Y is accounted for by X.[34][33] Despite its utility, the Pearson correlation coefficient has notable limitations. It solely measures linear associations and may detect no correlation for strong nonlinear relationships, such as quadratic patterns. Furthermore, it is highly sensitive to outliers, which can disproportionately influence the coefficient and lead to misleading interpretations of the association strength.[35][36]Statistical Properties
Unbiasedness
In simple linear regression, the ordinary least squares (OLS) estimators b_0 and b_1 are unbiased, meaning their expected values equal the true population parameters: E(b_1) = \beta_1 and E(b_0) = \beta_0.[37] This property holds under the core assumptions of the model, specifically linearity in parameters and the strict exogeneity condition that the errors have zero conditional mean given the predictors, E(\varepsilon_i \mid X_i) = 0.[21] To demonstrate unbiasedness for the slope estimator, consider the formula b_1 = \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2}, where \bar{X} and \bar{Y} are the sample means. Substituting the model Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i yields Y_i - \bar{Y} = \beta_1 (X_i - \bar{X}) + (\varepsilon_i - \bar{\varepsilon}), so b_1 = \beta_1 + \frac{\sum_{i=1}^n (X_i - \bar{X})(\varepsilon_i - \bar{\varepsilon})}{\sum_{i=1}^n (X_i - \bar{X})^2}. Taking expectations, E(b_1) = \beta_1 + E\left[ \frac{\sum_{i=1}^n (X_i - \bar{X})(\varepsilon_i - \bar{\varepsilon})}{\sum_{i=1}^n (X_i - \bar{X})^2} \right]. Under the assumptions, the expected value of the second term is zero because E(\varepsilon_i) = 0 for all i and the errors are independent of the predictors (treating X as fixed or conditioning on X).[37][18] For the intercept estimator, b_0 = \bar{Y} - b_1 \bar{X}. The sample mean \bar{Y} is unbiased for its expectation, E(\bar{Y}) = \beta_0 + \beta_1 \bar{X}, and since E(b_1) = \beta_1, it follows that E(b_0) = E(\bar{Y}) - E(b_1) \bar{X} = \beta_0.[37] Unbiasedness requires only the linearity and independence assumptions; it does not depend on normality of errors or homoscedasticity.[19] Under the Gauss-Markov theorem, which assumes linearity, strict exogeneity, homoscedasticity, and no serial correlation in errors, the OLS estimators are the best linear unbiased estimators (BLUE), meaning they have the minimum variance among all linear unbiased estimators.[38] This theorem, originally developed by Carl Friedrich Gauss in his 1821 work on least squares, provides the foundational justification for OLS in linear models.[39]Variances of Estimators
In simple linear regression, the ordinary least squares (OLS) estimators \hat{\beta}_0 and \hat{\beta}_1 are random variables due to the stochastic nature of the error terms, and their variances quantify the sampling variability around their expected values. Under the standard assumptions of the linear model—including linearity, strict exogeneity, homoscedasticity (constant error variance \sigma^2), and no perfect collinearity—the variance of the slope estimator is given by \text{Var}(\hat{\beta}_1) = \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2}, where \sum_{i=1}^n (x_i - \bar{x})^2 denotes the total variation in the predictor variable x, often abbreviated as S_{xx}. This formula arises from the Gauss-Markov theorem, which establishes the OLS estimators as the best linear unbiased estimators with minimum variance under these assumptions.[40] The variance of the intercept estimator is \text{Var}(\hat{\beta}_0) = \sigma^2 \left( \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n (x_i - \bar{x})^2} \right), reflecting contributions from both sample size n and the positioning of the mean \bar{x} relative to the spread in x. Larger values of \sum_{i=1}^n (x_i - \bar{x})^2 reduce \text{Var}(\hat{\beta}_1), improving precision by leveraging greater dispersion in the predictors, while increasing n diminishes \text{Var}(\hat{\beta}_0) through the $1/n term.[40] The covariance between \hat{\beta}_0 and \hat{\beta}_1 is \text{Cov}(\hat{\beta}_0, \hat{\beta}_1) = -\bar{x} \sigma^2 / \sum_{i=1}^n (x_i - \bar{x})^2, indicating negative dependence that strengthens when \bar{x} is farther from zero. Since \sigma^2 is unknown in practice, it is estimated unbiasedly by s^2 = \sum_{i=1}^n e_i^2 / (n-2), where e_i = y_i - \hat{y}_i are the residuals and n-2 reflects the degrees of freedom lost to estimating two parameters. The standard error of the slope, s_{\hat{\beta}_1} = s / \sqrt{\sum_{i=1}^n (x_i - \bar{x})^2}, then provides an estimate of \sqrt{\text{Var}(\hat{\beta}_1)} for inference purposes, assuming homoscedasticity holds.[40] These expressions highlight how estimator precision depends on error variance and data configuration, with violations of homoscedasticity potentially inflating these variances.Inference Procedures
Confidence Intervals
In simple linear regression, confidence intervals quantify the uncertainty around estimates of the regression coefficients and predicted values by providing a range likely to contain the true population values with a specified confidence level, such as 95%. These intervals rely on the t-distribution with n-2 degrees of freedom to account for the additional variability from estimating the error variance s^2. The standard errors used in these intervals derive from the variances of the estimators, ensuring the intervals reflect the sampling variability under the model's assumptions.[41] The confidence interval for the slope coefficient \beta_1 is constructed asb_1 \pm t_{\alpha/2, n-2} \, s_{b_1},
where s_{b_1} = s / \sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} is the standard error of the slope, and s = \sqrt{\sum_{i=1}^n (y_i - \hat{y}_i)^2 / (n-2)} is the residual standard error. Similarly, the interval for the intercept \beta_0 is
b_0 \pm t_{\alpha/2, n-2} \, s_{b_0},
with s_{b_0} = s \sqrt{1/n + \bar{x}^2 / \sum_{i=1}^n (x_i - \bar{x})^2}. These intervals indicate the plausible range for the true coefficients, with narrower widths for larger sample sizes or stronger linear relationships.[42][43] For the mean response at a specific predictor value x_0, the confidence interval estimates the expected value of y and is given by
\hat{y}_0 \pm t_{\alpha/2, n-2} \, s \sqrt{\frac{1}{n} + \frac{(x_0 - \bar{x})^2}{\sum_{i=1}^n (x_i - \bar{x})^2}}.
This formula incorporates the leverage of x_0 relative to the data, making the interval wider when x_0 is distant from \bar{x}, as extrapolation increases uncertainty.[41] The prediction interval for an individual future observation at x_0 extends beyond the mean response to account for both the uncertainty in the mean and the inherent variability of a single y, yielding
\hat{y}_0 \pm t_{\alpha/2, n-2} \, s \sqrt{1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{\sum_{i=1}^n (x_i - \bar{x})^2}}.
The extra term under the square root ensures this interval is always wider than the mean response interval, reflecting the added variability of individual predictions. Intervals for both the mean and predictions broaden with greater distance from the data's predictor range, emphasizing the risks of extrapolation.[44] In large samples where n is sufficiently big, an asymptotic approximation replaces the t-critical value with the z-critical value from the standard normal distribution, particularly when strict normality of errors is not assumed, though the t-based approach remains preferred for smaller samples.[41]
Hypothesis Testing
In simple linear regression, hypothesis testing is used to assess the significance of the relationship between the predictor variable x and the response variable y. The primary test focuses on the slope parameter \beta_1, evaluating whether it differs from zero, which would indicate no linear relationship. The null hypothesis is H_0: \beta_1 = 0 (no linear association), and the alternative hypothesis is H_a: \beta_1 \neq 0 (a linear association exists).[45] The test statistic is the t-ratio, given by t = \frac{b_1}{s_{b_1}}, where b_1 is the estimated slope and s_{b_1} is its standard error. Under H_0, this statistic follows a t-distribution with n-2 degrees of freedom, where n is the sample size.[45][46] For the overall model fit, an F-test examines whether the regression explains a significant portion of the variability in y. The null hypothesis is again H_0: \beta_1 = 0, testing if the model is better than a horizontal line through the mean. The test statistic is F = \frac{\text{MSR}}{\text{MSE}} = \frac{\text{[SSR](/page/SSR)}/1}{\text{[SSE](/page/SSE)}/(n-2)}, where SSR is the regression sum of squares (variability explained by the model) and SSE is the error sum of squares (unexplained variability). This simplifies to F = \frac{R^2}{1 - R^2} \cdot (n-2), with R^2 as the coefficient of determination. Under H_0, F follows an F-distribution with 1 and n-2 degrees of freedom. In simple linear regression, the F-test is mathematically equivalent to the square of the t-test for the slope, as F = t^2.[47][48] Decisions in hypothesis testing rely on p-values, which are the probability of observing a test statistic at least as extreme as the calculated value under H_0. The null hypothesis is rejected if the p-value is less than the significance level \alpha (commonly 0.05), indicating sufficient evidence of a linear relationship. Critical values from the t- or F-distribution can also be used for comparison.[45][46] The analysis of variance (ANOVA) table provides a structured summary for these tests, decomposing the total sum of squares (SST) into SSR and SSE components: \text{SST} = \text{SSR} + \text{SSE}, where SST measures total variability in y around its mean. The table includes degrees of freedom (df: 1 for regression, n-2 for error, n-1 total), mean squares (MSR = SSR/1, MSE = SSE/(n-2)), the F-statistic, and p-value. This breakdown quantifies how much variance the model captures versus random error.[47][48] Power considerations in these tests highlight the importance of sample size for detecting true effects. The power (1 - \beta) is the probability of rejecting H_0 when it is false, depending on the effect size (e.g., standardized slope), significance level \alpha, and n. For instance, detecting a small slope difference often requires larger samples, with formulas or software like G*Power used to compute required n based on desired power (e.g., 0.80).[49][50]Numerical Example
Data Setup
To illustrate simple linear regression, consider data from a sample of 10 students examining the relationship between height (in inches) and weight (in pounds).[51] The dataset consists of the following paired observations:| Height (inches) | Weight (pounds) |
|---|---|
| 63 | 127 |
| 64 | 121 |
| 66 | 142 |
| 69 | 157 |
| 69 | 162 |
| 71 | 156 |
| 71 | 169 |
| 72 | 165 |
| 73 | 181 |
| 75 | 208 |