Line fitting
Line fitting is a statistical procedure for determining the straight line that best approximates a set of scattered data points in a two-dimensional plane, typically representing the linear relationship between an independent variable (often denoted as x) and a dependent variable (often denoted as y).[1] The primary goal is to minimize the discrepancies, or residuals, between the observed data and the predicted values on the line, enabling predictions, trend identification, and modeling of linear associations in fields such as statistics, economics, engineering, and natural sciences.[2] The most widely used technique for line fitting is the method of least squares, which constructs the line by minimizing the sum of the squared vertical residuals between each data point and the line.[3] For a dataset with n points (x_i, y_i), the line equation is \hat{y} = b_0 + b_1 x, where the slope b_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} and the intercept b_0 = \bar{y} - b_1 \bar{x}, with \bar{x} and \bar{y} as the means of the x and y values, respectively.[1] This approach assumes errors are primarily in the y-direction and follow a Gaussian distribution, yielding unbiased estimators under certain conditions like linearity and homoscedasticity.[3] The goodness of fit is often assessed using the coefficient of determination r^2, which measures the proportion of variance in y explained by the line, ranging from 0 to 1.[1] Historically, the least squares method emerged in the early 19th century amid efforts to refine astronomical observations. Adrien-Marie Legendre first published the technique in 1805 in his work Nouvelles méthodes pour la détermination des orbites des comètes, presenting it as an algebraic tool for minimizing errors in planetary position calculations without probabilistic grounding.[4] Carl Friedrich Gauss, in his 1809 book Theoria Motus Corporum Coelestium, claimed prior development around 1795 and provided a theoretical justification based on the assumption of normally distributed errors, arguing that least squares yields the maximum likelihood estimate.[4] This sparked a priority dispute, though both contributions advanced the method's adoption; independently, American mathematician Robert Adrain derived a similar approach in 1808 for surveying problems.[4] While least squares dominates due to its mathematical simplicity and statistical properties, alternative methods address limitations such as errors in both variables or outlier sensitivity. Total least squares (or orthogonal regression) minimizes perpendicular distances to the line, suitable when measurement errors affect both x and y equally, as in calibration problems.[5] Robust techniques, like M-estimation or median-based fitting, reduce the impact of outliers by using less sensitive loss functions instead of squared residuals.[6] These alternatives are particularly valuable in noisy datasets from physics or environmental science, though they often require more computational effort.[6] In practice, line fitting underpins simple linear regression and extends to diagnostic tools like residual plots to validate assumptions and detect non-linearity or heteroscedasticity.[3] Software such as MATLAB, R, or Python's SciPy implements these methods efficiently, facilitating applications from economic forecasting to biological growth modeling.[7]Overview
Definition
Line fitting, also known as simple linear regression, is a fundamental statistical technique used to identify the straight line that best approximates a set of two-dimensional data points, each consisting of an observed pair (x_i, y_i) for i = 1 to n.[8][9] In this context, the line is typically represented by the equation y = mx + c, where m denotes the slope (indicating the rate of change in y per unit change in x) and c is the y-intercept (the value of y when x = 0).[8][10] The primary purposes of line fitting include summarizing underlying trends in the data, predicting future values of the dependent variable y based on the independent variable x, and modeling the linear relationship between two quantitative variables.[8][9] This approach assumes that the relationship can be adequately captured by a linear function, enabling quantitative analysis in fields such as economics, biology, and engineering.[11] Line fitting models can be distinguished as either deterministic or stochastic. A deterministic model posits an exact relationship without accounting for variability or error, such that each y is precisely determined by x via the line equation.[12][13] In contrast, a stochastic model incorporates random error terms to reflect real-world measurement inaccuracies or unexplained variation, treating the observed points as realizations around the true line.[13][14] The quality of the fit in these models can be geometrically interpreted as the line that minimizes deviations from the data points in a relevant metric.[15]Geometric interpretations
In line fitting, the vertical distance quantifies the discrepancy between an observed data point (x_i, y_i) and the corresponding point on the fitted line (\hat{y}_i, x_i), defined as the residual e_i = y_i - \hat{y}_i, where \hat{y}_i = m x_i + c and m, c are the slope and intercept parameters.[16] This measure assumes that the independent variable x_i is measured without error, while deviations occur only in the dependent variable y_i, making it suitable for models where x is controlled or precisely known.[17] The perpendicular, or orthogonal, distance provides a more symmetric geometric measure by calculating the shortest Euclidean distance from a data point to the fitted line, expressed as d_i = \frac{|a x_i + b y_i + c|}{\sqrt{a^2 + b^2}} for the line equation a x + b y + c = 0. This distance treats errors in both x and y directions equally, projecting the point orthogonally onto the line rather than vertically.[18] The distinction between vertical and perpendicular distances is critical: vertical distances are appropriate for asymmetric error models where predictions focus on y given x, whereas perpendicular distances account for isotropic errors in both variables, leading to a more balanced fit in scenarios like calibration or principal component analysis. For instance, in ordinary least squares, residuals appear as vertical segments in visualizations, emphasizing vertical deviations, while total least squares uses perpendicular segments to minimize orthogonal offsets.[19] Scatter plots effectively illustrate these concepts by overlaying the fitted line on the data cloud and depicting residual vectors as arrows from each point to the line. In a classic example using brushtail possum measurements, a scatter plot of head length versus total length shows the regression line \hat{y} = 41 + 0.59x with vertical residuals, such as -1.1 for the point (77.0, 85.3) and +7.45 for (85.0, 98.6), highlighting how points above the line contribute positive residuals and those below contribute negative ones.[16] Similarly, in studies of alcohol consumption and muscle strength, residual plots versus fitted values reveal random scattering around zero if the linear model holds, with vertical distances mirroring deviations from the line in the original scatter plot.[17] These visualizations underscore the geometric quality of the fit, where tight clustering of residuals indicates good alignment.Mathematical foundations
Model specification
In line fitting, the parametric linear model describes the relationship between a response variable y_i and a predictor variable x_i for i = 1, \dots, n observations asy_i = \beta_0 + \beta_1 x_i + \varepsilon_i,
where \beta_0 is the intercept, \beta_1 is the slope, and \varepsilon_i is the error term representing the deviation from the true line.[20] This model posits that the expected value of y_i given x_i follows a straight line, with the errors capturing unexplained variability. The error terms \varepsilon_i are typically assumed to be independent and identically distributed as \varepsilon_i \sim N(0, \sigma^2), implying a Gaussian distribution with mean zero and constant variance \sigma^2.[21] This normality assumption facilitates statistical inference, such as confidence intervals and hypothesis tests for the parameters. However, the model can be generalized to non-normal error distributions, such as in generalized linear models, depending on the data characteristics.[21] A key aspect of the error structure is homoscedasticity, where the variance \sigma^2 remains constant across all levels of x_i; in contrast, heteroscedasticity occurs when the error variance varies with x_i, potentially requiring adjusted estimation techniques.[22] In the context of line fitting to bivariate points, these errors often correspond to vertical distances from the points to the line, though geometric interpretations may involve perpendicular distances.[20] Simple linear regression represents a special case of line fitting, where the predictors x_i are treated as fixed and non-stochastic, and errors are confined to the response direction.[23]