Regression
Regression is a term with multiple meanings across various fields. In mathematics and statistics, it refers to regression analysis, a method for modeling relationships between variables; regression toward the mean, a statistical phenomenon; and related concepts like linear and nonlinear regression. In computing and software engineering, it denotes regression testing, the process of verifying that recent changes have not adversely affected existing functionality, and performance regression, a decline in system performance. In psychology and hypnosis, it includes hypnotic regression, a technique to recover memories, and age regression therapy, a therapeutic practice involving reversion to earlier developmental stages. In biology and medicine, it describes developmental regression, the loss of acquired skills, and tumor regression, the reduction in tumor size. In physical sciences, it covers regression of the nodes in astronomy and secular regression in orbital mechanics. In arts and entertainment, it appears in themes of regression in literature, film, and television.Mathematics and Statistics
Linear Regression
Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.[1] In simple linear regression, there is one independent variable (predictor), allowing the estimation of how changes in that predictor affect the dependent variable.[2] Multiple linear regression extends this to include two or more independent variables, capturing more complex relationships while assuming linearity in the parameters.[3] The core model for simple linear regression is expressed as: Y_i = \beta_0 + \beta_1 X_i + \epsilon_i where Y_i is the observed value of the dependent variable for the i-th observation, \beta_0 is the y-intercept (the expected value of Y when X=0), \beta_1 is the slope coefficient (indicating the change in Y for a one-unit increase in X), X_i is the value of the independent variable, and \epsilon_i is the random error term representing unexplained variation.[2] For multiple linear regression, the equation generalizes to Y_i = \beta_0 + \beta_1 X_{1i} + \cdots + \beta_p X_{pi} + \epsilon_i, where p is the number of predictors.[3] Linear regression relies on several key assumptions to ensure valid inferences. These include linearity, meaning the relationship between predictors and the dependent variable is linear in the parameters; independence of errors, so observations are not correlated; homoscedasticity, or constant variance of the error terms across all levels of the predictors; and normality of the residuals, which supports certain statistical tests though not strictly required for estimation.[4] Violations of these assumptions can lead to biased estimates or invalid predictions, necessitating diagnostic checks.[4] Parameter estimates in linear regression are typically obtained using ordinary least squares (OLS), a method that minimizes the sum of squared residuals (SSR), defined as \sum_{i=1}^n (y_i - \hat{y}_i)^2, where \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i is the predicted value.[5] The OLS estimators for \beta_0 and \beta_1 are derived as \hat{\beta}_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} and \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}, providing unbiased and efficient estimates under the model assumptions.[2] To assess model fit, the coefficient of determination, R^2, measures the proportion of variance in the dependent variable explained by the independent variables, calculated as R^2 = 1 - \frac{SSR}{[SST](/page/SST)}, where SST is the total sum of squares.[6] Values range from 0 to 1, with higher values indicating better fit, though R^2 always increases with added predictors, potentially leading to overfitting.[6] The adjusted R^2 addresses this by penalizing unnecessary variables: \bar{R}^2 = 1 - (1 - R^2) \frac{n-1}{n-p-1}, where n is the sample size and p is the number of predictors.[6] The concept of linear regression originated in the late 19th century through Francis Galton's studies on heredity, where he observed that offspring traits tended to revert toward the population mean, leading to the term "regression" in 1885.[7] Karl Pearson later formalized the method mathematically in the 1890s, developing the least squares approach and correlation measures that underpin modern linear regression.[7] A practical example is predicting house prices based on square footage. Suppose data from recent sales show an average price increase of $150 per square foot, with the model Price = 50,000 + 150 \times SqFt + \epsilon; for a 2,000-square-foot house, the predicted price is $350,000, illustrating how linear regression quantifies real estate trends.[8]Nonlinear Regression
Nonlinear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables when the relationship does not follow a linear pattern, such as in cases where data exhibit exponential, logistic, or polynomial curves./06%3A_Regression/6.04%3A_Nonlinear_Regression) This approach becomes necessary when linear models fail to capture the underlying dynamics, for instance, in processes involving growth or decay that accelerate or saturate over time, allowing for more accurate predictions and parameter estimation in complex systems.[9] Common examples of nonlinear regression models include the exponential growth model, given by Y = a e^{bX}, where a represents the initial value and b the growth rate, often applied to population dynamics or radioactive decay.[10] Another frequent model is the logistic curve, expressed as Y = \frac{L}{1 + e^{-k(X - X_0)}}, which describes S-shaped growth approaching a maximum capacity L, with k as the growth rate and X_0 the midpoint; this is particularly useful for modeling population growth limited by resources or adoption curves in epidemiology./06%3A_Regression/6.04%3A_Nonlinear_Regression) Parameter estimation in nonlinear regression typically relies on nonlinear least squares, which minimizes the sum of squared residuals between observed and predicted values but lacks a closed-form solution, requiring iterative numerical optimization.[11] Algorithms such as the Gauss-Newton method approximate the Jacobian matrix of partial derivatives to update parameter estimates iteratively, while the Levenberg-Marquardt algorithm enhances robustness by blending Gauss-Newton steps with gradient descent, adjusting a damping factor to improve convergence near local minima.[12][11] Key challenges in nonlinear regression include the sensitivity to initial parameter guesses, which can lead to convergence failures or entrapment in local minima rather than the global optimum, especially in multimodal objective functions.[13] Additionally, issues like ill-conditioned Jacobians or correlated parameters may cause slow convergence or numerical instability, necessitating careful model diagnostics and multiple starting points for reliable fits.[13] Software tools facilitate nonlinear regression implementation; in R, thenls() function from the base stats package performs nonlinear least squares fitting by specifying a formula and initial parameter values.[14] Similarly, Python's SciPy library provides curve_fit() in the optimize module, which uses least squares to fit user-defined functions to data, returning optimized parameters and covariance estimates.[15]
Applications of nonlinear regression are prominent in pharmacokinetics, where models like exponential decay describe drug concentration over time following administration, enabling predictions of dosing intervals and therapeutic levels.[16] In biology, it fits growth curves to data on microbial populations or plant development, capturing phases of rapid expansion followed by plateauing due to environmental constraints.[17]
Historically, nonlinear regression emerged in the early 20th century within biometrics for analyzing curved relationships in biological data, with methods advancing significantly in the 1960s through computational tools that enabled iterative algorithms for practical use.[18]
Regression to the Mean
Regression to the mean is a statistical phenomenon describing the tendency for extreme observations in a variable—either unusually high or low—to be followed by subsequent observations that are closer to the average value of that variable, due to natural variability and measurement error rather than any systematic process.[19] This effect arises because extreme values often include random fluctuations or errors that are unlikely to repeat in the same direction, pulling later measurements back toward the population mean.[20] The concept was first identified by Francis Galton in his studies on heredity during the 1880s. Galton initially observed the phenomenon in experiments with the sizes of sweet pea seeds, where offspring seeds from large or small parent seeds tended to have sizes closer to the average than their parents.[7] He formalized it in 1885 through analysis of human height data from 928 adult children of 205 families, noting that children of exceptionally tall or short parents were taller or shorter than average but less extreme than their mid-parental height (the average of both parents' heights, adjusted for sex).[21] Galton coined the term "regression towards mediocrity" in his 1885 paper, later shortened to "regression to the mean," to describe this co-varying tendency in correlated variables. Mathematically, in a bivariate normal distribution with imperfect correlation (ρ < 1), the expected value of one variable given an extreme value of the other is given by Galton's regression formula: E[Y \mid X = x] = \mu_y + \rho \frac{\sigma_y}{\sigma_x} (x - \mu_x) where μ_x and μ_y are the means, σ_x and σ_y are the standard deviations, and ρ is the correlation coefficient.[7] This equation shows that the predicted value regresses toward the mean μ_y by a factor of ρ (σ_y / σ_x), ensuring that only when ρ = 1 and σ_x = σ_y does the line pass through the origin with slope 1, avoiding regression.[20] Classic examples illustrate the effect. In Galton's height study, children of parents at the 90th percentile for height had offspring heights around the 75th percentile, demonstrating regression.[21] In sports, the "hot hand fallacy" misinterprets basketball shooting streaks as momentum, when regression to the mean explains why players after an exceptional run of makes are likely to score closer to their average thereafter.90010-6) Similarly, for IQ scores, children of parents with IQs two standard deviations above the mean (around 130) typically have IQs about one standard deviation above (around 115), regressing toward the population mean of 100 due to imperfect heritability and measurement variability. A common misconception is that regression to the mean results from interventions or causal mechanisms, such as treatments causing improvement in extreme cases; in reality, it is a non-causal statistical artifact independent of any action taken.[22] It is often confused with causality, leading to erroneous attributions like assuming a coaching change ended a slump when the player simply regressed from an unusually poor performance.[19] The implications are significant in fields like medicine and education. In clinical trials, regression to the mean contributes to apparent placebo effects, as patients selected for high symptom severity at baseline naturally improve toward their average, mimicking treatment benefits; studies estimate it accounts for much of the observed improvement in uncontrolled trials.[22] In education, average test scores for groups identified by extreme prior performance (e.g., low-achieving schools) tend to rise or fall toward the overall mean on retesting, regardless of interventions, complicating evaluations of teaching reforms.[23]Computing and Software Engineering
Regression Testing
Regression testing is a software testing practice that involves re-executing a subset or the entirety of existing test cases to verify that recent code changes, such as bug fixes, enhancements, or integrations, have not adversely affected previously functioning features.[24] This process can be performed manually or, more commonly in modern development, automated to ensure the software's existing functionality remains intact after modifications.[25] The term "regression" refers to the potential for new changes to cause a return to a prior, erroneous state in the software's behavior.[26] The practice emerged in the late 1970s amid the rise of structured programming methodologies, which emphasized modular code design and incremental development, necessitating systematic re-verification of changes.[27] The term was first used in the late 1970s in an IBM technical report, highlighting the need to test for unintended side effects in evolving systems.[27] By the 1990s, regression testing was formalized in industry standards, such as IEEE Std 829-1998, which outlines test documentation practices including regression procedures to support maintainable software lifecycles. In the regression testing process, relevant test cases are selected from an established test suite—often based on code impact analysis—and executed following key events like code commits, builds, or pull request merges in version control systems.[24] This integration with continuous integration/continuous deployment (CI/CD) pipelines automates the workflow, enabling frequent and efficient verification without manual intervention for every change.[25] Common types of regression testing include:- Full regression testing, which re-runs the entire test suite to provide comprehensive coverage, ideal for critical releases but resource-intensive.[26]
- Partial regression testing, focusing on tests related to modified code modules or dependencies to isolate potential impacts efficiently.[24]
- Selective regression testing, which prioritizes high-risk or frequently failing test cases using techniques like risk-based selection to optimize time and resources.[25]