Regression toward the mean
Regression toward the mean is a statistical phenomenon wherein extreme values of a random variable, whether unusually high or low, are likely to be followed by subsequent observations closer to the overall average upon remeasurement, due to natural variability and imperfect correlations rather than any causal intervention.[1][2] This effect arises fundamentally from the fact that extremes are often influenced by transient factors such as measurement error or random fluctuations, which regress toward stability in repeated trials, independent of underlying trends.[3][4] The concept was first systematically described by Francis Galton in 1886, through his analysis of hereditary stature, where he observed that children of exceptionally tall or short parents tended to have heights intermediate between their parents' extremes and the population mean, a pattern he termed "regression towards mediocrity."[3][5] Galton's empirical data from family height measurements quantified this reversion, laying the groundwork for linear regression analysis and highlighting its non-causal, probabilistic nature rooted in bivariate distributions with correlation coefficients less than one.[6] This discovery underscored the importance of distinguishing statistical artifacts from genuine hereditary or environmental influences, influencing fields from biometrics to modern econometrics.[3] Regression toward the mean holds critical implications for interpreting changes in performance across diverse domains, including sports, education, and clinical trials, where selecting groups based on extreme outcomes can produce illusory improvements or declines upon follow-up without any true effect.[1][7] Failure to account for it has led to persistent errors, such as overestimating treatment efficacy in studies of high-risk patients or attributing random athletic streaks to skill development, emphasizing the need for randomized controls and baseline adjustments in causal inference.[2][8] Despite its straightforward mathematical basis—derivable from the properties of conditional expectations in Gaussian distributions—the phenomenon remains underappreciated, often confounded with reversion due to corrective actions, perpetuating methodological pitfalls in observational data analysis.[4][9]