Overfitting
Overfitting is a modeling error that arises in statistical learning when a function is excessively tailored to a finite sample of data, memorizing random fluctuations and outliers alongside any true signal, thereby failing to generalize to independent data drawn from the same distribution.[1][2] This discrepancy stems from the high variance inherent in complex models, which can achieve near-perfect fit on training observations but exhibit inflated prediction errors on unseen cases due to their inability to distill underlying generative processes from noise.[1] Empirically, overfitting is diagnosed through elevated validation loss relative to training loss, often visualized in learning curves where performance divergence grows with training epochs or model capacity.[2] In contrast to underfitting, where models underperform due to excessive bias and insufficient expressiveness, overfitting highlights the risks of overparameterization, particularly in regimes with limited data relative to degrees of freedom, as finite samples inevitably incorporate sampling variability that high-capacity estimators exploit.[1] Prevention strategies emphasize parsimony and validation rigor, including regularization techniques like L1 or L2 penalties that constrain parameter magnitudes to favor simpler structures aligned with causal sparsity, ensemble methods such as bagging to average out variance, and data augmentation to broaden empirical coverage without altering the data-generating mechanism.[1][2] These approaches restore out-of-sample reliability, underscoring that effective prediction demands inductive biases rooted in the problem's causal ontology rather than rote memorization of historical artifacts.[2]