Fact-checked by Grok 2 weeks ago

Double descent

Double descent is a phenomenon observed in machine learning models where the test error initially decreases as model complexity increases, reaches a peak near the interpolation threshold (where the model perfectly fits the training data), and then decreases again as complexity continues to grow beyond that point, challenging traditional notions of overfitting.^[1] This behavior manifests in various settings, including linear models, random features, decision trees, and deep neural networks, across datasets such as MNIST and CIFAR-10.^[2] The concept was first formalized in 2019 by Mikhail Belkin and colleagues, who proposed the "double-descent" risk curve to reconcile classical statistical learning theory—characterized by a U-shaped bias-variance tradeoff—with the empirical success of overparameterized modern models that interpolate training data yet generalize effectively.^[1] Building on this, Preetum Nakkiran et al. demonstrated the phenomenon in deep learning contexts, showing that test performance worsens and then improves with increasing model size, training epochs, or even dataset size in certain regimes, introducing the idea of "effective model complexity" as a unifying measure.^[2] Mechanistically, double descent arises from the interplay between underparameterization (high bias), classical overfitting (high variance near interpolation), and overparameterization (improved generalization post-interpolation due to inductive biases favoring smoother solutions among interpolators).^[3] In deep learning, this is influenced by factors like architectural choices, optimization dynamics, and data noise, where larger models can mitigate variance through implicit regularization effects.^[3] The implications of double descent extend to practical machine learning, explaining why scaling model capacity and data often yields better performance despite apparent overfitting, and prompting reevaluation of regularization techniques like early stopping or pruning.^[2] It has been observed in diverse domains, including vision transformers and language models, underscoring its broad relevance, though recent analyses highlight that noisy labels can amplify or alter the curve's shape.^[4]^[3] Overall, double descent underscores the limitations of classical capacity control and supports the trend toward ever-larger models in contemporary AI systems.

Fundamentals

Bias-Variance Tradeoff in Classical Statistics

In classical statistics, the bias-variance tradeoff describes the fundamental tension between two sources of error in statistical estimators: bias and variance. Bias refers to the systematic error introduced by approximating a true underlying function with a simpler model, arising from incorrect assumptions about the data-generating process. Variance, on the other hand, measures the sensitivity of the estimator to fluctuations in the training data, leading to instability when the model is overly flexible and captures noise rather than signal. This tradeoff underscores the challenge of selecting model complexity to achieve reliable predictions on unseen data. The expected prediction error, often termed the risk, decomposes into these components plus an irreducible noise term. For a regression problem with squared loss, the risk R(\hat{f}) = \mathbb{E}[(y - \hat{f}(x))^2] breaks down as R(\hat{f}) = \Bias^2(\hat{f}(x)) + \Var(\hat{f}(x)) + \sigma^2, where \sigma^2 is the variance of the noise \epsilon in the true model y = f(x) + \epsilon.^[5] Here, the squared bias \Bias^2(\hat{f}(x)) = (\mathbb{E}[\hat{f}(x)] - f(x))^2 quantifies the average deviation of the estimator from the truth, while the variance \Var(\hat{f}(x)) = \mathbb{E}[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2] captures the estimator's variability across different training samples. In the classical statistical setting, the risk curve as a function of model complexity exhibits a characteristic U-shape. As complexity increases from low levels—such as simple linear models—bias decreases because the model better captures the underlying patterns, leading to an initial reduction in risk.^[5] However, beyond an optimal complexity, variance begins to dominate as the model overfits the training data, causing the risk to rise sharply.^[5] This behavior motivates regularization techniques to penalize excessive complexity and balance the tradeoff. The concept gained prominence in the statistical literature of the 1970s and 1980s, particularly through developments in nonparametric regression and smoothing methods,^[6] with influential discussions on regularization appearing in early 1990s works like Geman et al. (1992).^[5] The mathematical derivation of the bias-variance decomposition for squared loss in regression proceeds as follows. Consider the true data-generating process y = f(x) + \epsilon, where \mathbb{E}[\epsilon] = 0 and \Var(\epsilon) = \sigma^2, and \hat{f} is the estimator trained on a sample, with expectation taken over both the training data and noise. The expected squared error at a fixed x is:

\mathbb{E}[(y - \hat{f}(x))^2] = \mathbb{E}[(f(x) + \epsilon - \hat{f}(x))^2].

Expanding the square yields:

\mathbb{E}[(f(x) - \hat{f}(x) + \epsilon)^2] = \mathbb{E}[(f(x) - \hat{f}(x))^2] + 2\mathbb{E}[(f(x) - \hat{f}(x))\epsilon] + \mathbb{E}[\epsilon^2].

The cross term vanishes because \epsilon is independent of the training data and thus of \hat{f}(x), so \mathbb{E}[(f(x) - \hat{f}(x))\epsilon] = 0. The noise term simplifies to \mathbb{E}[\epsilon^2] = \sigma^2. For the first term, note that:

\mathbb{E}[(f(x) - \hat{f}(x))^2] = \mathbb{E}[(f(x) - \mathbb{E}[\hat{f}(x)] + \mathbb{E}[\hat{f}(x)] - \hat{f}(x))^2] = (\mathbb{E}[\hat{f}(x)] - f(x))^2 + \mathbb{E}[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2],

which is precisely the squared bias plus the variance. Thus, the decomposition holds: R(\hat{f}(x)) = \Bias^2(\hat{f}(x)) + \Var(\hat{f}(x)) + \sigma^2. This formula highlights why the U-shaped risk curve arises: low-complexity models incur high bias but low variance, while high-complexity models reverse this pattern. In classical underparameterized regimes, where the number of parameters is much smaller than the sample size, this tradeoff guides model selection to minimize overall risk.

Overparameterization and Interpolation

In modern machine learning, overparameterization refers to the use of models with a number of parameters p that greatly exceeds the number of training samples n, often denoted as p \gg n. This setup contrasts with classical statistical assumptions where model complexity is limited to avoid overfitting, enabling models to achieve unprecedented flexibility in fitting complex data distributions. The interpolation regime arises in overparameterized models when the training error reaches zero, meaning the model perfectly memorizes the training data rather than generalizing from underlying patterns. For instance, in unregularized linear regression, interpolation occurs precisely when p \geq n, as the system of equations becomes underdetermined and admits solutions that pass exactly through all data points. Deep neural networks in high dimensions similarly enter this regime, where their vast parameter counts allow them to interpolate noisy training sets without explicit constraints. Optimization algorithms like gradient descent introduce implicit regularization in the interpolation regime, biasing solutions toward those with desirable properties despite the absence of explicit penalties. Specifically, continuous-time gradient flow on separable linear models converges to the minimum \ell_2-norm interpolator among all solutions that achieve zero training loss, favoring simpler functions that generalize better. This effect stems from the dynamics of gradient descent starting from zero initialization, which progressively aligns parameters in a way that avoids high-norm solutions. Examples of interpolating models include unregularized linear regression, where the minimum-norm solution emerges naturally, and overparameterized deep networks trained via stochastic gradient descent, which exhibit similar biases toward low-complexity interpolants in high-dimensional spaces. From a statistical mechanics perspective, the transition from underparameterized (p < n) to overparameterized (p > n) regimes can be viewed as a phase transition akin to jamming in disordered systems, where the model's ability to fit data shifts abruptly, altering the landscape of possible solutions and generalization behavior.

Phenomenon Description

The Double Descent Curve

The double descent phenomenon is characterized by a non-monotonic test error curve that deviates from the classical U-shaped bias-variance tradeoff, featuring two distinct minima separated by a peak. In the underparameterized regime, as model complexity increases (e.g., number of parameters p or features D), the test error initially decreases, reflecting improved fit to the training data. This is followed by a rise in test error near the interpolation threshold, where the number of parameters approximates the number of training samples (p \approx n), marking the onset of textbook overfitting with high variance and poor generalization. Beyond this threshold, in the overparameterized regime, the test error decreases again, often achieving better performance than in the classical minimum, as larger models find solutions that interpolate the training data while generalizing well to unseen examples.^[1] This curve is commonly visualized with test risk plotted against model width or complexity for fixed sample size n, showing the "classical descent" in the underparameterized phase, a peak at the interpolation point representing "textbook overfitting," and the "modern descent" in the interpolating regime where p \gg n. The first empirical observation of this shape occurred in a random features model on a subset of the MNIST dataset, where test risk peaked at D = n = 784 features (matching the input dimensionality) before declining as D increased further, demonstrating that overparameterization does not inevitably lead to degradation.^[1] Variations of the double descent curve include plotting risk against sample size n for fixed model complexity p, which similarly exhibits initial improvement, a peak, and renewed decrease, highlighting sample-wise non-monotonicity. Schematically, the test risk R can be illustrated as exhibiting a classical U-shape for p < n, transitioning to non-monotonic behavior for p > n, such that

R(n) \propto \frac{\sigma^2 p}{n} + \text{bias term} \quad (p \ll n),

with the variance term dominating near p \approx n to produce the peak, followed by risk reduction in the overparameterized limit due to implicit regularization effects. This framework reconciles classical statistical expectations with modern machine learning practices.^[1]

Key Characteristics and Variations

The peak in the double descent curve typically occurs at the interpolation threshold, where the number of model parameters p is approximately equal to the number of training samples n.^[2] The height of this peak depends on the signal-to-noise ratio (SNR) in the data, with lower SNR resulting in a taller peak due to increased sensitivity to overfitting at the threshold.^[7] The rate of the second descent varies across model types; in neural networks, it often appears linear when model size is plotted on a logarithmic scale, reflecting rapid improvements in generalization with further overparameterization.^[2] In contrast, linear models exhibit a slower second descent, as the benefits of additional parameters diminish more gradually after interpolation.^[8] Double descent manifests differently depending on the loss function: in regression tasks with squared loss, the phenomenon emphasizes mean squared error minimization, while in classification with 0-1 loss, it highlights improvements in misclassification rates beyond interpolation.^[2] Noise significantly influences the double descent curve; high noise levels amplify the peak at the interpolation threshold by exacerbating overfitting, whereas low noise smooths the curve, reducing the peak's prominence.^[2] A variant known as computational double descent arises from optimization challenges, where test error spikes occur at specific model widths due to difficulties in converging to effective minima during training.^[9] Epoch-wise double descent refers to the pattern observed during training dynamics, where validation error initially decreases, rises as the model overfits, and then descends again with continued epochs, mirroring the model-wise curve but tied to training iterations.^[2]

Historical Development

Early Indications

Early indications of double descent-like behavior emerged in the late 1980s and 1990s through studies in statistical learning and high-dimensional regression, where models exhibited improved generalization despite apparent overfitting, challenging classical expectations of the bias-variance tradeoff. In 1989, Vallet et al. experimentally demonstrated this phenomenon using minimum norm linear regression on artificial data, observing that test error initially decreased with model complexity, peaked near the interpolation threshold, and then decreased again in the overparameterized regime.^[10] Opper et al. followed in 1990 with a theoretical analysis in fixed-design linear models, proving that risk could improve beyond the interpolation point under certain asymptotic conditions, hinting at benign overfitting in high dimensions. These findings highlighted anomalies in risk estimation methods, such as Stein's unbiased risk estimate, which unexpectedly showed stable or decreasing out-of-sample error in overparameterized settings contrary to traditional predictions of degradation.^[10] In parallel, early experiments with shallow neural networks in the 1990s revealed similar patterns of strong generalization despite overfitting to training data, prompting reevaluation of theoretical limits. Researchers reported that networks with many parameters achieved low test error even when training error approached zero, contradicting expectations from VC dimension theory. Vapnik, in his 1995 work on statistical learning theory, emphasized that high VC dimension in neural networks should lead to poor generalization due to overfitting risks, yet empirical results from applications like pattern recognition showed otherwise, with models maintaining performance in overparameterized regimes. By the 2000s, studies in kernel methods further evidenced risk stabilization after interpolation. Duin (2000) analyzed pseudo-Fisher linear discriminant analysis—equivalent to kernel ridge regression in certain settings—on real-world datasets, observing double-descent feature curves where test performance improved post-interpolation without explicit regularization. Connections to ensemble methods like boosting also surfaced; in 1998, Breiman critiqued AdaBoost in the context of arcing classifiers, noting its ability to drive training error near zero while sustaining low test error over many iterations, a behavior that resisted typical overfitting degradation. These scattered observations across regression, neural networks, kernels, and boosting lacked a unified framework or terminology like "double descent," remaining isolated insights that puzzled researchers but were not synthesized until later decades.^[10]

Modern Formulation and Key Milestones

The modern formulation of double descent emerged in 2018 with the seminal work by Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal, who introduced the term and demonstrated the phenomenon through experiments on random features models, reconciling classical bias-variance theory with overparameterized machine learning practices.^[11] Their analysis revealed a non-monotonic test error curve, where error decreases initially, rises at the interpolation threshold, and descends again in the overparameterized regime, challenging traditional views on model complexity.^[1] In 2019, Belkin and collaborators extended these insights to linear models, confirming the double descent curve's presence in settings like polynomial regression and ridgeless least squares, emphasizing its generality beyond nonlinear architectures.^[1] Concurrently, foundational scaling laws for neural networks, as explored by Joel Hestness et al., highlighted compute-optimal model sizes that aligned with overparameterization benefits, providing empirical support for double descent in deep learning training dynamics.^[12] By 2020, theoretical advancements deepened the understanding, with papers like that of Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian, and Anant Sahai deriving exact asymptotic expressions for ridgeless regression under label noise, quantifying the "price of interpolation" and explaining the second descent phase through bias-variance decompositions.^[13] This period also saw growing community engagement, with increased papers and discussions on overparameterization at conferences like ICML and NeurIPS. From 2021 to 2023, double descent connected to related phenomena such as grokking—where overparameterized models suddenly generalize after prolonged overfitting—and extensions of the lottery ticket hypothesis, illustrating how sparse subnetworks in large models exhibit similar non-monotonic generalization patterns. Comprehensive surveys during this era, including reviews in high-impact journals, synthesized these developments, underscoring double descent's role in rethinking generalization bounds. Recent updates from 2024 to 2025 have observed persistent double descent in advanced generative models, such as diffusion models where overparameterization improves sample efficiency beyond classical limits—for instance, comparisons showing double descent in discrete diffusion versus autoregressive models—and large language models (LLMs) with transformer scaling, showing a second descent in loss curves during extended training on massive datasets.^[14] These findings, often shared via arXiv preprints, affirm the phenomenon's relevance in contemporary architectures. Overall, the adoption of double descent has driven a paradigm shift in machine learning conferences, moving from emphasis on explicit regularization techniques to embracing overparameterization as a reliable path to strong generalization, as evidenced by increased focus on scaling studies in NeurIPS and ICML proceedings.

Theoretical Explanations

Linear Regression Models

In linear regression models, double descent emerges as a fundamental phenomenon in the analysis of overparameterized learning, providing the simplest setting to derive the risk curve explicitly. Consider the setup where training data consists of n samples with p-dimensional features drawn from an isotropic Gaussian distribution, \mathbf{x}_i \sim \mathcal{N}(\mathbf{0}, I_p), and labels generated by a teacher-student linear model: y_i = \langle \mathbf{x}_i, \boldsymbol{\beta}^* \rangle + \epsilon_i, where \boldsymbol{\beta}^* is the fixed true parameter vector with \|\boldsymbol{\beta}^*\|^2 = 1 for normalization, and noise \epsilon_i \sim \mathcal{N}(0, \sigma^2). The model is fitted using ridgeless least squares, which minimizes the squared loss without regularization. In the underparameterized regime (p < n), this yields the ordinary least squares estimator; in the overparameterized regime (p > n), it corresponds to the minimum-norm interpolator that achieves zero training error by selecting the solution with the smallest \ell_2-norm among all interpolators. This minimum-norm solution exhibits an implicit bias toward low-variance directions, equivalent to the limit of ridge regression as the regularization parameter \lambda \to 0^+.^[8] The generalization risk, defined as the expected out-of-sample prediction error \mathbb{E}[(y' - \hat{y}')^2] for a new test point (\mathbf{x}', y'), can be decomposed into bias and variance terms and analyzed in the high-dimensional asymptotic regime where n, p \to \infty with fixed ratio \gamma = n/p. In the random feature selection variant—where the p features are a random subset of a larger true support of size D >> p, ensuring isotropy—the exact risk expressions reveal the double descent curve. For the underparameterized case (\gamma > 1, or p \leq n-2), the risk is

R(\gamma) = \left( \left(1 - \frac{p}{D}\right) \|\boldsymbol{\beta}^*\|^2 + \sigma^2 \right) \left( 1 + \frac{p}{n - p - 1} \right).

Here, the first factor captures the bias from uncaptured signal features (decreasing as p increases) plus noise, while the second is the variance inflation factor. As p grows (γ decreases toward 1), the bias reduction initially dominates, causing risk to decrease; near the interpolation threshold, variance dominates, driving risk upward to a peak.^[8] In the overparameterized case (\gamma < 1, or p \geq n+2), the risk becomes

R(\gamma) = \|\boldsymbol{\beta}^*\|^2 \left(1 - \frac{n}{p}\right) + \left( \left(1 - \frac{p}{D}\right) \|\boldsymbol{\beta}^*\|^2 + \sigma^2 \right) \left( 1 + \frac{n}{p - n - 1} \right).

The term $1 - \gamma represents the squared bias from the component of \boldsymbol{\beta}^* orthogonal to the row space of the design matrix (due to the minimum-norm bias toward the origin in the null space), while the remaining terms reflect variance dominated by the interpolation penalty near γ ≈ 1, diverging as γ → 1^-, and decreasing as γ → 0 (p >> n). In the limit D → ∞ (pure noise-like weak signal spread across dimensions), the risk simplifies asymptotically to forms like R(\gamma) \approx (1 - \gamma) + \frac{\sigma^2}{1 - \gamma} for γ < 1 (up to constants), highlighting the second descent as overparameterization reduces variance while residual bias vanishes slowly. The full curve thus decreases in the underparameterized regime (bias reduction), peaks near the interpolation threshold γ ≈ 1 (variance explosion), and decreases again in the overparameterized regime (benign overfitting). This derivation relies on properties of the Moore-Penrose pseudoinverse and expectations over Wishart-distributed matrices.^[8] These results hold under the assumption of isotropic Gaussian features, enabling closed-form bias-variance decomposition via random matrix theory.

Generalizations to Non-Linear and Kernel Methods

The double descent phenomenon, initially characterized in linear regression, extends to non-linear models and kernel methods, demonstrating the broader applicability of overparameterization insights across diverse function classes. In kernel regression, particularly with random features approximations, the risk curve exhibits a characteristic peak at the interpolation threshold followed by a descent in the overparameterized regime. For instance, in ridgeless random features regression, the generalization error follows a double descent pattern, with explicit risk bounds revealing how the number of features influences the bias and variance components similarly to linear cases but adapted to the kernel's approximation properties. This behavior arises due to the spectral properties of the random features matrix, where increasing the number of features beyond the sample size leads to improved generalization despite interpolation.^[15] In the neural tangent kernel (NTK) regime, infinite-width neural networks trained with gradient descent behave as kernel machines, where the NTK governs the dynamics. The generalization risk in this setting mirrors the linear case but incorporates eigenvalue decay of the NTK, leading to double descent (or even triple descent in high dimensions) as the model width increases. Specifically, the asymptotic risk formula accounts for the decaying eigenvalues, which smooth the transition and explain the descent phase through enhanced alignment with the signal subspace. This framework highlights how non-linear architectures in the lazy training regime inherit double descent from kernel methods, with the peak moderated by the kernel's effective dimensionality.^[16] Extensions to non-linear activations further confirm the persistence of double descent, albeit with modifications to the risk curve's shape. Perturbation analysis around the linear regime shows that non-linearities, such as ReLU, alter the peak location and height but preserve the overall double descent structure in random features models. For example, in high-dimensional settings, the generalization error for non-linear random features exhibits double descent, with the non-linearity affecting the variance term through changes in feature correlations, as derived via asymptotic bias-variance decompositions. Similarly, in generalized linear models (GLMs) with non-linear link functions, such as logistic or Poisson regression, double descent emerges in the high-dimensional limit, where risk is analyzed using generalized bias-variance tradeoffs that incorporate the link function's convexity properties. These results demonstrate that double descent is robust to non-linear transformations, provided the model remains overparameterized relative to the data.^[17]^[18] From an information-theoretic perspective, double descent can be interpreted as a phase transition in the mutual information between inputs and learned representations in overparameterized function spaces. In high-dimensional regression under the information bottleneck framework, the optimal compression rate leads to an information-theoretic analog of double descent, where overfitting decreases as parameters increase due to a transition from redundant to synergistic encoding of the signal. This view connects double descent to capacity limits in non-linear models, emphasizing how overparameterization enhances information efficiency beyond the interpolation threshold.^[19] Recent theoretical advances, as of 2025, further elucidate the mechanisms behind double descent. For instance, fine-grained bias-variance decompositions provide deeper insights into its occurrence in deep learning, while Bayesian analyses reveal re-descending risk curves in probabilistic models, extending the phenomenon to uncertainty quantification.^[3]^[20] Despite these advances, theoretical generalizations face challenges, particularly the computational intractability of exact risk analysis for finite-width non-linear networks, where feature interactions and training dynamics deviate from kernel approximations. This limits closed-form derivations, often requiring numerical simulations or mean-field assumptions to approximate double descent behavior.

Empirical Evidence

Deep Neural Networks

Empirical observations of double descent in deep neural networks have been prominently demonstrated in image classification tasks using convolutional neural networks (CNNs) and residual networks (ResNets). In experiments on datasets such as CIFAR-10 and ImageNet, increasing model width leads to test error curves that initially decrease, reach a peak near the interpolation threshold where the model parameters match the number of training samples, and then decrease again in the overparameterized regime. For instance, Nakkiran et al. (2020) trained ResNets of varying widths on CIFAR-10 and observed this double descent pattern, with the peak error occurring around 10^5 parameters for 50,000 training samples, followed by improved generalization as width scaled to millions of parameters. Similar behavior was reported on ImageNet, where larger models beyond the interpolation point achieved lower top-1 error rates, challenging traditional bias-variance trade-off assumptions.^[2] Double descent also manifests when scaling the width or depth of multilayer perceptrons (MLPs) on simpler datasets like MNIST. As the number of hidden units or layers increases, test error follows a U-shaped curve that interpolates to zero training error before descending further, often peaking when the model size approaches the data dimensionality. For example, experiments with two-layer MLPs on MNIST subsets show the test accuracy dipping near the point where parameters equal samples (around 784 for flattened images), then rising to over 98% accuracy with wider networks, as the overparameterization enables better feature learning without regularization. Depth scaling in deeper MLPs exhibits analogous curves, with error peaking at moderate depths (e.g., 4-5 layers) before improving in very deep architectures trained via SGD.^[2] In terms of training dynamics, double descent emerges over the course of SGD optimization epochs, independent of model size scaling. Test loss decreases initially with more epochs, but can exhibit a peak before descending again, often linked to a "catastrophe" phase where the model transitions from underfitting to memorizing noise before refining generalizations. Geiger et al. (2020) analyzed this epoch-wise double descent in overparameterized networks on MNIST and CIFAR-10, attributing it to the superposition of multiple bias-variance components during early stopping; for instance, in wide ResNets, the test cross-entropy loss peaks around 50-100 epochs before dropping below underfit levels by 200 epochs. This phenomenon highlights how prolonged training in overparameterized settings can recover generalization after an apparent overfitting spike.^[21] At large scales, transformer models exhibit double descent when varying parameter count against metrics like perplexity on language tasks. GPT-like architectures trained on text corpora show test perplexity decreasing with model size up to the interpolation threshold, peaking there, and then descending as parameters exceed billions, enabling emergent abilities in overparameterized regimes. Nakkiran et al. (2020) first observed this in transformers on language modeling, with curves mirroring image tasks; more recent scaling studies confirm it, where models like those with 1.5B parameters achieve lower perplexity post-peak compared to smaller interpolating models.^[2] Recent evidence from 2024-2025 extends double descent to vision transformers (ViTs) and diffusion models, particularly under data scarcity. In ViTs on ImageNet subsets, sparse double descent appears when pruning parameters, with test accuracy peaking at high sparsity levels before improving as effective parameters decrease further, though optimal L2 regularization can mitigate the peak. For diffusion models in low-data regimes, such as discrete diffusion for text generation with limited samples, double descent in sample efficiency curves shows autoregressive baselines underperforming post-interpolation, while diffusion variants descend faster, achieving lower negative log-likelihoods (e.g., 2-3 nats improvement) in overparameterized setups with under 10^4 samples. These findings underscore double descent's relevance in modern generative architectures facing data constraints.^[22]^[14]

Other Machine Learning Domains

Double descent has been observed in classical machine learning methods beyond deep neural networks, demonstrating the phenomenon's generality across diverse algorithmic paradigms. In decision trees and random forests, the risk curve exhibits a characteristic peak near the interpolation threshold, where model complexity—measured by tree depth or ensemble size—leads to initial overfitting followed by improved generalization. For instance, empirical studies on datasets like MNIST show that increasing the number of trees in a random forest or the maximum leaves per tree results in a double-descent pattern, with test error declining after the point where the model perfectly fits the training data.^[1] This behavior aligns with findings that interpolating trees in ensembles can enhance robustness to noise, contrasting earlier views of random forests as inherently resistant to overfitting.^[1] Recent evaluations on genomic datasets further confirm double descent in decision tree regressors, where mean squared error rises with leaf count before descending again in overparameterized regimes.^[23] Support vector machines, particularly kernel-based variants, also display double descent when scaling the number of support vectors or features. On UCI benchmark datasets, kernel SVMs tuned with increasing model capacity—such as through random Fourier features—reveal a risk peak at the interpolation threshold, followed by a second descent as the effective dimensionality grows.^[1] This mirrors the minimum-norm interpolation dynamics seen in other methods, where larger kernel approximations yield smoother solutions and better test performance. Empirical curves from such experiments highlight how overparameterization mitigates the classical bias-variance trade-off in non-linear classification tasks.^[1] In boosting algorithms like AdaBoost and XGBoost, double descent emerges as the committee size or number of boosting iterations increases toward interpolation. Studies report that test risk initially decreases, peaks when the ensemble achieves zero training error, and then descends further with additional weak learners, observed across classification benchmarks.^[1] This pattern underscores the benefits of overparameterized ensembles in reducing variance without explicit regularization. For Gaussian processes, which operate in non-parametric limits, double descent appears in the effective dimensionality of the kernel, with cross-validation metrics showing non-monotonic behavior characteristic of the phenomenon. Analytical results demonstrate that uncertainty estimates in GPs exhibit a risk peak before improving in high-dimensional settings, providing insights into Bayesian non-parametric modeling.^[24] More recent applications extend double descent to specialized domains. In recommender systems using matrix factorization, overparameterization—via increased latent factors—leads to double descent curves on sparse user-item data, where test reconstruction error peaks at interpolation and then declines, improving recommendation accuracy.^[25] Similarly, in time-series forecasting, Transformer-based models trained on public benchmarks like electricity load data exhibit epoch-wise deep double descent, with validation loss rising mid-training before a second drop, challenging traditional early-stopping practices in overparameterized fits.^[26] These findings, from 2023 experiments, highlight the relevance of double descent in sequential prediction tasks.^[26] Unified empirical plots across these methods reveal consistent double-descent shapes, with complexity axes normalized to interpolation thresholds for comparability. For example, figures comparing random forests, kernel SVMs, and boosting on shared datasets like MNIST illustrate aligned risk peaks around 100% training accuracy, followed by parallel descents, emphasizing the phenomenon's broad occurrence in classical ML.^[1] Such cross-domain visualizations underscore how overparameterization enables better generalization in non-neural settings, broadening the scope of double descent beyond modern deep learning architectures.^[1]

Implications

Impact on Model Selection and Generalization

The discovery of double descent has prompted a reevaluation of regularization strategies in machine learning, particularly in overparameterized regimes where models exceed the number of training samples. Traditional explicit regularization techniques, such as L2 penalties or dropout, were historically employed to prevent overfitting by constraining model complexity below the interpolation threshold. However, double descent demonstrates that in these regimes, interpolating models—those achieving zero training error—can generalize effectively without such penalties, as the test error declines after the initial peak. This shift favors implicit regularization methods like early stopping or architectural modifications over explicit ones, allowing practitioners to leverage larger models for improved performance.^[1]^[2] Classical model selection techniques, including cross-validation (CV), encounter significant pitfalls when applied to overparameterized models exhibiting double descent. Standard CV assumes a U-shaped risk curve driven by the bias-variance trade-off, leading it to favor models just before interpolation where test error appears minimal. At and beyond the interpolation threshold, however, CV often fails to identify optimal overparameterized models, as the peak in test error misleads selection toward underperforming configurations. To address this, researchers recommend "interpolating validators," such as holdout sets evaluated on fully trained interpolators or modified CV schemes that account for post-interpolation descent, ensuring more reliable hyperparameter choices.^[1] Insights from double descent have advanced understanding of generalization, particularly through the lens of benign overfitting, where overparameterized interpolators achieve low test error despite memorizing training data. This phenomenon explains why models with parameters far exceeding sample size can still generalize, attributing success to implicit biases in optimization and architecture that favor smoother functions in high dimensions. Consequently, it highlights implications for sample efficiency: overparameterization reduces the data requirements for reaching low error rates, as the second descent phase mitigates the risks of classical overfitting.^[1] Best practices emerging from double descent emphasize scaling models beyond the interpolation threshold to access the second descent, where performance improves monotonically with increased capacity. Practitioners are advised to monitor both underparameterization risks (high bias) and overparameterization risks (potential instability near the peak), using techniques like ensembling to smooth transitions. This approach contrasts with prior conservatism, encouraging experimentation with wider or deeper networks trained to convergence rather than stopping early to avoid apparent overfitting.^[2] Quantitatively, embracing the second descent has yielded notable error reductions in vision tasks; for example, on CIFAR-10 with ResNet architectures, scaling model width beyond interpolation improved test accuracy by approximately 5-10% compared to critically parameterized baselines, from errors around 15% at the peak to under 7% in the overparameterized regime.^[2] Double descent exhibits connections to several related phenomena in machine learning. Grokking, the delayed generalization observed in overparameterized models trained on small algorithmic datasets, mirrors the second descent phase of double descent through analogous learning dynamics, where test performance improves sharply after prolonged overfitting.^[27] This link arises from inductive biases that prioritize slow-emerging, generalizing patterns over initial fast-learned, memorization-heavy representations, as demonstrated in transformer models on tasks like modular arithmetic.^[28] The lottery ticket hypothesis complements double descent by proposing that overparameterized networks contain sparse subnetworks—identified via pruning—that match or exceed full-network performance, often bypassing the risk peak associated with dense interpolation regimes.^[29] In sparse double descent contexts, such subnetworks can alter the descent curve, revealing that "winning tickets" do not always align with the optimal trajectory of the unpruned model, particularly under label noise or compression.^[30] Double descent integrates with neural scaling laws, unifying power-law loss reductions with model size and compute in the overparameterized regime, where the second descent reflects smoother navigation of high-dimensional loss landscapes.^[31] This connection explains why larger models, despite interpolation, achieve better generalization without the classical overfitting penalty, as seen in empirical curves for language models.^[32] Open questions in double descent research include the exact conditions triggering second descent under non-i.i.d. data distributions, such as repeated samples, where empirical studies show test loss spikes due to memorization overload but lack precise theoretical thresholds.^[33] Its interplay with adversarial robustness remains unresolved, though overparameterization has been observed to induce double descent in robust training losses, potentially enhancing resilience in data-rich regimes while complicating vulnerability in others.^[34] Causality within optimization dynamics—specifically, how gradient flows in non-convex settings causally drive the risk curve's re-descent—continues to elude full explanation, with ongoing work probing inductive biases and pattern competition.^[27] Future directions encompass extending double descent to federated learning, where local updates mirror non-i.i.d. challenges and amplify descent risks in distributed settings. Investigations into multimodal models and trillion-parameter large language models (LLMs) post-2023 highlight emerging issues, including sample efficiency in autoregressive architectures and the inheritance of descent behaviors during fine-tuning of foundation models. For instance, 2024-2025 research has explored double descent's links to emergent abilities in LLMs and proposed methods like momentum-guided perturbations to mitigate instabilities in LoRA fine-tuning.^[35]^[36] These areas underscore persistent gaps, particularly in linking double descent to non-convex optimization landscapes beyond linear approximations, with recent post-2023 evidence from LLMs, such as studies on low-rank adaptations revealing instabilities like transient divergence in training loss, highlighting ongoing challenges in these areas.^[36]

References

[1]
Reconciling modern machine-learning practice and the classical ...
The double-descent risk curve introduced in this paper reconciles the U-shaped curve predicted by the bias–variance trade-off and the observed behavior of ...Missing: original | Show results with:original
[2]
Deep Double Descent: Where Bigger Models and More Data Hurt
Dec 4, 2019 · We show that a variety of modern deep learning tasks exhibit a double-descent phenomenon where, as we increase model size, performance first gets worse and ...
[3]
Understanding the Double Descent Phenomenon in Deep Learning
Mar 15, 2024 · In this tutorial, we explain the concept of double descent and its mechanisms. The first section sets the classical statistical learning framework and ...
[4]
[PDF] Neural Networks and the Bias/Variance Dilemma
The most extensively studied neural network in recent years is prob- ably the backpropagation network, that is, a multilayer feedforward net- work with the ...
[5]
[PDF] Smoothing Noisy Data with Spline Functions - Department of Statistics
Craven and G. Wahba becomes finer, lim ER(A)/min ER(2) 11. A Monte Carlo experiment with several smooth g's was tried with m = 2, n=50 and several values of ...Missing: 1978 bias
[6]
[1903.07571] Two models of double descent for weak features - arXiv
Mar 18, 2019 · Title:Two models of double descent for weak features. Authors:Mikhail Belkin, Daniel Hsu, Ji Xu. View a PDF of the paper titled Two models of ...
[7]
[PDF] On the Role of Optimization in Double Descent: A Least Squares Study
Understanding double descent requires a fine-grained bias-variance decomposition. In Conference on Neural Information Processing Systems (NeurIPS), 2020. A.
[8]
A brief prehistory of double descent - PNAS
These curves can display what they call double descent: With increasing N, the risk initially decreases, attains a minimum, and then increases until N equals n.
[9]
Reconciling modern machine learning practice and the bias ... - arXiv
Dec 28, 2018 · In this paper, we reconcile the classical understanding and the modern practice within a unified performance curve. This "double descent" curve ...Missing: random | Show results with:random
[10]
[1712.00409] Deep Learning Scaling is Predictable, Empirically - arXiv
Dec 1, 2017 · View a PDF of the paper titled Deep Learning Scaling is Predictable, Empirically, by Joel Hestness and 8 other authors. View PDF. Abstract ...Missing: 2019 | Show results with:2019
[11]
ICML 2020 Workshops
ICML 2020 workshops included topics like Graph Representation Learning, Self-supervision in Audio and Speech, Law & Machine Learning, and AI for Autonomous ...
[12]
Workshops - NeurIPS 2020
Advances and Opportunities: Machine Learning for Education. Kumar Garg, Neil Heffernan, Kayla Meyers. Fri, Dec 11th, 2020 @ 05:30 – 14:10 PST.
[13]
[2205.00477] Ridgeless Regression with Random Features - arXiv
May 1, 2022 · Specifically, random features error exhibits the double-descent curve. Motivated by the theoretical findings, we propose a tunable kernel ...
[14]
A Precise Performance Analysis of Learning with Random Features
Aug 27, 2020 · ... double descent phenomenon" in learning. Subjects: Information Theory (cs.IT). Cite as: arXiv:2008.11904 [cs.IT]. (or arXiv:2008.11904v1 [cs.IT] ...
[15]
Generalization Error of Generalized Linear Models in High ...
We are also able to rigorously and analytically explain the \emph{double descent} phenomenon in generalized linear models. Cite this Paper. BibTeX.
[16]
[PDF] Information bottleneck theory of high-dimensional regression
The resulting maximum is an information-theoretic analog of double descent—the decrease in overfitting level (test error) as the number of parameters ...
[17]
Early Stopping in Deep Networks: Double Descent and How to...
Jan 12, 2021 · One-sentence Summary: Epoch wise double descent can be explained as a superposition of two or more bias-variance tradeoffs that arise because ...
[18]
[2307.14253] Sparse Double Descent in Vision Transformers - arXiv
Jul 26, 2023 · Neoteric studies have reported a ``sparse double descent'' phenomenon that can occur in modern deep-learning models, where extremely over-parametrized models ...Missing: 2024 | Show results with:2024
[19]
Double Descent as a Lens for Sample Efficiency in Autoregressive ...
Sep 29, 2025 · In this work, we use the double descent phenomenon to holistically compare the sample efficiency of discrete diffusion and autoregressive models ...Missing: 2024 | Show results with:2024
[20]
Monotonicity and Double Descent in Uncertainty Estimation ... - arXiv
Oct 14, 2022 · We prove that cross-validation metrics exhibit qualitatively different behavior that is characteristic of double descent.
[21]
Investigating Overparameterization for Non-Negative Matrix ...
Sep 13, 2021 · Moreover, we also show that the double descent phenomenon occurs when we increase the number of parameters of the NMF, where the test error ...
[22]
[2311.01442] Deep Double Descent for Time Series Forecasting
Nov 2, 2023 · We perform extensive experiments to investigate the occurrence of deep double descent in several Transformer models trained on public time ...
[23]
[2303.06173] Unifying Grokking and Double Descent - arXiv
Mar 10, 2023 · We hypothesize that grokking and double descent can be understood as instances of the same learning dynamics within a framework of pattern learning speeds.
[24]
https://arxiv.org/abs/2210.07612
[25]
https://dl.acm.org/doi/10.1145/3460231.3478854
[26]
Sparse Double Descent: Where Network Pruning Aggravates ... - arXiv
Jun 17, 2022 · Third, in the context of sparse double descent, a winning ticket in the lottery ticket hypothesis surprisingly may not always win. Comments ...
[27]
[2001.08361] Scaling Laws for Neural Language Models - arXiv
We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the ...Missing: double descent
[28]
Unified Neural Network Scaling Laws and Scale-time Equivalence
Sep 9, 2024 · Abstract page for arXiv paper 2409.05782: Unified Neural Network Scaling Laws and Scale-time Equivalence. ... double descent. Here, we present a ...
[29]
Scaling Laws and Interpretability of Learning from Repeated Data
May 21, 2022 · Abstract page for arXiv paper 2205.10487: Scaling Laws and Interpretability of Learning from Repeated Data. ... We find a strong double descent ...
[30]
[2002.11080] The Curious Case of Adversarially Robust Models
Feb 25, 2020 · In the medium adversary regime, with more training data, the generalization loss exhibits a double descent curve. This implies that in this ...Missing: robustness | Show results with:robustness