Fact-checked by Grok 2 weeks ago

Double descent

Double descent is a phenomenon observed in models where the test error initially decreases as model complexity increases, reaches a peak near the (where the model perfectly fits the training data), and then decreases again as complexity continues to grow beyond that point, challenging traditional notions of . This behavior manifests in various settings, including linear models, random features, decision trees, and deep neural networks, across datasets such as MNIST and CIFAR-10. The concept was first formalized in 2019 by Mikhail Belkin and colleagues, who proposed the "double-descent" risk curve to reconcile classical —characterized by a U-shaped bias-variance —with the empirical success of overparameterized modern models that interpolate training data yet generalize effectively. Building on this, Preetum Nakkiran et al. demonstrated the phenomenon in contexts, showing that test performance worsens and then improves with increasing model size, training epochs, or even dataset size in certain regimes, introducing the idea of "effective model complexity" as a unifying measure. Mechanistically, double descent arises from the interplay between underparameterization (high ), classical (high variance near ), and overparameterization (improved post-interpolation due to inductive biases favoring smoother solutions among interpolators). In deep learning, this is influenced by factors like architectural choices, optimization dynamics, and data noise, where larger models can mitigate variance through implicit regularization effects. The implications of double descent extend to practical , explaining why scaling model capacity and data often yields better performance despite apparent , and prompting reevaluation of regularization techniques like or . It has been observed in diverse domains, including vision transformers and language models, underscoring its broad relevance, though recent analyses highlight that noisy labels can amplify or alter the curve's shape. Overall, double descent underscores the limitations of classical capacity control and supports the trend toward ever-larger models in contemporary systems.

Fundamentals

Bias-Variance Tradeoff in Classical Statistics

In classical statistics, the -variance tradeoff describes the fundamental tension between two sources of error in statistical : and variance. refers to the systematic error introduced by approximating a true underlying with a simpler model, arising from incorrect assumptions about the data-generating process. Variance, on the other hand, measures the sensitivity of the to fluctuations in the , leading to instability when the model is overly flexible and captures noise rather than signal. This tradeoff underscores the challenge of selecting model complexity to achieve reliable predictions on unseen . The expected prediction error, often termed the risk, decomposes into these components plus an irreducible noise term. For a regression problem with squared loss, the risk R(\hat{f}) = \mathbb{E}[(y - \hat{f}(x))^2] breaks down as R(\hat{f}) = \Bias^2(\hat{f}(x)) + \Var(\hat{f}(x)) + \sigma^2, where \sigma^2 is the variance of the noise \epsilon in the true model y = f(x) + \epsilon. Here, the squared bias \Bias^2(\hat{f}(x)) = (\mathbb{E}[\hat{f}(x)] - f(x))^2 quantifies the average deviation of the estimator from the truth, while the variance \Var(\hat{f}(x)) = \mathbb{E}[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2] captures the estimator's variability across different training samples. In the classical statistical setting, the risk curve as a of model exhibits a characteristic U-shape. As increases from low levels—such as simple linear models— decreases because the model better captures the underlying patterns, leading to an initial reduction in risk. However, beyond an optimal , variance begins to dominate as the model overfits the training data, causing the risk to rise sharply. This behavior motivates regularization techniques to penalize excessive and balance the . The concept gained prominence in the statistical literature of the and , particularly through developments in and smoothing methods, with influential discussions on regularization appearing in early 1990s works like Geman et al. (1992). The mathematical derivation of the bias-variance decomposition for squared loss in regression proceeds as follows. Consider the true data-generating process y = f(x) + \epsilon, where \mathbb{E}[\epsilon] = 0 and \Var(\epsilon) = \sigma^2, and \hat{f} is the estimator trained on a sample, with expectation taken over both the training data and noise. The expected squared error at a fixed x is: \mathbb{E}[(y - \hat{f}(x))^2] = \mathbb{E}[(f(x) + \epsilon - \hat{f}(x))^2]. Expanding the square yields: \mathbb{E}[(f(x) - \hat{f}(x) + \epsilon)^2] = \mathbb{E}[(f(x) - \hat{f}(x))^2] + 2\mathbb{E}[(f(x) - \hat{f}(x))\epsilon] + \mathbb{E}[\epsilon^2]. The cross term vanishes because \epsilon is independent of the training data and thus of \hat{f}(x), so \mathbb{E}[(f(x) - \hat{f}(x))\epsilon] = 0. The noise term simplifies to \mathbb{E}[\epsilon^2] = \sigma^2. For the first term, note that: \mathbb{E}[(f(x) - \hat{f}(x))^2] = \mathbb{E}[(f(x) - \mathbb{E}[\hat{f}(x)] + \mathbb{E}[\hat{f}(x)] - \hat{f}(x))^2] = (\mathbb{E}[\hat{f}(x)] - f(x))^2 + \mathbb{E}[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2], which is precisely the squared bias plus the variance. Thus, the decomposition holds: R(\hat{f}(x)) = \Bias^2(\hat{f}(x)) + \Var(\hat{f}(x)) + \sigma^2. This formula highlights why the U-shaped risk curve arises: low-complexity models incur high bias but low variance, while high-complexity models reverse this pattern. In classical underparameterized regimes, where the number of parameters is much smaller than the sample size, this tradeoff guides model selection to minimize overall risk.

Overparameterization and Interpolation

In modern , refers to the use of models with a number of p that greatly exceeds the number of samples n, often denoted as p \gg n. This setup contrasts with classical statistical assumptions where model complexity is limited to avoid , enabling models to achieve unprecedented flexibility in fitting complex data distributions. The regime arises in overparameterized models when the error reaches zero, meaning the model perfectly memorizes the data rather than generalizing from underlying patterns. For instance, in unregularized , occurs precisely when p \geq n, as the becomes underdetermined and admits solutions that pass exactly through all data points. Deep neural networks in high dimensions similarly enter this regime, where their vast parameter counts allow them to interpolate noisy sets without explicit constraints. Optimization algorithms like introduce implicit regularization in the regime, biasing solutions toward those with desirable properties despite the absence of explicit penalties. Specifically, continuous-time gradient flow on separable linear models converges to the minimum \ell_2-norm interpolator among all solutions that achieve zero training loss, favoring simpler functions that generalize better. This effect stems from the dynamics of starting from zero initialization, which progressively aligns parameters in a way that avoids high-norm solutions. Examples of interpolating models include unregularized , where the minimum-norm solution emerges naturally, and overparameterized deep networks trained via , which exhibit similar biases toward low-complexity interpolants in high-dimensional spaces. From a perspective, the transition from underparameterized (p < n) to overparameterized (p > n) regimes can be viewed as a akin to in disordered systems, where the model's ability to fit data shifts abruptly, altering the landscape of possible solutions and behavior.

Phenomenon Description

The Double Descent Curve

The double descent phenomenon is characterized by a non-monotonic test curve that deviates from the classical U-shaped bias-variance , featuring two distinct minima separated by a peak. In the underparameterized regime, as model complexity increases (e.g., number of parameters p or features D), the test initially decreases, reflecting improved fit to the training data. This is followed by a rise in test near the threshold, where the number of parameters approximates the number of training samples (p \approx n), marking the onset of textbook with high variance and poor . Beyond this , in the overparameterized regime, the test decreases again, often achieving better performance than in the classical minimum, as larger models find solutions that interpolate the training data while generalizing well to examples. This curve is commonly visualized with test risk plotted against model width or complexity for fixed sample size n, showing the "classical descent" in the underparameterized phase, a peak at the interpolation point representing "textbook overfitting," and the "modern descent" in the interpolating regime where p \gg n. The first empirical observation of this shape occurred in a random features model on a subset of the MNIST dataset, where test risk peaked at D = n = 784 features (matching the input dimensionality) before declining as D increased further, demonstrating that overparameterization does not inevitably lead to degradation. Variations of the double descent curve include plotting risk against sample size n for fixed model complexity p, which similarly exhibits initial improvement, a peak, and renewed decrease, highlighting sample-wise non-monotonicity. Schematically, the test risk R can be illustrated as exhibiting a classical U-shape for p < n, transitioning to non-monotonic behavior for p > n, such that R(n) \propto \frac{\sigma^2 p}{n} + \text{bias term} \quad (p \ll n), with the variance term dominating near p \approx n to produce the peak, followed by risk reduction in the overparameterized limit due to implicit regularization effects. This framework reconciles classical statistical expectations with modern machine learning practices.

Key Characteristics and Variations

The in the double descent curve typically occurs at the , where the number of model parameters p is approximately equal to the number of samples n. The height of this depends on the (SNR) in the data, with lower SNR resulting in a taller due to increased sensitivity to at the . The of the second descent varies across model types; in neural networks, it often appears linear when model size is plotted on a , reflecting rapid improvements in with further overparameterization. In contrast, linear models exhibit a slower second descent, as the benefits of additional parameters diminish more gradually after . Double descent manifests differently depending on the loss function: in regression tasks with squared loss, the phenomenon emphasizes minimization, while in with 0-1 loss, it highlights improvements in misclassification rates beyond . Noise significantly influences the double descent curve; high noise levels amplify the peak at the interpolation threshold by exacerbating , whereas low noise smooths the curve, reducing the peak's prominence. A variant known as computational double descent arises from optimization challenges, where test spikes occur at specific model widths due to difficulties in converging to effective minima during . Epoch-wise double descent refers to the pattern observed during dynamics, where validation initially decreases, rises as the model overfits, and then descends again with continued epochs, mirroring the model-wise curve but tied to training iterations.

Historical Development

Early Indications

Early indications of double descent-like behavior emerged in the late and through studies in statistical learning and high-dimensional regression, where models exhibited improved despite apparent , challenging classical expectations of the bias-variance tradeoff. In 1989, Vallet et al. experimentally demonstrated this phenomenon using minimum norm on artificial data, observing that test initially decreased with model complexity, peaked near the , and then decreased again in the overparameterized regime. Opper et al. followed in 1990 with a theoretical analysis in fixed-design linear models, proving that could improve beyond the interpolation point under certain asymptotic conditions, hinting at benign overfitting in high dimensions. These findings highlighted anomalies in risk estimation methods, such as Stein's unbiased risk estimate, which unexpectedly showed stable or decreasing out-of-sample error in overparameterized settings contrary to traditional predictions of degradation. In parallel, early experiments with shallow neural networks in the revealed similar patterns of strong despite to , prompting reevaluation of theoretical limits. Researchers reported that networks with many parameters achieved low test error even when error approached zero, contradicting expectations from VC dimension theory. Vapnik, in his work on , emphasized that high VC dimension in neural networks should lead to poor due to risks, yet empirical results from applications like showed otherwise, with models maintaining performance in overparameterized regimes. By the 2000s, studies in kernel methods further evidenced risk stabilization after interpolation. Duin (2000) analyzed pseudo-Fisher —equivalent to kernel in certain settings—on real-world datasets, observing double-descent feature curves where performance improved post-interpolation without explicit regularization. Connections to ensemble methods like also surfaced; in 1998, Breiman critiqued in the context of arcing classifiers, noting its ability to drive training near zero while sustaining low over many iterations, a that resisted typical degradation. These scattered observations across , neural networks, kernels, and boosting lacked a unified or like "double descent," remaining isolated insights that puzzled researchers but were not synthesized until later decades.

Modern Formulation and Key Milestones

The modern formulation of double descent emerged in 2018 with the seminal work by Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal, who introduced the term and demonstrated the phenomenon through experiments on random features models, reconciling classical bias-variance theory with overparameterized practices. Their analysis revealed a non-monotonic test error curve, where error decreases initially, rises at the interpolation threshold, and descends again in the overparameterized regime, challenging traditional views on model complexity. In 2019, and collaborators extended these insights to linear models, confirming the double descent curve's presence in settings like and ridgeless , emphasizing its generality beyond nonlinear architectures. Concurrently, foundational scaling laws for neural networks, as explored by Joel Hestness et al., highlighted compute-optimal model sizes that aligned with overparameterization benefits, providing empirical support for double descent in training dynamics. By 2020, theoretical advancements deepened the understanding, with papers like that of Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian, and Anant Sahai deriving exact asymptotic expressions for ridgeless regression under label noise, quantifying the "price of interpolation" and explaining the second descent phase through bias-variance decompositions. This period also saw growing community engagement, with increased papers and discussions on overparameterization at conferences like ICML and NeurIPS. From 2021 to 2023, double descent connected to related phenomena such as grokking—where overparameterized models suddenly after prolonged —and extensions of the lottery ticket hypothesis, illustrating how sparse subnetworks in large models exhibit similar non-monotonic generalization patterns. Comprehensive surveys during this era, including reviews in high-impact journals, synthesized these developments, underscoring double descent's role in rethinking generalization bounds. Recent updates from 2024 to 2025 have observed persistent double descent in advanced generative models, such as models where overparameterization improves sample efficiency beyond classical limits—for instance, comparisons showing double descent in discrete versus autoregressive models—and large language models (LLMs) with scaling, showing a second descent in loss curves during extended training on massive datasets. These findings, often shared via preprints, affirm the phenomenon's relevance in contemporary architectures. Overall, the adoption of double descent has driven a paradigm shift in machine learning conferences, moving from emphasis on explicit regularization techniques to embracing overparameterization as a reliable path to strong generalization, as evidenced by increased focus on scaling studies in NeurIPS and ICML proceedings.

Theoretical Explanations

Linear Regression Models

In linear regression models, double descent emerges as a fundamental phenomenon in the analysis of overparameterized learning, providing the simplest setting to derive the risk curve explicitly. Consider the setup where training data consists of n samples with p-dimensional features drawn from an isotropic Gaussian distribution, \mathbf{x}_i \sim \mathcal{N}(\mathbf{0}, I_p), and labels generated by a teacher-student linear model: y_i = \langle \mathbf{x}_i, \boldsymbol{\beta}^* \rangle + \epsilon_i, where \boldsymbol{\beta}^* is the fixed true parameter vector with \|\boldsymbol{\beta}^*\|^2 = 1 for normalization, and noise \epsilon_i \sim \mathcal{N}(0, \sigma^2). The model is fitted using ridgeless least squares, which minimizes the squared loss without regularization. In the underparameterized regime (p < n), this yields the ordinary least squares estimator; in the overparameterized regime (p > n), it corresponds to the minimum-norm interpolator that achieves zero training error by selecting the solution with the smallest \ell_2-norm among all interpolators. This minimum-norm solution exhibits an implicit bias toward low-variance directions, equivalent to the limit of ridge regression as the regularization parameter \lambda \to 0^+. The generalization risk, defined as the expected out-of-sample prediction error \mathbb{E}[(y' - \hat{y}')^2] for a new test point (\mathbf{x}', y'), can be decomposed into bias and variance terms and analyzed in the high-dimensional asymptotic regime where n, p \to \infty with fixed ratio \gamma = n/p. In the random feature selection variant—where the p features are a random subset of a larger true support of size D >> p, ensuring isotropy—the exact risk expressions reveal the double descent curve. For the underparameterized case (\gamma > 1, or p \leq n-2), the risk is R(\gamma) = \left( \left(1 - \frac{p}{D}\right) \|\boldsymbol{\beta}^*\|^2 + \sigma^2 \right) \left( 1 + \frac{p}{n - p - 1} \right). Here, the first factor captures the bias from uncaptured signal features (decreasing as p increases) plus noise, while the second is the variance inflation factor. As p grows (γ decreases toward 1), the bias reduction initially dominates, causing risk to decrease; near the interpolation threshold, variance dominates, driving risk upward to a peak. In the overparameterized case (\gamma < 1, or p \geq n+2), the risk becomes R(\gamma) = \|\boldsymbol{\beta}^*\|^2 \left(1 - \frac{n}{p}\right) + \left( \left(1 - \frac{p}{D}\right) \|\boldsymbol{\beta}^*\|^2 + \sigma^2 \right) \left( 1 + \frac{n}{p - n - 1} \right). The term $1 - \gamma represents the squared bias from the component of \boldsymbol{\beta}^* orthogonal to the row space of the design matrix (due to the minimum-norm bias toward the origin in the null space), while the remaining terms reflect variance dominated by the interpolation penalty near γ ≈ 1, diverging as γ → 1^-, and decreasing as γ → 0 (p >> n). In the limit D → ∞ (pure noise-like weak signal spread across dimensions), the risk simplifies asymptotically to forms like R(\gamma) \approx (1 - \gamma) + \frac{\sigma^2}{1 - \gamma} for γ < 1 (up to constants), highlighting the second descent as overparameterization reduces variance while residual bias vanishes slowly. The full curve thus decreases in the underparameterized (bias reduction), peaks near the interpolation threshold γ ≈ 1 (variance explosion), and decreases again in the overparameterized (benign ). This derivation relies on properties of the Moore-Penrose pseudoinverse and expectations over Wishart-distributed matrices. These results hold under the assumption of isotropic Gaussian features, enabling closed-form bias-variance decomposition via random matrix theory.

Generalizations to Non-Linear and Kernel Methods

The double descent phenomenon, initially characterized in , extends to non-linear models and methods, demonstrating the broader applicability of overparameterization insights across diverse function classes. In , particularly with random features approximations, the risk curve exhibits a characteristic peak at the interpolation threshold followed by a descent in the overparameterized regime. For instance, in ridgeless random features regression, the follows a double descent pattern, with explicit risk bounds revealing how the number of features influences the and variance components similarly to linear cases but adapted to the kernel's properties. This behavior arises due to the spectral properties of the random features matrix, where increasing the number of features beyond the sample size leads to improved despite . In the (NTK) regime, infinite-width neural networks trained with behave as kernel machines, where the NTK governs the dynamics. The generalization in this setting mirrors the linear case but incorporates eigenvalue decay of the NTK, leading to double (or even in high dimensions) as the model width increases. Specifically, the asymptotic accounts for the decaying eigenvalues, which smooth the transition and explain the descent phase through enhanced alignment with the signal subspace. This framework highlights how non-linear architectures in the lazy training regime inherit double descent from kernel methods, with the peak moderated by the kernel's effective dimensionality. Extensions to non-linear activations further confirm the persistence of double descent, albeit with modifications to the risk curve's shape. Perturbation analysis around the linear regime shows that non-linearities, such as ReLU, alter the peak location and height but preserve the overall double descent structure in random features models. For example, in high-dimensional settings, the for non-linear random s exhibits double descent, with the non-linearity affecting the variance term through changes in feature correlations, as derived via asymptotic bias-variance decompositions. Similarly, in generalized linear models (GLMs) with non-linear link functions, such as logistic or , double descent emerges in the high-dimensional limit, where risk is analyzed using generalized bias-variance tradeoffs that incorporate the link function's convexity properties. These results demonstrate that double descent is robust to non-linear transformations, provided the model remains overparameterized relative to the data. From an information-theoretic perspective, double descent can be interpreted as a in the between inputs and learned representations in overparameterized function spaces. In high-dimensional under the information bottleneck framework, the optimal compression rate leads to an information-theoretic analog of double descent, where decreases as parameters increase due to a transition from redundant to synergistic encoding of the signal. This view connects double descent to capacity limits in non-linear models, emphasizing how overparameterization enhances information efficiency beyond the interpolation threshold. Recent theoretical advances, as of 2025, further elucidate the mechanisms behind double descent. For instance, fine-grained bias-variance decompositions provide deeper insights into its occurrence in , while Bayesian analyses reveal re-descending risk curves in probabilistic models, extending the phenomenon to . Despite these advances, theoretical generalizations face challenges, particularly the computational intractability of exact risk analysis for finite-width non-linear networks, where feature interactions and training dynamics deviate from kernel approximations. This limits closed-form derivations, often requiring numerical simulations or mean-field assumptions to approximate double descent behavior.

Empirical Evidence

Deep Neural Networks

Empirical observations of double descent in deep neural networks have been prominently demonstrated in image classification tasks using convolutional neural networks (CNNs) and residual networks (ResNets). In experiments on datasets such as and , increasing model width leads to test error curves that initially decrease, reach a peak near the threshold where the model parameters match the number of training samples, and then decrease again in the overparameterized regime. For instance, Nakkiran et al. (2020) trained ResNets of varying widths on and observed this double descent pattern, with the peak error occurring around 10^5 parameters for 50,000 training samples, followed by improved as width scaled to millions of parameters. Similar was reported on , where larger models beyond the interpolation point achieved lower top-1 error rates, challenging traditional bias-variance assumptions. Double descent also manifests when scaling the width or depth of multilayer perceptrons (MLPs) on simpler datasets like MNIST. As the number of hidden units or layers increases, test error follows a U-shaped that interpolates to zero training error before descending further, often peaking when the model size approaches the data dimensionality. For example, experiments with two-layer MLPs on MNIST subsets show the test accuracy dipping near the point where parameters equal samples (around 784 for flattened images), then rising to over 98% accuracy with wider networks, as the overparameterization enables better without regularization. Depth scaling in deeper MLPs exhibits analogous curves, with error peaking at moderate depths (e.g., 4-5 layers) before improving in very deep architectures trained via SGD. In terms of training dynamics, double descent emerges over the course of SGD optimization epochs, independent of model size scaling. Test decreases initially with more epochs, but can exhibit a peak before descending again, often linked to a "" phase where the model transitions from underfitting to memorizing noise before refining generalizations. Geiger et al. (2020) analyzed this epoch-wise double descent in overparameterized networks on MNIST and , attributing it to the superposition of multiple bias-variance components during ; for instance, in wide ResNets, the test peaks around 50-100 epochs before dropping below underfit levels by 200 epochs. This phenomenon highlights how prolonged in overparameterized settings can recover after an apparent spike. At large scales, models exhibit double descent when varying parameter count against metrics like on tasks. GPT-like architectures trained on text corpora show test decreasing with model size up to the interpolation threshold, peaking there, and then descending as parameters exceed billions, enabling emergent abilities in overparameterized regimes. Nakkiran et al. (2020) first observed this in on modeling, with curves mirroring image tasks; more recent scaling studies confirm it, where models like those with 1.5B parameters achieve lower post-peak compared to smaller interpolating models. Recent evidence from 2024-2025 extends double descent to vision transformers (ViTs) and models, particularly under data scarcity. In ViTs on subsets, sparse double descent appears when pruning parameters, with test accuracy peaking at high sparsity levels before improving as effective parameters decrease further, though optimal L2 regularization can mitigate the peak. For models in low-data regimes, such as discrete diffusion for text generation with limited samples, double descent in sample efficiency curves shows autoregressive baselines underperforming post-interpolation, while variants descend faster, achieving lower negative log-likelihoods (e.g., 2-3 nats improvement) in overparameterized setups with under 10^4 samples. These findings underscore double descent's relevance in modern generative architectures facing data constraints.

Other Machine Learning Domains

Double descent has been observed in classical machine learning methods beyond deep neural networks, demonstrating the phenomenon's generality across diverse algorithmic paradigms. In decision trees and s, the risk curve exhibits a characteristic peak near the interpolation threshold, where model complexity—measured by or ensemble size—leads to initial followed by improved . For instance, empirical studies on datasets like MNIST show that increasing the number of trees in a or the maximum leaves per tree results in a double-descent pattern, with test error declining after the point where the model perfectly fits the training data. This behavior aligns with findings that interpolating trees in ensembles can enhance robustness to noise, contrasting earlier views of s as inherently resistant to . Recent evaluations on genomic datasets further confirm double descent in decision tree regressors, where rises with leaf count before descending again in overparameterized regimes. Support vector machines, particularly kernel-based variants, also display double descent when scaling the number of support vectors or features. On UCI datasets, kernel SVMs tuned with increasing model capacity—such as through random Fourier features—reveal a risk peak at the threshold, followed by a second descent as the effective dimensionality grows. This mirrors the minimum-norm dynamics seen in other methods, where larger approximations yield smoother solutions and better test performance. Empirical curves from such experiments highlight how overparameterization mitigates the classical bias-variance trade-off in non-linear classification tasks. In boosting algorithms like and , double descent emerges as the committee size or number of boosting iterations increases toward . Studies report that test risk initially decreases, peaks when the ensemble achieves zero training error, and then descends further with additional weak learners, observed across benchmarks. This pattern underscores the benefits of overparameterized s in reducing variance without explicit regularization. For Gaussian processes, which operate in non-parametric limits, double descent appears in the effective dimensionality of the , with cross-validation metrics showing non-monotonic behavior characteristic of the phenomenon. Analytical results demonstrate that uncertainty estimates in GPs exhibit a risk peak before improving in high-dimensional settings, providing insights into Bayesian non-parametric modeling. More recent applications extend double descent to specialized domains. In recommender systems using matrix factorization, overparameterization—via increased latent factors—leads to double descent curves on sparse user-item data, where test reconstruction error peaks at and then declines, improving recommendation accuracy. Similarly, in time-series forecasting, Transformer-based models trained on public benchmarks like electricity load data exhibit epoch-wise deep double descent, with validation loss rising mid-training before a second drop, challenging traditional early-stopping practices in overparameterized fits. These findings, from 2023 experiments, highlight the relevance of double descent in sequential prediction tasks. Unified empirical plots across these methods reveal consistent double-descent shapes, with complexity axes normalized to thresholds for comparability. For example, figures comparing random forests, SVMs, and boosting on shared datasets like MNIST illustrate aligned risk peaks around 100% training accuracy, followed by parallel descents, emphasizing the phenomenon's broad occurrence in classical . Such cross-domain visualizations underscore how overparameterization enables better in non-neural settings, broadening the scope of double descent beyond modern architectures.

Implications

Impact on Model Selection and Generalization

The discovery of double descent has prompted a reevaluation of regularization strategies in , particularly in overparameterized regimes where models exceed the number of training samples. Traditional explicit regularization techniques, such as penalties or dropout, were historically employed to prevent by constraining model complexity below the interpolation threshold. However, double descent demonstrates that in these regimes, interpolating models—those achieving zero training error—can generalize effectively without such penalties, as the test error declines after the initial peak. This shift favors implicit regularization methods like or architectural modifications over explicit ones, allowing practitioners to leverage larger models for improved performance. Classical techniques, including , encounter significant pitfalls when applied to overparameterized models exhibiting double descent. Standard assumes a U-shaped driven by the bias-variance , leading it to favor models just before where test error appears minimal. At and beyond the interpolation threshold, however, often fails to identify optimal overparameterized models, as the peak in test error misleads selection toward underperforming configurations. To address this, researchers recommend "interpolating validators," such as holdout sets evaluated on fully trained interpolators or modified schemes that account for post-interpolation descent, ensuring more reliable hyperparameter choices. Insights from double descent have advanced understanding of , particularly through the lens of benign overfitting, where overparameterized interpolators achieve low test error despite memorizing training data. This phenomenon explains why models with parameters far exceeding sample size can still , attributing success to implicit biases in optimization and architecture that favor smoother functions in high dimensions. Consequently, it highlights implications for sample efficiency: overparameterization reduces the data requirements for reaching low error rates, as the second descent phase mitigates the risks of classical . Best practices emerging from double descent emphasize scaling models beyond the interpolation threshold to access the second , where performance improves monotonically with increased capacity. Practitioners are advised to monitor both underparameterization risks (high ) and overparameterization risks (potential near the peak), using techniques like ensembling to smooth transitions. This approach contrasts with prior conservatism, encouraging experimentation with wider or deeper networks trained to rather than stopping early to avoid apparent . Quantitatively, embracing the second descent has yielded notable error reductions in vision tasks; for example, on with ResNet architectures, scaling model width beyond improved test accuracy by approximately 5-10% compared to critically parameterized baselines, from errors around 15% at the peak to under 7% in the overparameterized regime. Double descent exhibits connections to several related phenomena in . Grokking, the delayed generalization observed in overparameterized models trained on small algorithmic datasets, mirrors the second descent phase of double descent through analogous learning dynamics, where test performance improves sharply after prolonged . This link arises from inductive biases that prioritize slow-emerging, generalizing patterns over initial fast-learned, memorization-heavy representations, as demonstrated in transformer models on tasks like . The lottery ticket hypothesis complements double descent by proposing that overparameterized networks contain sparse subnetworks—identified via —that match or exceed full-network performance, often bypassing the risk peak associated with dense interpolation regimes. In sparse double descent contexts, such subnetworks can alter curve, revealing that "winning tickets" do not always align with the optimal trajectory of the unpruned model, particularly under label noise or . Double descent integrates with neural scaling laws, unifying power-law loss reductions with model size and compute in the overparameterized regime, where the second descent reflects smoother navigation of high-dimensional loss landscapes. This connection explains why larger models, despite interpolation, achieve better generalization without the classical overfitting penalty, as seen in empirical curves for language models. Open questions in double descent research include the exact conditions triggering second descent under non-i.i.d. data distributions, such as repeated samples, where empirical studies show test loss spikes due to memorization overload but lack precise theoretical thresholds. Its interplay with adversarial robustness remains unresolved, though overparameterization has been observed to induce double descent in robust training losses, potentially enhancing resilience in data-rich regimes while complicating vulnerability in others. Causality within optimization dynamics—specifically, how gradient flows in non-convex settings causally drive the risk curve's re-descent—continues to elude full explanation, with ongoing work probing inductive biases and pattern competition. Future directions encompass extending double descent to , where local updates mirror non-i.i.d. challenges and amplify descent risks in distributed settings. Investigations into models and trillion-parameter large models (LLMs) post-2023 highlight emerging issues, including sample in autoregressive architectures and the inheritance of descent behaviors during of foundation models. For instance, 2024-2025 research has explored double descent's links to emergent abilities in LLMs and proposed methods like momentum-guided perturbations to mitigate instabilities in LoRA . These areas underscore persistent gaps, particularly in linking double descent to non-convex optimization landscapes beyond linear approximations, with recent post-2023 evidence from LLMs, such as studies on low-rank adaptations revealing instabilities like transient in training loss, highlighting ongoing challenges in these areas.

References

  1. [1]
    Reconciling modern machine-learning practice and the classical ...
    The double-descent risk curve introduced in this paper reconciles the U-shaped curve predicted by the bias–variance trade-off and the observed behavior of ...Missing: original | Show results with:original
  2. [2]
    Deep Double Descent: Where Bigger Models and More Data Hurt
    Dec 4, 2019 · We show that a variety of modern deep learning tasks exhibit a double-descent phenomenon where, as we increase model size, performance first gets worse and ...
  3. [3]
    Understanding the Double Descent Phenomenon in Deep Learning
    Mar 15, 2024 · In this tutorial, we explain the concept of double descent and its mechanisms. The first section sets the classical statistical learning framework and ...
  4. [4]
    [PDF] Neural Networks and the Bias/Variance Dilemma
    The most extensively studied neural network in recent years is prob- ably the backpropagation network, that is, a multilayer feedforward net- work with the ...
  5. [5]
    [PDF] Smoothing Noisy Data with Spline Functions - Department of Statistics
    Craven and G. Wahba becomes finer, lim ER(A)/min ER(2) 11. A Monte Carlo experiment with several smooth g's was tried with m = 2, n=50 and several values of ...Missing: 1978 bias
  6. [6]
    [1903.07571] Two models of double descent for weak features - arXiv
    Mar 18, 2019 · Title:Two models of double descent for weak features. Authors:Mikhail Belkin, Daniel Hsu, Ji Xu. View a PDF of the paper titled Two models of ...
  7. [7]
    [PDF] On the Role of Optimization in Double Descent: A Least Squares Study
    Understanding double descent requires a fine-grained bias-variance decomposition. In Conference on Neural Information Processing Systems (NeurIPS), 2020. A.
  8. [8]
    A brief prehistory of double descent - PNAS
    These curves can display what they call double descent: With increasing N, the risk initially decreases, attains a minimum, and then increases until N equals n.
  9. [9]
    Reconciling modern machine learning practice and the bias ... - arXiv
    Dec 28, 2018 · In this paper, we reconcile the classical understanding and the modern practice within a unified performance curve. This "double descent" curve ...Missing: random | Show results with:random
  10. [10]
    [1712.00409] Deep Learning Scaling is Predictable, Empirically - arXiv
    Dec 1, 2017 · View a PDF of the paper titled Deep Learning Scaling is Predictable, Empirically, by Joel Hestness and 8 other authors. View PDF. Abstract ...Missing: 2019 | Show results with:2019
  11. [11]
    ICML 2020 Workshops
    ICML 2020 workshops included topics like Graph Representation Learning, Self-supervision in Audio and Speech, Law & Machine Learning, and AI for Autonomous ...
  12. [12]
    Workshops - NeurIPS 2020
    Advances and Opportunities: Machine Learning for Education. Kumar Garg, Neil Heffernan, Kayla Meyers. Fri, Dec 11th, 2020 @ 05:30 – 14:10 PST.
  13. [13]
    [2205.00477] Ridgeless Regression with Random Features - arXiv
    May 1, 2022 · Specifically, random features error exhibits the double-descent curve. Motivated by the theoretical findings, we propose a tunable kernel ...
  14. [14]
    A Precise Performance Analysis of Learning with Random Features
    Aug 27, 2020 · ... double descent phenomenon" in learning. Subjects: Information Theory (cs.IT). Cite as: arXiv:2008.11904 [cs.IT]. (or arXiv:2008.11904v1 [cs.IT] ...
  15. [15]
    Generalization Error of Generalized Linear Models in High ...
    We are also able to rigorously and analytically explain the \emph{double descent} phenomenon in generalized linear models. Cite this Paper. BibTeX.
  16. [16]
    [PDF] Information bottleneck theory of high-dimensional regression
    The resulting maximum is an information-theoretic analog of double descent—the decrease in overfitting level (test error) as the number of parameters ...
  17. [17]
    Early Stopping in Deep Networks: Double Descent and How to...
    Jan 12, 2021 · One-sentence Summary: Epoch wise double descent can be explained as a superposition of two or more bias-variance tradeoffs that arise because ...
  18. [18]
    [2307.14253] Sparse Double Descent in Vision Transformers - arXiv
    Jul 26, 2023 · Neoteric studies have reported a ``sparse double descent'' phenomenon that can occur in modern deep-learning models, where extremely over-parametrized models ...Missing: 2024 | Show results with:2024
  19. [19]
    Double Descent as a Lens for Sample Efficiency in Autoregressive ...
    Sep 29, 2025 · In this work, we use the double descent phenomenon to holistically compare the sample efficiency of discrete diffusion and autoregressive models ...Missing: 2024 | Show results with:2024
  20. [20]
    Monotonicity and Double Descent in Uncertainty Estimation ... - arXiv
    Oct 14, 2022 · We prove that cross-validation metrics exhibit qualitatively different behavior that is characteristic of double descent.
  21. [21]
    Investigating Overparameterization for Non-Negative Matrix ...
    Sep 13, 2021 · Moreover, we also show that the double descent phenomenon occurs when we increase the number of parameters of the NMF, where the test error ...
  22. [22]
    [2311.01442] Deep Double Descent for Time Series Forecasting
    Nov 2, 2023 · We perform extensive experiments to investigate the occurrence of deep double descent in several Transformer models trained on public time ...
  23. [23]
    [2303.06173] Unifying Grokking and Double Descent - arXiv
    Mar 10, 2023 · We hypothesize that grokking and double descent can be understood as instances of the same learning dynamics within a framework of pattern learning speeds.
  24. [24]
  25. [25]
  26. [26]
    Sparse Double Descent: Where Network Pruning Aggravates ... - arXiv
    Jun 17, 2022 · Third, in the context of sparse double descent, a winning ticket in the lottery ticket hypothesis surprisingly may not always win. Comments ...
  27. [27]
    [2001.08361] Scaling Laws for Neural Language Models - arXiv
    We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the ...Missing: double descent
  28. [28]
    Unified Neural Network Scaling Laws and Scale-time Equivalence
    Sep 9, 2024 · Abstract page for arXiv paper 2409.05782: Unified Neural Network Scaling Laws and Scale-time Equivalence. ... double descent. Here, we present a ...
  29. [29]
    Scaling Laws and Interpretability of Learning from Repeated Data
    May 21, 2022 · Abstract page for arXiv paper 2205.10487: Scaling Laws and Interpretability of Learning from Repeated Data. ... We find a strong double descent ...
  30. [30]
    [2002.11080] The Curious Case of Adversarially Robust Models
    Feb 25, 2020 · In the medium adversary regime, with more training data, the generalization loss exhibits a double descent curve. This implies that in this ...Missing: robustness | Show results with:robustness