Fact-checked by Grok 2 weeks ago

Catastrophic interference

Catastrophic interference, also known as catastrophic forgetting, is a fundamental in artificial neural networks where the acquisition of new or skills causes a rapid and drastic loss of previously acquired information, leading to impaired performance on earlier tasks. This phenomenon arises because updates to the network's connection weights during on sequential tasks overwrite parameters critical to prior learning, disrupting the stability of stored representations. First identified in the late , catastrophic interference was demonstrated through simulations of connectionist networks trained on arithmetic facts, such as and tables, where learning one set of operations severely impaired recall of the other. In these early models, the issue highlighted limitations in sequential learning paradigms, contrasting with human cognition's ability to retain old knowledge while adapting to new experiences. By the , the problem gained renewed attention in , particularly with large-scale models like convolutional neural networks trained on benchmarks such as MNIST digit classification followed by unrelated tasks, revealing near-total forgetting without protective measures. The causes of catastrophic interference stem from the shared parameter space in neural networks and the stability-plasticity dilemma, where models must balance (ability to learn new information) with (retention of old ). Optimization for new objectives—often via —alters weights indiscriminately, lacking mechanisms to preserve task-specific importance. This is especially pronounced in scenarios involving non-stationary data distributions, such as continual or , where models must adapt over time without access to past training data. In modern applications, including large language models and autonomous systems, it poses risks like degraded reliability in dynamic environments, such as self-driving vehicles forgetting safe rules after updates for new routes. Efforts to mitigate catastrophic interference have led to advancements in continual learning techniques, including regularization methods like elastic weight consolidation, which penalize changes to weights important for old tasks, and replay-based approaches that rehearse past data to reinforce memories. These strategies aim to enable more biologically plausible learning, drawing parallels to hippocampal replay in . Despite progress, the problem remains a key barrier to achieving , underscoring the need for architectures that support stable, incremental knowledge accumulation.

Introduction

Definition and Importance

Catastrophic interference, also known as catastrophic forgetting, refers to the abrupt and severe loss of previously acquired knowledge in artificial neural networks when they are trained on new tasks or data in a sequential manner. This phenomenon arises because the adjustment of connection weights to accommodate new information drastically disrupts the representations established for prior knowledge, leading to a sudden and often complete degradation in performance on old tasks. Unlike the gradual forgetting observed in biological systems, where old memories fade incrementally over time, catastrophic interference in neural networks is characterized by its rapid and total erasure, highlighting a fundamental brittleness in current architectures. A simple illustration of this effect involves a first trained to recognize images of (Task A), achieving high accuracy on that task. Upon subsequent training on images of dogs (Task B), the network's performance on recognition may plummet to near-zero levels, as if the original knowledge has been entirely overwritten. This "catastrophic" nature stems from the distributed nature of representations in connectionist networks, where shared weights encode multiple pieces of information, making isolated updates highly disruptive. The importance of addressing catastrophic interference cannot be overstated, as it poses a central obstacle to developing systems capable of lifelong or continual learning, where agents must accumulate over time without access to all past data. This limitation severely hampers applications in domains requiring adaptive, cumulative expertise, such as , autonomous vehicles, and personalized assistants, where forgetting prior skills could lead to unsafe or inefficient behavior. In contrast to , which demonstrates robust retention through mechanisms like synaptic despite ongoing learning, neural networks' underscores their current inability to mimic biological adaptability, fueling research into the stability-plasticity dilemma that balances retention of old with acquisition of new information.

Stability-Plasticity Dilemma

The stability-plasticity dilemma refers to the inherent tradeoff in neural systems between preserving previously acquired knowledge through synaptic stability and incorporating new information via . In artificial neural networks (ANNs), this dilemma arises because learning algorithms like update weights uniformly across the network, potentially overwriting representations critical for old tasks when adapting to new ones. Excessive plasticity leads to rapid adaptation but at the cost of prior learning, while excessive results in rigidity that prevents effective learning of novel patterns. Biological neural systems address this dilemma through complementary learning mechanisms, such as the division between the and . The enables fast, episodic learning of new experiences with high , while the supports gradual, stable of long-term knowledge through slower integration processes like replay during . This dual-system architecture, as proposed in complementary learning systems theory, allows the brain to balance rapid adaptation without destabilizing established memories. In ANNs, the lack of such selective mechanisms exacerbates , making uniform weight updates particularly prone to disrupting task-specific representations during sequential learning. This imbalance manifests as catastrophic interference, where new training destabilizes weights essential for prior performance. was first conceptually framed in the late 1980s within connectionist models, highlighting the need for selective update rules to mimic biological selectivity and enable without wholesale forgetting.

Discovery

McCloskey and Cohen (1989)

McCloskey and Cohen's seminal work, published as a chapter titled "Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem" in the Psychology of Learning and Motivation series, identified a critical limitation in backpropagation-based connectionist models during sequential learning tasks. The authors argued that these networks, relying on distributed representations, exhibit severe disruption of previously acquired knowledge when trained on new information, a phenomenon they termed "catastrophic interference," which undermines their suitability for modeling human cognitive processes that handle more gracefully. In their first experiment, McCloskey and Cohen trained a three-layer network (28 input units, 50 hidden units, 24 output units) using to learn basic facts. The network was initially trained on "ones" facts, such as 1+1=2 up to 1+9=10, until achieving perfect performance (100% accuracy under a best-match criterion). Subsequently, training shifted to "twos" facts, like 2+1=3 up to 2+9=11, which overlapped in sums with the prior set (e.g., both 1+2 and 2+1 yield 3). After just two epochs on the new facts, performance on the original ones facts plummeted to approximately 30% accuracy, with systematic errors where the network output responses aligned more closely with the new twos equivalents (e.g., treating 1+2 as 2+1=3 but biasing toward higher sums). This abrupt degradation highlighted how weight updates for the second task rapidly overwrote the distributed encodings of the first, unlike the gradual forgetting observed in human arithmetic learning. The second experiment extended this to a paired-associate learning paradigm, simulating classic retroactive studies in human memory. Using a similar , the model was trained on an A-B list of eight nonsense syllable-adjective pairs (e.g., dux-regal, zib-majestic), achieving high recall before introducing an interfering A-C list sharing the same stimuli but paired with new responses (e.g., dux-noble). After only three training epochs on the A-C list, recall of the A-B list collapsed to 0% accuracy, in stark contrast to human data from Barnes and Underwood (1959), where participants retained about 51% (4.12 out of 8 items) after 20 trials on the interfering list. This demonstrated that connectionist networks not only forget old associations catastrophically but do so far more extremely than the moderate seen in human tasks. McCloskey and Cohen concluded that catastrophic interference arises fundamentally from the use of distributed representations, where knowledge is encoded across shared connection weights, allowing new learning to corrupt the fragile patterns supporting prior information—a problem less pronounced in localist or propositional models that store facts independently. They emphasized that this issue is inherent to the stability-plasticity dilemma in sequential learning, where accommodating new knowledge destabilizes the old. The paper's empirical demonstrations profoundly influenced the field, igniting widespread debate on the viability of connectionist architectures for cognitive modeling and prompting decades of research into interference mitigation techniques.

Ratcliff (1990)

In his 1990 paper "Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions," published in , Roger Ratcliff evaluated multilayer connectionist networks using the learning algorithm to model human , highlighting fundamental limitations in their ability to simulate realistic learning and forgetting dynamics. Building on earlier empirical demonstrations of in connectionist systems, such as those by McCloskey and (1989), Ratcliff focused on theoretical constraints arising from the incompatibility between network learning functions and observed human memory behaviors.90016-7) A central analysis in the paper concerns the form of forgetting curves: human recognition memory exhibits power-law decay, where retention decreases gradually over time following the function R(t) = 1 - a t^b (with b typically around 0.88), reflecting slow, asymptotic . In contrast, backpropagation networks produce either flat forgetting curves (minimal loss of old knowledge during new learning) or abrupt drops to near-chance , leading to catastrophic interference where acquiring new information severely disrupts prior memories. This mismatch arises because the algorithm's weight updates, driven by error minimization, overwrite distributed representations of old items when adapting to new ones, preventing the gradual degradation required for biological realism. To illustrate these issues, Ratcliff conducted simulations using a three-layer trained on lists of abstract items represented by random vectors, tasked with discriminating old items from novel distractors in a paradigm. The achieved high accuracy (over 90%) on an initial list of 20 items after extended , but subsequent on a new list caused performance on the original list to plummet to chance levels (around 50%), effectively erasing prior discriminations despite the absence of explicit unlearning. These results demonstrated that was not merely a training artifact but a systemic outcome of the , exacerbated in sequential tasks mimicking acquisition. Ratcliff further derived constraints showing that no single learning rate could simultaneously support rapid acquisition of new (requiring high rates for quick ) and slow, power-law of old (requiring low rates to preserve ), as high rates induced rapid overwriting while low rates delayed initial learning unacceptably. He proposed that purely distributed connectionist architectures are inherently limited for modeling stable and advocated localist-distributed models, where localist units maintain dedicated, interference-resistant representations for familiar items alongside distributed processing for novel ones. This analysis influenced subsequent research by emphasizing the necessity of biologically plausible forgetting mechanisms in neural networks to mitigate catastrophic interference, paving the way for explorations into alternative learning rules and architectures that better balance and stability.01294-2)

Mechanisms

Learning and Forgetting Dynamics

In neural networks, learning dynamics are governed by , which computes gradients to minimize error through , resulting in proportional updates across all weights that induce global changes in the network's parameters. These updates enable the network to adapt to new data but can disrupt existing configurations when training proceeds sequentially. The stability-plasticity dilemma underscores this tension, where the need for plasticity in acquiring new knowledge often compromises the stability of prior learning. Forgetting dynamics arise as old knowledge is encoded in specific patterns of weights, which new gradients can overwrite if the representations overlap, leading to a rapid and substantial drop in on previous tasks. This overwriting occurs because the minimization for new inputs adjusts shared weights without regard for their role in older tasks, amplifying the loss of previously acquired . The role of distributed representations exacerbates this interference, as artificial neural networks store information across interconnected weights rather than in modular, localized structures like those in biological brains, making isolated preservation of knowledge challenging. In contrast to modular biological systems, which can compartmentalize functions to limit , the holistic encoding in ANNs means that adjustments for one task propagate broadly, heightening vulnerability to . Empirically, is more severe for similar tasks due to greater overlap in their representations, as evidenced by backward metrics that quantify the average change in on tasks after learning a new one. Backward , defined as the mean difference between pre-new-task and post-new-task accuracies on previous tasks, often yields negative values indicating , particularly when tasks share representational subspaces. For instance, in sequential learning benchmarks, similar tasks show steeper declines compared to dissimilar ones, highlighting the impact of representational similarity.

Sequential Learning Challenges

In sequential learning scenarios, neural networks are trained on tasks presented non-stationarily, meaning from previous tasks becomes unavailable after the network adapts to the current one, forcing reliance on stored weights for retention. This setup mimics real-world continual learning but exposes networks to the core issue of catastrophic interference, where updates for new tasks overwrite representations critical to old ones. Key challenges arise from backward transfer effects, where learning a new task influences performance on prior tasks; negative backward manifests as detrimental , while positive is rare and minimal in standard architectures. intensifies with task similarity, as overlapping input representations—such as shared stimuli in arithmetic facts or paired associates—lead to greater weight conflicts and representational overlap, disrupting established knowledge more severely than dissimilar tasks. Interference is quantified by measuring the average accuracy drop on previous tasks after training on a new one, often expressed relative to initial performance. A common metric is the interference index I, defined as I = \frac{A_{\text{old}} - A_{\text{post}}}{A_{\text{old}}}, where A_{\text{old}} is the accuracy on prior tasks before new-task training, and A_{\text{post}} is the accuracy afterward; values near 1 indicate near-complete forgetting. Benchmarks like permuted MNIST, where each task applies a unique random pixel permutation to the digits, demonstrate this: standard networks exhibit near-total forgetting, with accuracy on earlier tasks dropping to below 5% after just a few sequential permutations. In contrast to artificial neural networks, biological sequential learning in humans leverages episodic memory systems, such as the hippocampus, to isolate and consolidate distinct experiences, enabling stable retention without widespread overwriting—a capability absent in conventional ANNs that rely solely on distributed weight updates.

Mitigation Strategies

Representation-Based Methods

Representation-based methods address catastrophic interference by modifying the input or hidden layer representations to reduce overlap between tasks, thereby minimizing the disruption caused by shared neural resources during sequential learning. These techniques emerged in the early as responses to the stability-plasticity dilemma, focusing on separable subspaces or selective activations without altering the network architecture or replaying data. By promoting distinct, non-interfering representations, they enable better preservation of prior knowledge while accommodating new information. A key approach involves enforcing in task representations to limit weight sharing and interference. In this method, task-specific vectors are designed to be orthogonal, such as by rotating inputs to project them into non-overlapping subspaces, ensuring that updates for one task do not alter weights critical for others. (1992) introduced a dynamic that iteratively adjusts hidden activations during to produce distributed yet orthogonal representations, demonstrating reduced forgetting in networks on tasks. The node sharpening technique enhances the selectivity of hidden units through output-dependent inhibition, which suppresses overlapping activations and sharpens responses dedicated to old tasks. This promotes semi-distributed representations where individual hidden nodes specialize in task-specific features, limiting the spread of new learning to previously established pathways. (1992) proposed this algorithm for feedforward networks, showing that it significantly mitigates interference by increasing sparsity and exclusivity in hidden layer patterns during sequential training. Another strategy is the novelty rule, which identifies novel inputs relative to existing and dynamically allocates new nodes to handle them, avoiding modifications to nodes tuned for prior tasks. This growth-based allocation isolates new representations, preserving the integrity of old ones. Kortge () developed this for simple networks, illustrating its ability to prevent overwriting by scaling the network's capacity only for unfamiliar patterns. Pre-training with initializes the network on a large, diverse to form general, robust representations before supervised task-specific . This establishes a broad foundational that subsequent learning extends rather than overwrites. McRae and Hetherington (1993) showed through simulations on associative tasks that such pre-training eliminates , as the pre-established hidden representations provide stable anchors for new mappings. These methods effectively reduce catastrophic interference in shallow networks, with studies reporting retention rates exceeding 90% on prior tasks after learning unrelated new ones in controlled experiments. However, their efficacy diminishes in deeper architectures, where maintaining or selective growth becomes computationally intensive and less scalable due to the in representational complexity.

Rehearsal Methods

methods address catastrophic interference by maintaining a small of representative examples from previous tasks and interleaving them with new to jointly optimize the model, thereby reinforcing prior knowledge during sequential learning. This approach mimics human processes and has been foundational in continual learning frameworks since the early demonstrations of replay in neural networks. By co-training on buffered old samples, these methods prevent the overwriting of established representations, achieving significant reductions in forgetting compared to naive . Traditional rehearsal techniques include pseudo-recurrent networks, which partition the network into distinct modules where one component handles new inputs while the other recurrently replays hidden states from past experiences to stabilize learning. Another early variant is self-refreshing memory, where the network periodically generates and retrains on internal pseudopatterns derived from random activations interleaved with new data, enabling the learning of temporal sequences without disrupting prior knowledge. These methods laid the groundwork for buffer-based , often employing simple strategies like random sampling to select exemplars for . Generative replay extends traditional by training a separate , such as a , alongside the primary classifier to synthesize plausible samples from previous tasks, avoiding the need to store actual data. Introduced in the Deep Generative Replay framework, this dual architecture trains the generator on mixed real and from old tasks before the classifier on new inputs augmented with generated replays, demonstrating effectiveness in permuted MNIST and rotated MNIST benchmarks where storage-constrained methods fail. This technique reduces storage overhead while preserving performance across task sequences. Spontaneous replay draws inspiration from hippocampal dynamics in the , where experiences are reactivated offline during idle periods or through injection to consolidate memories. In neural networks, this involves replaying internally generated hidden representations—rather than inputs—during training pauses or via contextual sampling, which has been shown to mitigate in sequential image classification tasks by promoting diverse reactivation without explicit . Such brain-inspired variants enhance by simulating sleep-like , leading to more robust knowledge retention. Variants of rehearsal often incorporate for efficient buffer management, where incoming samples replace stored ones with probability inversely proportional to buffer size, ensuring a representative of past data without toward recent tasks. This strategy is particularly effective in class-incremental learning scenarios, such as on CIFAR-100, where methods like iCaRL combine herding-based exemplar selection with to achieve average accuracies around 55% across 10 incremental classes (with 2000 exemplars), far surpassing non-rehearsal baselines. Reservoir sampling balances computational efficiency and coverage, making it a staple in scalable implementations. Despite their efficacy, methods incur notable limitations, including high storage costs for maintaining buffers of real , which poorly with task , and privacy concerns when retaining sensitive examples from prior distributions. These challenges have spurred ongoing refinements, though they remain inherent trade-offs in -dependent replay approaches.

Regularization Methods

Regularization methods mitigate catastrophic interference by modifying the during on new tasks to include penalty terms that constrain updates to parameters deemed important for previously learned tasks, thereby balancing and without requiring access to past . These approaches analytically prioritize the retention of old knowledge by weighting gradient changes based on estimated parameter importance, often derived from task-specific landscapes or posterior approximations. A foundational example is Elastic Weight Consolidation (EWC), introduced by Kirkpatrick et al. in 2017, which quantifies parameter importance using the diagonal of the matrix F, computed as the expected squared gradients of the loss with respect to parameters under the old task distribution. The total loss for the new task becomes: \mathcal{L} = \mathcal{L}_{\text{new}}(\theta) + \frac{\lambda}{2} \sum_{i} F_i (\theta_i - \theta_{\text{old},i})^2 where \mathcal{L}_{\text{new}} is the standard loss on new data, \lambda scales the regularization strength, are the current parameters, and \theta_{\text{old}} are the parameters optimized for the previous task. This quadratic penalty approximates the change in old-task loss induced by parameter shifts, effectively safeguarding critical weights. EWC's emerges from an approximate framework, where the posterior over parameters after prior tasks serves as a Gaussian for subsequent learning, approximated via around the old-task optimum to yield the Fisher-weighted penalty. Building on similar principles, Synaptic Intelligence (SI), proposed by Zenke et al. in 2017, estimates parameter importance by integrating their squared gradient contributions to the loss across all past tasks using path-integral methods during training, without needing separate Fisher computations after each task. This enables an online regularization term akin to EWC's, applied cumulatively to prevent forgetting in sequential multi-task settings. Another variant, Learning without Forgetting (LwF) by Li and Hoiem in 2016, regularizes by distilling knowledge from the old model: it trains a branched network on new data while using a distillation loss to match the softened output probabilities of the original model on new inputs, preserving representational capabilities for old tasks through output consistency rather than direct parameter penalties. These regularization techniques have demonstrated efficacy in supervised continual learning benchmarks, such as Split-MNIST, where EWC reduces average forgetting by over 90% compared to naive fine-tuning across five incremental tasks, highlighting their role in maintaining performance on disjoint class subsets without architectural modifications.

Architectural Methods

address catastrophic interference by modifying the neural network's structure to create dedicated subspaces or components for each task, thereby isolating parameters and minimizing overwriting of prior . These approaches prioritize for new tasks while preserving stability for old ones through explicit architectural separation, avoiding the need for buffers or regularization penalties on shared weights. Unlike regularization techniques that constrain updates within a fixed , architectural methods expand or partition the model to enable task-specific learning. Parameter isolation techniques allocate distinct sets of parameters for different tasks, often by copying portions of the network or expanding it to include task-specific modules. A seminal example is Progressive Neural Networks (PNNs), which construct a system of frozen "columns" for previous tasks and add new columns for subsequent tasks, connected via lateral links that allow without altering earlier parameters. This design ensures zero on prior tasks while enabling full plasticity for the current one, as demonstrated in benchmarks like , where PNNs retained performance across sequences of 10 tasks with only linear parameter growth. Similar isolation strategies have been extended to graphs and language models, where private parameters are dynamically assigned to preserve unaffected during updates. Nested learning builds on hierarchical architectures that layer new task components atop representations from prior tasks, fostering incremental without . In PNNs, this nesting manifests as a progressive buildup, where each new layer or column reuses low-level features from earlier ones via adapters, promoting while isolating high-level task-specific computations. Such hierarchies have shown effectiveness in sequential visual , maintaining near-original accuracies on permuted MNIST variants by avoiding shared weight updates that cause overwriting. Dynamic expansion methods further enhance efficiency by selectively growing the network with lightweight, task-specific additions like heads or adapters, rather than full copies. Low-rank adaptations () exemplify this by injecting low-dimensional trainable matrices into pre-trained layers, enabling task-specific with minimal parameter increase—often less than 1% of the original model—while freezing the base weights to prevent . In continual learning scenarios, LoRA variants like CL-LoRA use dual adapters (shared and private) to balance transfer and isolation, achieving competitive accuracies on class-incremental CIFAR-100 without rehearsal, outperforming baselines by up to 10% in average performance across tasks. PackNet provides another resource-efficient example, iteratively pruning unimportant weights from the current task's network and reallocating the freed parameters for new tasks, packing multiple models into a single architecture. On fine-grained classification like CUB-200, PackNet supported three tasks in a VGG-16 backbone with accuracies within 2% of individually trained networks, proving effective in parameter-constrained settings like edge devices. These methods trade off increased model size—typically linear or sublinear growth with the number of tasks—for the benefit of rehearsal-free operation, eliminating storage and concerns associated with data replay. While parameter expansion can lead to scalability issues in very long task sequences, techniques like in PackNet mitigate this by reusing capacity, maintaining feasibility for practical deployment.

Catastrophic Remembering

Catastrophic remembering refers to the phenomenon in artificial neural networks where excessive in learned representations leads to over-retention of prior , severely impairing the network's to acquire and adapt to new tasks or data distributions. This results in a of discriminative capacity, as the network persistently outputs responses associated with old patterns even when confronted with novel inputs, effectively "remembering" outdated at the expense of . Unlike catastrophic forgetting, which involves abrupt of old , catastrophic remembering manifests as an imbalance toward hyper-, preventing meaningful updates to the model's parameters during sequential learning. The primary causes of catastrophic remembering include overly conservative weight updates that minimize changes to existing parameters, thereby avoiding overwrite of entrenched representations, and rigid architectures that lack sufficient flexibility to accommodate new information without disrupting prior stability. Such mechanisms often arise inadvertently from strategies designed to counteract forgetting, such as excessive replay of old data, which can overgeneralize the network across tasks and reduce its adaptability. This excessive conservatism echoes the flipped side of the stability-plasticity dilemma, where an overemphasis on preserving old knowledge stifles the network's capacity for forward adaptation. In practical examples, such as multi-class classification tasks in continual learning scenarios, catastrophic remembering appears as the dominance of old class predictions over regions of the input space intended for new classes, leading to poor separation and high misclassification rates for novel data. This can be quantitatively assessed through negative forward transfer, where prior task knowledge hinders initial performance on a subsequent task, resulting in lower accuracy or slower convergence compared to training from scratch. For instance, in sequential image classification, a network might rigidly assign new object categories to previously learned ones, reflecting an inability to form distinct decision boundaries. Historically, early observations of this overgeneralization were noted in cascaded neural network architectures, as discussed by Sharkey and Sharkey (1995), who highlighted how such designs promoted inflexible generalization patterns that prioritized old task fidelity over new learning. The implications of catastrophic remembering extend beyond computational challenges, underscoring the need for balanced continual learning frameworks that avoid extremes of instability or rigidity, ensuring networks maintain both retention and adaptability in dynamic environments. Recent work as of explores brain-inspired approaches, such as metaplasticity in Bayesian neural networks, to mitigate both catastrophic forgetting and remembering.

Overgeneralization in Transformers

Overgeneralization in models manifests as an excessive reliance on prior knowledge from pretraining, which can impair to new tasks through persistent biases and degraded on novel data. This arises because transformers, with their mechanisms, tend to amplify pretrained representations, causing new to reinforce rather than sufficiently override old patterns. In transformer-based language models, such as those in the family, on new datasets frequently leads to the dominance of pretraining biases, where factual inaccuracies or stylistic patterns from the base model persist despite updates. This is particularly evident in sequential tasks, where models exhibit rigidity, applying outdated heuristics to new domains and amplifying errors in inference. Analyses of rigidity highlight how layers can propagate prior embeddings too rigidly, exacerbating issues in multi-task sequences. Adapter tuning, a parameter-efficient for adapting transformers, can help mitigate overgeneralization by isolating task-specific updates in lightweight modules inserted into the core model. However, careful balancing is needed to avoid incomplete unlearning of old biases. Empirical studies indicate challenges in scaling continual learning to larger transformer models, though specific trends in overgeneralization require further investigation.

Recent Advances

Brain-Inspired Approaches

Brain-inspired approaches to mitigating catastrophic interference draw from principles to enable more stable sequential learning in artificial neural networks (ANNs), emphasizing mechanisms like selective synaptic updates and hybrid architectures that mimic biological consolidation processes. A notable example is the functionally invariant path (FIP) algorithm developed at Caltech in 2024, which selectively updates neural connections by traversing invariant paths in weight space, thereby retaining prior knowledge with minimal computational overhead. This brain-like selective updating prevents widespread interference by focusing changes on specific pathways, allowing the network to adapt to new data without overwriting established representations. Tested on image classification tasks such as MNIST variants, the FIP algorithm demonstrated robust performance in continual learning scenarios, maintaining accuracy on previous tasks while achieving high proficiency on new ones. Building on such ideas, hybrid neural networks that integrate ANNs with (SNNs) emulate replay-like consolidation observed in corticohippocampal circuits, facilitating and reducing . In a 2025 study published in , researchers introduced a corticohippocampal-inspired hybrid neural network (CH-HNN) that leverages spiking neurons for temporal akin to hippocampal replay, combined with ANN layers for stable cortical storage. This achieved a 50% reduction in rates on standard continual learning benchmarks, such as permuted MNIST and CIFAR-100, by dynamically replaying experiences in a biologically plausible manner without requiring external buffers. The approach draws brief inspiration from methods but advances them through intrinsic neural spiking for efficiency. Further insights into sparse mechanisms come from the Cobweb/4V model, a hierarchical formation detailed in a 2025 , which employs sparse and selective updates to explain and achieve robustness against . By incrementally clustering through information-theoretic principles, Cobweb/4V minimizes updates to only relevant nodes, thereby preserving prior knowledge structures during sequential task learning. Experiments on datasets including , MedMNIST, and showed that this sparse updating—coupled with adaptive structural reorganization—outperformed gradient-based neural baselines in retention, with reduced by limiting global parameter changes. These findings highlight how biologically motivated sparsity can foster stability without high computational costs. Empirical parallels between human cognition and ANNs further inform these designs, as evidenced by a 2025 Nature Human Behaviour study revealing similar and transfer patterns across both systems during continual learning. Humans and networks exhibited comparable negative transfer when tasks shared features but diverged in structure, underscoring shared computational principles like task similarity governing . This alignment supports the pursuit of bio-plausible strategies for scalable deep networks with relatively low computational overhead while avoiding the pitfalls of dense updates. Overall, these approaches demonstrate effectiveness in general continual settings, paving the way for interference-resistant models inspired by neural efficiency.

Continual Learning in Large Language Models

Recent empirical studies have highlighted the extent of catastrophic forgetting in large language models (LLMs) during sequential fine-tuning on natural language understanding (NLU) tasks. In a 2025 analysis, researchers evaluated open-source LLMs with varying parameter sizes on benchmarks from the GLUE suite, revealing that smaller models (under 10 billion parameters) exhibit more severe forgetting compared to larger counterparts, primarily due to limited representational capacity that hinders retention of prior knowledge amid new task adaptations. This scaling effect underscores the need for tailored mitigation strategies in resource-constrained LLM deployments. A notable advancement in addressing this issue is the Forgetting-Aware Metric (FAPM), introduced at EMNLP 2025, which enables efficient model while minimizing . FAPM prunes redundant parameters by integrating traditional magnitude-based criteria with a , computed as the difference between pre- and post-fine-tuning states on prior tasks; this dual approach preserved 99.67% of downstream accuracy while reducing catastrophic by up to 0.25% in sequential fine-tuning experiments on models like Llama-2-7B. Complementing such parameter-efficient techniques, self-synthesized methods, as proposed in 2024 and extended in subsequent works, leverage the itself to generate mimicking old tasks for replay buffering, thereby avoiding the storage overhead of real historical datasets and achieving up to 15% improvement in knowledge retention during continual pre-training. Gradient-based methods have also emerged for in continual LLM scenarios, with techniques like Continual Gradient Low-Rank Projection (GORP) restricting updates to low-rank subspaces to preserve core representations; applied to models such as , GORP reduced forgetting by 20-30% on multi-task sequences without full parameter retraining. Comprehensive surveys from 2025 categorize these approaches into replay (e.g., generation) and regularization (e.g., gradient projection) paradigms, emphasizing their efficacy in balancing and for LLMs under evolving data streams. However, challenges persist, including collapse induced by recursive loops, where prolonged exposure to model-generated samples amplifies forgetting and degrades , as evidenced in 2025 analyses of LLM pre-training pipelines.

References

  1. [1]
  2. [2]
  3. [3]
    What is Catastrophic Forgetting? - IBM
    Also known as “catastrophic interference,” this phenomenon causes trained networks to lose information related to old tasks when being trained on new data in a ...
  4. [4]
    [PDF] CATASTROPHIC INTERFERENCE IN CONNECTIONIST NETWORKS
    Michael McCloskey and Neal J. Cohen. VI. Generality of the Interference Problem. Our arithmetic and retroactive interference results raise a variety of.
  5. [5]
    Overcoming catastrophic forgetting in neural networks - PNAS
    Mar 14, 2017 · This phenomenon, termed catastrophic forgetting (2–6), occurs specifically when the network is trained sequentially on multiple tasks because ...Results · Ewc Extends Memory Lifetime... · Random Patterns
  6. [6]
  7. [7]
    Catastrophic Interference in Connectionist Networks: The Sequential ...
    New learning may interfere catastrophically with old learning when networks are trained sequentially.
  8. [8]
    Catastrophic Interference in Connectionist Networks: The Sequential ...
    Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem · M. McCloskey, N. J. Cohen · Published 1989 · Computer Science, Psychology ...
  9. [9]
    (PDF) Catastrophic forgetting in connectionist networks
    Aug 6, 2025 · In this article the causes, consequences and numerous solutions to the problem of catastrophic forgetting in neural networks are examined.
  10. [10]
    [PDF] A Comprehensive Survey of Forgetting in Deep Learning Beyond ...
    Jul 16, 2023 · The concept of catastrophic forgetting was first formally introduced by McCloskey and Cohen. [1]. They demonstrated that neural networks when ...
  11. [11]
    Continual lifelong learning with neural networks: A review
    In this review, we critically summarize the main challenges linked to lifelong learning for artificial learning systems and compare existing neural network ...Review · 2. Biological Aspects Of... · 2.1. The...
  12. [12]
    [PDF] Using Semi-Distributed Representations to Overcome Catastrophic ...
    The algorithm presented below, using a different technique, allows semi-distributed representations to evolve that significantly reduce catastrophic forgetting.
  13. [13]
    (PDF) Catastrophic Interference is Eliminated in Pretrained Networks
    PDF | this article, we outline the major cause of catastrophic interference in standard networks, describe recent approaches to the problem, ...<|separator|>
  14. [14]
    [1703.04200] Continual Learning Through Synaptic Intelligence - arXiv
    In this study, we introduce intelligent synapses that bring some of this biological complexity into artificial neural networks.Missing: ICLR | Show results with:ICLR
  15. [15]
    [PDF] Learning without Forgetting - arXiv
    Feb 14, 2017 · Li and D. Hoiem, “Learning without forgetting,” in European. Conference on Computer Vision. Springer, 2016, pp. 614–629. [11] G. Hinton, O ...
  16. [16]
    Understanding Catastrophic Forgetting and Remembering in ... - arXiv
    Feb 22, 2021 · ... neural networks ... Additionally, current approaches that deal with forgetting ignore the problem of catastrophic remembering, i.e. the worsening ...
  17. [17]
    The stability-plasticity dilemma: investigating the continuum from ...
    The stability-plasticity dilemma is a well-know constraint for artificial and biological neural systems. The basic idea is that learning in a parallel and ...The Problem of Catastrophic... · The Entrenchment Effect: The...Missing: seminal | Show results with:seminal
  18. [18]
    Progressive learning: A deep learning framework for continual ...
    Progressive learning is a deep learning framework for continual learning that comprises three procedures: curriculum, progression, and pruning.
  19. [19]
  20. [20]
    [1606.04671] Progressive Neural Networks - arXiv
    Jun 15, 2016 · The progressive networks approach represents a step forward in this direction: they are immune to forgetting and can leverage prior knowledge via lateral ...
  21. [21]
    Memory Efficient Continual Learning with Transformers
    In this paper, we devise a method to incrementally train a model on a sequence of tasks using pre-trained Transformers and extending them with Adapters.
  22. [22]
    [2308.08747] An Empirical Study of Catastrophic Forgetting in Large ...
    Aug 17, 2023 · The experiments reveal that catastrophic forgetting is generally observed in LLMs ranging from 1b to 7b parameters. Surprisingly, as the model ...
  23. [23]
    New Algorithm Enables Neural Networks to Learn Continuously
    Oct 9, 2024 · Caltech researchers have now developed a new type of algorithm that enables neural networks to be continuously updated with new data that they are able to ...Missing: Hebbian rules interference
  24. [24]
    Hybrid neural networks for continual learning inspired by ... - Nature
    Feb 2, 2025 · A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Sci ...Results · Knowledge Transfer From... · Methods
  25. [25]
    [2510.23756] Explaining Robustness to Catastrophic Forgetting ...
    Oct 27, 2025 · Catastrophic forgetting remains a central challenge in continual learning, where models are required to integrate new knowledge over time ...
  26. [26]
    Humans and neural networks show similar patterns of transfer and ...
    Oct 30, 2025 · ... catastrophic interference relate to transfer during continual learning in humans and ANNs. Here, we directly compare humans and linear ANNs ...
  27. [27]
    Catastrophic Forgetting in LLMs: A Comparative Analysis Across ...
    Apr 1, 2025 · This study evaluates the continual fine-tuning of various open-source LLMs with different parameter sizes (specifically models under 10 billion parameters) on ...Missing: empirical sequential
  28. [28]
    Mitigating Catastrophic Forgetting in Large Language Models with ...
    In this paper, we propose the Forgetting-Aware Pruning Metric (FAPM), a novel pruning-based approach to balance CF and downstream task performance. Our ...
  29. [29]
    Mitigating Catastrophic Forgetting in Large Language Models with ...
    We propose a framework called Self-Synthesized Rehearsal (SSR) that uses the LLM to generate synthetic instances for rehearsal.
  30. [30]
    Understanding Catastrophic Forgetting in Continual Learning
    Jun 11, 2025 · First, we propose a set of metrics to evaluate models learning over a continuum of data. These metrics characterize models not only by their ...
  31. [31]
    New Study Warns of Catastrophic Overtraining in Large ... - HPCwire
    Apr 3, 2025 · When the researchers added Gaussian noise to pre-trained models, they found performance became significantly worse with increasing pre-training ...