Fact-checked by Grok 2 weeks ago

Sequence learning

Sequence learning is the ability to acquire knowledge about the regularities or patterns in sequences of information or actions, a capability studied in , , and . In , it involves developing models to process, analyze, and generate data where the order of elements is essential, such as in time series, , or biological sequences. Unlike traditional tasks that treat data as independent samples, sequence learning algorithms account for dependencies and temporal structures within the data, enabling tasks like , , or of sequential outputs. This approach has become central to handling real-world data streams where context and progression matter, drawing from statistical models to advanced neural architectures. Psychological foundations of sequence learning emerged in the mid-20th century, with early studies on serial order in and implicit learning. Computational approaches trace back to recurrent neural networks (RNNs), introduced in the and , which process sequences by maintaining a hidden state that captures information from previous steps. A significant advancement came in with the sequence-to-sequence () framework, which uses encoder-decoder architectures—often powered by (LSTM) units—to map input sequences to output sequences, revolutionizing applications like . This model achieved state-of-the-art results on benchmarks such as the WMT-14 English-to-French translation task, scoring 34.8 points and outperforming prior systems. Modern sequence learning has shifted toward transformer-based models, which rely on self-attention mechanisms to efficiently capture long-range dependencies without the sequential processing bottlenecks of RNNs. Transformers, first detailed in 2017, enable scalable training on massive datasets and have driven breakthroughs in , such as large language models like the GPT series. These architectures excel in sequential tasks, including , by modeling trajectories as sequences and improving sample efficiency and generalization. Key applications of sequence learning span diverse domains: in natural language processing, it powers translation, summarization, and chatbots; in speech recognition, it transcribes audio sequences; in time series forecasting, it predicts stock prices or weather patterns; and in bioinformatics, it analyzes DNA or protein sequences. Recent innovations, such as integrating sequence models with reinforcement learning, have enhanced personalized recommendations and autonomous systems, demonstrating the field's ongoing evolution toward more adaptive and efficient AI.

Fundamentals

Definition and Scope

Sequence learning refers to the process by which human or artificial systems acquire the ability to predict, generate, or respond to ordered elements, capturing temporal or patterns inherent in sequences such as , , or motor actions. This acquisition involves learning the relationships between successive elements, enabling the system to anticipate future states based on prior context. As a fundamental aspect of , it underpins both cognitive processes and computational modeling across diverse domains. The scope of sequence learning is interdisciplinary, bridging and , where it manifests in human skill acquisition—such as mastering coordinated movements like riding a —and in artificial systems tackling predictive tasks, like forecasting stock prices from historical trends. Unlike analyses of independent data points, sequence learning prioritizes the ordered nature of inputs, allowing systems to exploit dependencies that arise from temporal progression rather than static associations. This emphasis on order makes it essential for applications involving continuity, from biological processes to engineered predictions. Central to sequence learning are concepts like sequential dependencies, where the likelihood of an event or output relies on preceding elements in the chain; in basic formulations, this adheres to the , positing that the next state depends solely on the current one, simplifying modeling of short-range influences. This contrasts sharply with in non-sequential data, which ignores ordering and treats elements as interchangeable, potentially missing critical contextual cues. Representative examples include everyday human activities such as typing on a , which requires fluid execution of finger sequences, or computational challenges like , formalized as learning a f(x_1, x_2, \dots, x_n) \to y where the output y hinges on the specific sequence order.

Cognitive and Computational Perspectives

From a cognitive perspective, sequence learning is integral to , enabling the automation of skills such as or playing an instrument through repeated exposure without reliance on declarative recall. This form of learning supports the gradual shift from effortful control to fluent execution, where sequences become habitual and less demanding on cognitive resources over time. Evidence from developmental studies highlights its early emergence; for instance, 8-month-old infants can detect statistical regularities in auditory sequences after brief exposure, segmenting continuous input into predictable units akin to words. Neuropsychological research further underscores its implicit nature: patients with , who struggle with forming new explicit memories, nevertheless exhibit normal sequence learning in tasks like the serial reaction time task (SRTT), where reaction times decrease for repeated patterns without conscious awareness of the structure. In contrast, computational perspectives on sequence learning prioritize to handle vast, large-scale datasets, focusing on scalable methods that process variable-length inputs, such as those encountered in or time-series analysis. These approaches enable systems to model temporal dependencies across extensive corpora, optimizing for accuracy and in applications like , where sequences of arbitrary length must be parsed and generated coherently. Key differences between human and machine sequence learning lie in their mechanisms and strengths: humans excel at implicit, context-adaptive acquisition that integrates sensory-motor with environmental cues for flexible, adjustments, often without explicit rules. Machines, however, surpass in scalable, explicit extraction of patterns from massive datasets, leveraging optimization techniques to achieve high on structured tasks but struggling with the nuanced, one-shot adaptability seen in biological systems. Interdisciplinary connections bridge these domains, with cognitive insights informing AI design; for example, the human tendency to chunk motor sequences into hierarchical units for efficient recall has inspired computational models that build layered representations, enhancing learning of complex, long-horizon tasks in reinforcement learning frameworks.

Historical Development

Early Psychological Foundations

In the early 20th century, behaviorist psychologists such as and conceptualized sequence learning primarily through the lens of reflex chains and habit formation, viewing complex behaviors as linked series of stimulus-response associations. Washburn, in her motor of , argued that mental processes, including sequential actions, arise from partial or inhibited movements organized into chains, allowing for the anticipation and execution of ordered responses without invoking internal mental states. Watson extended this by emphasizing that form through repeated of reflexes, breaking down behaviors like walking or speaking into sequential responses triggered by environmental stimuli, thereby explaining learning as the strengthening of these chains over time. This approach dominated psychological thought in the and , prioritizing observable behaviors and dismissing cognitive mediation. By the mid-20th century, cognitive perspectives began challenging these serial chaining models, with Karl Lashley's critique marking a pivotal shift toward hierarchical and plan-based accounts of sequence learning. Lashley argued that simple reflex chains could not account for the flexibility and rapidity observed in skilled sequential actions, such as piano playing or , where errors in one element do not disrupt the overall order. Instead, he proposed that sequences are governed by higher-level cognitive plans that organize actions hierarchically, allowing for and correction. Evidence for this came from observations of rapid motor sequences, where performers maintain timing despite interruptions, suggesting pre-planned structures rather than linear chaining. Lashley also introduced the concept of chunking, where individual actions are grouped into larger units to facilitate learning and execution, as seen in the fluent segmentation of words in speech or phrases in typing. Key experiments from this era illuminated how sequence order influences learning efficiency, demonstrating steeper learning curves for logically structured progressions. In their pioneering studies on operators, William L. Bryan and Noble Harter (1899) tracked performance across hierarchical levels—from individual letters to words and phrases—revealing that acquisition accelerates when sequences follow meaningful or logical orders, such as common word patterns, compared to random arrangements. Participants showed plateaus in learning at each level, resolved only by advancing to chunked higher-order units, underscoring the role of sequence in overcoming initial difficulties and achieving . These findings highlighted that illogical or arbitrary orders prolong learning, while progressive structures enable faster and reduced error rates. Foundational to these developments was the emerging distinction between conscious and automatic execution in sequence learning, laying groundwork for later memory system theories. Lashley's analysis differentiated deliberate, schema-like for novel sequences from the automatic, ballistic execution of practiced ones, where conscious fades as habits consolidate. This prefigured the formal introduction of —for implicit, skill-based sequences like riding a —and declarative memory—for explicit, fact-based knowledge that can be verbally described—by Cohen and Squire in 1980, who drew on mid-century behavioral evidence to separate automatic habit formation from conscious recall.

Emergence in Artificial Intelligence

In the , early neural networks were designed to process sequential information by extending foundational models like Frank Rosenblatt's perceptrons—originally developed for in the and —to accommodate temporal dependencies, enabling networks to handle inputs that unfolded over time rather than as static patterns. This integration marked an initial shift toward computational systems that could mimic aspects of human-like sequence processing, drawing parallels to psychological notions of chunking without delving into behavioral experiments. A pivotal milestone came with the development of recurrent neural networks (RNNs) in the mid-1980s, particularly through the work of David Rumelhart, , and Ronald Williams, who adapted algorithms to train networks on sequential data, allowing them to capture temporal dependencies via recurrent connections. By the 1990s, these architectures gained traction, with Jeff Elman's 1990 introduction of the simple recurrent network (SRN) showcasing emergent capabilities in discovering grammatical and structural patterns in sequences, such as predicting the next element in a stream of inputs. Concurrently, hidden Markov models (HMMs), formalized in the 1970s but prominently applied in the 1980s, revolutionized sequence modeling in fields like by representing observable sequences as probabilistic emissions from hidden states, as detailed in Lawrence Rabiner's influential 1989 tutorial. Influential studies further bridged cognitive and computational perspectives, such as Clegg, DiGirolamo, and Keele's 1998 review, which examined implicit sequence learning mechanisms in machine models that paralleled human cognitive processes, emphasizing how algorithms could acquire sequential knowledge without explicit rules. These developments transitioned sequence learning from theoretical cognitive inspirations to practical tools, with early applications emerging in time-series —where RNNs forecasted stock prices or weather patterns based on historical data—and , such as Elman's demonstrations of word sequence in simple sentences. This era laid the groundwork for efficient computational handling of temporal data, linking psychological foundations to scalable methodologies.

Types of Sequence Learning

Implicit and Explicit Learning

Implicit learning refers to the unconscious acquisition of sequential patterns, such as statistical regularities in visual or motor stimuli, without the learner's awareness or intent to learn. In this process, individuals detect and internalize underlying structures in sequences through exposure alone, leading to improved performance on related tasks. A classic demonstration comes from the (SRTT), where participants respond faster to repeating spatial patterns—indicating sequence acquisition—yet report no conscious knowledge of the pattern. In contrast, explicit learning involves conscious, strategy-based mastery of sequences, such as deliberately memorizing the steps of a routine, which allows for verbal description and intentional recall. This form of learning is typically slower and more effortful than implicit learning but enables greater flexibility in applying rules across contexts. Mechanistically, implicit learning of motor sequences relies heavily on the , a subcortical structure that facilitates formation and without conscious mediation. and lesion studies support this, showing basal ganglia activation during implicit sequence tasks and deficits when the region is compromised. Debates persist on whether implicit learning captures abstract rules or is limited to item-specific associations; evidence suggests it can extract higher-order dependencies, though the extent varies by task complexity. Pioneering work by Arthur Reber in the 1960s established implicit learning through artificial paradigms, where participants classified novel letter strings conforming to hidden rules at above-chance levels without articulating the . Patient studies further illuminate distinctions: individuals with , characterized by dysfunction, exhibit impaired implicit sequence learning in SRTT variants while preserving explicit strategies, underscoring the region's selective role.

Supervised, Unsupervised, and Reinforcement-Based Learning

Sequence learning in encompasses three primary paradigms: supervised, , and reinforcement-based approaches, each tailored to handle sequential data through distinct training mechanisms and objectives. These paradigms enable systems to process and generate ordered data, such as or text, by leveraging different forms of during training. In supervised sequence learning, models are trained on labeled datasets consisting of input-output sequence pairs, where the goal is to learn mappings that predict subsequent elements or entire output sequences based on observed inputs. For instance, this approach is used to predict the next word in a given preceding words as input, requiring annotated corpora to capture contextual dependencies. The reliance on explicit labels allows for precise tasks but demands substantial human-annotated , which can be resource-intensive for long sequences. Unsupervised sequence learning, by contrast, operates without labeled outputs, focusing instead on discovering inherent patterns and structures within unlabeled sequential data through statistical properties like similarity or repetition. A representative application involves clustering data to detect anomalies, where the model identifies deviations based on sequence statistics rather than predefined categories. This paradigm excels in exploratory tasks, enabling in vast, unannotated datasets, though it may yield less interpretable results compared to supervised methods. Reinforcement-based sequence learning involves agents interacting with an to learn optimal sequences through trial-and-error, guided by delayed rewards that reflect long-term outcomes rather than immediate . For example, in sequential tasks like games, agents refine policies to maximize cumulative rewards over extended episodes, incorporating from environmental responses. This approach is particularly suited to dynamic settings with , but it requires careful to handle sparse rewards and temporal dependencies in sequences. The key distinctions among these paradigms lie in their feedback mechanisms and objectives: emphasizes accurate prediction from labeled pairs for precise mapping; prioritizes pattern discovery from for exploratory insights; and reinforcement-based learning focuses on sequential decision optimization via reward signals, making it ideal for interactive, goal-oriented tasks with prolonged horizons.

Models and Algorithms

Statistical and Probabilistic Models

Statistical and probabilistic models form a of sequence learning, providing mathematical frameworks to capture dependencies in sequential data through explicit probability distributions and assumptions about underlying structures. These approaches model sequences as realizations of processes, where observations are generated according to probabilistic rules, enabling about hidden patterns or future elements. Unlike data-driven neural methods, they rely on tractable computations and interpretable parameters, making them suitable for scenarios with well-defined generative assumptions. Hidden Markov Models (HMMs) exemplify this paradigm, representing sequences as arising from a of unobserved (hidden) states, each emitting observable symbols probabilistically. The model assumes the hidden state s_t at time t depends only on the previous state s_{t-1}, governed by transition probabilities P(s_t \mid s_{t-1}), while the observation o_t depends solely on the current state via emission probabilities P(o_t \mid s_t). This first-order Markov property facilitates efficient inference; for instance, the employs dynamic programming to find the most likely state sequence given observations, maximizing the joint probability \arg\max_{s} P(s \mid o) = \arg\max_{s} P(o \mid s) P(s) by recursively computing path probabilities. HMMs originated in the work of Baum and Petrie and were popularized through applications in . Autoregressive models extend this by directly predicting each sequence element based on prior ones, assuming a linear dependence structure. In time series contexts, the (ARIMA) model captures non-stationary sequences through differencing to achieve stationarity, followed by autoregressive terms for past values and terms for past errors. The general form is y_t = c + \phi_1 y_{t-1} + \cdots + \phi_p y_{t-p} + \theta_1 \epsilon_{t-1} + \cdots + \theta_q \epsilon_{t-q} + \epsilon_t, where y_t is the differenced series, \phi and \theta are parameters, and \epsilon_t is . Developed by Box and Jenkins, excels in forecasting univariate sequences with short-term correlations. Bayesian approaches enhance these models by incorporating prior distributions to handle uncertainty and enable nonparametric extensions for flexible sequence lengths. For instance, the Hierarchical Dirichlet Process (HDP) prior in infinite HMMs allows an unbounded number of states, sharing transition and emission distributions across sequences via a Dirichlet with concentration \alpha and measure G_0, coupled through a top-level process with \gamma. This facilitates posterior inference over variable-length sequences using techniques like . Such methods, introduced by Teh et al., address limitations of fixed-state models in discovering latent structures. These models offer strengths in computational efficiency for sequences with local dependencies, as their probabilistic formulations support exact or approximate inference via algorithms like forward-backward for HMMs. However, they struggle with long-range dependencies due to assumptions like the , which can lead to exponential state explosion or vanishing probabilities over extended horizons. In bioinformatics, HMMs have proven impactful for tasks such as gene sequence alignment, where profile HMMs model conserved motifs in protein families to align and annotate sequences with high accuracy.

Neural Network Architectures

Neural network architectures have become central to computational approaches in sequence learning, enabling the modeling of temporal dependencies through dynamic processing of sequential data. Unlike static models, these architectures incorporate mechanisms to handle variable-length inputs and capture long-range interactions, making them suitable for tasks such as language modeling and time-series prediction. Recurrent neural networks (RNNs) form the foundational class, where hidden states are updated iteratively to maintain memory across time steps, allowing the network to process sequences of arbitrary length by reusing the same weights in a looped structure. Standard RNNs, however, suffer from the during training, which hinders learning over long sequences as gradients diminish exponentially through . To address this, (LSTM) units were introduced, featuring specialized gates that regulate information flow and mitigate gradient issues. An LSTM cell at time step t computes the forget gate as f_t = \sigma(W_f [h_{t-1}, x_t] + b_f), where \sigma is the , h_{t-1} is the previous hidden state, x_t is the current input, and W_f, b_f are learnable parameters; similarly, the input gate i_t = \sigma(W_i [h_{t-1}, x_t] + b_i) and output gate o_t = \sigma(W_o [h_{t-1}, x_t] + b_o) control what new information is added and what is output, respectively, enabling persistent memory retention. As a computationally lighter alternative to LSTMs, Gated Recurrent Units (GRUs) simplify the gating mechanism while preserving much of the performance, using only an update gate z_t = \sigma(W_z [h_{t-1}, x_t]) to determine how much of the previous state to carry over and a reset gate r_t = \sigma(W_r [h_{t-1}, x_t]) to decide the extent to which the previous state influences the candidate activation. This design reduces the number of parameters compared to LSTMs, facilitating faster training without a separate cell state. GRUs have shown comparable efficacy in sequence tasks like , often with fewer resources. Transformers represent a by eschewing recurrence entirely in favor of mechanisms, allowing parallel computation across the entire for efficient handling of long dependencies. The core self- operation is defined as \text{[Attention](/page/Attention)}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V, where Q, K, and V are query, , and projections of the input, and d_k is the of the keys, enabling the model to weigh the of different positions dynamically. Stacked encoder-decoder layers in transformers process sequences bidirectionally in the encoder, outperforming RNNs on benchmarks like WMT 2014 English-to-German translation by achieving a score of 28.4 compared to 26.3 for prior RNN ensembles. Training these architectures involves adaptations of gradient-based optimization tailored to sequential data. For RNNs and their variants like LSTMs and GRUs, backpropagation through time (BPTT) unfolds across time steps, computing gradients by propagating errors backward from the output sequence to initial inputs, though truncated versions limit unrolling to combat computational cost and instability. Transformers, being non-recurrent, use standard but benefit from pre-training strategies, such as masked language modeling in , where the model learns bidirectional representations on large corpora before , yielding state-of-the-art results on GLUE tasks with an average score of 80.5.

Applications and Challenges

Sequence Prediction and Generation

Sequence prediction involves forecasting future elements in a sequence based on historical data, a core application in time series analysis. (LSTM) networks, a type of , have been widely applied to predict weather patterns by modeling temporal dependencies in meteorological data such as and . For instance, LSTM models achieve improved accuracy in short-term forecasting compared to traditional methods, with reductions observed in datasets from urban weather stations. Similarly, hybrid models combining (ARIMA) with LSTM enhance stock price predictions by capturing both linear trends and non-linear patterns, demonstrating lower forecasting errors on financial like closing prices of major indices. Sequence generation focuses on producing new, coherent sequences that mimic learned patterns, often in creative or synthetic domains. (GPT) models excel in text generation by autoregressively predicting subsequent tokens, enabling applications like story completion or dialogue simulation with high fluency. In music composition, Transformer-based architectures generate expressive piano sequences by attending to long-range dependencies in symbolic representations, producing minute-long pieces that align with stylistic constraints. These methods leverage self-attention mechanisms to maintain structural coherence across extended outputs. Key challenges in sequence prediction and generation include handling inherent uncertainty in sequential data, which probabilistic outputs address by providing confidence intervals or distributions over predictions rather than point estimates. For example, Bayesian extensions to neural models output probability distributions to quantify prediction variability in time series. Evaluation relies on domain-specific metrics: perplexity measures the model's surprise at test sequences in language tasks, with lower values indicating better predictive fit, while mean squared error quantifies deviation in continuous forecasts like weather or stock values. In , Transformer architectures power by predicting target sequences from source inputs, achieving state-of-the-art scores on like WMT through parallelized attention. In bioinformatics, employs sequence modules to predict protein structures from sequences, resolving spatial configurations with median backbone RMSD of 0.96 Å (r.m.s.d.95) on the CASP14 for most targets, aiding . Subsequent releases, such as AlphaFold 3 in 2024, further enhance predictions for protein complexes with small molecules and other biomolecules. These applications highlight sequence learning's role in automating complex pattern extrapolation across disciplines.

Sequential Decision Making

Sequential decision making in sequence learning involves agents that select actions over time to maximize cumulative rewards, often modeled within frameworks where sequences represent states, actions, and transitions. A foundational structure for this is the (MDP), which formalizes sequential actions in stochastic environments with states S, actions A, transition probabilities P(s'|s,a), rewards R(s,a,s'), and a discount factor \gamma. In MDPs, an optimal policy \pi^*(s) is derived to solve the for value functions, enabling agents to evaluate long-term consequences of action sequences. Q-learning, a model-free algorithm for MDPs, iteratively updates action-value estimates using the temporal-difference rule: Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right] where \alpha is the learning rate and r is the immediate reward, allowing convergence to optimal Q-values under certain conditions. Applications of sequential decision making span diverse domains, leveraging MDPs and reinforcement learning for adaptive action sequences. In robotics, reinforcement learning optimizes path planning by training agents to navigate dynamic environments, such as avoiding obstacles in real-time while minimizing energy use, as demonstrated in deep reinforcement learning approaches for mobile robots. In games, AlphaGo employs policy and value networks within a reinforcement learning setup to select move sequences, achieving superhuman performance by simulating millions of future board states through Monte Carlo tree search integrated with neural networks. In healthcare, sequential decision making supports treatment protocols, where reinforcement learning models patient trajectories to personalize interventions such as optimizing antipsychotic treatments for schizophrenia, balancing short-term risks with long-term outcomes over time horizons. Key challenges in sequential decision making include credit assignment, where agents struggle to attribute rewards to specific actions in long sequences due to delayed feedback, complicating learning in high-dimensional state spaces. The exploration-exploitation trade-off further hinders progress, as agents must balance trying novel actions to discover better policies against leveraging known strategies for immediate gains, often addressed through epsilon-greedy or entropy-regularized methods. To mitigate these, hierarchical methods like the options framework decompose sequences into temporally abstract sub-policies, or "options," each defined by an initiation set, policy, and termination function, enabling reusable chunks inspired by cognitive processes for scalable learning in complex tasks.

Current Research Directions

Advances in Cognitive Neuroscience

Recent neuroimaging studies using (fMRI) have elucidated distinct neural substrates for explicit and implicit sequence learning in humans. Explicit sequence learning, which involves conscious awareness of patterns, engages the , particularly the , to support rule-based processing and integration. In contrast, implicit sequence learning relies more heavily on the , including the , for habitual and automatic pattern detection without explicit knowledge. For instance, a 2018 study demonstrated that activity in prefrontal and striatal regions correlates inversely with sequence entropy, highlighting their roles in reducing uncertainty during learning. Additionally, post-2015 research has revealed sequence replay during rest (including NREM sleep), where hippocampal and cortical networks reactivate learned motor sequences to consolidate memories, as observed in human intracortical recordings showing temporal compression of replay events akin to forward prediction. Experiments in the 2020s have further explored mechanisms in motor sequence learning, where the anticipates sensory outcomes to minimize prediction errors. frameworks, supported by fMRI and , indicate that the and parietal cortex update internal models of expected sequences during motor tasks, enhancing efficiency in response to violations of learned patterns. Moreover, sequence complexity modulates hippocampal engagement; higher-order or hierarchical sequences recruit the more robustly for binding temporal elements, as evidenced in a 2021 intracranial recording study where hippocampal oscillations strengthened with increasing sequence depth. These findings underscore the 's role in constructing predictive representations beyond simple repetition. Key discoveries highlight dopamine's involvement in reward-modulated sequence acquisition, where dopamine signals reinforce temporal predictions tied to outcomes. In human studies, elevated in the facilitates faster learning of rewarded sequences by amplifying prediction error signals, particularly in probabilistic environments. Disruptions in this system appear in neurodevelopmental disorders; individuals with spectrum disorder exhibit impaired temporal sequencing, with reduced striatal and prefrontal activation during implicit motor tasks, leading to deficits in anticipating social or action sequences. Studies have linked atypical in autism to challenges in processing surprising events within sequences. Theoretical advancements integrate into cognitive models of sequence expectations, positing that the performs probabilistic updates to form prior beliefs about upcoming elements. This approach, informed by predictive processing theories, models hippocampal and cortical circuits as Bayesian filters that weigh sensory evidence against learned priors to generate expectations. A 2019 study applied to multiscale sequence learning under memory load, revealing signatures of hierarchical belief updating in fronto-temporal regions. Such models bridge empirical data on replay and prediction errors, offering a unified framework for how humans adapt to sequential uncertainties. In 2025, research has advanced understanding of dynamics in applying learned rules to sequences, using innovative to reveal sequential neuronal activity patterns during behavioral .

Innovations in Machine Learning

The introduction of the Transformer architecture in 2017 marked a pivotal shift in sequence learning by replacing recurrent mechanisms with self-attention, enabling of sequences and achieving superior performance on tasks like . This addressed limitations in handling long-range dependencies, scaling effectively to massive datasets and leading to the development of large language models (LLMs) such as in 2023, which excel in zero-shot sequence tasks through in-context learning without task-specific . Transformers' ability to model sequences autoregressively has powered advancements in , where models generate coherent text by predicting subsequent tokens based on prior context. Building on this foundation, recent innovations have expanded sequence generation beyond text. Diffusion models, which iteratively denoise data to generate sequences, have been adapted for audio synthesis, as demonstrated by AudioGen in 2022, which conditions audio on textual descriptions to produce diverse soundscapes like environmental noises or music clips. Complementing these, state-space models (SSMs) offer efficient alternatives to Transformers for long sequences; , introduced in 2023, employs selective SSMs to achieve linear-time inference and scaling, outperforming Transformers on language modeling benchmarks with up to 5× faster training on sequences exceeding 1 million . These models mitigate complexity issues in attention mechanisms, enabling practical applications in and time-series forecasting. Despite these advances, interpretability remains a core challenge, as the opaque decision-making in Transformer-based models obscures how sequences are learned and predicted, complicating debugging and trust in high-stakes domains like healthcare. Ethical concerns also intensify with generative AI, where biases in training data propagate through sequential predictions, leading to discriminatory outputs such as stereotypical narrative completions in LLMs that reinforce societal inequities. Mitigation strategies, including debiasing during , are essential to ensure fairness in sequence generation tasks. Emerging directions integrate sequences, extending CLIP's contrastive learning to video-text pairs via models like Vita-CLIP in 2023, which uses prompt tuning to align temporal video frames with descriptive text for zero-shot retrieval and . Additionally, quantum-inspired approaches enhance optimization in sequence learning; models, drawing from quantum principles, efficiently process probabilistic graphical models for tasks like protein sequence design, outperforming classical methods in scalability and interpretability. These developments promise faster convergence in training large-scale sequence models while maintaining computational efficiency on classical hardware. As of 2025, innovations include nested learning paradigms for continual sequence adaptation, treating models as nested optimization problems to improve efficiency in dynamic environments.

References

  1. [1]
    Sequence Learning and NLP with Neural Networks
    Sequence learning refers to a variety of related tasks that neural nets can be trained to perform. What all these tasks have in common is that the input to the ...
  2. [2]
    Deep Learning in a Nutshell: Sequence Learning - NVIDIA Developer
    Mar 7, 2016 · In this post, we'll look at sequence learning with a focus on natural language processing. Part 4 of the series covers reinforcement learning.Missing: definition | Show results with:definition
  3. [3]
    Sequential Learning - an overview | ScienceDirect Topics
    Sequential learning refers to the process of training and evaluating models that can learn and make decisions in a sequential manner, such as model-based ...
  4. [4]
    Machine Learning for Sequential Data: A Review - ACM Digital Library
    This paper formalizes the principal learning tasks and describes the methods that have been developed within the machine learning research community for ...
  5. [5]
    None
    ### Abstract Summary
  6. [6]
    Large Sequence Models for Sequential Decision-Making: A Survey
    Jun 24, 2023 · This survey presents a comprehensive overview of recent works aimed at solving sequential decision-making tasks with sequence models such as the Transformer.
  7. [7]
    A Survey and Formal Analyses on Sequence Learning ... - IEEE Xplore
    This paper presents a literature survey and analysis on a variety of neural networks towards sequence learning. The conceptual models, methodologies, ...
  8. [8]
    Sequence Learning
    ### Summary of Sequence Learning Definition and Scope
  9. [9]
    Temporal-Sequential Learning With a Brain-Inspired Spiking Neural ...
    Jul 1, 2020 · Sequence learning is a fundamental cognitive function of the brain. However, the ways in which sequential information is represented and ...
  10. [10]
    The significance of brain oscillations in motor sequence learning
    Complex movements such as riding a bike or playing a musical instrument are composed of sequences of single mostly simple movements. Therefore, our capacity ...
  11. [11]
    Data-driven stock forecasting models based on neural networks
    This paper comprehensively reviews the literature on data-driven neural networks in the field of stock forecasting from 2015 to 2023.
  12. [12]
    [PDF] Markovian Models for Sequential Data Yoshua Bengio > Dept ...
    Hidden Markov Models (HMMs) are statistical models of sequential data that have been used successfully in many applications in artificial intelligence, pattern ...
  13. [13]
    Motor sequence learning - Scholarpedia
    May 24, 2018 · Motor sequence learning broadly refers to the process by which a sequence of movements comes to be performed faster and more accurately than before.Other types of Motor Sequence... · Dimensions of motor...
  14. [14]
    Encapsulation of Implicit and Explicit Memory in Sequence Learning
    Mar 1, 1998 · In our study, amnesic patients were given extensive SRT training. Their implicit and explicit test performance was compared to the performance ...
  15. [15]
    [PDF] Learning Structure from the Ground up—Hierarchical ...
    Chunking as a mechanism is a basis for humans to identify patterns as objects, assigning labels to them to facilitate memory compression [8, 9], sequence ...
  16. [16]
    Psychology as the Behaviorist Views it. John B. Watson (1913).
    It has been shown that improvement in habit comes unconsciously. The first we know of it is when it is achieved -- when it becomes an object. I believe that ' ...Missing: sequential | Show results with:sequential
  17. [17]
    [PDF] The Problem of Serial Order in Behavior - Language Log
    Jan 6, 2017 · Lashlcy's analysis lies in the fact that it exhibits the significant factors involved in the expression of ideas as well as in other instances ...
  18. [18]
    The problem of serial order in behavior: Lashley's legacy
    In 1951, Karl Lashley, a neurophysiologist at Harvard University, published a paper that has become a classic: “The Problem of Serial Order in Behavior.
  19. [19]
    Hierarchical processing in music, language, and action: Lashley ...
    Apr 2, 2014 · Karl Lashley suggested that complex action sequences, from simple motor acts to language and music, are a fundamental but neglected aspect of neural function.
  20. [20]
    Hierarchical processing in music, language, and action: Lashley ...
    Our study reveals how musical training refines the hierarchical neural processing of music and provides a neuro-computational account of this remarkable ...
  21. [21]
    Finding Structure in Time - Elman - 1990 - Cognitive Science
    Encoding sequential structure in simple recurrent networks (CMU Tech. Rep ... A learning algorithm for continually running fully recurrent neural networks (Tech.
  22. [22]
    Sequence learning - ScienceDirect.com
    Sequence learning has provided a natural domain for investigating the computations and neural structures involved in skill acquisition. As we have already ...Missing: definition | Show results with:definition
  23. [23]
    Implicit learning of artificial grammars - ScienceDirect.com
    An artificial grammar was used to generate the stimuli. Experiment I showed that Ss learned to become increasingly sensitive to the grammatical structure of the ...
  24. [24]
    Attentional requirements of learning: Evidence from performance ...
    Peter Bullemer. Show more. Add to Mendeley. Share. Cite. https://doi.org/10.1016 ... Journal of Experimental Psychology: Learning, Memory, and Cognition, 10 ...
  25. [25]
    A Neostriatal Habit Learning System in Humans - Science
    In contrast, patients with Parkinson's disease failed to learn the probabilistic classification task, despite having intact memory for the training episode.
  26. [26]
    Sequence learning in the human brain: A functional ...
    Feb 15, 2020 · The study provides solid evidence that, at least as tested with the visuo-motor SRT task, sequence learning in humans relies on the basal ganglia.
  27. [27]
    Association and Abstraction in Sequential Learning:“What is ...
    Aug 6, 2025 · The first part of the article addresses the major questions and challenges that underlie the debate on implicit and explicit learning. In ...
  28. [28]
    [PDF] Machine Learning for Sequential Data: A Review
    This paper formalizes the principal learning tasks and describes the methods that have been developed within the machine learning re- search community for ...Missing: survey | Show results with:survey
  29. [29]
    Machine Learning: Algorithms, Real-World Applications and ... - NIH
    This study's key contribution is explaining the principles of different machine learning techniques and their applicability in various real-world application ...
  30. [30]
    Supervised Machine Learning - DataCamp
    Aug 22, 2022 · Supervised machine learning learns patterns and relationships between input and output data. It is defined by its use of labeled data. A labeled ...
  31. [31]
    [PDF] Reinforcement Learning: An Introduction - Stanford University
    The reinforcement learning agent and its environment interact over a sequence of discrete time steps. The specification of their interface defines a ...
  32. [32]
    [PDF] Deep Reinforcement Learning for Sequence-to-Sequence Models
    We intend for this paper to provide a broad overview on the strength and complexity of combining seq2seq training with RL training and to guide researchers in ...
  33. [33]
    [PDF] A Tutorial on Hidden Markov Models and Selected Applications in ...
    This tutorial is intended to provide an overview of the basic theory of HMMs (as originated by Baum and his colleagues), provide practical details on methods of.
  34. [34]
    The viterbi algorithm | IEEE Journals & Magazine
    This paper gives a tutorial exposition of the algorithm and of how it is implemented and analyzed. Applications to date are reviewed. Increasing use of the ...
  35. [35]
    [PDF] The Infinite Hidden Markov Model - MLG Cambridge
    We have shown how a two-level Hierarchical Dirichlet Process can be used to define a non- parametric Bayesian HMM. The HDP implicity integrates out the ...
  36. [36]
    Hidden Markov Models and their Applications in Biological ... - NIH
    We show how these HMMs can be used to solve various sequence analysis problems, such as pairwise and multiple sequence alignments, gene annotation, ...
  37. [37]
    [PDF] LONG SHORT-TERM MEMORY 1 INTRODUCTION
    For instance, in his postdoctoral thesis. (1993), Schmidhuber uses hierarchical recurrent nets to rapidly solve certain grammar learning tasks involving minimal ...
  38. [38]
    [PDF] Long Short-Term Memory - Semantic Scholar
    Long Short-Term Memory · Sepp Hochreiter, J. Schmidhuber · Published in Neural Computation 1 November 1997 · Computer Science.
  39. [39]
    [1706.03762] Attention Is All You Need - arXiv
    Jun 12, 2017 · We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
  40. [40]
    [PDF] Backpropagation Through Time: What It Does and How to Do It
    This paper first reviews basic backpropagation, a simple method which is now being widely used in areas like pattern recognition and fault diagnosis, ...
  41. [41]
    BERT: Pre-training of Deep Bidirectional Transformers for Language ...
    Oct 11, 2018 · BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
  42. [42]
    [PDF] Short term temperature forecasting using LSTMS, and CNN
    May 31, 2021 · Long Short-Term Memory (LSTM) is a widely used deep learning architecture for time series forecasting. In this paper, we aim to predict one ...
  43. [43]
    [PDF] Predicting Stock Prices Using Hybrid LSTM and ARIMA Model - IAENG
    In this paper, we use the closing price of the sample as the prediction target, and the input and output of the training model are all one-dimensional matrices.<|separator|>
  44. [44]
    [PDF] Improving Language Understanding by Generative Pre-Training
    Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and.
  45. [45]
    [1809.04281] Music Transformer - arXiv
    Sep 12, 2018 · The Transformer (Vaswani et al., 2017), a sequence model based on self-attention, has achieved compelling results in many generation tasks that ...
  46. [46]
    Highly accurate protein structure prediction with AlphaFold - Nature
    Jul 15, 2021 · The AlphaFold network directly predicts the 3D coordinates of all heavy atoms for a given protein using the primary amino acid sequence and ...
  47. [47]
    Q-learning | Machine Learning
    This paper presents and proves in detail a convergence theorem forQ-learning based on that outlined in Watkins (1989). We show thatQ-learning converges to ...
  48. [48]
    A Markovian Decision Process - Semantic Scholar
    A Markovian Decision Process · R. Bellman · Published 18 April 1957 · Mathematics · Indiana University Mathematics Journal.
  49. [49]
    [PDF] A Markovian Decision Process - DTIC
    A MARKOVIAN DECISION PROCESS. By. Richard Bellman. §1. Introduction. The purpose of this paper is to discuss the asymptotic behavior of the sequence f fN(i)3 I ...
  50. [50]
    A Review of Deep Reinforcement Learning Algorithms for Mobile ...
    This review paper discusses path-planning methods that use neural networks, including deep reinforcement learning, and its different types.
  51. [51]
    Mastering the game of Go with deep neural networks and tree search
    Jan 27, 2016 · Silver, D., Huang, A., Maddison, C. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).
  52. [52]
    Informing sequential clinical decision-making through reinforcement ...
    This paper highlights the role that reinforcement learning can play in the optimization of treatment policies for informing clinical decision making. We have ...
  53. [53]
    Rethinking exploration–exploitation trade-off in reinforcement ...
    The exploration–exploitation dilemma is one of the fundamental challenges in deep reinforcement learning (RL). Agents must strike a trade-off between making ...
  54. [54]
    [PDF] A framework for temporal abstraction in reinforcement learning
    We can analyze options in terms of the SMDP and then use their MDP interpretation to change them and produce a new SMDP. Page 17. R.S. Sutton et al. / ...
  55. [55]
    Replay of Learned Neural Firing Sequences during Rest in Human ...
    The offline “replay” of neural firing patterns underlying waking experience, previously observed in non-human animals, is thought to be a mechanism for memory ...
  56. [56]
    Article Predictive sequence learning in the hippocampal formation
    Aug 7, 2024 · We developed a predictive autoencoder model of the hippocampus including the trisynaptic and monosynaptic circuits from the entorhinal cortex (EC).
  57. [57]
    Learning hierarchical sequence representations across human ...
    Feb 19, 2021 · In contrast, “chunking models” posit that learners represent statistically coherent units of information from the input in memory such that ...
  58. [58]
    Intact predictive motor sequence learning in autism spectrum disorder
    Oct 19, 2021 · We conclude that individuals with autism do not show atypicalities in response to surprising events in the context of motor sequence-learning.Methods · Serial Reaction Time Task · Results<|control11|><|separator|>
  59. [59]
    Brain signatures of a multiscale process of sequence learning ... - eLife
    Feb 4, 2019 · Using Bayesian inference (again), the estimated statistics are turned into a prediction about the next stimulus. In that framework, surprise ...
  60. [60]
    [2303.08774] GPT-4 Technical Report - arXiv
    Mar 15, 2023 · We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
  61. [61]
    [2209.15352] AudioGen: Textually Guided Audio Generation - arXiv
    AudioGen is an auto-regressive model that generates audio samples conditioned on text inputs, using a discrete audio representation.
  62. [62]
    Mamba: Linear-Time Sequence Modeling with Selective State Spaces
    Dec 1, 2023 · Mamba is a neural network using selective SSMs, with fast inference and linear scaling, achieving state-of-the-art performance in language, ...
  63. [63]
    [2410.06070] Enforcing Interpretability in Time Series Transformers
    Oct 8, 2024 · We develop a framework based on Concept Bottleneck Models to enforce interpretability of time series Transformers.
  64. [64]
    Bias and Fairness in Large Language Models: A Survey
    We present a comprehensive survey of bias evaluation and mitigation techniques for LLMs. We first consolidate, formalize, and expand notions of social bias and ...
  65. [65]
    Large language models show amplified cognitive biases in moral ...
    Our experiments demonstrate that the decisions and advice of LLMs are systematically biased against doing anything, and this bias is stronger than in humans.
  66. [66]
    Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting
    In this work, we propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training.
  67. [67]
    Sequence processing with quantum-inspired tensor networks - Nature
    Feb 28, 2025 · We introduce efficient tensor network models for sequence processing motivated by correspondence to probabilistic graphical models, interpretability and ...
  68. [68]
    Protein Design by Integrating Machine Learning and Quantum ...
    Nov 15, 2024 · Strikingly, our quantum-inspired reformulation outperforms conventional sequence optimization even when adopted on classical machines. The ...