Fact-checked by Grok 2 weeks ago

Seq2seq

Sequence-to-sequence (seq2seq) models are a class of architectures in designed to transform an input sequence into an output sequence of potentially different length, using an encoder-decoder framework that processes variable-length data end-to-end. Introduced in 2014, these models make minimal assumptions about the input structure and leverage recurrent neural networks (RNNs), particularly (LSTM) units, to handle long-range dependencies in sequences. The core consists of an encoder LSTM that reads the input timestep by timestep to produce a fixed-dimensional vector representation, followed by a LSTM that generates the output conditioned on this representation. In their foundational implementation, Sutskever, Vinyals, and Le employed four-layer LSTMs with approximately 384 million parameters, trained on large parallel corpora, and incorporated a technique of reversing the source order to facilitate gradient flow during training. This approach yielded a breakthrough in , achieving a score of 34.8 on the WMT'14 English-to-French dataset, surpassing traditional phrase-based systems. A significant advancement came with the integration of an attention mechanism, proposed by Bahdanau, Cho, and Bengio in , which addresses the bottleneck of fixed vector representations by enabling the decoder to dynamically weigh and attend to relevant portions of the input sequence. This "soft-alignment" process, computed as a weighted sum of encoder hidden states, improved translation quality for longer sentences, raising scores from 17.8 (basic encoder-decoder) to 26.8 on English-to-French tasks for sentences up to 50 words in length. has since become a standard component in seq2seq models, enhancing their ability to model alignments between input and output elements. Beyond translation, seq2seq models have been widely adopted for diverse (NLP) tasks, including dialogue generation, where they produce contextually relevant responses to input utterances; abstractive text summarization, condensing long documents into concise outputs; and , converting audio sequences to text transcripts. These applications demonstrate the model's versatility in handling sequential data, paving the way for further innovations like transformer-based variants that scale to even larger contexts.

Introduction

Definition and Purpose

Sequence-to-sequence (seq2seq) models constitute a foundational in for transforming an input into an output , employing an encoder-decoder framework to handle tasks such as mapping a in a source to its in a . The encoder compresses the variable-length input into a fixed-dimensional that captures essential information, while the decoder autoregressively generates the output by producing elements one at a time, conditioned on the encoded and previously generated . The core purpose of seq2seq models is to facilitate end-to-end learning for sequence transduction problems, where the entire mapping from input to output is optimized jointly from raw data, bypassing the need for hand-engineered features or multi-stage pipelines. This addresses the constraints of earlier fixed-length models, which assume uniform input and output dimensions and fail to model sequential effectively. Seq2seq architectures emerged as a response to persistent challenges in , particularly the difficulties faced by traditional phrase-based () systems in capturing long-range dependencies and handling non-monotonic alignments between source and target sequences. By enabling direct learning of sequence mappings, seq2seq models improve fluency and accuracy in generating coherent outputs from diverse inputs.

Key Components Overview

Seq2seq models consist of three primary components: an encoder, a , and optionally, an attention mechanism to enhance performance on longer sequences. The encoder processes the input sequence to produce a fixed-length context vector that captures the essential information of the input. The decoder then uses this context vector to generate the output sequence one element at a time. To handle variable-length sequences while preserving order and dependencies, seq2seq models employ layers, such as (LSTM) units, in both the encoder and decoder. These layers process the sequence timestep by timestep, allowing the model to manage inputs and outputs of differing lengths without fixed-size assumptions. Input sequences are represented through tokenization into words or subwords, followed by embedding layers that map tokens to dense vector representations, typically in high-dimensional spaces like 1000 dimensions. Outputs are generated probabilistically using a softmax layer over a predefined vocabulary, enabling the model to predict the next token based on the accumulated context. For instance, in , an input like "" is tokenized and embedded by the encoder to form a context vector, which the uses to produce tokens for the target language, such as "Bonjour le monde" in , step by step during inference. Later enhancements, like , allow the to focus dynamically on relevant parts of the input, improving alignment for complex tasks.

Historical Development

Early Foundations

The foundations of sequence-to-sequence (seq2seq) models trace back to the pre-neural era of (SMT), which dominated the field from the late 1980s to the early 2010s. SMT systems modeled translation as a noisy channel problem, estimating the probability of a target sentence given a source sentence using parallel corpora to learn translation and language models. These approaches shifted away from rigid rule-based systems toward data-driven methods that could handle real-world linguistic variability. A cornerstone of early was the IBM Models, developed by researchers at 's Thomas J. Watson Research Center in the early 1990s. These models, detailed in a seminal series of five progressively complex formulations, focused on word-level alignments between source and target languages to map sequences probabilistically. Model 1 introduced basic lexical translation probabilities assuming uniform alignment, while later models incorporated distortion (position-based alignments) and fertility (the number of target words generated per source word) to better capture structural differences between languages. By estimating parameters via expectation-maximization on bilingual data, the IBM Models enabled automatic alignment extraction, forming the basis for phrase-based systems that improved translation quality significantly over prior methods. Despite their impact, frameworks faced key challenges in handling variable-length sequences and alignments. N-gram language models, commonly used in for fluency scoring, suffered from data sparsity and limited context (typically 3-5 grams), failing to capture long-range dependencies essential for coherent . Alignment processes required heuristics to resolve ambiguities, such as one-to-many mappings, and struggled with reordering in linguistically distant language pairs, leading to error propagation in decoding. These limitations highlighted the need for models that could learn continuous representations and sequential dependencies more flexibly. Neural precursors emerged in the late with recurrent neural networks (RNNs) applied to language modeling, offering a path toward addressing these issues. In 2010, Mikolov et al. introduced RNN-based models (RNNLMs) that used hidden states to maintain context across arbitrary lengths, outperforming traditional n-grams on metrics for and text prediction tasks. These models demonstrated RNNs' ability to learn distributed word representations and handle sequential data without explicit , paving the way for neural integration into translation pipelines. By 2011-2013, neural models were incorporated as features in systems for rescoring hypotheses, yielding modest but consistent score improvements (e.g., 0.5-1.0 points) on benchmarks like WMT. Initial neural ideas for built on these advances, proposing encoder structures to compress source sequences into fixed representations. In 2013, Kalchbrenner and Blunsom presented recurrent continuous translation models using bidirectional RNNs to encode source sentences and generate target sequences, achieving competitive results on small-scale English-French tasks without relying on phrase tables. This work emphasized end-to-end learning of alignments through continuous embeddings, reducing the modular complexity of . The transition from statistical to neural end-to-end learning accelerated around 2013-2014, driven by advances in GPU computing and larger datasets, enabling seq2seq paradigms to supplant hybrid SMT-neural systems.

Priority Dispute and Key Publications

The RNN encoder-decoder architecture, a foundational component of sequence-to-sequence (seq2seq) models, was first introduced by Kyunghyun Cho and colleagues in their June 2014 arXiv preprint, later published at EMNLP 2014. Authored by researchers from institutions including New York University and the Université de Montréal, the paper proposed using two recurrent neural networks (RNNs)—one as an encoder to compress input sequences into a fixed-length vector and another as a decoder to generate output sequences—for tasks such as statistical machine translation, with additional applications noted for automatic summarization, question answering, and dialogue systems. The model emphasized learning continuous phrase representations to improve integration with traditional phrase-based translation systems. Shortly thereafter, Ilya Sutskever and colleagues from Google Brain published their September 2014 arXiv preprint, accepted at NeurIPS 2014, which formalized "sequence to sequence learning with neural networks" and popularized the "seq2seq" terminology. This work built upon the encoder-decoder framework by employing long short-term memory (LSTM) units to handle longer sequences more effectively, focusing primarily on end-to-end machine translation without relying on intermediate phrase alignments. Sutskever et al. explicitly cited Cho et al. as related prior work, acknowledging the encoder-decoder structure while advancing it through deeper LSTMs and empirical demonstrations on translation benchmarks. The Sutskever paper had profound impact by demonstrating state-of-the-art performance on the WMT'14 English-to-French translation task, achieving a score of 34.81 with an ensemble of deep LSTMs—surpassing the previous phrase-based baseline of 33.3 and establishing seq2seq as a viable to traditional methods. This result, obtained via direct sequence generation without , highlighted the model's ability to capture long-range dependencies in . The work's emphasis on reversing input sequences during training further improved convergence and performance. Subsequent influences included rapid integration of seq2seq models into major frameworks, such as , where example implementations and tutorials for encoder-decoder architectures emerged by 2017 to facilitate experimentation and beyond. These developments accelerated adoption across research and industry, cementing seq2seq as a cornerstone of neural sequence modeling.

Core Architecture

Encoder Mechanism

The encoder in a sequence-to-sequence (seq2seq) model serves to process an input x_1, \dots, x_T, where each x_t is typically an of a such as a word or subword, through a series of recurrent layers to generate a sequence of hidden states h_1, \dots, h_T. These hidden states capture the contextual information from the input up to each time step t. In the basic formulation, the final hidden state h_T is often used as a fixed-dimensional context vector that summarizes the entire input for downstream processing. Recurrent neural networks (RNNs) form the backbone of the encoder, with the hidden state at each step computed as h_t = \tanh(W_{xh} x_t + W_{hh} h_{t-1} + b_h), where W_{xh} and W_{hh} are weight matrices, and b_h is a bias term; this update allows the model to maintain a memory of prior inputs while incorporating the current one. To enhance context capture, especially for tasks requiring understanding of future tokens, bidirectional RNNs or LSTMs are commonly employed, processing the sequence in both forward and backward directions to produce concatenated hidden states. Long short-term memory (LSTM) units, which replace the simple tanh activation with a cell state and gating mechanisms—including forget, input, and output gates—address limitations in standard RNNs by selectively retaining or discarding information over extended sequences. A key challenge in encoder design is the , where gradients diminish exponentially during through time, hindering learning of long-range dependencies in sequences. LSTMs mitigate this by using multiplicative gates to maintain stable gradient flow, enabling effective encoding of lengthy inputs. For instance, in , the encoder processes a source sentence like "The cat sat on the mat" to produce hidden states that encode syntactic and semantic relationships, summarized in the context vector for decoding the target language output.

Decoder Mechanism

The decoder in a sequence-to-sequence (seq2seq) model is responsible for generating the output sequence y_1, \dots, y_{T'} autoregressively, conditioned on the input sequence x via a fixed-dimensional context vector c produced by the encoder. This process begins with the decoder's initial hidden state initialized from c, and each subsequent hidden state s_t is computed as s_t = f(y_{t-1}, s_{t-1}), where f denotes the recurrent function (typically an LSTM or GRU), y_{t-1} is the previous output symbol (as embedding), and s_{t-1} is the prior hidden state; the context vector c influences the decoding through this initialization, enabling the model to produce variable-length outputs without explicit alignment to the input. During training, the decoder employs , where the ground-truth previous outputs y_{<t} are fed as inputs to compute the next symbol, rather than the model's own predictions. This facilitates efficient optimization by maximizing the conditional likelihood P(y_t \mid y_{<t}, x) = \softmax(W_o s_t), where W_o is a learned output mapping the hidden state s_t to the vocabulary size, and \softmax yields the over possible output tokens. The overall training objective decomposes into the product of these conditional probabilities across the output sequence, allowing the model to learn coherent generations step by step. At inference time, the operates fully autoregressively, starting from a special start token and using its own predictions to generate subsequent symbols, which introduces a train-test discrepancy. To improve output quality over decoding (which selects the highest-probability token at each step), is commonly used; it maintains a fixed number of partial hypotheses (beam width B), expanding and pruning them based on cumulative log-probability to explore more likely sequences. Experiments on tasks demonstrate that with B = 12 can yield substantial gains; for an ensemble of five models, it improves the score from 33.0 ( decoding) to 34.81 on the WMT'14 English-to-French task. A key challenge in this setup is exposure bias, arising from the mismatch between (where errors do not propagate due to ) and inference (where compounding errors from model predictions degrade performance over long sequences). This discrepancy causes the model to be exposed only to during , leading to suboptimal robustness when generating from its own outputs.

Training and Inference Differences

Seq2seq models are trained using , where the objective is to maximize the of the target given the input , typically formulated as minimizing the loss \mathcal{L} = -\sum_t \log P(y_t | y_{<t}, x), with the sum taken over the target tokens y_t. This loss is computed autoregressively, but during , the decoder receives the ground-truth previous tokens as input—a technique known as —to accelerate convergence and avoid error propagation from early predictions. In contrast, inference in seq2seq models involves autoregressive decoding without access to ground-truth targets, where the generates each conditioned only on the model's previous outputs, leading to potential accumulation over long sequences—a phenomenon called exposure bias that can degrade performance compared to . To mitigate suboptimal greedy selection of the most probable at each step, strategies like are employed, maintaining a fixed-width beam of k hypotheses and exploring the top-k likely extensions at each step to find a higher-probability output sequence overall. This process is computationally more intensive than , as it requires evaluating multiple partial sequences in parallel, often resulting in longer times, especially for larger beam widths. A key distinction arises from the absence of ground truth during inference, necessitating extrinsic evaluation metrics such as , which measures n-gram overlap between generated and reference sequences to approximate quality without human judgment. Additionally, to stabilize training of the underlying recurrent components, techniques like gradient clipping are applied, capping the gradient norm (e.g., at 5) to prevent exploding gradients that could destabilize optimization.

Attention Integration

Role of Attention in Seq2seq

In vanilla sequence-to-sequence (seq2seq) models, the encoder compresses the entire input into a single fixed-length context vector, which serves as the sole source of information for the during generation. This approach creates a significant , particularly for long sequences, as the encoder must encode all relevant details into a limited representation, leading to information loss and degraded performance as input length increases. To address this limitation, attention mechanisms were introduced to enable the decoder to dynamically focus on different parts of the input sequence at each generation step, rather than relying on a static context vector. In their seminal work, Bahdanau et al. (2014) proposed an additive mechanism specifically for , where the decoder computes alignment scores between the current hidden state and all encoder annotations to weigh their contributions selectively. This allows the model to emphasize relevant input elements, such as specific words or phrases, improving alignment between source and target sequences. The integration of attention involves the decoder attending to the full set of encoder hidden states—representations produced by the encoder for each input position—at every decoding timestep. By forming a context vector as a weighted of these states, the model handles long sequences more effectively, avoiding the need to propagate through deep recurrent layers solely via the initial . This enhancement yields several benefits: it mitigates error propagation from the encoder by allowing retrieval of input , and it produces interpretable soft-alignment weights that reveal plausible linguistic correspondences between input and output, aiding model and .

Computing Attention Weights

In sequence-to-sequence models, attention weights are computed to determine the relevance of each encoder hidden state to the current decoder state. The process begins with calculating alignment scores, which quantify the compatibility between the decoder's hidden state s_t at time step t and each encoder hidden state h_i for input positions i = 1 to T. In the additive mechanism introduced by Bahdanau et al., the score e_{ti} is computed as e_{ti} = v_a^T \tanh(W_a [s_t; h_i]), where W_a is a learnable weight matrix, v_a is a learnable vector, and [s_t; h_i] denotes . These raw scores are then normalized to form attention weights using the : \alpha_{ti} = \frac{\exp(e_{ti})}{\sum_{j=1}^T \exp(e_{tj})}. The resulting weights \alpha_{ti} sum to 1 over i, enabling the computation of a context vector c_t = \sum_{i=1}^T \alpha_{ti} h_i, which aggregates a weighted sum of the encoder states to inform the 's output at step t. Alternative formulations exist for computing scores. For instance, the dot-product attention variant, proposed by Luong et al., simplifies the score to e_{ti} = s_t^T h_i (or a scaled version thereof), which is computationally efficient and effective when encoder and decoder states are in the same space. Multi-head attention extends these mechanisms by projecting queries, keys, and values into multiple subspaces and computing attention in , allowing the model to jointly attend to information from different representation subspaces, as later developed in architectures. Attention weights are often visualized as heatmaps, where rows correspond to decoder time steps and columns to encoder positions, with color intensity representing \alpha_{ti} values; such visualizations reveal alignments between and sequences, such as focusing on specific input words during .

Applications and Extensions

Machine Translation

Sequence-to-sequence (seq2seq) models were initially developed for as an end-to-end approach, directly mapping language sentences to language outputs without relying on intermediate phrase alignments or rules. This , introduced in 2014, outperformed traditional phrase-based (SMT) systems, which dominated prior benchmarks. For instance, on the WMT'14 English-to-French task, an ensemble of deep LSTM-based seq2seq models achieved a score of 34.8, surpassing the phrase-based SMT baseline of 33.3. The integration of mechanisms further enhanced performance by allowing the to focus on relevant parts of the input sequence, yielding a score of 28.45 (excluding unknown words) on the same dataset with an extended training regime. In 2016, adopted based on seq2seq architectures with attention in its service, marking a widespread commercial deployment. The (GNMT) system, utilizing deep LSTM layers, reduced translation errors by 60% relative to phrase-based systems on English-to-French, English-to-Spanish, and English-to-Chinese pairs. This transition improved fluency and accuracy, establishing seq2seq as the foundation for production-scale translation systems. Seq2seq models for are commonly evaluated on Workshop on Machine Translation (WMT) datasets, which provide standardized corpora for language pairs like English-French and English-German, using the metric to measure n-gram overlap with human references. Early attention-augmented seq2seq models achieved scores exceeding 28 on English-to-French tasks, setting new standards that phrase-based methods struggled to match. Later implementations, such as GNMT, pushed scores to 38.95 on WMT'14 English-to-French, demonstrating the scalability of seq2seq to larger datasets and deeper networks. The evolution of seq2seq in progressed from basic RNN and LSTM architectures to optimized toolkits like fairseq, developed by Research, which facilitates efficient training of seq2seq models including LSTMs and convolutions for tasks. A key advancement addressed the challenge of rare words, which comprise up to 20% of vocabulary in corpora and often lead to unknown token issues in fixed-vocabulary models. (BPE) subword units resolve this by decomposing rare and out-of-vocabulary words into frequent subword segments, enabling open-vocabulary ; for example, on WMT 2015 English-to-German, BPE improved by 0.5 points and rare word accuracy (unigram F1) from 36.8% to 41.8%. A notable case study in seq2seq translation involves visualizing attention-derived alignments, which reveal how the model associates source and target words. In the attention model, soft-alignment weights are depicted as grayscale matrices, illustrating monotonic alignments such as mapping "" in English to "zone économique européenne" in , providing interpretability into the translation process. This visualization underscores attention's role in producing coherent, context-aware translations.

Speech Recognition and Other NLP Tasks

Seq2seq models have been adapted for by treating the problem as mapping acoustic input sequences, such as mel-frequency cepstral coefficients, to output sequences of characters or words. A seminal approach is the Listen, Attend and Spell () model, which employs an encoder-decoder architecture with to directly transcribe speech utterances without explicit alignment, achieving end-to-end learning from audio to text. This contrasts with (CTC), an earlier alignment-free method that uses a to compute probabilities over output labels at each time step, often combined with a for decoding, but lacking the for focusing on relevant audio segments. Attention-based decoders in seq2seq, as in LAS, generally outperform CTC in handling variable-length inputs and improving transcription accuracy on large-vocabulary tasks by dynamically weighting acoustic features. Beyond speech, seq2seq architectures power other natural language processing tasks, such as abstractive text summarization, where an encoder processes the input document to produce a fixed representation, and the decoder generates a concise abstract by attending to key parts of the source text. A foundational model for this is the neural attention-based abstractive summarizer, which frames summarization as sequence transduction and demonstrates superior fluency and informativeness over extractive methods on datasets like the Gigaword corpus. In dialogue systems, seq2seq models enable response generation by encoding conversational context as input sequences and decoding coherent replies, as exemplified by the neural conversational model trained on multi-turn dialogues, which captures utterance dependencies to produce contextually relevant outputs. Seq2seq extends to multimodal and non-text domains, including image captioning, where a (CNN) encoder extracts visual features from images, feeding them into an RNN decoder to generate descriptive captions. The model pioneered this encoder-decoder paradigm for vision-to-language tasks, attaining state-of-the-art performance on benchmarks like MSCOCO by leveraging to align image regions with words. For time-series forecasting, seq2seq models encode historical data sequences to predict future values, providing a flexible for multivariate predictions that outperforms traditional autoregressive methods in capturing long-range dependencies. Modern advancements integrate seq2seq with pre-trained language models, such as , which unifies diverse tasks under a text-to-text format using an encoder-decoder , fine-tuned on large corpora to enhance performance in summarization, , and beyond while preserving the core seq2seq structure.

References

  1. [1]
    Sequence to Sequence Learning with Neural Networks - arXiv
    Sep 10, 2014 · In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure.
  2. [2]
    Neural Machine Translation by Jointly Learning to Align and ... - arXiv
    Sep 1, 2014 · The neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance.
  3. [3]
    [PDF] Neural Machine Translation and Sequence-to-sequence Models
    Mar 5, 2017 · This tutorial introduces a new and powerful set of techniques variously called “neural machine translation” or “neural sequence-to-sequence ...
  4. [4]
    [PDF] Extended Translation Models in Phrase-based Decoding
    (ii) Dependen- cies beyond phrase boundaries are not modelled at all. (iii) Phrase-based translation models have dif- ficulties modelling long-distance ...
  5. [5]
    [PDF] Recurrent Neural Network Based Language Model
    A new recurrent neural network based language model (RNN. LM) with applications to speech recognition is presented. Re- sults indicate that it is possible ...
  6. [6]
    None
    ### Summary of https://arxiv.org/pdf/1409.0473.pdf
  7. [7]
    Learning Phrase Representations using RNN Encoder-Decoder for ...
    Jun 3, 2014 · Abstract page for arXiv paper 1406.1078: Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation.
  8. [8]
    [PDF] Multi-task Sequence to Sequence Learning - Google Research
    Table 4: English→German WMT'14 translation – shown are perplexities (ppl) and BLEU scores of various translation models. Our multi-task systems combine ...
  9. [9]
    Bidirectional recurrent neural networks | IEEE Journals & Magazine
    Abstract: In the first part of this paper, a regular recurrent neural network (RNN) is extended to a bidirectional recurrent neural network (BRNN).
  10. [10]
    Long Short-Term Memory | Neural Computation - MIT Press Direct
    Nov 15, 1997 · We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called ...
  11. [11]
    [PDF] Learning long-term dependencies with gradient descent is difficult
    Our only claim here is that discrete propagation of error offers interesting solutions to the vanishing gradient problem in recurrent network. Our ...<|control11|><|separator|>
  12. [12]
    [PDF] Sequence to Sequence Learning with Neural Networks - arXiv
    Dec 14, 2014 · The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large fixed- dimensional vector representation, and ...
  13. [13]
    A Learning Algorithm for Continually Running Fully Recurrent ...
    Jun 1, 1989 · A gradient-following learning algorithm for completely recurrent networks running in continually sampled time is derived and used as the basis for practical ...
  14. [14]
    [PDF] BLEU: a Method for Automatic Evaluation of Machine Translation
    BLEU is a method for automatic machine translation evaluation, measuring closeness to human translations using a weighted average of phrase matches. It is ...
  15. [15]
    [PDF] On the difficulty of training Recurrent Neural Networks - arXiv
    Feb 16, 2013 · We propose a gradient norm clipping strategy to deal with exploding gra- dients and a soft constraint for the vanishing gradients problem. We ...
  16. [16]
    Effective Approaches to Attention-based Neural Machine Translation
    Aug 17, 2015 · This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one.
  17. [17]
    [1706.03762] Attention Is All You Need - arXiv
    Jun 12, 2017 · We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
  18. [18]
    Google's Neural Machine Translation System: Bridging the Gap ...
    Sep 26, 2016 · In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues.
  19. [19]
    Neural Machine Translation of Rare Words with Subword Units - arXiv
    Aug 31, 2015 · In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown ...
  20. [20]
    [1508.01211] Listen, Attend and Spell - arXiv
    Aug 5, 2015 · Abstract:We present Listen, Attend and Spell (LAS), a neural network that learns to transcribe speech utterances to characters.
  21. [21]
    [PDF] Connectionist Temporal Classification: Labelling Unsegmented ...
    This is a natural measure for tasks (such as speech or handwriting recognition) where the aim is to minimise the rate of transcription mistakes. 3.
  22. [22]
    A Neural Attention Model for Abstractive Sentence Summarization
    Sep 2, 2015 · A Neural Attention Model for Abstractive Sentence Summarization. Authors:Alexander M. Rush, Sumit Chopra, Jason Weston.
  23. [23]
    [1506.05869] A Neural Conversational Model - arXiv
    Jun 19, 2015 · Title:A Neural Conversational Model. Authors:Oriol Vinyals, Quoc Le. View a PDF of the paper titled A Neural Conversational Model, by Oriol ...
  24. [24]
    [1411.4555] Show and Tell: A Neural Image Caption Generator - arXiv
    Nov 17, 2014 · Title:Show and Tell: A Neural Image Caption Generator. Authors:Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan. View a PDF of the ...
  25. [25]
    Foundations of Sequence-to-Sequence Modeling for Time Series
    May 9, 2018 · We provide the first theoretical analysis of this time series forecasting framework. We include a comparison of sequence-to-sequence modeling to classical time ...