Fact-checked by Grok 2 weeks ago

Seq2seq

Sequence-to-sequence (seq2seq) models are a class of neural network architectures in deep learning designed to transform an input sequence into an output sequence of potentially different length, using an encoder-decoder framework that processes variable-length data end-to-end.^[1] Introduced in 2014, these models make minimal assumptions about the input structure and leverage recurrent neural networks (RNNs), particularly long short-term memory (LSTM) units, to handle long-range dependencies in sequences.^[1] The core architecture consists of an encoder LSTM that reads the input sequence timestep by timestep to produce a fixed-dimensional vector representation, followed by a decoder LSTM that generates the output sequence conditioned on this representation.^[1] In their foundational implementation, Sutskever, Vinyals, and Le employed four-layer LSTMs with approximately 384 million parameters, trained on large parallel corpora, and incorporated a technique of reversing the source sequence order to facilitate gradient flow during training.^[1] This approach yielded a breakthrough in machine translation, achieving a BLEU score of 34.8 on the WMT'14 English-to-French dataset, surpassing traditional phrase-based statistical machine translation systems.^[1] A significant advancement came with the integration of an attention mechanism, proposed by Bahdanau, Cho, and Bengio in 2014, which addresses the bottleneck of fixed vector representations by enabling the decoder to dynamically weigh and attend to relevant portions of the input sequence.^[2] This "soft-alignment" process, computed as a weighted sum of encoder hidden states, improved translation quality for longer sentences, raising BLEU scores from 17.8 (basic encoder-decoder) to 26.8 on English-to-French tasks for sentences up to 50 words in length.^[2] Attention has since become a standard component in seq2seq models, enhancing their ability to model alignments between input and output elements.^[2] Beyond translation, seq2seq models have been widely adopted for diverse natural language processing (NLP) tasks, including dialogue generation, where they produce contextually relevant responses to input utterances; abstractive text summarization, condensing long documents into concise outputs; and automatic speech recognition, converting audio sequences to text transcripts.^[3] These applications demonstrate the model's versatility in handling sequential data, paving the way for further innovations like transformer-based variants that scale to even larger contexts.^[3]

Introduction

Definition and Purpose

Sequence-to-sequence (seq2seq) models constitute a foundational architecture in deep learning for transforming an input sequence into an output sequence, employing an encoder-decoder framework to handle tasks such as mapping a sentence in a source language to its translation in a target language. The encoder compresses the variable-length input sequence into a fixed-dimensional representation that captures essential information, while the decoder autoregressively generates the output sequence by producing elements one at a time, conditioned on the encoded representation and previously generated tokens.^[1] The core purpose of seq2seq models is to facilitate end-to-end learning for sequence transduction problems, where the entire mapping from input to output is optimized jointly from raw data, bypassing the need for hand-engineered features or multi-stage pipelines. This design addresses the constraints of earlier fixed-length models, which assume uniform input and output dimensions and fail to model sequential structure effectively.^[1] Seq2seq architectures emerged as a response to persistent challenges in machine translation, particularly the difficulties faced by traditional phrase-based statistical machine translation (SMT) systems in capturing long-range dependencies and handling non-monotonic alignments between source and target sequences. By enabling direct learning of sequence mappings, seq2seq models improve fluency and accuracy in generating coherent outputs from diverse inputs.^[1]^[4]

Key Components Overview

Seq2seq models consist of three primary components: an encoder, a decoder, and optionally, an attention mechanism to enhance performance on longer sequences. The encoder processes the input sequence to produce a fixed-length context vector that captures the essential information of the input. The decoder then uses this context vector to generate the output sequence one element at a time.^[1] To handle variable-length sequences while preserving order and dependencies, seq2seq models employ recurrent neural network layers, such as Long Short-Term Memory (LSTM) units, in both the encoder and decoder. These layers process the sequence timestep by timestep, allowing the model to manage inputs and outputs of differing lengths without fixed-size assumptions.^[1] Input sequences are represented through tokenization into words or subwords, followed by embedding layers that map tokens to dense vector representations, typically in high-dimensional spaces like 1000 dimensions. Outputs are generated probabilistically using a softmax layer over a predefined vocabulary, enabling the model to predict the next token based on the accumulated context.^[1] For instance, in machine translation, an input like "Hello world" is tokenized and embedded by the encoder to form a context vector, which the decoder uses to produce tokens for the target language, such as "Bonjour le monde" in French, step by step during inference.^[1] Later enhancements, like attention, allow the decoder to focus dynamically on relevant parts of the input, improving alignment for complex tasks.^[2]

Historical Development

Early Foundations

The foundations of sequence-to-sequence (seq2seq) models trace back to the pre-neural era of statistical machine translation (SMT), which dominated the field from the late 1980s to the early 2010s. SMT systems modeled translation as a noisy channel problem, estimating the probability of a target sentence given a source sentence using parallel corpora to learn translation and language models. These approaches shifted machine translation away from rigid rule-based systems toward data-driven methods that could handle real-world linguistic variability. A cornerstone of early SMT was the IBM Models, developed by researchers at IBM's Thomas J. Watson Research Center in the early 1990s. These models, detailed in a seminal series of five progressively complex formulations, focused on word-level alignments between source and target languages to map sequences probabilistically. Model 1 introduced basic lexical translation probabilities assuming uniform alignment, while later models incorporated distortion (position-based alignments) and fertility (the number of target words generated per source word) to better capture structural differences between languages. By estimating parameters via expectation-maximization on bilingual data, the IBM Models enabled automatic alignment extraction, forming the basis for phrase-based SMT systems that improved translation quality significantly over prior methods. Despite their impact, SMT frameworks faced key challenges in handling variable-length sequences and alignments. N-gram language models, commonly used in SMT for fluency scoring, suffered from data sparsity and limited context (typically 3-5 grams), failing to capture long-range dependencies essential for coherent translation. Alignment processes required heuristics to resolve ambiguities, such as one-to-many mappings, and struggled with reordering in linguistically distant language pairs, leading to error propagation in decoding. These limitations highlighted the need for models that could learn continuous representations and sequential dependencies more flexibly. Neural precursors emerged in the late 2000s with recurrent neural networks (RNNs) applied to language modeling, offering a path toward addressing these issues. In 2010, Mikolov et al. introduced RNN-based language models (RNNLMs) that used hidden states to maintain context across arbitrary lengths, outperforming traditional n-grams on perplexity metrics for speech recognition and text prediction tasks. These models demonstrated RNNs' ability to learn distributed word representations and handle sequential data without explicit smoothing, paving the way for neural integration into translation pipelines. By 2011-2013, neural language models were incorporated as features in SMT systems for rescoring hypotheses, yielding modest but consistent BLEU score improvements (e.g., 0.5-1.0 points) on benchmarks like WMT.^[5] Initial neural ideas for machine translation built on these advances, proposing encoder structures to compress source sequences into fixed representations. In 2013, Kalchbrenner and Blunsom presented recurrent continuous translation models using bidirectional RNNs to encode source sentences and generate target sequences, achieving competitive results on small-scale English-French tasks without relying on phrase tables. This work emphasized end-to-end learning of alignments through continuous embeddings, reducing the modular complexity of SMT. The transition from statistical to neural end-to-end learning accelerated around 2013-2014, driven by advances in GPU computing and larger datasets, enabling seq2seq paradigms to supplant hybrid SMT-neural systems.^[6]

Priority Dispute and Key Publications

The RNN encoder-decoder architecture, a foundational component of sequence-to-sequence (seq2seq) models, was first introduced by Kyunghyun Cho and colleagues in their June 2014 arXiv preprint, later published at EMNLP 2014.^[7] Authored by researchers from institutions including New York University and the Université de Montréal, the paper proposed using two recurrent neural networks (RNNs)—one as an encoder to compress input sequences into a fixed-length vector and another as a decoder to generate output sequences—for tasks such as statistical machine translation, with additional applications noted for automatic summarization, question answering, and dialogue systems.^[7] The model emphasized learning continuous phrase representations to improve integration with traditional phrase-based translation systems. Shortly thereafter, Ilya Sutskever and colleagues from Google Brain published their September 2014 arXiv preprint, accepted at NeurIPS 2014, which formalized "sequence to sequence learning with neural networks" and popularized the "seq2seq" terminology.^[1] This work built upon the encoder-decoder framework by employing long short-term memory (LSTM) units to handle longer sequences more effectively, focusing primarily on end-to-end machine translation without relying on intermediate phrase alignments.^[1] Sutskever et al. explicitly cited Cho et al. as related prior work, acknowledging the encoder-decoder structure while advancing it through deeper LSTMs and empirical demonstrations on translation benchmarks.^[1] The Sutskever paper had profound impact by demonstrating state-of-the-art performance on the WMT'14 English-to-French translation task, achieving a BLEU score of 34.81 with an ensemble of deep LSTMs—surpassing the previous phrase-based SMT baseline of 33.3 and establishing seq2seq as a viable alternative to traditional methods.^[1] This result, obtained via direct sequence generation without post-editing, highlighted the model's ability to capture long-range dependencies in natural language. The work's emphasis on reversing input sequences during training further improved convergence and performance.^[1] Subsequent influences included rapid integration of seq2seq models into major deep learning frameworks, such as TensorFlow, where example implementations and tutorials for encoder-decoder architectures emerged by 2017 to facilitate experimentation in translation and beyond.^[8] These developments accelerated adoption across research and industry, cementing seq2seq as a cornerstone of neural sequence modeling.

Core Architecture

Encoder Mechanism

The encoder in a sequence-to-sequence (seq2seq) model serves to process an input sequence x_1, \dots, x_T, where each x_t is typically an embedding of a token such as a word or subword, through a series of recurrent layers to generate a sequence of hidden states h_1, \dots, h_T. These hidden states capture the contextual information from the input up to each time step t. In the basic formulation, the final hidden state h_T is often used as a fixed-dimensional context vector that summarizes the entire input sequence for downstream processing.^[1] Recurrent neural networks (RNNs) form the backbone of the encoder, with the hidden state at each step computed as h_t = \tanh(W_{xh} x_t + W_{hh} h_{t-1} + b_h), where W_{xh} and W_{hh} are weight matrices, and b_h is a bias term; this update allows the model to maintain a memory of prior inputs while incorporating the current one.^[1] To enhance context capture, especially for tasks requiring understanding of future tokens, bidirectional RNNs or LSTMs are commonly employed, processing the sequence in both forward and backward directions to produce concatenated hidden states.^[9] Long short-term memory (LSTM) units, which replace the simple tanh activation with a cell state and gating mechanisms—including forget, input, and output gates—address limitations in standard RNNs by selectively retaining or discarding information over extended sequences.^[10] A key challenge in encoder design is the vanishing gradient problem, where gradients diminish exponentially during backpropagation through time, hindering learning of long-range dependencies in sequences.^[11] LSTMs mitigate this by using multiplicative gates to maintain stable gradient flow, enabling effective encoding of lengthy inputs. For instance, in machine translation, the encoder processes a source sentence like "The cat sat on the mat" to produce hidden states that encode syntactic and semantic relationships, summarized in the context vector for decoding the target language output.^[1]^[10]

Decoder Mechanism

The decoder in a sequence-to-sequence (seq2seq) model is responsible for generating the output sequence y_1, \dots, y_{T'} autoregressively, conditioned on the input sequence x via a fixed-dimensional context vector c produced by the encoder. This process begins with the decoder's initial hidden state initialized from c, and each subsequent hidden state s_t is computed as s_t = f(y_{t-1}, s_{t-1}), where f denotes the recurrent function (typically an LSTM or GRU), y_{t-1} is the previous output symbol (as embedding), and s_{t-1} is the prior hidden state; the context vector c influences the decoding through this initialization, enabling the model to produce variable-length outputs without explicit alignment to the input.^[1] During training, the decoder employs teacher forcing, where the ground-truth previous outputs y_{<t} are fed as inputs to compute the next symbol, rather than the model's own predictions. This facilitates efficient optimization by maximizing the conditional likelihood P(y_t \mid y_{<t}, x) = \softmax(W_o s_t), where W_o is a learned output projection matrix mapping the hidden state s_t to the vocabulary size, and \softmax yields the probability distribution over possible output tokens. The overall training objective decomposes into the product of these conditional probabilities across the output sequence, allowing the model to learn coherent generations step by step.^[1] At inference time, the decoder operates fully autoregressively, starting from a special start token and using its own predictions to generate subsequent symbols, which introduces a train-test discrepancy. To improve output quality over greedy decoding (which selects the highest-probability token at each step), beam search is commonly used; it maintains a fixed number of partial hypotheses (beam width B), expanding and pruning them based on cumulative log-probability to explore more likely sequences. Experiments on machine translation tasks demonstrate that beam search with B = 12 can yield substantial gains; for an ensemble of five models, it improves the BLEU score from 33.0 (greedy decoding) to 34.81 on the WMT'14 English-to-French task.^[1] A key challenge in this setup is exposure bias, arising from the mismatch between training (where errors do not propagate due to teacher forcing) and inference (where compounding errors from model predictions degrade performance over long sequences). This discrepancy causes the model to be exposed only to reference data during training, leading to suboptimal robustness when generating from its own outputs.^[12]

Training and Inference Differences

Seq2seq models are trained using maximum likelihood estimation, where the objective is to maximize the log probability of the target sequence given the input sequence, typically formulated as minimizing the cross-entropy loss \mathcal{L} = -\sum_t \log P(y_t | y_{<t}, x), with the sum taken over the target sequence tokens y_t.^[13] This loss is computed autoregressively, but during training, the decoder receives the ground-truth previous tokens as input—a technique known as teacher forcing—to accelerate convergence and avoid error propagation from early predictions.^[14]^[13] In contrast, inference in seq2seq models involves autoregressive decoding without access to ground-truth targets, where the decoder generates each token conditioned only on the model's previous outputs, leading to potential error accumulation over long sequences—a phenomenon called exposure bias that can degrade performance compared to training.^[12] To mitigate suboptimal greedy selection of the most probable token at each step, strategies like beam search are employed, maintaining a fixed-width beam of k hypotheses and exploring the top-k likely extensions at each step to find a higher-probability output sequence overall.^[13] This process is computationally more intensive than training, as it requires evaluating multiple partial sequences in parallel, often resulting in longer inference times, especially for larger beam widths.^[13] A key distinction arises from the absence of ground truth during inference, necessitating extrinsic evaluation metrics such as BLEU, which measures n-gram overlap between generated and reference sequences to approximate translation quality without human judgment.^[15] Additionally, to stabilize training of the underlying recurrent components, techniques like gradient clipping are applied, capping the gradient norm (e.g., at 5) to prevent exploding gradients that could destabilize optimization.^[16]^[13]

Attention Integration

Role of Attention in Seq2seq

In vanilla sequence-to-sequence (seq2seq) models, the encoder compresses the entire input sequence into a single fixed-length context vector, which serves as the sole source of information for the decoder during generation. This approach creates a significant bottleneck, particularly for long sequences, as the encoder must encode all relevant details into a limited representation, leading to information loss and degraded performance as input length increases.^[2] To address this limitation, attention mechanisms were introduced to enable the decoder to dynamically focus on different parts of the input sequence at each generation step, rather than relying on a static context vector. In their seminal work, Bahdanau et al. (2014) proposed an additive attention mechanism specifically for neural machine translation, where the decoder computes alignment scores between the current hidden state and all encoder annotations to weigh their contributions selectively. This allows the model to emphasize relevant input elements, such as specific words or phrases, improving alignment between source and target sequences.^[2] The integration of attention involves the decoder attending to the full set of encoder hidden states—representations produced by the encoder for each input position—at every decoding timestep. By forming a context vector as a weighted sum of these states, the model handles long sequences more effectively, avoiding the need to propagate information through deep recurrent layers solely via the initial context. This enhancement yields several benefits: it mitigates error propagation from the encoder by allowing on-demand retrieval of input information, and it produces interpretable soft-alignment weights that reveal plausible linguistic correspondences between input and output, aiding model analysis and debugging.^[2]

Computing Attention Weights

In sequence-to-sequence models, attention weights are computed to determine the relevance of each encoder hidden state to the current decoder state. The process begins with calculating alignment scores, which quantify the compatibility between the decoder's hidden state s_t at time step t and each encoder hidden state h_i for input positions i = 1 to T. In the additive attention mechanism introduced by Bahdanau et al., the score e_{ti} is computed as e_{ti} = v_a^T \tanh(W_a [s_t; h_i]), where W_a is a learnable weight matrix, v_a is a learnable vector, and [s_t; h_i] denotes concatenation.^[2] These raw scores are then normalized to form attention weights using the softmax function: \alpha_{ti} = \frac{\exp(e_{ti})}{\sum_{j=1}^T \exp(e_{tj})}. The resulting weights \alpha_{ti} sum to 1 over i, enabling the computation of a context vector c_t = \sum_{i=1}^T \alpha_{ti} h_i, which aggregates a weighted sum of the encoder states to inform the decoder's output at step t.^[2] Alternative formulations exist for computing alignment scores. For instance, the dot-product attention variant, proposed by Luong et al., simplifies the score to e_{ti} = s_t^T h_i (or a scaled version thereof), which is computationally efficient and effective when encoder and decoder states are in the same space.^[17] Multi-head attention extends these mechanisms by projecting queries, keys, and values into multiple subspaces and computing attention in parallel, allowing the model to jointly attend to information from different representation subspaces, as later developed in transformer architectures.^[18] Attention weights are often visualized as heatmaps, where rows correspond to decoder time steps and columns to encoder positions, with color intensity representing \alpha_{ti} values; such visualizations reveal alignments between source and target sequences, such as focusing on specific input words during translation.^[2]

Applications and Extensions

Machine Translation

Sequence-to-sequence (seq2seq) models were initially developed for machine translation as an end-to-end approach, directly mapping source language sentences to target language outputs without relying on intermediate phrase alignments or rules. This paradigm shift, introduced in 2014, outperformed traditional phrase-based statistical machine translation (SMT) systems, which dominated prior benchmarks. For instance, on the WMT'14 English-to-French task, an ensemble of deep LSTM-based seq2seq models achieved a BLEU score of 34.8, surpassing the phrase-based SMT baseline of 33.3.^[1] The integration of attention mechanisms further enhanced performance by allowing the decoder to focus on relevant parts of the input sequence, yielding a BLEU score of 28.45 (excluding unknown words) on the same dataset with an extended training regime. In 2016, Google adopted neural machine translation based on seq2seq architectures with attention in its Google Translate service, marking a widespread commercial deployment. The Google Neural Machine Translation (GNMT) system, utilizing deep LSTM layers, reduced translation errors by 60% relative to phrase-based systems on English-to-French, English-to-Spanish, and English-to-Chinese pairs.^[19] This transition improved fluency and accuracy, establishing seq2seq as the foundation for production-scale translation systems. Seq2seq models for machine translation are commonly evaluated on Workshop on Machine Translation (WMT) datasets, which provide standardized corpora for language pairs like English-French and English-German, using the BLEU metric to measure n-gram overlap with human references. Early attention-augmented seq2seq models achieved BLEU scores exceeding 28 on English-to-French tasks, setting new standards that phrase-based methods struggled to match. Later implementations, such as GNMT, pushed scores to 38.95 BLEU on WMT'14 English-to-French, demonstrating the scalability of seq2seq to larger datasets and deeper networks.^[19] The evolution of seq2seq in machine translation progressed from basic RNN and LSTM architectures to optimized toolkits like fairseq, developed by Facebook AI Research, which facilitates efficient training of seq2seq models including LSTMs and convolutions for translation tasks. A key advancement addressed the challenge of rare words, which comprise up to 20% of vocabulary in translation corpora and often lead to unknown token issues in fixed-vocabulary models. Byte Pair Encoding (BPE) subword units resolve this by decomposing rare and out-of-vocabulary words into frequent subword segments, enabling open-vocabulary translation; for example, on WMT 2015 English-to-German, BPE improved BLEU by 0.5 points and rare word accuracy (unigram F1) from 36.8% to 41.8%.^[20] A notable case study in seq2seq translation involves visualizing attention-derived alignments, which reveal how the model associates source and target words. In the attention model, soft-alignment weights are depicted as grayscale matrices, illustrating monotonic alignments such as mapping "European Economic Area" in English to "zone économique européenne" in French, providing interpretability into the translation process. This visualization underscores attention's role in producing coherent, context-aware translations.

Speech Recognition and Other NLP Tasks

Seq2seq models have been adapted for speech recognition by treating the problem as mapping acoustic input sequences, such as mel-frequency cepstral coefficients, to output sequences of characters or words. A seminal approach is the Listen, Attend and Spell (LAS) model, which employs an encoder-decoder architecture with attention to directly transcribe speech utterances without explicit alignment, achieving end-to-end learning from audio to text.^[21] This contrasts with Connectionist Temporal Classification (CTC), an earlier alignment-free method that uses a recurrent neural network to compute probabilities over output labels at each time step, often combined with a language model for decoding, but lacking the attention mechanism for focusing on relevant audio segments.^[22] Attention-based decoders in seq2seq, as in LAS, generally outperform CTC in handling variable-length inputs and improving transcription accuracy on large-vocabulary tasks by dynamically weighting acoustic features.^[21] Beyond speech, seq2seq architectures power other natural language processing tasks, such as abstractive text summarization, where an encoder processes the input document to produce a fixed representation, and the decoder generates a concise abstract by attending to key parts of the source text. A foundational model for this is the neural attention-based abstractive summarizer, which frames summarization as sequence transduction and demonstrates superior fluency and informativeness over extractive methods on datasets like the Gigaword corpus.^[23] In dialogue systems, seq2seq models enable response generation by encoding conversational context as input sequences and decoding coherent replies, as exemplified by the neural conversational model trained on multi-turn dialogues, which captures utterance dependencies to produce contextually relevant outputs.^[24] Seq2seq extends to multimodal and non-text domains, including image captioning, where a convolutional neural network (CNN) encoder extracts visual features from images, feeding them into an RNN decoder to generate descriptive captions. The Show and Tell model pioneered this encoder-decoder paradigm for vision-to-language tasks, attaining state-of-the-art performance on benchmarks like MSCOCO by leveraging attention to align image regions with words.^[25] For time-series forecasting, seq2seq models encode historical data sequences to predict future values, providing a flexible framework for multivariate predictions that outperforms traditional autoregressive methods in capturing long-range dependencies.^[26] Modern advancements integrate seq2seq with pre-trained language models, such as T5, which unifies diverse NLP tasks under a text-to-text format using an encoder-decoder Transformer, fine-tuned on large corpora to enhance performance in summarization, dialogue, and beyond while preserving the core seq2seq structure.

References

[1]
Sequence to Sequence Learning with Neural Networks - arXiv
Sep 10, 2014 · In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure.
[2]
Neural Machine Translation by Jointly Learning to Align and ... - arXiv
Sep 1, 2014 · The neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance.
[3]
[PDF] Neural Machine Translation and Sequence-to-sequence Models
Mar 5, 2017 · This tutorial introduces a new and powerful set of techniques variously called “neural machine translation” or “neural sequence-to-sequence ...
[4]
[PDF] Extended Translation Models in Phrase-based Decoding
(ii) Dependen- cies beyond phrase boundaries are not modelled at all. (iii) Phrase-based translation models have dif- ficulties modelling long-distance ...
[5]
[PDF] Recurrent Neural Network Based Language Model
A new recurrent neural network based language model (RNN. LM) with applications to speech recognition is presented. Re- sults indicate that it is possible ...
[6]
None
### Summary of https://arxiv.org/pdf/1409.0473.pdf
[7]
Learning Phrase Representations using RNN Encoder-Decoder for ...
Jun 3, 2014 · Abstract page for arXiv paper 1406.1078: Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation.
[8]
[PDF] Multi-task Sequence to Sequence Learning - Google Research
Table 4: English→German WMT'14 translation – shown are perplexities (ppl) and BLEU scores of various translation models. Our multi-task systems combine ...
[9]
Bidirectional recurrent neural networks | IEEE Journals & Magazine
Abstract: In the first part of this paper, a regular recurrent neural network (RNN) is extended to a bidirectional recurrent neural network (BRNN).
[10]
Long Short-Term Memory | Neural Computation - MIT Press Direct
Nov 15, 1997 · We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called ...
[11]
[PDF] Learning long-term dependencies with gradient descent is difficult
Our only claim here is that discrete propagation of error offers interesting solutions to the vanishing gradient problem in recurrent network. Our ...<|control11|><|separator|>
[12]
[PDF] Sequence to Sequence Learning with Neural Networks - arXiv
Dec 14, 2014 · The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large fixed- dimensional vector representation, and ...
[13]
A Learning Algorithm for Continually Running Fully Recurrent ...
Jun 1, 1989 · A gradient-following learning algorithm for completely recurrent networks running in continually sampled time is derived and used as the basis for practical ...
[14]
[PDF] BLEU: a Method for Automatic Evaluation of Machine Translation
BLEU is a method for automatic machine translation evaluation, measuring closeness to human translations using a weighted average of phrase matches. It is ...
[15]
[PDF] On the difficulty of training Recurrent Neural Networks - arXiv
Feb 16, 2013 · We propose a gradient norm clipping strategy to deal with exploding gra- dients and a soft constraint for the vanishing gradients problem. We ...
[16]
Effective Approaches to Attention-based Neural Machine Translation
Aug 17, 2015 · This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one.
[17]
[1706.03762] Attention Is All You Need - arXiv
Jun 12, 2017 · We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
[18]
Google's Neural Machine Translation System: Bridging the Gap ...
Sep 26, 2016 · In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues.
[19]
Neural Machine Translation of Rare Words with Subword Units - arXiv
Aug 31, 2015 · In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown ...
[20]
[1508.01211] Listen, Attend and Spell - arXiv
Aug 5, 2015 · Abstract:We present Listen, Attend and Spell (LAS), a neural network that learns to transcribe speech utterances to characters.
[21]
[PDF] Connectionist Temporal Classification: Labelling Unsegmented ...
This is a natural measure for tasks (such as speech or handwriting recognition) where the aim is to minimise the rate of transcription mistakes. 3.
[22]
A Neural Attention Model for Abstractive Sentence Summarization
Sep 2, 2015 · A Neural Attention Model for Abstractive Sentence Summarization. Authors:Alexander M. Rush, Sumit Chopra, Jason Weston.
[23]
[1506.05869] A Neural Conversational Model - arXiv
Jun 19, 2015 · Title:A Neural Conversational Model. Authors:Oriol Vinyals, Quoc Le. View a PDF of the paper titled A Neural Conversational Model, by Oriol ...
[24]
[1411.4555] Show and Tell: A Neural Image Caption Generator - arXiv
Nov 17, 2014 · Title:Show and Tell: A Neural Image Caption Generator. Authors:Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan. View a PDF of the ...
[25]
Foundations of Sequence-to-Sequence Modeling for Time Series
May 9, 2018 · We provide the first theoretical analysis of this time series forecasting framework. We include a comparison of sequence-to-sequence modeling to classical time ...