Fact-checked by Grok 2 weeks ago

Neural machine translation

Neural machine translation (NMT) is an end-to-end approach to machine translation that employs deep neural networks to directly map a source language sequence to a target language sequence, modeling the entire translation process within a single, trainable model.^[1] Unlike earlier statistical machine translation (SMT) systems, which relied on separate modules for language modeling, alignment, and decoding using discrete phrase-based rules, NMT uses continuous vector representations to capture contextual dependencies and semantic nuances across sentences.^[1] This paradigm shift enables more fluent and accurate translations by learning hierarchical representations from large parallel corpora without extensive feature engineering.^[2] The foundations of NMT were laid in the early 2010s with the advent of recurrent neural network (RNN)-based encoder-decoder architectures, which encode input sequences into fixed-length vectors for decoding outputs.^[3] A pivotal advancement came in 2014 with the introduction of the attention mechanism, which allows the decoder to dynamically focus on relevant parts of the source sequence, addressing limitations in handling long sentences and improving alignment between source and target words.^[4] By 2017, the Transformer architecture revolutionized NMT by replacing RNNs entirely with self-attention mechanisms, enabling parallelization, faster training, and superior performance on diverse language pairs through scaled dot-product attention and multi-head configurations.^[5] NMT systems have since become the state-of-the-art in machine translation, powering applications like Google Translate and achieving human-like fluency in high-resource languages, though challenges persist in low-resource scenarios and morphological complexity.^[1] Key advantages include better generalization from data, reduced error propagation compared to SMT pipelines, and integration with techniques like beam search decoding^[2] and back-translation for augmentation.^[6] Ongoing research focuses on efficiency, multilingual capabilities, and robustness to domain shifts, with benchmarks like BLEU scores demonstrating consistent gains over predecessors.^[1]

Fundamentals

Overview

Neural machine translation (NMT) is an end-to-end deep learning approach to automated translation that directly maps a source language sequence to a target language sequence using neural networks, without relying on intermediate linguistic representations or hand-crafted rules.^[3] This paradigm formulates translation as the task of maximizing the conditional probability P(y \mid x), where x is the source sentence and y is the target sentence, learned from vast parallel corpora containing aligned sentence pairs.^[7] By training on such data, NMT systems capture complex linguistic patterns, producing translations that are more fluent and contextually appropriate than those from earlier methods. At its core, NMT operates on high-level principles of probabilistic sequence modeling through neural architectures, enabling the system to infer semantic and syntactic relationships across languages.^[7] The models learn distributed representations of words and sentences, allowing for generalization to unseen inputs and handling of variable-length sequences, which facilitates context-aware translations that consider the entire source text rather than isolated fragments. In contrast to traditional machine translation paradigms, such as statistical machine translation (SMT), which decompose the process into modular pipelines—including separate models for alignment, phrase extraction, and language modeling—NMT unifies these steps into a single, differentiable network trained jointly via backpropagation. SMT systems, often phrase-based, rely on probabilistic alignments and bilingual dictionaries derived from parallel data but struggle with long-range dependencies and reordering due to their reliance on fixed phrase units.^[8] This end-to-end nature of NMT reduces the need for linguistic expertise in system design and has led to superior translation quality across many language pairs.^[7] To illustrate, consider translating the English sentence "Hello, how are you?" into French. An NMT system processes the input sequence through its neural layers to generate the output "Bonjour, comment allez-vous?", preserving politeness and structure by modeling the probabilistic flow from source to target without explicit phrase matching. Typically, this involves an encoder-decoder framework where the source is encoded into a contextual representation, and the target is decoded autoregressively, with attention mechanisms aiding in aligning relevant parts of the input.^[4]

Key Components

Neural machine translation (NMT) systems rely on a core encoder-decoder architecture to process input sequences and generate output translations. The encoder transforms the source language sequence into a set of contextual representations, typically in the form of hidden states from recurrent or transformer layers, which capture the semantic and syntactic information of the input.^[3] These representations serve as the foundation for the decoder, which autoregressively produces the target sequence word by word, conditioning each output token on the previously generated tokens and the encoder's outputs.^[3] This framework marked a shift from earlier statistical models by enabling end-to-end learning directly from source to target texts.^[3] A fundamental component preceding the encoder and decoder is the embedding layer, which maps discrete tokens from the vocabulary into dense, continuous vector representations. These embeddings allow neural networks to perform algebraic operations on linguistic elements, facilitating the capture of semantic similarities between words or subwords.^[3] In NMT, embeddings are typically learned during training alongside other model parameters, enabling the system to adapt representations to the translation task. To manage vocabulary size and address out-of-vocabulary (OOV) issues, NMT models employ subword tokenization techniques, such as Byte-Pair Encoding (BPE), which break words into smaller, frequent units. This approach handles rare words by decomposing them into known subword components, reducing the vocabulary to tens of thousands of units while maintaining coverage for diverse languages and morphologies.^[9] For instance, a rare compound word like "unhappiness" might be tokenized as "un", "happi", and "ness", allowing the model to generalize translations based on compositional patterns. Encoders in NMT often use bidirectional processing to access both past and future context in the source sequence, enhancing representation quality through forward and backward passes that are concatenated. In contrast, decoders operate unidirectionally to simulate real-time generation, processing only preceding tokens to prevent information leakage from the target sequence during training or inference. Attention mechanisms further align these encoder and decoder states, though their detailed mechanics are explored elsewhere.

Historical Development

Early Neural Approaches

Early explorations into neural methods for machine translation during the 2000s primarily involved shallow feedforward neural networks and word-based models aimed at phrase-level translation tasks, often as supplements to rule-based or statistical systems. These initial experiments focused on learning distributed representations of words to improve probability estimates for translation candidates, rather than end-to-end translation. For instance, researchers experimented with neural networks to model lexical mappings and basic phrase substitutions, demonstrating modest gains in handling local word order variations compared to purely symbolic approaches. However, these models were constrained by their shallow architectures and inability to capture broader contextual dependencies, limiting their applicability to short phrases or isolated components of the translation pipeline.^[10] In the 1990s, recurrent neural networks (RNNs) emerged as a key advancement for language modeling, with early applications extending to machine translation through improved alignment modeling. Pioneering work by Elman introduced simple RNNs to discover syntactic structures in sequential data, laying groundwork for using recurrent architectures to predict word alignments in bilingual corpora. These RNN-based language models were adapted for MT by estimating conditional probabilities that aided in aligning source and target words, outperforming traditional n-gram models in capturing sequential dependencies for alignment tasks. Despite these contributions, the models struggled with long-range dependencies due to vanishing gradients, restricting their effectiveness to short sequences in translation alignments. The period from 2003 to 2010 marked the first dedicated neural machine translation papers, emphasizing neural probabilistic models for reordering and alignment within statistical frameworks. Bengio et al. introduced a feedforward neural probabilistic language model that learned continuous word representations, which was subsequently integrated into statistical MT systems to enhance fluency and rerank translation hypotheses, yielding BLEU score improvements of up to 0.7 points on French-English tasks. Building on this, Schwenk developed continuous-space language models using neural networks to replace discrete n-gram models in SMT, achieving better perplexity reductions and translation quality gains through more expressive probability distributions for target phrases. Additional efforts applied similar neural models to reordering, such as predicting orientation probabilities (monotone, swap, discontinuous) based on surrounding context, which improved handling of syntactic divergences in languages like English-Japanese by 1-2 BLEU points in phrase-based systems. For alignment, neural probabilistic approaches modeled soft alignments via distributed features, outperforming IBM models in low-resource settings by better generalizing rare word pairs.^[10] A primary limitation of these early neural approaches was their reliance on fixed-length vector representations, which compressed variable-length input sequences into uniform dimensions, leading to information loss for longer phrases or sentences. This bottleneck hindered scalability to full-sentence translation, as the models could not effectively encode extended contexts without truncation or padding, resulting in degraded performance on structurally complex inputs. These constraints underscored the need for more flexible architectures capable of processing sequences of arbitrary length, paving the way for subsequent advancements in sequence modeling.^[11]

Sequence-to-Sequence Models

The sequence-to-sequence (seq2seq) paradigm emerged in 2014 as a transformative framework for neural machine translation, shifting from rule-based or statistical methods to fully differentiable, end-to-end neural architectures capable of mapping variable-length input sequences to variable-length outputs. Independently introduced by Sutskever et al. and Cho et al., these models utilized recurrent neural networks (RNNs) in an encoder-decoder structure, with the encoder employing LSTM or GRU variants to process the source sequence and the decoder generating the target sequence autoregressively.^[3]^[12] This design enabled direct optimization of translation quality through backpropagation across the entire system, bypassing the need for explicit alignment or phrase extraction common in prior approaches.^[3]^[12] Central to the seq2seq approach is the encoder's role in compressing the input sequence of arbitrary length into a fixed-dimensional context vector, which encapsulates the semantic essence of the source for the decoder to condition upon during output generation. This fixed intermediate representation allows the model to handle inputs and outputs of differing lengths without predefined alignments, a flexibility that proved essential for natural language tasks like translation.^[3]^[12] By training on parallel corpora, the system learns to represent sequences in a continuous vector space, facilitating smoother handling of syntactic and semantic variations across languages.^[12] Initial deployments of seq2seq models emphasized hybrid integration with phrase-based statistical machine translation (SMT) systems, leveraging the neural component to refine phrase scoring or re-ranking for better fluency and adequacy. Cho et al. specifically applied the RNN encoder-decoder to learn dense representations of phrases, which were then incorporated into SMT pipelines to outperform baseline phrase tables on tasks like English-to-French translation.^[12] These hybrids demonstrated seq2seq's potential as a complementary tool, bridging neural expressiveness with SMT's robustness in low-data scenarios.^[12] The seq2seq framework catalyzed rapid progress in neural machine translation, enabling the first purely neural systems to win tracks at WMT 2015, such as the English-to-French news translation, and generally surpassing SMT performance on WMT benchmarks around 2016. This milestone, achieved through scaled-up training on large corpora, underscored seq2seq's scalability and its role in establishing neural methods as viable alternatives to decades-old statistical paradigms. Despite these advances, seq2seq models faced a critical limitation in the fixed-size context vector, which acts as an information bottleneck for long sentences, often resulting in loss of distant dependencies and reduced translation accuracy for extended inputs. This issue prompted further refinements, such as the integration of attention mechanisms to dynamically access input representations and mitigate the bottleneck (detailed in Attention Mechanisms).

Transformer Architecture

The Transformer architecture, introduced in the seminal 2017 paper "Attention Is All You Need" by Vaswani et al., marked a pivotal shift in neural machine translation by fully replacing recurrent and convolutional layers with a stack of self-attention and feed-forward components, enabling an attention-only paradigm. This innovation addressed key limitations of prior recurrent models, such as sequential computation bottlenecks, while maintaining or exceeding translation quality. The model relies entirely on attention mechanisms to capture dependencies, dispensing with recurrence to process input sequences in parallel.^[5] At its core, the Transformer uses an encoder-decoder structure. The encoder comprises a stack of six identical layers, each featuring a multi-head self-attention sub-layer—allowing the model to jointly attend to information from different representation subspaces—to weigh the importance of different words in the input sequence, followed by a position-wise feed-forward network that applies identical transformations to each position separately. The decoder mirrors this with six layers but incorporates three sub-layers: a masked multi-head self-attention to prevent attending to future positions during generation, a multi-head attention over the encoder outputs to align source and target representations, and a position-wise feed-forward network. Positional encodings are added to input embeddings to incorporate sequence order information, as the architecture lacks inherent recurrence. Residual connections and layer normalization are applied around each sub-layer for stable training.^[5] A primary advantage of this design is enhanced parallelization, as the absence of sequential recurrence allows all positions to be computed simultaneously, leading to substantially faster training on GPUs compared to recurrent architectures. For instance, the Transformer trains 3.5 times faster than previous state-of-the-art models while using 98% fewer parameters, achieving convergence in 3.5 days on eight GPUs for the base configuration. On the WMT 2014 English-to-German benchmark, it attained a BLEU score of 28.4, outperforming prior best results—including ensemble systems—by more than 2 points and setting a new standard for single-model performance.^[5] The architecture's modularity and efficiency have facilitated scalability in subsequent neural machine translation systems, with variants scaling to over 6 billion parameters in massively multilingual models to improve translation across hundreds of languages. This scalability has underpinned broader adaptations, including in large language models for translation tasks.^[5]^[13]

Integration with Large Language Models

Following the introduction of large language models (LLMs) around 2020, neural machine translation (NMT) underwent a significant paradigm shift, with models like GPT-3 and T5 demonstrating zero-shot translation capabilities through pretraining on vast multilingual corpora.^[14]^[15] GPT-3, pretrained on diverse internet text including multiple languages, exhibited emergent translation proficiency without task-specific fine-tuning, as evidenced by its ability to handle simple language pairs in few-shot prompts. Similarly, the T5 model, unified as a text-to-text framework, and its multilingual extension mT5, pretrained on the mC4 dataset spanning 101 languages, enabled zero-shot or few-shot translation by reformatting translation tasks as text generation problems, outperforming earlier bilingual NMT systems on low-resource pairs in initial evaluations.^[16] This pretraining approach leveraged shared representations across languages, allowing LLMs to infer translations from monolingual and parallel data implicitly present in the training corpus. Fine-tuning strategies further integrated LLMs into NMT by adapting pretrained multilingual Transformers for specific translation tasks, enhancing performance on targeted language pairs. Models such as mBART, pretrained via denoising objectives on 25 languages, were fine-tuned on parallel corpora to achieve state-of-the-art results in multilingual NMT, with continued denoising during adaptation preserving cross-lingual transfer.^[17] mT5 followed a similar path, where fine-tuning on supervised translation data improved BLEU scores compared to from-scratch training, making it suitable for low-resource scenarios through continued pretraining on monolingual text.^[16] These methods reduced the need for massive parallel data, enabling efficient adaptation via techniques like parameter-efficient fine-tuning (e.g., adapters), which updated only a fraction of parameters while maintaining the model's broad linguistic knowledge. Emergent abilities in LLMs, driven by scaling laws, revealed that larger models exhibit improved translation quality for low-resource languages without dedicated MT training, as performance gains follow predictable power-law relationships with model size and data volume. For instance, as model parameters exceeded 100 billion, LLMs like PaLM and BLOOM showed improvements in zero-shot translation for underrepresented languages, attributing this to enhanced cross-lingual alignment from massive multilingual pretraining. By 2023, LLMs showed strong performance on benchmarks like WMT and FLORES-200, surpassing specialized NMT in some directions (e.g., 7 out of 13 in WMT23) but with mixed results overall, particularly in fluency for certain low-resource pairs.^[18] Advancements in 2024 and 2025 introduced mixture-of-experts (MoE) architectures to enable efficient multilingual MT within LLMs, routing inputs to specialized sub-networks for scalability. MoE-LLM frameworks, which sparsely activate experts per token, improved translation efficiency over dense LLMs while enhancing performance on multilingual benchmarks through targeted expert allocation for language-specific tasks.^[19] Emerging hybrids combined NMT with diffusion models for iterative text generation, as in discrete diffusion approaches submitted to WMT 2024, which generated translations via noise addition and denoising, yielding competitive results on high-resource pairs with better diversity than autoregressive baselines.^[20] Retrieval-augmented NMT hybrids, integrating external monolingual or parallel retrieval, further enhanced low-resource translation in 2024-2025 by conditioning generation on retrieved segments, improving performance on pairs like English-Yoruba without full retraining.^[21]

Architectures

Encoder-Decoder Framework

The encoder-decoder framework forms the foundational paradigm for neural machine translation (NMT), consisting of two interconnected neural network components that transform a source sequence into a target sequence. The encoder processes the input sequence to produce a compact representation, while the decoder leverages this representation to generate the output sequence iteratively. This architecture enables end-to-end learning directly from source-target pairs, bypassing traditional phrase-based alignments.^[12]^[3] In operation, the encoder—a recurrent neural network (RNN) or its variants—reads the source sentence token by token, updating a series of hidden states that progressively capture contextual information. For an input sequence of length T, the encoder computes hidden states h_1, h_2, \dots, h_T, where each h_t encodes the preceding tokens up to position t. The final hidden state h_T serves as a fixed-length summary of the entire source, though later enhancements allow the decoder to access all hidden states. The decoder, also typically an RNN, initializes its state from the encoder's output and generates the target sequence autoregressively: at each step, it predicts the next token conditioned on the previous tokens and the encoder's representations, producing tokens one at a time until an end-of-sequence marker. This step-by-step generation ensures the output respects the sequential dependencies in the target language.^[12]^[3] Autoregressive decoding is central to the framework, as the probability of the target sequence y_1, y_2, \dots, y_{T'}, given source x_1, x_2, \dots, x_T, is modeled as P(y_1 \mid x) \prod_{j=2}^{T'} P(y_j \mid y_{<j}, x), where each conditional probability is computed by the decoder. This approach, originating from early sequence-to-sequence models, allows flexible handling of variable-length inputs and outputs but requires techniques like beam search during inference to mitigate error propagation.^[3] In multilingual NMT setups, the framework adapts by sharing components across languages to promote parameter efficiency and cross-lingual transfer. A shared encoder processes inputs from multiple source languages, producing representations that capture universal linguistic features, while language-specific decoders generate outputs tailored to each target language; alternatively, a fully shared encoder-decoder pair uses target language indicators (e.g., special tokens) to route generation appropriately. Separate encoders and decoders per language pair, in contrast, avoid interference but scale poorly with the number of languages. This shared approach enables zero-shot translation between unseen language pairs by leveraging learned alignments.^[22] Consider translating the English phrase "Hello world" to French "Bonjour le monde". The encoder processes "Hello" to form initial hidden state h_1, then incorporates "world" to yield h_2, which summarizes the greeting's semantics. The decoder, starting from a begin-sequence token, attends to these states to predict "Bonjour" as the first token, conditioned on the full representation; it then generates "le" based on "Bonjour" and the states, followed by "monde", until completion. This flow demonstrates how hidden states bridge source understanding to target production.^[3]

Attention Mechanisms

Attention mechanisms represent a pivotal innovation in neural machine translation (NMT), enabling models to dynamically focus on relevant parts of the input sequence rather than relying solely on a fixed-length context vector.^[4] Introduced to address the bottleneck in early encoder-decoder architectures, where the entire source information is compressed into a single vector, attention allows the decoder to weigh and align source elements softly during translation.^[4] This soft alignment improves handling of long-range dependencies and variable-length inputs, leading to more accurate translations.^[23] One of the earliest attention variants is the additive attention proposed by Bahdanau et al. in 2014, which computes alignment weights using a feedforward neural network to score the relevance between decoder states and source hidden states.^[4] Specifically, for a target hidden state h_t at time t and source hidden state h_s at position s, the alignment score is given by a_t(s) = v_a^\top \tanh(W_a h_s + U_a h_t), where W_a and U_a are weight matrices, and v_a is a weight vector.^[4] These scores are then normalized via softmax to form a context vector as a weighted sum of source states, enabling the model to jointly learn alignment and translation.^[4] Building on this, Luong et al. in 2015 introduced dot-product attention as a simpler alternative, using direct similarity scoring between source and target representations without additional nonlinear transformations.^[23] The score is computed as the dot product \text{score}(h_s, h_t) = h_s^\top h_t, followed by softmax normalization to derive attention weights.^[23] This approach, part of their global attention mechanism that attends to all source words, proved computationally efficient and effective, achieving state-of-the-art results on English-to-German translation tasks.^[23] A refined form, scaled dot-product attention, was later formalized by Vaswani et al. in 2017 to enhance stability in deeper models.^[5] The mechanism takes query matrix Q, key matrix K, and value matrix V, computing attention as:

\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{Q K^\top}{\sqrt{d_k}} \right) V

where d_k is the dimension of the keys.^[5] The scaling factor \sqrt{d_k} counters the quadratic growth in dot-product magnitudes for high-dimensional keys, preventing vanishing gradients in the softmax and ensuring effective learning.^[5] This formulation allows for efficient parallel computation via matrix operations, making it suitable for large-scale NMT.^[5] The benefits of attention mechanisms include dynamic weighting of source elements, which resolves the information bottleneck in vanilla sequence-to-sequence models by providing access to the full input context at each decoding step.^[4] This leads to better performance on tasks with long sentences, as demonstrated by improvements in BLEU scores on benchmark datasets like WMT'14 English-to-French, where attention-augmented models outperformed non-attentive baselines by several points.^[23] To capture diverse relationships, multi-head attention extends the basic mechanism by performing attention in parallel across multiple subspaces, or "heads."^[5] Each head projects Q, K, and V linearly to lower-dimensional subspaces, computes scaled dot-product attention independently, and the outputs are concatenated and transformed to match the original dimension.^[5] This allows the model to attend to information from different representation subspaces simultaneously, enhancing expressiveness without increasing computational cost linearly.^[5]

Positional Encoding and Variants

In neural machine translation models based on the Transformer architecture, which rely on self-attention mechanisms that are inherently permutation-invariant, positional encodings are essential to inject information about the relative or absolute positions of tokens in the input sequence, thereby preserving order without recurrent structures.^[5] This approach enables the model to capture sequential dependencies critical for translating source sentences into target languages, where word order significantly affects meaning. The original Transformer model employs fixed sinusoidal positional encodings, defined as follows for a position \text{pos} and dimension i in a model of embedding dimension d_{\text{model}}:

\begin{align*} \text{PE}(\text{pos}, 2i) &= \sin\left( \text{pos} / 10000^{2i / d_{\text{model}}} \right), \\ \text{PE}(\text{pos}, 2i+1) &= \cos\left( \text{pos} / 10000^{2i / d_{\text{model}}} \right). \end{align*}

These encodings produce periodic signals with wavelengths spanning from $2\pi to $10000 \cdot 2\pi, allowing the model to represent positions using linear combinations that facilitate learning relative distances; for any fixed offset k, \text{PE}(\text{pos} + k, i) can be expressed as a linear function of \text{PE}(\text{pos}, i).^[5] An alternative is learned positional embeddings, which are trainable parameters optimized alongside the model weights and have been shown to yield comparable performance to sinusoidal encodings in translation tasks.^[5] Variants such as relative positional encodings address limitations in handling longer contexts, as introduced in Transformer-XL, where encodings represent relative distances i - j between positions rather than absolute ones, injected directly into attention computations to maintain temporal coherence across segmented sequences.^[24] This enables modeling dependencies over extended ranges, up to 450% longer than standard Transformers during inference, which is particularly beneficial for NMT on long sentences or documents.^[24] More recent variants include Rotary Position Embeddings (RoPE), introduced in 2021, which encode absolute positions using rotation matrices applied to query and key vectors in self-attention, naturally incorporating relative positional information and improving length extrapolation in Transformer models.^[25] RoPE enhances performance in sequence tasks like NMT by decaying inter-token dependencies with distance and supporting efficient relative attention without additional parameters. Another variant, Attention with Linear Biases (ALiBi), proposed in 2021, avoids explicit positional embeddings altogether by adding linear biases to attention scores that penalize distant positions, promoting better generalization to longer sequences in NMT and other applications.^[26] Overall, these encodings ensure that attention mechanisms remain permutation-invariant to content while incorporating order, a key enabler for effective sequence modeling in NMT.^[5] Notably, the fixed nature of sinusoidal encodings supports some extrapolation to sequence lengths longer than those encountered during training, unlike learned embeddings which are bounded by the trained range.

Training and Inference

Training Objectives

Neural machine translation (NMT) models are primarily trained using the cross-entropy loss function, which serves as the standard objective for maximum likelihood estimation in sequence-to-sequence learning. This loss measures the token-level prediction error by computing the negative log-likelihood of the target sequence given the source input, formalized as
\mathcal{L} = -\sum_{t=1}^{T} \log p(y_t \mid y_{<t}, x),
where y_t is the target token at position t, y_{<t} are the preceding target tokens, x is the source sequence, and p(\cdot) is the model's predicted probability distribution over the vocabulary obtained via softmax.^[3] This objective encourages the model to assign high probability to the correct target tokens autoregressively, aligning with the goal of generating fluent and accurate translations by optimizing conditional probabilities directly.^[3] While the cross-entropy loss operates at the token level, sequence-level objectives address the mismatch between training and evaluation metrics like BLEU by optimizing the entire output sequence holistically, often using reinforcement learning or risk minimization techniques.^[27] For instance, methods such as minimum risk training sample multiple candidate translations during training and penalize deviations from the reference based on sequence metrics, leading to improved correlation with translation quality scores compared to pure token-level training.^[27] Recent advancements in sequence-level training include direct preference optimization (DPO), which aligns model outputs with human preferences by optimizing pairwise comparisons without explicit reward modeling, showing improvements in translation quality on benchmarks as of 2025.^[28] However, token-level objectives remain dominant due to their computational efficiency and stability in gradient-based optimization. To mitigate overconfidence in predictions, which can lead to poor generalization, label smoothing is commonly applied by softening the one-hot target distribution during cross-entropy computation. This involves distributing a small portion of probability mass (typically \epsilon = 0.1) uniformly across incorrect tokens, replacing the hard target y_k = 1 for the correct class k with y_k = 1 - \epsilon + \epsilon / V, where V is the vocabulary size. Introduced in NMT contexts to regularize softmax outputs, this technique has been shown to boost BLEU scores on benchmarks like WMT English-to-German by encouraging calibrated probabilities without significantly altering perplexity. Supervised training of NMT models relies on large-scale bilingual corpora, such as those from the Workshop on Machine Translation (WMT), which provide parallel sentence pairs for source-target alignment. For data augmentation in low-resource scenarios, unsupervised objectives like back-translation leverage monolingual data by automatically translating it into the source language using an existing model, then treating the synthetic pairs as additional supervised examples to enhance translation performance.^[29] Recent advancements incorporate contrastive losses to improve semantic alignments, such as by pulling positive translation pairs closer in embedding space while repelling negatives, yielding gains in fluency and accuracy on multilingual benchmarks as of 2024.^[30]

Decoding Strategies

In neural machine translation (NMT), decoding strategies determine how the model generates output sequences during inference, starting from the source input and producing target tokens autoregressively until an end-of-sequence token is predicted. These methods trade off between translation quality, computational efficiency, and output diversity, with the goal of approximating the most probable translation under the model's learned distribution. Unlike training, where the model sees ground-truth previous tokens (a technique known as teacher forcing), inference relies on the model's own predictions, introducing potential error propagation. Greedy decoding is the simplest approach, where at each step, the token with the highest probability according to the model's output distribution is selected and appended to the partial sequence. This method is computationally efficient, requiring only a single forward pass per token, but it often yields suboptimal translations because it commits irrevocably to local maxima, potentially missing globally better hypotheses. Beam search addresses these limitations by maintaining a fixed number of partial hypotheses (the beam width, typically denoted as k) throughout generation, expanding each by considering the top-k most probable next tokens and pruning to retain only the k highest-scoring complete sequences based on cumulative log-probability. Pruning occurs at each step to manage exponential growth in candidates, and to mitigate biases toward shorter sequences, a length normalization penalty is applied, often as \alpha \times \log(\text{length}) where \alpha controls the strength; in Transformer-based NMT systems, beam search with a beam width of 4 and \alpha = 0.6 has become a standard configuration for balancing fluency and adequacy. For scenarios requiring diverse outputs, such as exploratory applications or to avoid repetitive translations, sampling-based variants introduce stochasticity by drawing tokens from the model's probability distribution rather than selecting deterministically. Top-k sampling restricts sampling to the k most probable tokens (e.g., k=40) and normalizes their probabilities, promoting variety while avoiding low-likelihood tokens that could lead to incoherence. Nucleus sampling, or top-p, dynamically selects the smallest set of tokens whose cumulative probability exceeds a threshold p (e.g., p=0.9), offering adaptive diversity that scales with the distribution's entropy and reduces the risk of degenerate outputs compared to uniform sampling. A key challenge in these strategies is exposure bias, arising from the train-test mismatch where the model is trained on correct prefixes but must generate from potentially erroneous ones during inference, leading to compounding errors and degraded translation quality over long sequences. This issue is particularly pronounced in autoregressive NMT models and motivates techniques like scheduled sampling to bridge the gap, though it remains a focus of ongoing research.

Optimization Techniques

Teacher forcing is a training technique employed in neural machine translation (NMT) models to stabilize the learning process by feeding the ground-truth previous tokens from the target sequence as input to the decoder during training, rather than the model's own predictions.^[3] This approach mitigates error accumulation in autoregressive decoding and has been foundational since the introduction of sequence-to-sequence models.^[3] Curriculum learning enhances NMT training efficiency by ordering training data according to increasing difficulty, such as progressing from short to longer sentences, which allows the model to build foundational representations before tackling complex examples.^[31] Empirical studies have shown this method reduces convergence time and improves translation quality, particularly for models trained on diverse corpora.^[31] Gradient clipping prevents exploding gradients during backpropagation by constraining the norm of the gradient updates, a common issue in recurrent architectures used in early NMT systems. For instance, clipping the gradient norm to a threshold of 1 has been applied to maintain stable training dynamics. Learning rate scheduling further optimizes training by dynamically adjusting the rate over steps; the Noam scheduler, widely adopted in Transformer-based NMT, follows the formula:

\text{lr} = d_{\text{model}}^{-0.5} \cdot \min(\text{step}^{-0.5}, \text{step} \cdot \text{warmup_steps}^{-1.5})

where d_{\text{model}} is the model dimension and warmup_steps is typically set to 4000, enabling linear warmup followed by decay.^[5] The Adam optimizer, configured with \beta_1 = 0.9 and \beta_2 = 0.98, has become the standard for NMT training since the Transformer era, providing adaptive per-parameter learning rates that accelerate convergence compared to earlier methods like AdaGrad.^[5] Dropout regularization, applied at rates of 0.1 to 0.3 on embeddings, attention layers, and feed-forward networks, helps prevent overfitting by randomly masking units during training, with empirical evaluations confirming its effectiveness in improving generalization for NMT tasks.^[5] In low-resource NMT settings, where limited parallel data leads to overfitting, fine-tuning pre-trained models on target-domain data addresses this by adapting general representations to specific language pairs, yielding substantial BLEU score improvements (e.g., up to 5-10 points on low-resource benchmarks). This technique leverages transfer learning to bootstrap performance without requiring extensive new training from scratch.

Comparisons

With Statistical Machine Translation

Statistical machine translation (SMT), particularly the dominant phrase-based variant, relies on parallel corpora to extract phrase translation tables that capture contiguous multi-word units from source to target languages. These tables are combined with a target-language model, typically an n-gram model, using a log-linear framework to score candidate translations during decoding. The translation model provides probabilities for phrase pairs, while the language model ensures fluency, and additional features like distortion penalties handle reordering; the overall score is a weighted sum maximized via beam search. This modular approach allows independent optimization of components but requires explicit alignment estimation, often using tools like GIZA++ for word-level links that inform phrase extraction.^[32] In contrast, neural machine translation (NMT) operates as an end-to-end system, directly mapping source sequences to target sequences through neural networks, implicitly learning alignments via mechanisms like attention rather than extracting them explicitly as in SMT. SMT's modular design separates alignment, translation, and reordering, enabling targeted improvements but introducing error propagation across pipelines, whereas NMT's integrated architecture captures long-range dependencies and contextual nuances more holistically, though it demands larger datasets for effective training. This shift from SMT's reliance on hand-crafted features and heuristics to NMT's data-driven parameterization has fundamentally altered translation modeling, with NMT avoiding the fragmentation of phrase-based handling by representing entire sentences in continuous vector spaces. Performance-wise, NMT generally surpasses SMT in generating fluent, contextually coherent translations, achieving BLEU score improvements of 5-10 points on benchmarks like WMT for high-resource language pairs such as English-German or English-French, attributed to better handling of morphology and syntax. However, SMT retains advantages in robustness for rare phrases or low-frequency terms, where its explicit lookup tables provide reliable fallbacks without the extrapolation challenges NMT faces in data-sparse regions. Human evaluations confirm NMT's edge in overall adequacy and fluency, though SMT can outperform in morphologically rich languages with limited data due to its lower sample complexity. Hybrid NMT-SMT systems, which leverage SMT's alignments or rescoring to guide NMT decoding, reached their peak influence around 2016, blending strengths to boost performance before pure NMT models achieved dominance through architectural advances like transformers. These hybrids mitigated early NMT limitations in vocabulary coverage and alignment quality but became less necessary as NMT scaled with more data and compute.^[33]

With Rule-Based Machine Translation

Rule-based machine translation (RBMT) relies on hand-crafted linguistic rules, including grammars for morphology and syntax, bilingual dictionaries for lexical mapping, and transfer rules to restructure sentences from the source language to the target language.^[34] These components enable direct, transfer-based, or interlingua approaches, where explicit patterns handle structural and semantic transformations without requiring large datasets.^[35] In contrast, neural machine translation (NMT) adopts a data-driven paradigm, learning translation patterns end-to-end from parallel corpora using neural networks, which introduces a black-box nature lacking the interpretability of RBMT's transparent rule sets.^[36] This fundamental difference allows RBMT greater domain control through manual adjustments to rules, while NMT's dependence on training data limits such fine-grained oversight but enables broader adaptability.^[36] Performance comparisons highlight RBMT's strengths in controlled domains, such as technical terminology, where its predefined dictionaries and rules achieve high accuracy for single words (e.g., 96.25% vs. NMT's 82.50%) and formal texts with consistent structures.^[36] However, RBMT proves brittle for idiomatic expressions and out-of-domain content, as it fails without explicit rules for nuanced or context-dependent phrases, yielding lower generalization (e.g., 46.25% accuracy on Twitter data).^[36] NMT, conversely, generalizes better across varied inputs by capturing contextual patterns from data, outperforming RBMT on idiomatic test sets and in-domain tasks like spoken language (90.42% accuracy).^[36] Ongoing RBMT systems like Apertium, an open-source platform, demonstrate persistent use for closely related language pairs through shallow-transfer rules, maintaining scalability challenges compared to NMT's data-fueled expansion.^[37] In the 2010s, hybrid approaches emerged to leverage RBMT's precision for post-editing NMT outputs, combining rule-based corrections with neural generation to improve fluency in low-resource scenarios.^[38] NMT's key advantage lies in its adaptability to new domains and languages without extensive manual rule creation, relying instead on retraining with additional data to achieve superior overall translation quality.^[35]^[36]

Applications and Challenges

Real-World Applications

Neural machine translation (NMT) has transformed commercial translation tools, with Google Translate pioneering its adoption in 2016 through the introduction of the Google Neural Machine Translation (GNMT) system, which replaced earlier statistical methods to improve fluency and accuracy across multiple language pairs.^[39] By 2025, Google Translate supports over 240 languages, enabling seamless text, speech, and image translations for billions of users worldwide.^[40] Similar shifts occurred in other major services, such as Microsoft Translator and DeepL, where NMT underpins core functionality for everyday and professional use. In domain-specific applications, fine-tuned NMT models address the precision needs of specialized fields like medicine and law. For medical translation, domain-adapted NMT systems outperform general large language models in accuracy for clinical texts, as demonstrated in evaluations of hospital records and pharmaceutical documents, reducing errors in terminology handling.^[41] In the legal sector, the European Union's eTranslation platform, launched in 2017 as a neural-based service, supports fine-tuned models for official documents across 24 EU languages, facilitating cross-border legal communication while ensuring compliance with data protection standards.^[42] These adaptations leverage parallel corpora from respective domains to enhance contextual relevance and terminological consistency.^[43] Real-time NMT applications extend to dynamic scenarios, including live subtitles for videos and meetings, where systems like those from RWS Language Weaver provide instantaneous captioning for multilingual broadcasts and conferences.^[44] In customer service, NMT powers multilingual chatbots, enabling on-the-fly translation of user queries in platforms supporting over 100 languages for seamless interactions.^[45] For gaming and technology localization, NMT accelerates the adaptation of in-game dialogues, user interfaces, and narratives, allowing developers to target global markets more efficiently, as seen in workflows that combine neural translation with human post-editing for cultural nuance.^[46] By 2025, NMT has become the dominant approach in machine translation, powering major online services and handling vast volumes of text annually. Furthermore, NMT integrates with automatic speech recognition (ASR) and text-to-speech (TTS) systems to enable end-to-end speech-to-speech translation, as in NVIDIA's NIM microservices, which cascade ASR for input processing, NMT for core translation, and TTS for output synthesis in real-time applications like virtual assistants.^[47] Recent advancements in model quantization from 2024 onward have spurred growth in edge-device NMT, allowing lightweight models to run on smartphones for offline mobile translation without cloud dependency, reducing latency and enhancing privacy.^[48]

Limitations and Future Directions

Neural machine translation (NMT) systems, while advanced, face significant challenges including hallucinations, where models generate fluent but factually incorrect or extraneous content not supported by the input.^[49] These issues are particularly prevalent in low-resource language pairs, occurring more frequently and distinctly compared to high-resource scenarios.^[49] Additionally, cultural biases arise from training data that often reflects dominant languages and perspectives, leading to translations that perpetuate stereotypes or inadequately represent diverse cultural nuances, such as in mental health resources.^[50] Low-resource languages, which suffer from limited availability of parallel corpora, underperform due to data scarcity, resulting in lower translation accuracy and higher error rates compared to high-resource pairs.^[51] Evaluation of NMT relies on metrics like BLEU, which measures n-gram overlap between machine outputs and human references, and TER, which assesses edit distance for post-editing effort.^[52] Human judgments provide complementary qualitative insights into fluency, adequacy, and overall quality.^[53] However, these metrics have limitations; BLEU's emphasis on n-gram precision often fails to capture semantic similarity or contextual appropriateness, leading to misleading scores for paraphrased or semantically equivalent translations.^[52]^[53] Similarly, TER overlooks deeper meaning preservation, prompting calls for hybrid approaches incorporating reference-free and learned metrics.^[53] Looking ahead, research in NMT emphasizes controllable translation, enabling users to specify attributes like formality or style through targeted prompts or auxiliary inputs to reduce ambiguity in outputs.^[54] Multimodal NMT integrates textual and visual inputs, such as images, to enhance context disambiguation and robustness, particularly for ambiguous descriptions. Efficiency improvements via knowledge distillation transfer capabilities from large teacher models to compact students, while pruning removes redundant parameters to lower computational demands without substantial accuracy loss.^[55]^[56] As of 2025, trends include federated learning for privacy-preserving MT, allowing collaborative training across distributed datasets without sharing raw data.^[57] Ethical AI guidelines advocate bias mitigation through diverse data curation and fairness-aware training to promote equitable translations.^[58] Advances in zero-shot multilingualism from 2023-2025 enable translation between unseen language pairs via shared representations in large-scale models.^[59] Efforts to bolster robustness against noisy inputs, such as typos or perturbations, have shown that modern models implicitly gain resilience through scaling, though targeted noise augmentation during training yields further gains.^[60]^[61]

References

[1]
Neural Machine Translation: A Review of Methods, Resources, and ...
Dec 31, 2020 · In this article, we first provide a broad review of the methods for NMT and focus on methods relating to architectures, decoding, and data augmentation.
[2]
Google's Neural Machine Translation System: Bridging the Gap ...
Sep 26, 2016 · Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the ...
[3]
Sequence to Sequence Learning with Neural Networks - arXiv
Sep 10, 2014 · Title:Sequence to Sequence Learning with Neural Networks. Authors:Ilya Sutskever, Oriol Vinyals, Quoc V. Le. View a PDF of the paper titled ...
[4]
Neural Machine Translation by Jointly Learning to Align and ... - arXiv
Sep 1, 2014 · The neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance.
[5]
[1706.03762] Attention Is All You Need - arXiv
Jun 12, 2017 · We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
[6]
A Survey of Deep Learning Techniques for Neural Machine ...
Feb 18, 2020 · This literature survey traces back the origin and principal development timeline of NMT, investigates the important branches, categorizes different research ...
[7]
Statistical Machine Translation - Cambridge University Press
'Philipp Koehn has provided the first comprehensive text for the rapidly growing field of statistical machine translation. This book is an invaluable resource ...
[8]
Neural Machine Translation of Rare Words with Subword Units - arXiv
Aug 31, 2015 · In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown ...
[9]
[PDF] A Neural Probabilistic Language Model
Abstract. A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language.Missing: 2003-2010 reordering alignment
[10]
[PDF] Part VI Neural Machine Translation, Seq2seq and Attenti
In the context of translation, we're allowing the network to translate the first few words of the input as soon as it sees them; once it has the first few words ...
[11]
Learning Phrase Representations using RNN Encoder-Decoder for ...
Jun 3, 2014 · In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN).
[12]
Exploring Massively Multilingual, Massive Neural Machine Translation
Oct 11, 2019 · ... Transformers with over 6 billion parameters. Increasing the model capacity resulted in significantly improved performance across all ...<|control11|><|separator|>
[13]
Google's Multilingual Neural Machine Translation System - arXiv
Nov 14, 2016 · We propose a simple solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages.Missing: seminal | Show results with:seminal
[14]
Effective Approaches to Attention-based Neural Machine Translation
Aug 17, 2015 · This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one.
[15]
Transformer-XL: Attentive Language Models Beyond a Fixed-Length ...
Jan 9, 2019 · Abstract page for arXiv paper 1901.02860: Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. ... positional encoding scheme.Missing: relative | Show results with:relative
[16]
Sequence Level Training with Recurrent Neural Networks - arXiv
Nov 20, 2015 · We address this issue by proposing a novel sequence level training algorithm that directly optimizes the metric used at test time, such as BLEU or ROUGE.Missing: translation | Show results with:translation
[17]
Improving Neural Machine Translation Models with Monolingual Data
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving Neural Machine Translation Models with Monolingual Data. In Proceedings of the 54th Annual ...
[18]
Contrastive Preference Learning for Neural Machine Translation
We propose Contrastive Preference Learning (CPL), which uses offline samples with list-wise preferences to fine-tune a pre-trained model in Neural Machine ...Missing: loss | Show results with:loss
[19]
An Empirical Exploration of Curriculum Learning for Neural Machine ...
Nov 2, 2018 · Abstract:Machine translation systems based on deep neural networks are expensive to train. Curriculum learning aims to address this issue by ...
[20]
[PDF] Statistical Phrase-Based Translation - ACL Anthology
We propose a new phrase-based translation model and decoding algorithm that enables us to evaluate and compare several, previ- ously proposed phrase-based ...
[21]
[PDF] Findings of the 2016 Conference on Machine Translation - Statmt.org
(2016). SMT and Hybrid systems of the QTLeap project in the WMT16 IT-task. In Proceedings of the First Conference on Ma- chine Translation, Berlin, Germany ...
[22]
Rescuing Low Resource Neural Machine Translation with Statistical ...
This paper describes our submission to the WMT24 shared task for Low-Resource Languages of Spain in the Constrained task category.Missing: revival | Show results with:revival
[23]
[PDF] Leveraging Rule-Based Machine Translation Knowledge for Under ...
Rule-based machine translation (RBMT) uses expert-defined rules and lexicons to translate, while neural machine translation (NMT) learns from examples. RBMT ...
[24]
[PDF] Machine Translation: A Literature Review - arXiv
Dec 28, 2018 · In this literature review, we survey two major sub-fields of machine translation: statistical machine translation, and neural machine ...
[25]
Revisiting Rule-Based and Neural Machine Translation - MDPI
This paper proposes a hybrid machine-translation system that combines neural machine translation with well-developed rule-based machine translation.
[26]
[PDF] Rule-Based, Neural and LLM Back-Translation - ACL Anthology
Aug 15, 2024 · In this work, we conducted a detailed comparison of RBMT, NMT and LLMs for back-translation in a low-resource scenario. We have tested various.
[27]
Recent advances in Apertium, a free/open-source rule-based ...
Oct 18, 2021 · This paper presents an overview of Apertium, a free and open-source rule-based machine translation platform.
[28]
Latest trends in hybrid machine translation and its applications
This survey paper reviews recent methods that combine and hybridize MT approaches in single architectures.
[29]
Zero-Shot Translation with Google's Multilingual Neural Machine ...
Nov 22, 2016 · Google Translate is switching to a new system called Google Neural Machine Translation (GNMT), an end-to-end learning framework that learns from millions of ...
[30]
110 new languages are coming to Google Translate
Jun 27, 2024 · We're using AI to add 110 new languages to Google Translate, including Cantonese, NKo and Tamazight.
[31]
New Study Challenges LLM Dominance with Specialized Medical ...
Aug 14, 2024 · A new study shows that fine-tuned, domain-specific NMT models surpass industry-leading LLMs in delivering accurate medical translations.Missing: surpass | Show results with:surpass
[32]
[PDF] Evaluating the usefulness of neural machine translation for the ...
In November 2017 the Directorate-General for Translation launched eTranslation, a neural machine translation system, as part of the Connecting Europe Facility.
[33]
[PDF] How Good Are They at Machine Translation in the Legal Domain?
Feb 12, 2024 · This study evaluates the machine translation (MT) quality of two state-of-the-art large language models (LLMs) against a traditional neural ...
[34]
Neural machine translation vs large language models: Are you ...
May 23, 2024 · Language Weaver clients, for example, are already using NMT to translate closed captions, live customer service interactions and product ...
[35]
Multilingual Chatbots Made Easy: Playbook for Breaking Language ...
May 28, 2025 · Uses machine learning models, particularly Neural Machine Translation (NMT), to translate user questions and bot replies on the fly. Flexibility ...
[36]
AI translation in game localization: The complete guide | Gridly
Sep 17, 2025 · Learn the 5-step framework that delivers your games 80% faster through this detailed guide for AI translation game localization workflows.Missing: chatbots | Show results with:chatbots
[37]
Machine Translation Market Analysis, Size, and Forecast 2025-2029
The machine translation market size is valued to increase USD 1.5 billion, at a CAGR of 16.6% from 2024 to 2029. Increasing demand for content localization will ...
[38]
Quickly Voice Your Apps with NVIDIA NIM Microservices for Speech ...
Sep 18, 2024 · Try speech NIM microservices to experience the ease of integrating ASR, NMT, and TTS into your pipelines. Explore the APIs and see how these ...
[39]
Bringing High-Performance Language Models to Your Pocket | Synced
Sep 4, 2024 · MobileQuant is a straightforward post-training quantization technique that reduces both inference latency and energy consumption while preserving accuracy ...Missing: NMT | Show results with:NMT<|control11|><|separator|>
[40]
Hallucinations in Large Multilingual Translation Models
Dec 14, 2023 · Hallucinations in low-resource language pairs are not only more frequent, but also distinct. Table 2 shows that hallucinations occur frequently ...
[41]
Cultural and Linguistic Bias of Neural Machine Translation Technology
Aug 31, 2023 · This chapter explores the cultural and linguistic bias of neural machine translation of English educational resources on mental health and well-being.
[42]
Neural Machine Translation for Low-resource Languages: A Survey
This article presents a detailed survey of research advancements in low-resource language NMT (LRL-NMT) and quantitative analysis to identify the most popular ...
[43]
[PDF] DEMETR: Diagnosing Evaluation Metrics for Translation
Dec 7, 2022 · The BLEU metric (Papineni et al., 2002), which is a function of n-gram overlap between sys- tem and reference outputs, is still used widely ...
[44]
A Survey on Evaluation Metrics for Machine Translation - MDPI
Feb 16, 2023 · Several studies have shown that traditional metrics (e.g., BLEU, TER) show poor performance in capturing semantic similarity between MT outputs ...
[45]
Neural machine translation: Challenges, progress and future
Aug 7, 2025 · This article makes a review of NMT framework, discusses the challenges in NMT, introduces some exciting recent progresses and finally looks forward to some ...
[46]
Knowledge Distillation: A Method for Making Neural Machine ... - MDPI
In this work, we investigate knowledge distillation on a simulated low-resource German-to-English translation task. We show that sequence-level knowledge ...1. Introduction · 3. Results · 3.4. Hyperparameter TuningMissing: pruning | Show results with:pruning
[47]
Accepted Findings Papers - ACL 2025
A Self-Distillation Recipe for Neural Machine Translation Hongfei Xu, Zhuofei Liang, Qiuhui Liu, Lingling Mu; BlockPruner: Fine-grained Pruning for Large ...
[48]
Federated Learning: A Survey on Privacy-Preserving Collaborative ...
Aug 12, 2025 · Federated Learning (FL) has emerged as a powerful paradigm for privacy-preserving collaborative machine learning, enabling the development of ...
[49]
[PDF] Ethical Challenges and Solutions in Neural Machine Translation
Apr 1, 2024 · [14] A. Jain, A. Banerjee, and P. Bhattacharyya, “Literature survey: Neural machine translation in low resource setting,” 2021.<|control11|><|separator|>
[50]
https://www.cambridge.org/core/books/translation-technology-in-accessible-health-communication/cultural-and-linguistic-bias-of-neural-machine-translation-technology/0635E9793BA9B604B6BC8D527B91CF25
[51]
Did Translation Models Get More Robust Without Anyone Even ...
Oct 3, 2025 · For swap and drop noise, NLLB is more robust than TI when translating to English, while the reverse is true when translating from English. This ...
[52]
Did Translation Models Get More Robust Without Anyone Even ...
Neural machine translation (MT) models achieve strong results across a variety of settings, but it is widely believed that they are highly sensitive to “noisy ...Missing: 2023-2025 | Show results with:2023-2025