Fact-checked by Grok 2 weeks ago

BLEU

BLEU (Bilingual Evaluation Understudy) is an automatic metric for evaluating the quality of machine translation outputs by measuring their similarity to one or more human-generated reference translations, primarily through modified n-gram precision for sequences of 1 to 4 words combined with a brevity penalty to account for translation length differences.^[1] Developed to address the high cost and time demands of human evaluations, BLEU provides a quick, language-independent score ranging from 0 (no match) to 1 (perfect match), enabling efficient assessment of translation systems during development.^[1] Introduced in 2002 by IBM researchers Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, BLEU was presented at the 40th Annual Meeting of the Association for Computational Linguistics (ACL) as a solution to the evaluation bottleneck in machine translation research.^[1] The method computes precision by counting overlapping n-grams between the candidate translation and references, clipping counts to avoid overcounting, and then applies a geometric mean across n-gram orders weighted equally, multiplied by the brevity penalty factor of \min(1, \exp(1 - r/c)), where r is the reference length and c is the candidate length.^[1] Tested on a corpus of approximately 500 sentences from news stories with 2–4 references per sentence, BLEU demonstrated high correlation with human adequacy and fluency judgments, achieving coefficients of 0.99 for monolingual evaluators and 0.96 for bilingual ones.^[1] BLEU's impact on natural language processing (NLP) has been profound, serving as a foundational benchmark that accelerated advancements in machine translation by allowing researchers to iterate rapidly without extensive human annotation.^[2] With nearly 20,000 citations by 2022, it remains a de facto standard in shared tasks like the Conference on Machine Translation (WMT) and has influenced the development of related metrics such as ROUGE for summarization.^[2] Its simplicity and reliability across languages—including Arabic, Chinese, French, and Spanish—have made it indispensable for comparing system performance, though scores can vary based on the number of references used.^[1]^[2] Despite its strengths, BLEU has limitations, such as sensitivity to reference diversity and potential underpenalization of fluent but reordered translations, prompting the emergence of complementary metrics like METEOR and BERTScore in modern NLP.^[1] Nonetheless, BLEU continues to be widely adopted for its corpus-level reliability and role in driving progress toward more accurate multilingual AI systems.^[2]

Background

Introduction

BLEU, or Bilingual Evaluation Understudy, is an algorithm for automatically evaluating the quality of machine-generated translations by computing their overlap with human reference translations using n-grams.^[1] This metric enables rapid assessment of translation performance without relying on extensive human judgment, which is time-consuming and expensive.^[1] The score ranges from 0 to 1, with higher values indicating greater similarity to the references, and it is designed to be language-independent, making it applicable across diverse linguistic pairs.^[1] Developed in 2001 and introduced in 2002 by Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu at IBM Research, BLEU correlates well with human evaluations and has become a standard benchmark in the field.^[1] BLEU is extensively used in machine translation research, including development of neural models, and in competitions like the annual Workshop on Machine Translation (WMT), where it serves as a primary automatic metric.^[1] While effective, it can be sensitive to the number of available reference translations, potentially underestimating quality when references are limited.^[3]

Historical Development

The BLEU metric was developed in 2001 at IBM's T. J. Watson Research Center by Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu.^[1] This work addressed the growing complexity of machine translation (MT) systems in the early 2000s, where manual human evaluations—assessing aspects like adequacy and fluency—were becoming increasingly time-consuming and costly due to the scale of data involved.^[1] The researchers sought an automated alternative that could approximate human judgments efficiently, drawing inspiration from fluency and adequacy but operationalizing them through n-gram overlap statistics to enable quick, language-independent assessments.^[1] The metric was first detailed in the seminal paper "BLEU: a Method for Automatic Evaluation of Machine Translation," presented at the 40th Annual Meeting of the Association for Computational Linguistics (ACL) in 2002.^[1] Initially employed internally at IBM for evaluating statistical MT prototypes in 2001, BLEU's public release marked a pivotal moment, providing a standardized, objective measure that correlated strongly with human evaluations while requiring minimal resources.^[4]^[1] Post-publication, BLEU rapidly gained traction as the de facto standard for MT evaluation. In 2002, the National Institute of Standards and Technology (NIST) adopted it as the official metric for its MT evaluation series under the DARPA TIDES program, enabling consistent benchmarking across systems and languages.^[5] This widespread use facilitated the field's shift from rule-based to statistical MT approaches, as BLEU's automated scoring supported rapid experimentation and progress tracking in data-driven models.^[6] By the mid-2000s, it had been integrated into key open-source tools like the Moses statistical MT toolkit, released in 2007, further solidifying its role in research and development workflows.

Mathematical Formulation

Setup and Notation

The BLEU (Bilingual Evaluation Understudy) metric evaluates the quality of a machine-generated translation by comparing it to one or more human-produced reference translations. In the standard setup, a single candidate translation C is assessed against m \geq 1 reference translations R_1, R_2, \dots, R_m, where each translation is treated as an unsegmented sequence of words following tokenization appropriate to the language.^[1] BLEU operates under key assumptions that ensure its broad applicability. It is designed to be language-independent, relying solely on the ability to tokenize text into words or subword units, without dependence on specific linguistic features like morphology or syntax. The metric focuses on n-gram overlaps, considering sequences of contiguous tokens up to order 4—that is, unigrams (n=1), bigrams (n=2), trigrams (n=3), and 4-grams (n=4)—as higher-order n-grams capture longer-range correspondences between the candidate and references. An n-gram is defined as a contiguous sequence of n tokens from a translation.^[1] The notation used in BLEU's formulation is as follows: C denotes the candidate translation, with |C| representing its total number of words (often denoted as c); r is the effective reference length, computed as the closest reference length to |C| for sentence-level evaluation, or as the sum over all sentences of the closest reference length to each candidate sentence for corpus-level evaluation. For any word sequence w (an n-gram), \mathrm{Count}(w) is the number of times w appears in C, while \mathrm{Count}_{\mathrm{clip}}(w) is the clipped count, defined as the minimum of \mathrm{Count}(w) and the maximum count of w across all reference translations \max_i \mathrm{Count}_{R_i}(w). The vocabulary V encompasses all unique word sequences appearing in the candidate and reference translations relevant to the n-gram orders considered.^[1] Although BLEU can be computed at the sentence level, it is typically aggregated at the corpus level for greater statistical stability, summing clipped n-gram counts across all sentences in the corpus before normalization by the total number of n-grams in the candidate corpus. This corpus-level approach mitigates variability from short sentences and provides a more reliable overall score.^[1]

N-gram Precision

The n-gram precision in BLEU is a modified precision metric that evaluates the overlap between n-grams in a candidate translation C and those in one or more reference translations R_1, R_2, \dots, R_m, focusing on content adequacy while penalizing overgeneration.^[1] For unigrams (n=1), the precision p_1 is defined as the ratio of the sum of clipped matching unigrams to the total number of unigrams in the candidate:

p_1 = \frac{\sum_{\text{word} \in C} \text{Count}_{\text{clip}}(\text{word})}{\sum_{\text{word} \in C} \text{Count}(\text{word})},

where \text{Count}(\text{word}) is the number of times the word appears in C, and the clipping ensures that matches are not overcounted beyond what appears in the references.^[1] The clipping mechanism, \text{Count}_{\text{clip}}(\text{ngram}) = \min(\text{Count}_C(\text{ngram}), \max_i \text{Count}_{R_i}(\text{ngram})), limits the contribution of any n-gram in the candidate to the maximum frequency it has in any single reference translation, thereby preventing the metric from rewarding excessive repetitions or hallucinations in the output.^[1] This modification addresses a key limitation of standard precision, which would otherwise inflate scores for translations that reuse the same words or phrases far more often than justified by the references.^[1] For instance, consider a candidate translation containing the word "the" three times, while the maximum occurrence in any reference is two; the clipped count for "the" would be two, reducing the numerator accordingly and lowering p_1 to penalize the overgeneration.^[1] This approach ensures that precision reflects faithful content overlap without bias toward verbose or repetitive outputs.^[1] The unigram precision generalizes to higher-order n-grams (n=2 to n=4) in a similar fashion:

p_n = \frac{\sum_{\text{ngram} \in C} \text{Count}_{\text{clip}}(\text{ngram})}{\sum_{\text{ngram} \in C} \text{Count}(\text{ngram})},

where clipping is applied per n-gram type across all candidates and references, capturing both local fluency (via bigrams and trigrams) and longer-range coherence (via 4-grams).^[1] The rationale for this clipping across multiple n-gram orders is to balance adequacy (unigrams) with fluency (higher n-grams), as empirical analysis showed that even single n-gram precisions distinguish good from poor translations, but combining them enhances robustness against overgeneration.^[1]

Brevity Penalty

The brevity penalty (BP) in BLEU is a multiplicative factor designed to penalize candidate translations that are shorter than the reference translations, addressing the tendency of raw n-gram precision to favor concise outputs with fewer opportunities for errors.^[1] It is defined as

BP = \begin{cases} 1 & \text{if } |c| > r \\ \exp\left(1 - \frac{r}{|c|}\right) & \text{if } |c| \leq r \end{cases}

where |c| is the length of the candidate translation (in words) and r is the effective length of the reference translation.^[1] The rationale for BP stems from the observation that unmodified precision scores can be artificially high for very short candidates, as they contain fewer n-grams that might mismatch the references, thus rewarding incomplete translations over those that better capture fluency and adequacy.^[1] By applying BP, BLEU balances content overlap with overall translation completeness, mimicking human evaluators' preference for outputs that are neither too terse nor excessively verbose.^[1] In practice, [r](/page/R) is determined at the corpus level by selecting, for each candidate sentence, the reference sentence length closest to the candidate's length, then summing these values across all sentences to yield the effective reference length [r](/page/R); this "best match" approach avoids over-penalizing isolated short sentences while ensuring the penalty reflects aggregate brevity.^[1] The exponential form of BP for short candidates provides a smooth, non-zero penalty that caps the minimum at approximately 0.37 when |c| is half of [r](/page/R), preventing extreme suppression of scores for moderately concise corpora.^[1] For example, consider a candidate translation of length |c| = 10 words against a reference effective length r = 15; here, BP = \exp(1 - 15/10) = \exp(-0.5) \approx 0.607, reducing the overall score to discourage the brevity.^[1] This component was introduced in the original BLEU formulation to align automatic scores more closely with human adequacy judgments, achieving high correlation (e.g., 0.81–0.99) in empirical tests on English-French translations.^[1]

Final Score Calculation

The final BLEU score aggregates the modified n-gram precisions p_n (for n = 1 to $4) and the brevity penalty BP into a single metric using the formula:

\text{BLEU} = BP \times \exp\left( \sum_{n=1}^{4} w_n \log p_n \right)

where w_n = 1/4 are uniform weights, p_n is the modified precision for n-grams (computed via clipping to avoid overcounting, as detailed in the n-gram precision section), and BP penalizes overly short translations (as derived in the brevity penalty section).^[1] This formulation employs a geometric mean of the p_n values, obtained through the exponential of the average logarithm, to ensure balanced contribution across n-gram orders; the logarithmic aggregation promotes a multiplicative combination that strongly penalizes weaknesses in any p_n, reflecting the exponential decay of precision as n increases.^[1] The uniform weighting scheme treats all n-grams equally in the standard implementation, though alternative non-uniform weights (e.g., higher w_n for longer n-grams to emphasize fluency) have been explored in extensions.^[1] BLEU scores are inherently in the range [0, 1] but are conventionally multiplied by 100 and reported as percentages (0–100) for interpretability in evaluations.^[7] The score is computed at the corpus level, aggregating statistics across all sentences in the test set to provide a stable system-level assessment.^[1]

Computation

Algorithm Overview

The computation of the BLEU score involves a multi-step process that measures n-gram precision between a candidate translation C and a set of reference translations \{R_1, \dots, R_m\}, adjusted by a brevity penalty to account for translation length. This procedure is applied at the corpus level for robust evaluation, aggregating statistics across multiple sentences.^[1] The first step entails tokenizing the candidate and reference texts into sequences of words. This includes normalization such as case folding, and for languages without explicit word boundaries (e.g., those using logographic scripts), appropriate segmentation is applied to identify word units. Multiple reference translations are prepared similarly to enable comparison.^[1] Next, for each n-gram order n from 1 to 4, all contiguous n-grams are extracted from the candidate C and from each reference R_i. The count of each n-gram in C is then clipped to the maximum count of that n-gram observed in any single reference translation, preventing overcounting of n-grams that appear more frequently in the candidate than in the references. This clipping uses the highest frequency across the m references for each n-gram instance.^[1] The modified n-gram precision p_n for each n is calculated as the ratio of the sum of clipped n-gram counts to the total number of n-grams of order n in the candidate, aggregated over the entire corpus. Specifically,

p_n = \frac{\sum_{\text{clipped counts for } n\text{-grams}}}{\sum_{\text{counts in } C \text{ for } n\text{-grams}}}.

These precisions are computed separately for n=1,2,3,4.^[1] The brevity penalty (BP) addresses the tendency of unpenalized n-gram matching to favor short candidates. For each sentence, the effective reference length r is the length of the reference translation closest in length to the candidate |C|; the corpus-level r is the sum of these per-sentence values, while c is the total candidate length across the corpus. The BP is then

BP = \begin{cases} 1 & \text{if } c > r, \\ \exp(1 - r/c) & \text{if } c \leq r. \end{cases}

This ensures longer candidates are not unduly penalized if they match the references closely.^[1] Finally, the BLEU score is aggregated as the brevity penalty multiplied by the geometric mean of the precisions, using equal weights for the four n-gram orders:

\text{BLEU} = BP \cdot \exp\left( \frac{1}{4} \sum_{n=1}^{4} \ln(p_n) \right).

For corpora with multiple sentences, all statistics (precisions and lengths) are pooled before this computation to yield a single score.^[1]

Implementation Details

Computing the BLEU score requires careful preprocessing, particularly tokenization, which is language-specific and critical for accurate n-gram extraction. For English, standard tokenization splits text on whitespace and punctuation, while languages like Japanese may use tools such as MeCab for morphological analysis. In neural machine translation systems, subword units (e.g., via Byte-Pair Encoding) are common during training, but BLEU evaluation typically applies detokenization followed by standardized word-level tokenization to ensure comparability, as subword segmentation can inflate scores if not normalized.^[1]^[8]^[9] Several software libraries facilitate BLEU computation, building on the original implementation by IBM researchers. The NLTK library in Python provides a flexible corpus_bleu function that handles multiple references and smoothing options. The Moses toolkit includes the multi-bleu.perl script for efficient scoring against multiple references. Modern implementations like the sacrebleu package standardize the process, incorporating WMT tokenization and producing reproducible signatures to avoid discrepancies.^[1]^[10]^[11]^[9] For corpus-level evaluation, n-gram counts are summed across all sentences before computing precision, rather than averaging sentence-level scores, to prevent length bias and ensure stable estimates. This aggregation aligns with BLEU's design as a corpus metric, where individual sentence scores can be unreliable due to sparsity. Reliable results typically require corpora of at least 1000 sentences, as smaller sets amplify variance from outliers.^[1]^[8]^[12] Edge cases must be handled explicitly to avoid undefined behavior. An empty candidate translation yields a brevity penalty of 0 and thus a BLEU score of 0, as no n-grams can match. With a single reference (m=1), n-gram clipping is based solely on that reference's counts, potentially leading to harsher penalties for rare terms compared to multiple references, where the maximum count across references is used for clipping, allowing higher precision.^[10]^[1] Reproducibility challenges arise primarily from tokenization variations, with different schemes causing score differences of up to 1.8 BLEU points, as reported in a 2018 study on preprocessing impacts. To mitigate this, best practices recommend standardized tokenizers (e.g., WMT's 13a scheme) and tools like sacrebleu, which enforce consistent normalization and report full computation signatures.^[13]

Evaluation

Correlation with Human Judgments

The original validation of BLEU demonstrated strong correlation with human judgments of translation quality. In the seminal 2002 study, BLEU achieved a Pearson correlation coefficient of 0.96 with bilingual human assessments emphasizing adequacy on a Chinese-to-English machine translation task involving 500 sentences from news articles, using multiple reference translations.^[1] For monolingual judgments focused on fluency and readability, the correlation reached 0.99, indicating BLEU's ability to closely track expert evaluations across different aspects of translation performance.^[1] Subsequent empirical studies have confirmed BLEU's consistent performance in correlating with human judgments, particularly at the system level in large-scale benchmarks. In the Workshop on Machine Translation (WMT) evaluations, BLEU has shown system-level accuracy around 0.7 with human ratings across various language pairs, demonstrating reliability in relative system ranking.^[14] For instance, in the NIST 2005 machine translation evaluation, BLEU generally ranked systems in the same order as human assessors for Arabic-to-English and Chinese-to-English tasks, providing a robust indicator of comparative quality despite limitations in fine-grained scoring.^[15] BLEU's strengths lie in its reliability for evaluating statistical machine translation systems, where it effectively detects improvements in n-gram overlap that align with enhanced fluency.^[1] It excels at capturing modifications that boost local sequence matching, such as better word order and phrase reproduction, which human judges often prioritize in adequacy and fluency assessments.^[16] Additionally, BLEU is language-agnostic when applied to tokenized inputs, allowing consistent evaluation across diverse linguistic structures without heavy reliance on morphology-specific adjustments. In the modern context of neural machine translation, including Transformer-based models, BLEU remains a standard baseline metric, showing moderate to high correlation with advanced reference-free evaluators like BLEURT and COMET while serving as a quick proxy for human-aligned progress.^[17] For example, in WMT benchmarks involving neural systems, BLEU's system-level correlations with human judgments continue to support its use for tracking iterative improvements, even as neural metrics achieve superior absolute agreement.^[14]

Limitations and Criticisms

BLEU's reliance on n-gram precision primarily measures lexical overlap between candidate translations and references, but it largely disregards semantics, synonyms, and grammatical correctness. For instance, ungrammatical phrases like "cat the" could receive a high unigram precision score if matching a reference containing those words, even though such outputs lack fluency or meaning. This theoretical flaw leads to BLEU rewarding surface-level matches over meaningful translations, as synonyms or paraphrases are only accounted for if explicitly present in the reference set.^[18] The brevity penalty (BP) component introduces further issues by harshly penalizing translations shorter than their references, potentially undervaluing concise yet accurate outputs that align closely with human preferences. Additionally, BLEU's insensitivity to paraphrasing means equivalent translations using different wording receive lower scores unless the references include those variants, exacerbating its limitations in capturing translation adequacy.^[18] BLEU performs poorly on morphologically rich languages, where inflections and agglutinative structures inflate n-gram mismatches, leading to artificially low scores even for high-quality translations. It also struggles with unsegmented languages like Chinese and Japanese without proper preprocessing, as tokenization assumptions favor space-separated scripts and reduce precision for word-boundary ambiguities.^[19] Empirically, BLEU exhibits low correlation with human judgments in specialized domains, such as speech machine translation, where Pearson correlations range from 0.1 to 0.2 due to disfluencies and prosodic elements not captured by n-gram matching. It fails to reliably detect adequacy improvements in outputs exceeding 30-40 BLEU points, where scores saturate and small semantic gains go unmeasured.^[20]^[18] Early critiques highlighted BLEU's extreme sensitivity to reference translations, with score variations of up to 10 points from minor reference changes, undermining its reliability for system comparisons. Reproducibility concerns arose from inconsistent implementation details, such as tokenization and smoothing, leading to non-comparable scores across studies until standardized reporting was advocated.^[18]^[13] In the era of neural machine translation, post-2020 analyses revealed BLEU's saturation at high-quality levels, where top systems yield scores above 40 with minimal differentiation, prompting a shift toward learned metrics like COMET that better correlate with human assessments in these regimes. This transition underscores 2020s critiques of BLEU's inadequacy for evaluating fluent, context-aware neural outputs.

Variants

SacreBLEU

SacreBLEU is a standardized Python implementation of the BLEU metric designed to address inconsistencies in BLEU score reporting arising from variations in tokenization, smoothing techniques, and other preprocessing choices in machine translation evaluation. Developed by Matt Post in 2018, it enforces a fixed reference tokenizer and built-in smoothing to ensure reproducibility and comparability across studies.^[21] Key features of SacreBLEU include the use of the '13a' tokenizer, which mimics the WMT standard mteval-v13a.pl script for English-language evaluation, and exponential decay smoothing as the default method to handle zero n-gram precision counts. It also generates a unique signature string that encapsulates all computation parameters for exact reproduction, such as "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0". This signature allows researchers to verify and replicate scores without ambiguity.^[9]^[21] Unlike the original BLEU implementation, which relies on user-provided tokenized references and allows flexible weights, SacreBLEU operates exclusively at the corpus level, requires plain text inputs (with internal tokenization), and fixes the n-gram weights to the standard uniform 0.25 for 1- to 4-grams. It accommodates subword units like those produced by Byte-Pair Encoding (BPE) by expecting detokenized input, ensuring consistent scoring after subword merging.^[9]^[21] SacreBLEU has been the standard for official evaluations in the Conference on Machine Translation (WMT) since 2018, with the toolkit automatically downloading relevant test sets such as WMT 2014 en-de, where it reports a representative score of BLEU = 22.0 for baseline systems under default settings. Available as an open-source Python package installable via pip, it has facilitated widespread adoption in academic and industrial machine translation research.^[21]^[22]^[9] The primary benefits of SacreBLEU include significant reductions in score variance—up to 1.8 BLEU points across configurations—stemming from standardized preprocessing, thereby promoting fairer comparisons between models and minimizing discrepancies that could otherwise reach several points due to implementation differences. This standardization enhances the reliability of BLEU as a benchmark in the field.^[21]

Other Extensions

iBLEU, introduced in 2011, extends the BLEU metric through an interactive framework designed for debugging and scoring statistical machine translation systems. It visualizes n-gram precision breakdowns and enables users to incorporate human judgments on translation adequacy (preservation of meaning) and fluency (naturalness of language), allowing for targeted refinements in evaluation that align automated scores more closely with human assessments.^[23]^[24] This approach facilitates iterative improvements by highlighting discrepancies between system outputs and references, particularly useful for developers analyzing failure cases without altering the core BLEU computation. Hybrid metrics combining elements of BLEU and METEOR address some of BLEU's limitations in handling synonyms and semantic variations, though they represent integrations rather than pure BLEU extensions. These hybrids leverage BLEU's n-gram precision alongside METEOR's explicit ordering and synonym matching to better capture lexical flexibility and improve correlation with human judgments on adequacy. For instance, comparative analyses demonstrate that such combinations yield higher agreement with fluency and adequacy ratings in diverse language pairs compared to standalone BLEU.^[25]^[26] Domain-specific adaptations of BLEU include case-sensitive variants, which preserve capitalization in n-gram matching to better evaluate translations involving proper nouns or stylistic elements where case matters. Standard BLEU's case-insensitivity can overlook errors in domains like named entity translation, but case-sensitive implementations penalize mismatches accordingly, leading to more precise scores in tasks such as biomedical or legal text evaluation.^[27] Additionally, enhancements to multi-reference handling optimize aggregation across multiple human references, such as using maximum clipping per n-gram to reduce bias from single references and improve robustness in varied translation scenarios.^[9] Direct non-learned modifications to BLEU often target computational stability, particularly through alternative smoothing techniques for handling zero n-gram matches in short or sparse outputs. The add-1 (Laplace) smoothing adds a count of 1 to all n-grams, providing a simple baseline that prevents zero precision but can over-smooth longer sequences. In contrast, pyramid smoothing employs a hierarchical distribution of probability mass from observed higher-order n-grams to unobserved lower-order ones, preserving more accurate precision estimates for sentence-level evaluation. A systematic comparison showed pyramid and related methods providing improved correlation with human judgments for low-count sentences on WMT datasets.^[28] These adjustments maintain BLEU's n-gram core while enhancing reliability without relying on external models. Recent non-learned extensions include spBLEU (2022) and spBLEU-1K (2024), which standardize tokenization using SentencePiece models trained on multilingual corpora (over 1,000 sources for spBLEU-1K) to improve comparability across languages, particularly benefiting low-resource settings by reducing tokenizer-induced variances.^[29]^[30] While learned evolutions like BLEURT (2020) use BERT embeddings for context-aware scoring and achieve superior correlation (up to 0.3 higher Pearson r with humans than BLEU on WMT), they diverge from BLEU's lineage by prioritizing neural representations over surface n-grams.^[31]

References

[1]
[PDF] BLEU: a Method for Automatic Evaluation of Machine Translation
1So we call our method the bilingual evaluation understudy,. BLEU. the evaluation bottleneck. Developers would bene- fit from an inexpensive automatic ...
[2]
The AI paper at the foundations of multilingual NLP - IBM Research
but it also happens to be the French translation of ...
[3]
A Structured Review of the Validity of BLEU - MIT Press Direct
The BLEU metric has been widely used in NLP for over 15 years to evaluate NLP systems, especially in machine translation and natural language generation.
[4]
[PDF] Bleu: a method for automatic evaluation of machine translation
Sep 17, 2001 · Bleu uses the average logarithm with uni- form weights, which is equivalent to using the geometric mean of the modified n-gram preci- sions. 3, ...
[5]
[PDF] Evaluating the Output of Machine Translation Systems
Sep 19, 2011 · • 2002: IBM's BLEU Metric comes out. • 2002: NIST starts MT Eval series under DARPA TIDES program, using BLEU as the official metric. • 2003 ...
[6]
Moses: Open Source Toolkit for Statistical Machine Translation.
Jun 23, 2007 · We have used BLEU [3] for a quick evaluation of progress in translation improvement. A freely available toolkit for training and decoding of ...
[7]
[PDF] Quality expectations of machine translation - arXiv
Here, and more commonly used nowadays in the field, this score is multiplied by 100 ... BLEU score in testing, one might be better off tuning on (say) ...
[8]
Computing and reporting BLEU scores - Mathias Müller
Dec 14, 2020 · The main goal of computing BLEU scores is not accurate tokenization, but automatic MT evaluation that correlates well with human judgement.Missing: details | Show results with:details
[9]
mjpost/sacrebleu: Reference BLEU implementation that ... - GitHub
It properly computes scores on detokenized outputs, using WMT (Conference on Machine Translation) standard tokenization; It produces the same values as the ...
[10]
nltk.translate.bleu_score - NLTK
This function finds the reference that is the closest length to the hypothesis. The closest reference length is referred to as *r* variable.Missing: software sacrebleu
[11]
Moses/SupportTools - Statmt.org
Feb 6, 2016 · Scoring translations with BLEU. A simple BLEU scoring tool is the script multi-bleu.perl : multi-bleu.perl reference < mt-output. Reference ...
[12]
Understanding MT Quality: BLEU Scores - ModernMT Blog
Oct 25, 2021 · BLEU is a “quality metric” score for an MT system that is attempting to measure the correspondence between a machine translation output and that of a human.
[13]
[PDF] ACL Anthology ID W18-6319 / revision 9 / 18 Sep 2025
Moses' multi-bleu.perl cannot be used because it requires user-supplied preprocessing. The same is true of another evaluation framework, MultEval. (Clark ...
[14]
None
### Summary of BLEU's Correlation with Human Judgments in WMT22 Metrics Task
[15]
[PDF] NIST 2005 Machine Translation Evaluation Official Results
Aug 1, 2005 · It has been found to generally rank systems in the same order as human assessments. BLEU, however, does not have the power to distinguish subtle ...Missing: matches 81%
[16]
[PDF] Tangled up in BLEU: Reevaluating the Evaluation of Automatic ...
In this case, the Pearson correlation can over-estimate metric reliability, irrespective of the relationship between human and metric scores of other systems.
[17]
BLEU Meets COMET: Combining Lexical and Neural Metrics ... - arXiv
May 30, 2023 · This paper combines neural (COMET, BLEURT) and traditional (BLEU) machine translation evaluation metrics to improve robustness, especially for ...
[18]
[PDF] Re-evaluating the Role of BLEU in Machine Translation Research
We show that an improved Bleu score is nei- ther necessary nor sufficient for achieving an actual improvement in translation qual- ity, and give two significant ...
[19]
AdaBLEU: A Modified BLEU Score for Morphologically Rich ...
Aug 23, 2021 · She is pursuing her PhD in neural machine translation and evaluation for morphologically rich and low resource languages from the National ...Missing: poor | Show results with:poor
[20]
[PDF] Assessing Evaluation Metrics for Speech-to-Speech Translation
Our findings suggest BLEU is not appropriate for evaluating speech-to-speech translation for high-resource languages or non-standardized dialects, and while ...
[21]
A Call for Clarity in Reporting BLEU Scores - ACL Anthology
The main culprit is different tokenization and normalization schemes applied to the reference. Pointing to the success of the parsing community, I suggest ...<|separator|>
[22]
[PDF] A Call for Clarity in Reporting BLEU Scores - Statmt.org
This is of course not to claim there are no problems with BLEU. Its weaknesses abound, and much has been written about them (cf. Callison-. Burch et al. (2006); ...
[23]
iBLEU: Interactively Debugging and Scoring Statistical Machine ...
Machine Translation (MT) systems are evaluated and debugged using the BLEU automated metric. However, the current community implementation of BLEU is not ...
[24]
iBLEU: Interactively Debugging and Scoring Statistical Machine ...
PDF | Machine Translation (MT) systems are evaluated and debugged using the BLEU automated metric. However, the current community implementation of BLEU.
[25]
[PDF] Comparative Study Between METEOR and BLEU Methods of MT
weights are used for translation Adequacy and Fluency. The S- score helps to weigh Content words differently from common words. DARPA-94 MT French-English ...
[26]
[PDF] Part 5: Machine Translation Evaluation
Equation (5.3) shows the calculation of the BLEU brevity penalty, where is the length of the candidate translation, and is the length of the reference ...
[27]
Case-Sensitive Neural Machine Translation - PMC - NIH
In this paper, we introduce two types of case-sensitive neural machine translation (NMT) approaches to alleviate the above problems.
[28]
[PDF] A Systematic Comparison of Smoothing Techniques for Sentence ...
Abstract. BLEU is the de facto standard machine translation (MT) evaluation metric. How- ever, because BLEU computes a geo- metric mean of n-gram precisions ...