Fact-checked by Grok 2 weeks ago

BLEU

BLEU (Bilingual Evaluation Understudy) is an automatic metric for evaluating the quality of outputs by measuring their similarity to one or more human-generated translations, primarily through modified n-gram for sequences of 1 to 4 words combined with a brevity penalty to account for translation length differences. Developed to address the high cost and time demands of human evaluations, BLEU provides a quick, language-independent score ranging from 0 (no match) to 1 (), enabling efficient assessment of translation systems during development. Introduced in 2002 by researchers Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, BLEU was presented at the 40th Annual Meeting of the () as a solution to the evaluation bottleneck in research. The method computes precision by counting overlapping n-grams between the candidate translation and references, clipping counts to avoid overcounting, and then applies a across n-gram orders weighted equally, multiplied by the brevity penalty factor of \min(1, \exp(1 - r/c)), where r is the reference length and c is the candidate length. Tested on a of approximately 500 sentences from news stories with 2–4 references per sentence, BLEU demonstrated high correlation with human adequacy and fluency judgments, achieving coefficients of 0.99 for monolingual evaluators and 0.96 for bilingual ones. BLEU's impact on (NLP) has been profound, serving as a foundational that accelerated advancements in by allowing researchers to iterate rapidly without extensive human annotation. With nearly 20,000 citations by 2022, it remains a in shared tasks like the Conference on Machine Translation (WMT) and has influenced the development of related metrics such as for summarization. Its simplicity and reliability across languages—including , , , and —have made it indispensable for comparing system performance, though scores can vary based on the number of references used. Despite its strengths, BLEU has limitations, such as sensitivity to reference diversity and potential underpenalization of fluent but reordered translations, prompting the emergence of complementary metrics like and BERTScore in modern . Nonetheless, BLEU continues to be widely adopted for its corpus-level reliability and role in driving progress toward more accurate multilingual systems.

Background

Introduction

BLEU, or Bilingual Evaluation Understudy, is an for automatically evaluating the quality of machine-generated by computing their overlap with reference translations using n-grams. This metric enables rapid assessment of translation performance without relying on extensive human judgment, which is time-consuming and expensive. The score ranges from 0 to 1, with higher values indicating greater similarity to the references, and it is designed to be language-independent, making it applicable across diverse linguistic pairs. Developed in 2001 and introduced in 2002 by Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu at , BLEU correlates well with human evaluations and has become a standard benchmark in the field. BLEU is extensively used in research, including development of neural models, and in competitions like the annual Workshop on (WMT), where it serves as a primary automatic metric. While effective, it can be sensitive to the number of available translations, potentially underestimating when references are limited.

Historical Development

The BLEU metric was developed in 2001 at IBM's T. J. Research Center by Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. This work addressed the growing complexity of (MT) systems in the early , where manual human evaluations—assessing aspects like adequacy and —were becoming increasingly time-consuming and costly due to the scale of data involved. The researchers sought an automated alternative that could approximate human judgments efficiently, drawing inspiration from fluency and adequacy but operationalizing them through n-gram overlap statistics to enable quick, language-independent assessments. The metric was first detailed in the seminal paper "BLEU: a Method for Automatic Evaluation of Machine Translation," presented at the 40th Annual Meeting of for Computational Linguistics () in 2002. Initially employed internally at for evaluating statistical MT prototypes in 2001, BLEU's public release marked a pivotal moment, providing a standardized, measure that correlated strongly with human evaluations while requiring minimal resources. Post-publication, BLEU rapidly gained traction as the for MT evaluation. In 2002, the National Institute of Standards and Technology (NIST) adopted it as the official metric for its MT evaluation series under the program, enabling consistent benchmarking across systems and languages. This widespread use facilitated the field's shift from rule-based to statistical MT approaches, as BLEU's automated scoring supported rapid experimentation and progress tracking in data-driven models. By the mid-2000s, it had been integrated into key open-source tools like the Moses statistical MT toolkit, released in , further solidifying its role in research and development workflows.

Mathematical Formulation

Setup and Notation

The BLEU (Bilingual Evaluation Understudy) metric evaluates the quality of a machine-generated by comparing it to one or more human-produced reference translations. In the standard setup, a single candidate translation C is assessed against m \geq 1 reference translations R_1, R_2, \dots, R_m, where each translation is treated as an unsegmented sequence of words following tokenization appropriate to the . BLEU operates under key assumptions that ensure its broad applicability. It is designed to be language-independent, relying solely on the ability to tokenize text into words or subword units, without dependence on specific linguistic features like morphology or syntax. The metric focuses on n-gram overlaps, considering sequences of contiguous tokens up to order 4—that is, unigrams (n=1), bigrams (n=2), trigrams (n=3), and 4-grams (n=4)—as higher-order n-grams capture longer-range correspondences between the candidate and references. An n-gram is defined as a contiguous sequence of n tokens from a translation. The notation used in BLEU's formulation is as follows: C denotes the candidate translation, with |C| representing its total number of words (often denoted as c); r is the effective reference length, computed as the closest reference length to |C| for sentence-level evaluation, or as the sum over all sentences of the closest reference length to each candidate sentence for corpus-level evaluation. For any word sequence w (an n-gram), \mathrm{Count}(w) is the number of times w appears in C, while \mathrm{Count}_{\mathrm{clip}}(w) is the clipped count, defined as the minimum of \mathrm{Count}(w) and the maximum count of w across all reference translations \max_i \mathrm{Count}_{R_i}(w). The vocabulary V encompasses all unique word sequences appearing in the candidate and reference translations relevant to the n-gram orders considered. Although BLEU can be computed at the sentence level, it is typically aggregated at the corpus level for greater statistical stability, summing clipped n-gram counts across all sentences in the corpus before normalization by the total number of n-grams in the candidate corpus. This corpus-level approach mitigates variability from short sentences and provides a more reliable overall score.

N-gram Precision

The n-gram precision in BLEU is a modified precision metric that evaluates the overlap between n-grams in a candidate translation C and those in one or more reference translations R_1, R_2, \dots, R_m, focusing on content adequacy while penalizing overgeneration. For unigrams (n=1), the precision p_1 is defined as the ratio of the sum of clipped matching unigrams to the total number of unigrams in the candidate: p_1 = \frac{\sum_{\text{word} \in C} \text{Count}_{\text{clip}}(\text{word})}{\sum_{\text{word} \in C} \text{Count}(\text{word})}, where \text{Count}(\text{word}) is the number of times the word appears in C, and the clipping ensures that matches are not overcounted beyond what appears in the references. The clipping mechanism, \text{Count}_{\text{clip}}(\text{ngram}) = \min(\text{Count}_C(\text{ngram}), \max_i \text{Count}_{R_i}(\text{ngram})), limits the contribution of any n-gram in the to the maximum it has in any single , thereby preventing the from rewarding excessive repetitions or hallucinations in the output. This modification addresses a key limitation of standard , which would otherwise inflate scores for translations that reuse the same words or phrases far more often than justified by the references. For instance, consider a candidate translation containing the word "the" three times, while the maximum occurrence in any reference is two; the clipped count for "the" would be two, reducing the numerator accordingly and lowering p_1 to penalize the overgeneration. This approach ensures that precision reflects faithful content overlap without bias toward verbose or repetitive outputs. The unigram precision generalizes to higher-order n-grams (n=2 to n=4) in a similar fashion: p_n = \frac{\sum_{\text{ngram} \in C} \text{Count}_{\text{clip}}(\text{ngram})}{\sum_{\text{ngram} \in C} \text{Count}(\text{ngram})}, where clipping is applied per n-gram type across all candidates and references, capturing both local (via bigrams and trigrams) and longer-range (via 4-grams). The rationale for this clipping across multiple n-gram orders is to balance adequacy (unigrams) with (higher n-grams), as empirical analysis showed that even single n-gram precisions distinguish good from poor s, but combining them enhances robustness against overgeneration.

Brevity Penalty

The brevity penalty (BP) in BLEU is a multiplicative factor designed to penalize candidate translations that are shorter than the reference translations, addressing the tendency of raw n-gram precision to favor concise outputs with fewer opportunities for errors. It is defined as BP = \begin{cases} 1 & \text{if } |c| > r \\ \exp\left(1 - \frac{r}{|c|}\right) & \text{if } |c| \leq r \end{cases} where |c| is the length of the candidate translation (in words) and r is the effective length of the reference translation. The rationale for stems from the observation that unmodified scores can be artificially high for very short candidates, as they contain fewer n-grams that might mismatch the references, thus rewarding incomplete translations over those that better capture fluency and adequacy. By applying , BLEU balances content overlap with overall translation completeness, mimicking human evaluators' preference for outputs that are neither too terse nor excessively verbose. In practice, [r](/page/R) is determined at the corpus level by selecting, for each candidate sentence, the reference sentence length closest to the candidate's length, then summing these values across all to yield the effective reference length [r](/page/R); this "best match" approach avoids over-penalizing isolated short sentences while ensuring the penalty reflects aggregate brevity. The exponential form of for short candidates provides a smooth, non-zero penalty that caps the minimum at approximately 0.37 when |c| is half of [r](/page/R), preventing extreme suppression of scores for moderately concise corpora. For example, consider a candidate translation of length |c| = 10 words against a reference effective length r = 15; here, BP = \exp(1 - 15/10) = \exp(-0.5) \approx 0.607, reducing the overall score to discourage the brevity. This component was introduced in the original BLEU formulation to align automatic scores more closely with human adequacy judgments, achieving high correlation (e.g., 0.81–0.99) in empirical tests on English-French translations.

Final Score Calculation

The final BLEU score aggregates the modified n-gram precisions p_n (for n = 1 to $4) and the brevity penalty BP into a single metric using the formula: \text{BLEU} = BP \times \exp\left( \sum_{n=1}^{4} w_n \log p_n \right) where w_n = 1/4 are uniform weights, p_n is the modified precision for n-grams (computed via clipping to avoid overcounting, as detailed in the n-gram precision section), and BP penalizes overly short translations (as derived in the brevity penalty section). This formulation employs a of the p_n values, obtained through the of the logarithm, to ensure balanced contribution across n-gram orders; the logarithmic aggregation promotes a multiplicative combination that strongly penalizes weaknesses in any p_n, reflecting the of as n increases. The uniform weighting scheme treats all n-grams equally in the standard implementation, though alternative non-uniform weights (e.g., higher w_n for longer n-grams to emphasize ) have been explored in extensions. BLEU scores are inherently in the range [0, 1] but are conventionally multiplied by 100 and reported as percentages (0–100) for interpretability in evaluations. The score is computed at the corpus level, aggregating statistics across all sentences in the test set to provide a stable system-level assessment.

Computation

Algorithm Overview

The computation of the BLEU score involves a multi-step process that measures n-gram precision between a candidate translation C and a set of reference translations \{R_1, \dots, R_m\}, adjusted by a brevity penalty to account for translation length. This procedure is applied at the corpus level for robust evaluation, aggregating statistics across multiple sentences. The first step entails tokenizing the and texts into sequences of words. This includes such as case folding, and for languages without explicit word boundaries (e.g., those using logographic scripts), appropriate segmentation is applied to identify word units. Multiple translations are prepared similarly to enable comparison. Next, for each n-gram order n from 1 to 4, all contiguous n-grams are extracted from the C and from each R_i. The of each n-gram in C is then clipped to the maximum of that n-gram observed in any single translation, preventing overcounting of n-grams that appear more frequently in the than in the . This clipping uses the highest frequency across the m for each n-gram instance. The modified n-gram precision p_n for each n is calculated as the ratio of the sum of clipped n-gram counts to the total number of n-grams of order n in the candidate, aggregated over the entire corpus. Specifically, p_n = \frac{\sum_{\text{clipped counts for } n\text{-grams}}}{\sum_{\text{counts in } C \text{ for } n\text{-grams}}}. These precisions are computed separately for n=1,2,3,4. The brevity penalty (BP) addresses the tendency of unpenalized n-gram matching to favor short candidates. For each sentence, the effective reference length r is the length of the reference translation closest in length to the candidate |C|; the corpus-level r is the sum of these per-sentence values, while c is the total candidate length across the corpus. The BP is then BP = \begin{cases} 1 & \text{if } c > r, \\ \exp(1 - r/c) & \text{if } c \leq r. \end{cases} This ensures longer candidates are not unduly penalized if they match the references closely. Finally, the BLEU score is aggregated as the brevity penalty multiplied by the geometric mean of the precisions, using equal weights for the four n-gram orders: \text{BLEU} = BP \cdot \exp\left( \frac{1}{4} \sum_{n=1}^{4} \ln(p_n) \right). For corpora with multiple sentences, all statistics (precisions and lengths) are pooled before this computation to yield a single score.

Implementation Details

Computing the BLEU score requires careful preprocessing, particularly tokenization, which is language-specific and critical for accurate n-gram extraction. For English, standard tokenization splits text on whitespace and punctuation, while languages like Japanese may use tools such as MeCab for morphological analysis. In neural machine translation systems, subword units (e.g., via Byte-Pair Encoding) are common during training, but BLEU evaluation typically applies detokenization followed by standardized word-level tokenization to ensure comparability, as subword segmentation can inflate scores if not normalized. Several software libraries facilitate BLEU computation, building on the original implementation by researchers. The NLTK library in provides a flexible corpus_bleu function that handles multiple references and smoothing options. The Moses toolkit includes the multi-bleu.perl script for efficient scoring against multiple references. Modern implementations like the package standardize the process, incorporating WMT tokenization and producing reproducible signatures to avoid discrepancies. For corpus-level evaluation, n-gram counts are summed across all before computing , rather than averaging sentence-level scores, to prevent length bias and ensure stable estimates. This aggregation aligns with BLEU's design as a , where individual sentence scores can be unreliable due to sparsity. Reliable results typically require corpora of at least 1000 , as smaller sets amplify variance from outliers. Edge cases must be handled explicitly to avoid . An empty candidate translation yields a brevity penalty of 0 and thus a BLEU score of 0, as no n-grams can match. With a single (m=1), n-gram clipping is based solely on that 's counts, potentially leading to harsher penalties for rare terms compared to multiple references, where the maximum count across references is used for clipping, allowing higher . Reproducibility challenges arise primarily from tokenization variations, with different schemes causing score differences of up to 1.8 BLEU points, as reported in a 2018 study on preprocessing impacts. To mitigate this, best practices recommend standardized tokenizers (e.g., WMT's 13a scheme) and tools like , which enforce consistent normalization and report full computation signatures.

Evaluation

Correlation with Human Judgments

The original validation of BLEU demonstrated strong correlation with human judgments of quality. In the seminal 2002 study, BLEU achieved a of 0.96 with bilingual human assessments emphasizing adequacy on a Chinese-to-English task involving 500 sentences from news articles, using multiple reference translations. For monolingual judgments focused on and , the correlation reached 0.99, indicating BLEU's ability to closely track expert evaluations across different aspects of . Subsequent empirical studies have confirmed BLEU's consistent performance in correlating with human judgments, particularly at the system level in large-scale benchmarks. In the Workshop on Machine Translation (WMT) evaluations, BLEU has shown system-level accuracy around 0.7 with human ratings across various language pairs, demonstrating reliability in relative system ranking. For instance, in the NIST 2005 machine translation evaluation, BLEU generally ranked systems in the same order as human assessors for Arabic-to-English and Chinese-to-English tasks, providing a robust indicator of comparative quality despite limitations in fine-grained scoring. BLEU's strengths lie in its reliability for evaluating statistical machine translation systems, where it effectively detects improvements in n-gram overlap that align with enhanced . It excels at capturing modifications that boost local sequence matching, such as better and phrase reproduction, which human judges often prioritize in adequacy and fluency assessments. Additionally, BLEU is when applied to tokenized inputs, allowing consistent across diverse linguistic structures without heavy reliance on morphology-specific adjustments. In the modern context of , including Transformer-based models, BLEU remains a standard baseline metric, showing moderate to high correlation with advanced reference-free evaluators like BLEURT and while serving as a quick proxy for human-aligned progress. For example, in WMT benchmarks involving neural systems, BLEU's system-level correlations with human judgments continue to support its use for tracking iterative improvements, even as neural metrics achieve superior absolute agreement.

Limitations and Criticisms

BLEU's reliance on n-gram primarily measures lexical overlap between candidate translations and s, but it largely disregards semantics, synonyms, and grammatical correctness. For instance, ungrammatical phrases like "cat the" could receive a high unigram score if matching a containing those words, even though such outputs lack or meaning. This theoretical flaw leads to BLEU rewarding surface-level matches over meaningful translations, as synonyms or paraphrases are only accounted for if explicitly present in the reference set. The brevity penalty (BP) component introduces further issues by harshly penalizing translations shorter than their references, potentially undervaluing concise yet accurate outputs that align closely with preferences. Additionally, BLEU's insensitivity to paraphrasing means equivalent translations using different wording receive lower scores unless the references include those variants, exacerbating its limitations in capturing adequacy. BLEU performs poorly on morphologically rich languages, where inflections and agglutinative structures inflate n-gram mismatches, leading to artificially low scores even for high-quality translations. It also struggles with unsegmented languages like and without proper preprocessing, as tokenization assumptions favor space-separated scripts and reduce precision for word-boundary ambiguities. Empirically, BLEU exhibits low with human judgments in specialized domains, such as speech , where Pearson correlations range from 0.1 to 0.2 due to disfluencies and prosodic elements not captured by n-gram matching. It fails to reliably detect adequacy improvements in outputs exceeding 30-40 BLEU points, where scores saturate and small semantic gains go unmeasured. Early critiques highlighted BLEU's extreme sensitivity to reference translations, with score variations of up to 10 points from minor reference changes, undermining its reliability for system comparisons. Reproducibility concerns arose from inconsistent implementation details, such as tokenization and , leading to non-comparable scores across studies until standardized reporting was advocated. In the era of , post-2020 analyses revealed BLEU's saturation at high-quality levels, where top systems yield scores above 40 with minimal differentiation, prompting a shift toward learned metrics like that better correlate with human assessments in these regimes. This transition underscores 2020s critiques of BLEU's inadequacy for evaluating fluent, context-aware neural outputs.

Variants

SacreBLEU

SacreBLEU is a standardized implementation of the BLEU metric designed to address inconsistencies in BLEU score reporting arising from variations in tokenization, techniques, and other preprocessing choices in evaluation. Developed by Matt Post in 2018, it enforces a fixed reference tokenizer and built-in to ensure and comparability across studies. Key features of SacreBLEU include the use of the '13a' tokenizer, which mimics the WMT standard mteval-v13a.pl script for English-language evaluation, and as the default method to handle zero n-gram precision counts. It also generates a unique signature string that encapsulates all computation parameters for exact reproduction, such as "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0". This signature allows researchers to verify and replicate scores without ambiguity. Unlike the original BLEU implementation, which relies on user-provided tokenized references and allows flexible weights, SacreBLEU operates exclusively at the corpus level, requires plain text inputs (with internal tokenization), and fixes the n-gram weights to the standard uniform 0.25 for 1- to 4-grams. It accommodates subword units like those produced by Byte-Pair Encoding (BPE) by expecting detokenized input, ensuring consistent scoring after subword merging. SacreBLEU has been the standard for official evaluations in the Conference on Machine Translation (WMT) since , with the toolkit automatically downloading relevant test sets such as WMT 2014 en-de, where it reports a representative score of BLEU = 22.0 for baseline systems under default settings. Available as an open-source package installable via , it has facilitated widespread adoption in academic and industrial research. The primary benefits of SacreBLEU include significant reductions in score variance—up to 1.8 BLEU points across configurations—stemming from standardized preprocessing, thereby promoting fairer comparisons between models and minimizing discrepancies that could otherwise reach several points due to differences. This enhances the reliability of BLEU as a in the field.

Other Extensions

iBLEU, introduced in 2011, extends the BLEU metric through an interactive framework designed for debugging and scoring systems. It visualizes n-gram precision breakdowns and enables users to incorporate judgments on adequacy (preservation of meaning) and (naturalness of ), allowing for targeted refinements in evaluation that align automated scores more closely with assessments. This approach facilitates iterative improvements by highlighting discrepancies between system outputs and references, particularly useful for developers analyzing failure cases without altering the core BLEU computation. Hybrid metrics combining elements of BLEU and address some of BLEU's limitations in handling and semantic variations, though they represent integrations rather than pure BLEU extensions. These hybrids leverage BLEU's n-gram alongside METEOR's explicit ordering and synonym matching to better capture lexical flexibility and improve with human judgments on adequacy. For instance, comparative analyses demonstrate that such combinations yield higher agreement with fluency and adequacy ratings in diverse language pairs compared to standalone BLEU. Domain-specific adaptations of BLEU include case-sensitive variants, which preserve in n-gram matching to better evaluate s involving proper nouns or stylistic elements where case matters. Standard BLEU's case-insensitivity can overlook errors in domains like , but case-sensitive implementations penalize mismatches accordingly, leading to more precise scores in tasks such as biomedical or legal text . Additionally, enhancements to multi-reference handling optimize aggregation across multiple human references, such as using maximum clipping per n-gram to reduce bias from single references and improve robustness in varied scenarios. Direct non-learned modifications to BLEU often target computational stability, particularly through alternative smoothing techniques for handling zero n-gram matches in short or sparse outputs. The add-1 (Laplace) smoothing adds a count of 1 to all n-grams, providing a simple baseline that prevents zero precision but can over-smooth longer sequences. In contrast, pyramid smoothing employs a hierarchical distribution of probability mass from observed higher-order n-grams to unobserved lower-order ones, preserving more accurate precision estimates for sentence-level evaluation. A systematic comparison showed pyramid and related methods providing improved correlation with human judgments for low-count sentences on WMT datasets. These adjustments maintain BLEU's n-gram core while enhancing reliability without relying on external models. Recent non-learned extensions include spBLEU (2022) and spBLEU-1K (2024), which standardize tokenization using SentencePiece models trained on multilingual corpora (over 1,000 sources for spBLEU-1K) to improve comparability across languages, particularly benefiting low-resource settings by reducing tokenizer-induced variances. While learned evolutions like BLEURT (2020) use embeddings for context-aware scoring and achieve superior correlation (up to 0.3 higher Pearson r with humans than BLEU on WMT), they diverge from BLEU's lineage by prioritizing neural representations over surface n-grams.

References

  1. [1]
    [PDF] BLEU: a Method for Automatic Evaluation of Machine Translation
    1So we call our method the bilingual evaluation understudy,. BLEU. the evaluation bottleneck. Developers would bene- fit from an inexpensive automatic ...
  2. [2]
    The AI paper at the foundations of multilingual NLP - IBM Research
    but it also happens to be the French translation of ...
  3. [3]
    A Structured Review of the Validity of BLEU - MIT Press Direct
    The BLEU metric has been widely used in NLP for over 15 years to evaluate NLP systems, especially in machine translation and natural language generation.
  4. [4]
    [PDF] Bleu: a method for automatic evaluation of machine translation
    Sep 17, 2001 · Bleu uses the average logarithm with uni- form weights, which is equivalent to using the geometric mean of the modified n-gram preci- sions. 3, ...
  5. [5]
    [PDF] Evaluating the Output of Machine Translation Systems
    Sep 19, 2011 · • 2002: IBM's BLEU Metric comes out. • 2002: NIST starts MT Eval series under DARPA TIDES program, using BLEU as the official metric. • 2003 ...
  6. [6]
    Moses: Open Source Toolkit for Statistical Machine Translation.
    Jun 23, 2007 · We have used BLEU [3] for a quick evaluation of progress in translation improvement. A freely available toolkit for training and decoding of ...
  7. [7]
    [PDF] Quality expectations of machine translation - arXiv
    Here, and more commonly used nowadays in the field, this score is multiplied by 100 ... BLEU score in testing, one might be better off tuning on (say) ...
  8. [8]
    Computing and reporting BLEU scores - Mathias Müller
    Dec 14, 2020 · The main goal of computing BLEU scores is not accurate tokenization, but automatic MT evaluation that correlates well with human judgement.Missing: details | Show results with:details
  9. [9]
    mjpost/sacrebleu: Reference BLEU implementation that ... - GitHub
    It properly computes scores on detokenized outputs, using WMT (Conference on Machine Translation) standard tokenization; It produces the same values as the ...
  10. [10]
    nltk.translate.bleu_score - NLTK
    This function finds the reference that is the closest length to the hypothesis. The closest reference length is referred to as *r* variable.Missing: software sacrebleu
  11. [11]
    Moses/SupportTools - Statmt.org
    Feb 6, 2016 · Scoring translations with BLEU. A simple BLEU scoring tool is the script multi-bleu.perl : multi-bleu.perl reference < mt-output. Reference ...
  12. [12]
    Understanding MT Quality: BLEU Scores - ModernMT Blog
    Oct 25, 2021 · BLEU is a “quality metric” score for an MT system that is attempting to measure the correspondence between a machine translation output and that of a human.
  13. [13]
    [PDF] ACL Anthology ID W18-6319 / revision 9 / 18 Sep 2025
    Moses' multi-bleu.perl cannot be used because it requires user-supplied preprocessing. The same is true of another evaluation framework, MultEval. (Clark ...
  14. [14]
    None
    ### Summary of BLEU's Correlation with Human Judgments in WMT22 Metrics Task
  15. [15]
    [PDF] NIST 2005 Machine Translation Evaluation Official Results
    Aug 1, 2005 · It has been found to generally rank systems in the same order as human assessments. BLEU, however, does not have the power to distinguish subtle ...Missing: matches 81%
  16. [16]
    [PDF] Tangled up in BLEU: Reevaluating the Evaluation of Automatic ...
    In this case, the Pearson correlation can over-estimate metric reliability, irrespective of the relationship between human and metric scores of other systems.
  17. [17]
    BLEU Meets COMET: Combining Lexical and Neural Metrics ... - arXiv
    May 30, 2023 · This paper combines neural (COMET, BLEURT) and traditional (BLEU) machine translation evaluation metrics to improve robustness, especially for ...
  18. [18]
    [PDF] Re-evaluating the Role of BLEU in Machine Translation Research
    We show that an improved Bleu score is nei- ther necessary nor sufficient for achieving an actual improvement in translation qual- ity, and give two significant ...
  19. [19]
    AdaBLEU: A Modified BLEU Score for Morphologically Rich ...
    Aug 23, 2021 · She is pursuing her PhD in neural machine translation and evaluation for morphologically rich and low resource languages from the National ...Missing: poor | Show results with:poor
  20. [20]
    [PDF] Assessing Evaluation Metrics for Speech-to-Speech Translation
    Our findings suggest BLEU is not appropriate for evaluating speech-to-speech translation for high-resource languages or non-standardized dialects, and while ...
  21. [21]
    A Call for Clarity in Reporting BLEU Scores - ACL Anthology
    The main culprit is different tokenization and normalization schemes applied to the reference. Pointing to the success of the parsing community, I suggest ...<|separator|>
  22. [22]
    [PDF] A Call for Clarity in Reporting BLEU Scores - Statmt.org
    This is of course not to claim there are no problems with BLEU. Its weaknesses abound, and much has been written about them (cf. Callison-. Burch et al. (2006); ...
  23. [23]
    iBLEU: Interactively Debugging and Scoring Statistical Machine ...
    Machine Translation (MT) systems are evaluated and debugged using the BLEU automated metric. However, the current community implementation of BLEU is not ...
  24. [24]
    iBLEU: Interactively Debugging and Scoring Statistical Machine ...
    PDF | Machine Translation (MT) systems are evaluated and debugged using the BLEU automated metric. However, the current community implementation of BLEU.
  25. [25]
    [PDF] Comparative Study Between METEOR and BLEU Methods of MT
    weights are used for translation Adequacy and Fluency. The S- score helps to weigh Content words differently from common words. DARPA-94 MT French-English ...
  26. [26]
    [PDF] Part 5: Machine Translation Evaluation
    Equation (5.3) shows the calculation of the BLEU brevity penalty, where is the length of the candidate translation, and is the length of the reference ...
  27. [27]
    Case-Sensitive Neural Machine Translation - PMC - NIH
    In this paper, we introduce two types of case-sensitive neural machine translation (NMT) approaches to alleviate the above problems.
  28. [28]
    [PDF] A Systematic Comparison of Smoothing Techniques for Sentence ...
    Abstract. BLEU is the de facto standard machine translation (MT) evaluation metric. How- ever, because BLEU computes a geo- metric mean of n-gram precisions ...