Fact-checked by Grok 2 weeks ago

Word error rate

Word error rate (WER) is a standard evaluation metric used to assess the performance of automatic speech recognition (ASR) systems and, to a lesser extent, machine translation systems.^[1] It quantifies the accuracy of a system's output by measuring the minimum number of single-word edits—substitutions, insertions, or deletions—required to match the hypothesized transcription to a reference transcript, typically expressed as a percentage ranging from 0% (perfect accuracy) to values exceeding 100% in cases of severe errors.^[2] This metric originated in the context of DARPA-funded speech recognition evaluations starting in the 1970s, where it became the primary benchmark for tracking progress in the field.^[3] The calculation of WER relies on the Levenshtein distance (edit distance) algorithm, which employs dynamic programming to find the optimal alignment between the reference and hypothesis sequences at the word level.^[2] Specifically, WER is computed as:

\text{WER} = \frac{S + D + I}{N} \times 100\%

where S represents the number of substitutions (words incorrectly recognized), D the deletions (words omitted from the hypothesis), I the insertions (extraneous words added), and N the total number of words in the reference transcript.^[2] Lower WER values indicate higher accuracy, with state-of-the-art ASR systems achieving rates below 5% on clean, read speech in controlled environments as of 2025, though performance degrades significantly on noisy or spontaneous speech, often exceeding 20%.^[4] WER plays a crucial role in ASR research and development, serving as a key indicator in benchmarks like those from NIST and influencing applications in voice assistants, transcription services, and dialogue systems.^[5] However, it has notable limitations: it treats all errors equally regardless of semantic impact, ignores word order beyond edits, and can be sensitive to segmentation variations in continuous speech.^[2] Despite these drawbacks, WER remains the de facto standard due to its simplicity and correlation with overall system reliability, driving ongoing innovations in acoustic modeling and language processing.^[3]

Fundamentals

Definition

Word error rate (WER) is a fundamental metric in natural language processing that quantifies the accuracy of sequence alignment by calculating the percentage of errors at the word level between a reference sequence—representing the ground truth—and a hypothesis sequence produced by a system.^[2] This measure focuses on discrepancies such as substitutions (incorrect words), deletions (missing words), and insertions (extraneous words), providing an overall assessment of how closely the generated output aligns with the expected text.^[2] WER emerged in the early 1990s as a standardized evaluation metric for speech recognition systems, with its initial formal applications documented in DARPA-funded projects, including assessments of systems like SRI's DECIPHER on 1989 test sets.^[6] Prior to this, speech recognition evaluations relied on related but less standardized error measures; WER's adoption facilitated consistent benchmarking across research efforts, particularly in large-vocabulary continuous speech recognition tasks sponsored by DARPA.^[3] At its core, WER captures the intuition of textual dissimilarity by treating words as atomic units and computing a ratio that reflects the minimum operations needed to transform the hypothesis into the reference, drawing from the broader concept of edit distance.^[2] A WER of 0% indicates perfect alignment, while values of 100% or higher signify complete mismatch or worse performance, as the metric can exceed 100% due to numerous insertions.^[7] This offers a scalable way to gauge system performance without requiring semantic interpretation. To illustrate, consider a reference sequence "the cat sat" compared to a hypothesis "a cat sits": the word "the" is replaced by "a" (substitution), "cat" matches exactly, and "sat" is altered to "sits" (substitution), highlighting how WER identifies specific types of word-level mismatches without delving into meaning.^[2]

Calculation Formula

The word error rate (WER) is computed using the formula WER = (S + D + I) / N × 100%, where S represents the number of substitutions (words incorrectly replaced), D the number of deletions (words omitted from the reference), I the number of insertions (extraneous words added in the hypothesis), and N the total number of words in the reference transcript.^[8] Prior to alignment, the reference and hypothesis sequences are typically preprocessed by converting to lowercase and removing punctuation to standardize tokenization.^[7] This formula adapts the Levenshtein edit distance to the word level, treating each operation as equally costly to quantify overall recognition accuracy.^[8] Normalization by dividing the total errors (S + D + I) by the reference length N and multiplying by 100 yields a percentage, enabling direct comparability across transcripts of varying lengths, such as short utterances versus extended dialogues. Without this normalization, raw error counts would favor shorter texts, skewing evaluations in automatic speech recognition (ASR) benchmarks.^[8] To compute WER, the reference and hypothesis sequences are first aligned using dynamic programming, such as the Viterbi algorithm, to identify the minimum-cost path of edit operations (substitutions, deletions, insertions) between them.^[8] Counts for S, D, and I are then tallied from this alignment, and the formula is applied to derive the final rate. The full equation is presented as:

\text{WER} = \frac{S + D + I}{N} \times 100\%

where S is the count of substitutions, D deletions, I insertions, and N the reference word count.^[8] For a numerical example, consider a reference transcript "This is a test case" (N = 5 words) and hypothesis "This is test case now" (aligning to one deletion of "a" and one insertion of "now," with S = 0, D = 1, I = 1). The total errors are 2, so WER = (2 / 5) × 100% = 40%, indicating moderate recognition inaccuracy.^[8]

Underlying Mechanisms

Edit Distance Operations

In the computation of word error rate (WER), three fundamental edit operations are used to align and compare a reference transcript with a hypothesis transcript: substitution, insertion, and deletion. A substitution occurs when a word in the hypothesis replaces a different word in the reference, counting as one error. An insertion happens when an extra word appears in the hypothesis that is absent from the reference, also counting as one error. A deletion is recorded when a word present in the reference is omitted from the hypothesis, likewise contributing one error.^[2]^[9] These operations play a central role in WER by providing an unweighted measure of discrepancies at the word level, where each type contributes equally to the total error count without regard to semantic or contextual impact. The goal is to find the alignment that minimizes the combined number of these operations, ensuring the most accurate representation of differences between sequences. This minimal alignment is determined through dynamic programming, as formalized in the Levenshtein distance algorithm.^[10]^[11] To illustrate, consider aligning the reference sequence "hello world" with the hypothesis "hello there world". The optimal alignment inserts "there" in the hypothesis, resulting in one insertion error and no substitutions or deletions:

Reference	[hello		world](/page/Hello_World)
Hypothesis	hello	there	world

This example demonstrates how the operations are applied sequentially to achieve the lowest total error count of 1.^[2]

Levenshtein Distance Adaptation

The Levenshtein distance, introduced by Vladimir Levenshtein in 1966, measures the similarity between two strings as the minimum number of single-character edits—insertions, deletions, or substitutions—required to transform one into the other, with each operation assigned a unit cost of 1. This distance is computed via dynamic programming, which constructs an (m+1) × (n+1) cost matrix for strings of lengths m and n, where each entry D represents the minimum edits needed for the prefixes of length i and j.^[12] In the context of word error rate (WER), the Levenshtein algorithm is adapted by first tokenizing the reference and hypothesis transcripts into sequences of words using spaces as delimiters, thereby applying the edit operations at the word level rather than the character level and treating individual words as indivisible units.^[13] The dynamic programming table is then populated with rows corresponding to positions in the reference word sequence and columns to the hypothesis word sequence, yielding the minimum number of word-level insertions, deletions, or substitutions required for alignment.^[14] This word-granularity approach preserves the unit cost structure of the original algorithm while scaling the computation to sequence lengths measured in words. The adapted algorithm can be expressed in the following pseudo-code for reference word sequence ref of length m and hypothesis word sequence hyp of length n:

Initialize (m+1) × (n+1) matrix D
For i from 0 to m:
    D[i][0] = i  // deletions for empty hypothesis prefix
For j from 0 to n:
    D[0][j] = j  // insertions for empty reference prefix
For i from 1 to m:
    For j from 1 to n:
        cost = 0 if ref[i-1] == hyp[j-1] else 1
        D[i][j] = min(
            D[i-1][j] + 1,    // deletion
            D[i][j-1] + 1,    // insertion
            D[i-1][j-1] + cost  // match or substitution
        )
Return D[m][n]
Initialize (m+1) × (n+1) matrix D
For i from 0 to m:
    D[i][0] = i  // deletions for empty hypothesis prefix
For j from 0 to n:
    D[0][j] = j  // insertions for empty reference prefix
For i from 1 to m:
    For j from 1 to n:
        cost = 0 if ref[i-1] == hyp[j-1] else 1
        D[i][j] = min(
            D[i-1][j] + 1,    // deletion
            D[i][j-1] + 1,    // insertion
            D[i-1][j-1] + cost  // match or substitution
        )
Return D[m][n]

This pseudo-code explicitly handles boundary conditions for empty sequences through the initialization, where the distance equals the length of the non-empty sequence.^[15] The time complexity of this dynamic programming approach remains O(mn), where m and n denote the number of words in the reference and hypothesis sequences, respectively, making it computationally feasible for most evaluation datasets in automatic speech recognition tasks.^[12]

Applications

Automatic Speech Recognition

Word error rate (WER) serves as the de facto standard metric for evaluating the accuracy of automatic speech recognition (ASR) systems, particularly in benchmarks organized by the National Institute of Standards and Technology (NIST) since the early 2000s.^[16] These evaluations assess how closely machine-generated transcripts match human-verified references, providing a consistent measure of progress in speech-to-text conversion across diverse audio conditions.^[5] In the ASR evaluation process, audio recordings are first transcribed by the system to produce a hypothesis text output, which is then aligned and compared to manually created reference transcripts using WER to quantify discrepancies.^[16] This alignment relies on edit distance operations to identify substitutions, insertions, and deletions at the word level. Tools such as SCLITE from the NIST Scoring Toolkit (SCTK) facilitate batch computation of WER in ASR pipelines, enabling scalable analysis of large datasets.^[17] A notable example of WER's application is the DARPA HUB-5 evaluation in 1997, which demonstrated significant advancements in ASR performance; during the 1990s, WER for clean speech tasks dropped from around 30% to under 10%, reflecting improvements in acoustic modeling and language processing.^[3] Specific challenges in ASR, such as handling homophones or accents, often elevate substitution errors, as these factors introduce ambiguities in phonetic interpretation. For instance, on real-world conversational datasets like Switchboard, state-of-the-art models as of 2020 achieve WER as low as 4.7%, though performance can degrade to 15-20% or higher under accented or noisy conditions that mimic varied real-world usage.^[18]

Machine Translation

In machine translation (MT), word error rate (WER) serves as a key automatic evaluation metric, particularly valued for its focus on word-level fidelity in assessing output quality. Since the inception of the Workshop on Machine Translation (WMT) conferences in 2005, WER has been employed alongside metrics like BLEU to provide complementary insights into translation accuracy, emphasizing exact lexical matches over n-gram overlaps.^[19]^[20] This makes WER especially useful for diagnosing substitution errors that affect semantic precision in translated text.^[21] The evaluation process involves aligning a machine-generated translation hypothesis against one or more human-provided reference translations and computing WER based on the minimum edit distance required to transform the hypothesis into the reference. This captures lexical errors such as substitutions (e.g., incorrect word choices), while insertions and deletions account for length mismatches between the hypothesis and reference in a single sentence.^[21]^[22] For instance, given a hypothesis "the house is red" and a reference "the building is crimson," WER would reflect a high substitution rate due to mismatches in "house/building" and "red/crimson," highlighting fidelity issues at the word level.^[21] In the neural MT era following 2016, WER has demonstrated improved correlation with human judgments compared to its performance on statistical MT systems, achieving Spearman correlations around 0.52 on datasets like WMT16 direct assessments.^[23] However, unlike BLEU, which allows partial credit through n-gram matching, WER more severely penalizes paraphrasing or synonymous rephrasings that deviate from exact word matches in the reference.^[23]^[24] WER finds common application in benchmark datasets such as the International Workshop on Spoken Language Translation (IWSLT) and WMT corpora, where it helps evaluate performance across language pairs. For low-resource languages in these corpora, WER typically averages 20-30%, underscoring persistent challenges in achieving high fidelity without abundant training data.^[21]^[25]

Comparative Metrics

Character Error Rate

The character error rate (CER) is a performance metric for evaluating the accuracy of text transcription systems, analogous to word error rate but computed at the character level rather than the word level. It quantifies the minimum number of single-character edits—substitutions (S), deletions (D), and insertions (I)—required to transform the hypothesized output into the reference text, expressed as a percentage: CER = (S + D + I) / N × 100%, where N denotes the total number of characters in the reference.^[26] This metric provides a granular assessment of errors, focusing on individual character mismatches without relying on word boundaries.^[27] A primary distinction from word error rate lies in CER's finer granularity, which generally produces lower values for equivalent errors since it penalizes only specific character discrepancies rather than entire words. For instance, the American English reference "color" and British English hypothesis "colour" yield a word error rate of 100% (treating them as fully mismatched words) but a CER of 20% (one insertion amid five reference characters).^[28] Similarly, for the reference "cat" and hypothesis "cot", CER registers 33% (one substitution out of three characters), in contrast to a 100% word error rate.^[29] This sensitivity to partial similarities makes CER valuable for detecting subtle transcription issues that coarser metrics overlook.^[30] CER finds particular utility in domains where word segmentation is ambiguous or absent, such as optical character recognition (OCR) and handwriting recognition, enabling evaluation without predefined word boundaries.^[30]^[31] It is less prevalent in automatic speech recognition or machine translation, where word-level metrics better align with semantic evaluation needs.^[32] Furthermore, CER's independence from tokenization renders it advantageous for logographic languages like Chinese, which lack inter-word spaces and thus complicate word-based measures.^[33]

Sentence Error Rate

Sentence error rate (SER) is a binary evaluation metric in automatic speech recognition (ASR) that quantifies the percentage of sentences in a transcript that contain at least one recognition error, providing a coarse-grained assessment of output quality at the sentence level.^[34] Unlike word-level metrics, SER treats any deviation from the reference transcript—such as substitutions, deletions, or insertions—as rendering the entire sentence erroneous. To compute SER, the word error rate (WER) is first calculated for each individual sentence in the corpus; a sentence is then flagged as erroneous if its WER exceeds 0%, and the overall SER is the ratio of erroneous sentences to the total number of sentences, multiplied by 100 to express it as a percentage. This aggregation method is simpler and less granular than WER, as it does not differentiate between minor and major errors within a sentence, focusing instead on complete sentence failures.^[35] SER offers advantages in scenarios where sentence-level coherence is paramount, such as dialogue systems and conversational AI, by highlighting systemic recognition failures that could disrupt user interactions more starkly than averaged word errors. It is particularly useful in evaluating chatbots and spoken language understanding tasks, where even a single error can invalidate the entire response.^[36] For instance, in a corpus of 10 sentences where 3 exhibit any WER greater than 0%, the SER would be 30%. Introduced in the late 1980s for continuous speech recognition systems employing grammars, SER has been less standardized than WER but is gaining traction in modern end-to-end ASR models for its ability to capture holistic performance.^[34]

Limitations and Extensions

Common Criticisms

One primary criticism of word error rate (WER) is its equal treatment of all error types—substitutions, deletions, and insertions—without considering their semantic or contextual impact. For instance, inserting a minor function word like "the" may incur the same penalty as deleting a crucial content word such as a key noun, potentially misrepresenting the overall quality of the output in applications like automatic speech recognition (ASR).^[37] This limitation arises because WER focuses solely on surface-level lexical matches, ignoring how errors affect meaning or downstream tasks like comprehension.^[37] WER is also highly sensitive to tokenization choices, particularly in languages without explicit word boundaries, such as Japanese or Chinese, where segmentation can vary significantly and lead to inflated error scores. In Japanese speech recognition, inconsistent orthographic representations—such as variations between hiragana and katakana (e.g., "だめ" vs. "ダメ") or ambiguous kanji readings—cause standard WER to penalize equivalent expressions as errors, even when they convey identical meaning.^[38] This dependency on predefined tokenization rules can distort evaluations, making WER less reliable for multilingual or non-alphabetic scripts compared to languages with clear spacing conventions.^[38] Another key issue is WER's tendency to penalize semantically equivalent paraphrases as full substitutions, failing to capture human-like fluency or intent. For example, a reference sentence "I am happy" and a hypothesis "I'm glad" yield a WER of 100% due to mismatches in all words, despite expressing nearly identical sentiment.^[37] Such cases highlight how WER prioritizes exact lexical overlap over contextual equivalence, which can undervalue outputs that are functionally correct but phrased differently.^[39] Empirical studies further underscore WER's limitations by demonstrating its poor correlation with human judgments, especially when error rates are low. In evaluations of ASR outputs, WER achieves only moderate alignment (correlation coefficient of 0.43) with human assessments of error severity, as it does not account for semantic nuances like substituting "love" with "loathe."^[39] Similarly, in speech translation tasks, optimizing for WER can degrade end-to-end performance, with experiments showing negligible or negative correlation between WER reductions and improvements in human-evaluated translation quality like BLEU scores.^[37] These findings indicate that below approximately 20-25% error rates, WER becomes less discriminative for subtle quality differences perceived by humans.^[37]

Weighted Variants

Weighted variants of the word error rate (WER) address limitations in the standard metric by assigning differential weights to words or errors, reflecting their varying importance in specific contexts such as semantic understanding or information retrieval (IR). Unlike uniform WER, which treats all errors equally, these variants prioritize critical elements, enabling more nuanced evaluations of automatic speech recognition (ASR) systems. Early developments focused on weighting based on word significance for IR tasks, while recent approaches incorporate semantic models like BERT for deeper contextual weighting.^[40]^[41]^[42] A seminal weighted variant is the Weighted Word Error Rate (WWER), introduced to generalize WER by incorporating word importance in speech-to-text applications, particularly for IR systems. In WWER, each word w_i is assigned a weight v_{w_i} based on its contribution to overall task performance, such as retrieval accuracy. The metric is computed as:

\text{WWER} = \frac{V_I + V_D + V_S}{V_N}

where V_N = \sum v_{w_i} sums the weights of correctly recognized reference words, V_I = \sum_{\hat{w}_i \in I} v_{\hat{w}_i} weights inserted words, V_D = \sum_{w_i \in D} v_{w_i} weights deleted words, and V_S = \sum_{\text{seg}_j \in S} v_{\text{seg}_j} weights substituted segments (using the maximum weight of the correct or recognized word). Weights are often derived from IR metrics like tf-idf for keywords (e.g., nouns excluding proper nouns, pronouns, and numbers), with non-keywords weighted at zero in variants like Weighted Keyword Error Rate (WKER). This formulation correlates strongly with IR degradation (e.g., 0.969 supervised correlation with IRDR) and supports minimum Bayes-risk (MBR) decoding to minimize WWER, reducing errors from 25.57% to 24.96% in experiments on presentation speech.^[41]^[40]^[43] WWER has been extended for automatic weight estimation in ASR, aligning error penalties with downstream tasks like IR without manual annotation. Unsupervised methods achieve 0.712 correlation with IR performance, enabling practical deployment in speech-based search systems. In IR-oriented ASR, WWER guides decoding by penalizing errors on high-impact words more heavily, improving key sentence indexing F-measure to 0.548 compared to 0.545 baselines.^[41]^[43]^[40] More recent semantic-weighted variants, such as the Semantic-Weighted Word Error Rate (SWWER), leverage pre-trained language models like BERT to assign weights dynamically based on a word's semantic contribution to the utterance. SWWER processes reference and hypothesis texts through BERT (or domain-adapted variants like PPBERT) to compute contextual embeddings, deriving weights from attention scores or similarity metrics that quantify semantic impact. This addresses WER's insensitivity in low-error regimes, where standard metrics show weak correlation with human judgments (e.g., near-zero WER for top systems like Whisper Large-v3), by emphasizing errors in semantically pivotal words. Evaluated on Vietnamese ASR models from providers like Vais and FPT, SWWER provides finer-grained assessments, highlighting perceptual differences in mid-performing systems like HuBERT.^[42]^[44] Other weighted forms include duration-weighted WER (\hat{\text{WER}}_{dur}), used in reference-free estimation, where errors are scaled by utterance length: \hat{\text{WER}}_{dur} = \sum (\hat{\text{WER}}_i \times \text{Duration}_i) / \sum \text{Duration}_i. This variant, estimated via self-supervised representations from models like HuBERT and XLM-R, achieves 0.8900 Pearson correlation with true WER, outperforming baselines by 14.10% in RMSE for fast ASR evaluation. Such adaptations underscore weighted WER's role in tailoring metrics to application-specific needs, from IR to real-time transcription.^[45] In 2024, further extensions addressed WER's limitations in human perception, such as the Human Evaluation Word Error Rate (HEWER), which focuses on errors impacting readability and semantics rather than all edits equally. HEWER demonstrates stronger alignment with human judgments in ASR tasks, particularly for conversational and noisy speech.^[46] As of 2025, emerging metrics like the Word Diarization Error Rate (WDER) adapt WER principles to evaluate speaker attribution in multi-speaker scenarios, measuring the fraction of words with incorrect labels.^[47]

References

[1]
Word Error Rate (WER) - OECD.AI
Word Error Rate (WER) is a common metric of the performance of an automatic speech recognition (ASR) system.
[2]
Word Error Rate - an overview | ScienceDirect Topics
Word error rate (WER) is defined as a statistic that measures the performance of automated speech recognition (ASR) systems by calculating the dissimilarity ...Formal Definition and... · Applications and Role of Word...<|control11|><|separator|>
[3]
A Historical Perspective of Speech Recognition
Jan 1, 2014 · As per the DARPA-funded speech evaluations, the speech recognition word error rate has been used as the main metric to evaluate the progress.
[4]
Historical progress on word error rates for speaker-independent ...
Historical progress on word error rates for speaker-independent speech recognition for increasingly difficult speech data (simplified and redrawn from [1, 2]).
[5]
[PDF] NIST Open Speech Analytic Technologies 2020 Evaluation Plan
Jul 1, 2020 · • An overall Word Error Rate (WER) will be computed as the fraction of token recognition errors per maximum number of reference tokens ...
[6]
[PDF] Training Set Issues in SRI's DECIPHER Speech Recognition System
The word-error rate when all three improvements were combined was 3.7% on DARPA's February 1989 speaker-independent test set using the standard perplexity 60 ...
[7]
[PDF] From WER and RIL to MER and WIL - ISCA Archive
The word error rate (WER), commonly used in ASR assess- ment, measures the cost of restoring the output word se- quence to the original input sequence ...
[8]
[PDF] On the Use of Information Retrieval Measures for Speech ...
The word error rate is defined as: WER = S + D + I. Nr . (1). Page 5. IDIAP–RR ... three types of errors: insertions, deletions and substitutions. While ...
[9]
[PDF] Speech and Language Processing - Stanford University
Aug 20, 2024 · In the first part of the book we introduce the fundamental suite of algorithmic tools that make up the modern neural language model that is ...
[10]
[PDF] Binary codes capable of correcting deletions, insertions, and reversals
"It can be shown that if a code K in Bn-7 can correct one deletion, insertion, or reversal (e.g., K = Kn-7.2(n-7)), the code J=K11.01 is admissible. Page 4. 710.
[11]
[PDF] Minimum Edit Distance - Stanford University
Defining Min Edit Distance (Levenshtein). • Ini'aliza'on. D(i,0) = i. D(0,j) ... Thus, I thought dynamic programming was a good name. It was something not ...
[12]
None
### Summary of Levenshtein Distance and Related Details in OpenSAT20 Evaluation Plan
[13]
[PDF] Word Error Rates: Decomposition over POS classes and ...
The dynamic programming algorithm for WER en- ables a simple and straightforward identification of each erroneous word which actually contributes to. WER.
[14]
[PDF] Levenshtein Edit Distance - Web - UNLV Computer Science
Levenstein distance is computed using dynamic programming. Let n be the length of u and m the length of v. Oet u[i] be the prefix of u of length i and let ...
[15]
[PDF] NIST Language Technology Evaluation Cookbook
The word error rate is then the total number of errors divided by the number of words actually spoken. For detection tasks (speaker and language detection ...
[16]
Tools | NIST - National Institute of Standards and Technology
Dec 1, 2009 · A set of ASR transcription filtering scripts. These scripts are used in conjunction with SCTK to score the NIST speech recognition evaluations.
[17]
[PDF] Comparing Human and Machine Errors in Conversational Speech ...
We then applied the NIST scoring tools to obtain word error rates of 5.9% on the SWB portion, and 11.3% on the CallHome (CH) portion of the NIST 2000 test set.
[18]
Taking MT Evaluation Metrics to Extremes: Beyond Correlation with ...
Starting in 2005, WMT has conducted yearly evaluations of machine translation quality using human judgments, as well as meta-evaluation of automatic ...
[19]
[PDF] Part 5: Machine Translation Evaluation
Word Error Rate (WER MWER) and PositionIndependent Error. Rate. One of the first automatic metrics used to evaluate MT systems was Word Error Rate. (WER) ...
[20]
A Survey on Evaluation Metrics for Machine Translation - MDPI
Feb 16, 2023 · 3. WER. Word error rate (WER) [25] has been used as a quality evaluation tool in speech recognition and machine translation field [26] ...<|control11|><|separator|>
[21]
[PDF] Towards Automatic Error Analysis of Machine Translation Output
The first step is the identification of the actual erroneous words using the algorithms for the calculation of Word Error Rate (WER) and Position-independent ...
[22]
[PDF] Taking MT Evaluation Metrics to Extremes: Beyond Correlation with ...
Both methods reveal the fact that correlation with human judgments is lower for lower-quality translation. Subsequently, we investigate the relation between ...
[23]
Ten Natural Language Processing Tasks with Generative Artificial ...
... WER′s distance metric, a more stringent approach. This has been modified to ... Ignores meaning; penalizes paraphrasing, Fails to reflect semantic ...Ten Natural Language... · 2. Materials And Methods · 2.1. Large Language Models<|control11|><|separator|>
[24]
[PDF] FINDINGS OF THE IWSLT 2022 EVALUATION CAMPAIGN
May 26, 2022 · Low-resource speech translation, focusing on resource-scarce settings for translating input speech in Tamasheq into French text, and input.
[25]
CER - a Hugging Face Space by evaluate-metric
Character error rate (CER) is a common metric of the performance of an automatic speech recognition system. CER is similar to Word Error Rate (WER), ...
[26]
Character error rate (CER) - OECD AI Policy Observatory
The formula to calculate CER is as follows: CER = [ (i + s + d) / n ]*100. Related use cases : Large-Scale End-to-End Multilingual Speech Recognition and ...
[27]
[PDF] Advocating Character Error Rate for Multilingual ASR Evaluation
WER. (%). CER. (%). English. The colour drained from his face; he imme- diately apologised. The color drained from his face. He imme- diately apologized. 44.4 ...
[28]
Understanding Character Error Rate (CER) for AI Accuracy | Galileo
Mar 26, 2025 · How to Calculate the Character Error Rate Metric (CER). The basic formula for the Character Error Rate metric is: CER = (I + D + S) / N. Where:.
[29]
Evaluate OCR Output Quality with Character Error Rate (CER) and ...
Jun 24, 2021 · Character Error Rate (CER) measures the percentage of incorrectly predicted characters, while Word Error Rate (WER) measures the percentage of ...
[30]
Character Error Rate (CER) in Text Recognition - FutureBeeAI
Defining CER and Its Significance CER calculates the percentage of characters that are incorrectly predicted by a model compared to the actual text. It's ...What Else Do People Ask? · Related Ai Articles · Necessity Of Informed...
[31]
Metrics for ASR Performance: WER and CER - ApX Machine Learning
Word Error Rate (WER) and Character Error Rate (CER) are two standard metrics used in speech recognition. These metrics quantify the difference between the ...<|control11|><|separator|>
[32]
Why Chinese WER is important | Speech Recognition With Vosk
May 29, 2022 · In traditional Chinese words are written without spaces which leads to a thought that you do not need to measure word error rate, it is enough ...
[33]
[PDF] The BBN BYBLOS Continuous Speech Recognition System
When sentence error rate is quoted, (typically only when using a grammar) it is defined as the percentage of sentences with any error at all. 95. Page 3 ...
[34]
[PDF] Word Error Rate: Definitions & Algorithms for Speech Recognition
Aug 4, 2025 · or be used to compute a Sentence Error Rate (SER). 2) Related tools ... levenshtein edit distance calculations for evaluating automatic speech.
[35]
Intelligent Speech Recognition and Interaction Technologies for ...
Jul 21, 2025 · In the experiment, this article uses the evaluation indexes of recognition accuracy, WER (Word Error Rate), and SER (Sentence Error Rate) to ...
[36]
[PDF] why word error rate is not a good metric for speech recognizer
Unlike ASR, where the widely used metric is word error rate (WER), the translation accuracy is usually measured by the quantities including BLEU (Bi-Lingual ...
[37]
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/0005632.pdf
[38]
None
### Summary of WER's Correlation with Human Judgments
[39]
[PDF] A New ASR Evaluation Measure and Minimum Bayes-Risk ...
Based on the background, we generalize WER and introduce. Weighted Word Error Rate (WWER), in which each word has a ... Weighted WER (WWER). 0.09. 0.29 ...
[40]
[PDF] Automatic Estimation of Word Significance oriented for Speech ...
2.1 Weighted Word Error Rate (WWER). The conventional ASR evaluation measure ... alize and weighted WER (WWER), in which each word has a different ...
[41]
Semantic-Weighted Word Error Rate Based on BERT for Evaluating ...
This paper introduces the Semantic-Weighted Word Error Rate (SWWER). SWWER leverages the BERT model to assign weights to each word based on its contribution to ...
[42]
https://ieeexplore.ieee.org/document/10818270/
[43]
Semantic-Weighted Word Error Rate Based on BERT for Evaluating ...
The Semantic-Weighted Word Error Rate (SWWER) leverages the BERT model to assign weights to each word based on its contribution to the semantic content of ...
[44]
[PDF] Fast Word Error Rate Estimation Using Self-Supervised ... - arXiv
Jan 29, 2025 · For weighted WER esti- mation, the ratio between the weighted WERwrd and ̂. WERdur. (WERR) is also measured. WERR = ∣WERwrd − ̂. WERdur ...