Fact-checked by Grok 2 weeks ago

Speech recognition

Speech recognition, also known as automatic speech recognition (ASR), is a technology that enables computers to identify and transcribe spoken language into text by analyzing audio signals, modeling acoustic patterns, and applying linguistic constraints to decode words and sentences.^[1] This process typically involves three core components: acoustic modeling to map audio features to phonetic units, language modeling to predict probable word sequences, and search algorithms to find the most likely transcription from possible hypotheses.^[1] ASR systems must handle variability in speech due to accents, noise, speaking rates, and context, making it a challenging intersection of signal processing, machine learning, and natural language processing.^[2] The field originated in the mid-20th century with early acoustic-phonetic approaches in the 1960s, which segmented speech into phonemes using rule-based spectral analysis, but these were limited by high error rates and computational demands.^[2] By the 1970s and 1980s, pattern-matching techniques, particularly Hidden Markov Models (HMMs) combined with Gaussian Mixture Models (GMMs), became dominant, enabling speaker-independent recognition for small to medium vocabularies (e.g., 1,000–5,000 words) with word error rates (WER) as low as 3–5% in controlled tasks like resource management dialogues.^[2] These hybrid GMM-HMM systems powered initial commercial applications, such as dictation software from companies like Dragon and IBM in the 1990s.^[1] Advancements in the 2010s shifted ASR toward data-driven deep learning paradigms, replacing traditional hybrid models with end-to-end neural architectures like recurrent neural networks (RNNs), connectionist temporal classification (CTC), and attention-based models, which directly optimize transcription from raw audio to text.^[3] The introduction of Transformer-based models, such as those in wav2vec 2.0 and Conformer architectures (as of 2020), further reduced WER by up to 36% relative to prior baselines on benchmarks like LibriSpeech (e.g., achieving 2.6% WER on clean test sets), even in noisy or low-resource scenarios, with state-of-the-art now below 1.2% as of 2025.^[4] Recent innovations incorporate federated learning for privacy-preserving training on distributed data and deep reinforcement learning to refine decoding, enabling robust performance across dialects, accents, and spontaneous speech with WERs below 5% in many real-world applications. Models like OpenAI's Whisper (2022) have advanced multilingual and robust ASR, achieving near-human performance on diverse datasets.^[4]^[5] Today, ASR underpins diverse applications, including virtual assistants like Siri and Alexa, real-time captioning for accessibility, medical transcription, and multilingual translation systems, though challenges persist in handling out-of-vocabulary words, code-switching, and adverse environments. Ongoing research focuses on multimodal integration (e.g., combining audio with visual cues) and efficient deployment on edge devices, promising broader adoption in healthcare, automotive, and education sectors.^[4]

Fundamentals

Definition and core concepts

Speech recognition, also known as automatic speech recognition (ASR) or speech-to-text (STT), is an interdisciplinary field in computer science and signal processing that enables machines to identify and interpret spoken language from audio signals, converting them into readable text or executable commands.^[6] This process mimics human auditory perception by analyzing acoustic patterns to transcribe speech accurately, often integrating with natural language processing (NLP) to derive meaning or intent from the recognized text.^[7] At its core, speech recognition deals with the fundamental units of spoken language: phonemes, which are the smallest distinct sound units (typically lasting 80 ms on average, with variations from 10-200 ms); words, formed by sequences of phonemes that convey meaning; and utterances, which are complete segments of continuous speech carrying semantic content.^[8] These elements form the building blocks for modeling how speech is produced and perceived, allowing systems to map audio inputs to linguistic outputs.^[9] A key distinction in speech recognition systems lies between speaker-dependent and speaker-independent approaches. Speaker-dependent systems are trained on data from a specific individual, achieving higher accuracy for that user but requiring personalized enrollment, whereas speaker-independent systems generalize across multiple speakers using diverse training data, though they demand larger datasets to account for variations in accents, pitch, and speaking styles.^[8] Similarly, systems differ in handling isolated versus continuous speech: isolated speech recognition processes discrete words or phrases separated by pauses, simplifying the task by avoiding overlaps; continuous speech recognition, in contrast, manages natural, fluid speech where words blend due to co-articulation, posing greater challenges in segmenting and decoding boundaries.^[9] These classifications influence system design, with isolated and speaker-dependent setups often serving as entry points for simpler applications like command interfaces.^[10] The basic workflow of a speech recognition system begins with audio capture, where a microphone records the raw speech signal as a time-varying waveform.^[9] This is followed by feature extraction, which transforms the signal into compact representations suitable for analysis; a common technique uses Mel-frequency cepstral coefficients (MFCCs), derived by applying Fourier transforms to short frames (e.g., 20 ms) of audio, filtering through Mel-scale bands to mimic human hearing, and then inverse transforming to yield spectral envelopes in 39-dimensional vectors.^[8] Finally, decoding integrates these features with acoustic models (mapping sounds to phonemes) and language models (predicting word sequences) to output the most probable transcription, often employing search algorithms to navigate possible interpretations efficiently.^[9] This pipeline establishes the foundational mechanism for converting spoken utterances into actionable text, underpinning applications from voice assistants to transcription services.^[6]

System architecture and components

Speech recognition systems generally operate through a modular pipeline that transforms raw audio input into textual output, encompassing signal processing, acoustic modeling, language modeling, and decoding stages. This architecture enables the system to handle the variability in spoken language by breaking down the recognition process into interdependent components. The pipeline begins with capturing audio via a microphone or other input device, followed by preprocessing to prepare the signal for analysis, and proceeds through modeling and search mechanisms to generate the most likely transcription.^[11] The initial stage involves audio input and preprocessing, where raw speech signals from a microphone are digitized and cleaned. Preprocessing includes noise reduction techniques to suppress background interference, endpoint detection for segmenting speech from silence, and normalization to adjust for volume variations, ensuring robust handling of real-world audio conditions. Feature extraction then converts the preprocessed waveform into compact representations, such as spectral coefficients, that capture phonetic content while reducing dimensionality for efficient modeling.^[12] Subsequent components focus on modeling the acoustic and linguistic aspects of speech. Acoustic or phonetic models estimate the probability of phonetic units given the extracted features, often using statistical frameworks like hidden Markov models to represent temporal sequences of sounds. A pronunciation lexicon or dictionary maps words to their phonetic transcriptions, bridging acoustic outputs to vocabulary items and accommodating variations in pronunciation. The language model, typically an n-gram or neural network-based probabilistic model, incorporates grammatical and contextual constraints to score word sequences, favoring fluent and semantically coherent hypotheses.^[13]^[14]^[11] The final decoding stage integrates these models to search for the optimal transcription, employing algorithms like the Viterbi method to find the most probable word sequence by maximizing the joint probability from acoustic, lexicon, and language scores. This search often uses dynamic programming to efficiently explore hypotheses within a finite-state transducer framework, balancing accuracy and computational cost.^[13] Traditional hybrid architectures separate these components for independent training and optimization, allowing specialization but requiring careful integration during decoding. In contrast, pure end-to-end architectures map raw audio directly to text using a single neural network, simplifying the pipeline by jointly learning acoustic, lexical, and linguistic features, though they may demand larger datasets for comparable performance. Hidden Markov models serve as a foundational tool in acoustic modeling within hybrid systems.^[11]^[15] For real-time applications, hardware acceleration plays a critical role, with graphics processing units (GPUs) enabling parallel computation of neural network layers in both hybrid and end-to-end models, achieving transcription speeds thousands of times faster than real-time on large-scale audio. This GPU utilization supports low-latency processing in devices like smart assistants and supports scalable deployment in cloud environments.^[16]

History

Early foundations (pre-1970)

The foundations of speech recognition trace back to 19th-century innovations in sound recording and visualization, which enabled the scientific study of speech acoustics. In 1857, French inventor Édouard-Léon Scott de Martinville developed the phonautograph, a device that captured sound waves as graphical traces on smoked glass or paper, producing the first known visualizations of human speech without playback capability.^[17] This instrument laid groundwork for later acoustic analysis by representing speech as waveforms. Two decades later, in 1877, Thomas Edison invented the phonograph, the first practical device to both record and reproduce sound using a tinfoil-wrapped cylinder, allowing researchers to capture and replay spoken words for repeated examination.^[18] Edison's phonograph shifted focus toward practical audio preservation, influencing early experiments in speech transmission and pattern study at institutions like Bell Laboratories.^[19] By the 1930s and 1940s, advancements in acoustic instrumentation propelled speech analysis forward, particularly through the invention of the sound spectrograph at Bell Laboratories. In 1941, Ralph K. Potter and colleagues developed this device, which converted audio signals into time-frequency spectrograms—visual displays showing speech energy distribution across frequencies over time—initially for military applications during World War II.^[20] The spectrograph, refined in subsequent publications, revealed key speech features like formants, the resonant frequencies that distinguish vowels and consonants.^[21] Pioneering researchers such as Harvey Fletcher, a physicist at Bell Labs, advanced formant analysis in his seminal 1929 book Speech and Hearing, where he described formants as critical acoustic cues for speech intelligibility, based on experiments measuring vowel resonances.^[22] Fletcher's work emphasized how formants could be isolated for transmission, influencing early conceptual models of speech decoding. Similarly, Franklin S. Cooper at Haskins Laboratories contributed to spectrogram-based research in the 1950s, developing the Pattern Playback synthesizer to test human perception of hand-painted spectrograms mimicking speech sounds, demonstrating that formant transitions could convey phonetic information.^[23] A landmark experimental system emerged in 1952 with Bell Laboratories' AUDREY (Automatic Digit Recognizer), the first functional speech recognition device. Designed by K. H. Davis, R. Biddulph, and S. Balashek, AUDREY used analog electronics and pattern-matching techniques—drawing from signal processing methods akin to those in radar for waveform comparison—to identify spoken digits (0–9) from a single speaker at normal rates over telephone lines.^[24] It achieved 98–99% accuracy in quiet conditions by correlating input signals against stored templates of the speaker's utterances, segmented into phonetic components like formants and bursts.^[24] However, AUDREY's scope was severely limited by hardware constraints: it required a room-sized rack of vacuum tubes and relays, operated offline without real-time processing, and performed poorly (dropping to 70–80% accuracy) with unfamiliar speakers or noisy environments.^[25] These early efforts highlighted emerging acoustic modeling concepts, where speech was treated as analyzable patterns of frequency and amplitude, though practical recognition remained confined to isolated digits or words.^[26]

Development era (1970–1990)

The 1970s marked a pivotal shift in speech recognition from rule-based acoustic analysis to statistical pattern-matching methods, driven by advances in computing power and substantial government funding amid Cold War-era priorities for military and intelligence applications. The U.S. Department of Defense, through the Advanced Research Projects Agency (ARPA), initiated the Speech Understanding Research (SUR) program in 1971, allocating millions to develop systems capable of understanding continuous speech with a 1,000-word vocabulary at 90% accuracy for a specific speaker.^[27] This five-year effort funded multiple research teams at institutions like Carnegie Mellon University (CMU), Bolt Beranek and Newman (BBN), and Stanford Research Institute, fostering interdisciplinary collaboration on acoustic modeling, linguistic constraints, and search algorithms.^[27] A landmark outcome of the SUR program was CMU's Harpy system, completed in 1976, which achieved the program's ambitious goals by recognizing 1,011 words in connected speech using a network of 500 phoneme-like units and innovative beam search techniques to prune computational complexity.^[27] Harpy employed template matching with the Itakura distance metric for acoustic comparison, demonstrating feasibility for practical deployment in constrained domains like air traffic control.^[27] Key technical advancements during this era included Dynamic Time Warping (DTW), a nonlinear alignment algorithm for comparing variable-length speech patterns against templates, originally proposed in the late 1960s but widely adopted in the 1970s for isolated word recognition.^[27] By the early 1980s, Hidden Markov Models (HMMs) emerged as a foundational statistical framework, pioneered by IBM researchers like Lalit Bahl and Frederick Jelinek, to model the probabilistic sequences of acoustic states underlying speech sounds.^[27] DARPA's continued sponsorship in the 1980s built on SUR successes, funding projects that scaled to larger vocabularies and speaker-independent recognition, though persistent challenges with continuous speech—such as coarticulation effects, speaker variability, and environmental noise—limited real-world robustness.^[27] Commercial efforts paralleled these initiatives; IBM, extending its 1962 Shoebox prototype—a discrete-command recognizer—in the early 1970s established a dedicated Continuous Speech Recognition Group, leading to speaker-dependent systems for dictation by the mid-1980s.^[28] Internationally, Japan advanced pattern recognition techniques, with institutions like Kyoto University and NEC developing hardware-based vowel and phoneme recognizers in the 1970s, emphasizing segment-based approaches that influenced global standards for isolated utterance processing.^[27] These efforts highlighted the era's emphasis on statistical rigor over deterministic rules, laying groundwork for broader adoption despite computational constraints.^[29]

Commercialization and expansion (1990–2010)

The 1990s marked a pivotal shift in speech recognition from research prototypes to viable commercial products, driven by advancements in computational power and statistical modeling that enabled continuous speech dictation for general consumers. Dragon NaturallySpeaking, released in 1997 by Dragon Systems, became the first widely accessible consumer dictation software, allowing users to speak naturally into a microphone and convert speech to text with a vocabulary of up to 30,000 words and accuracy rates approaching 95% after user training.^[30] Similarly, IBM ViaVoice, launched in 1997, offered speaker-independent recognition for personal computers, supporting dictation and command control with improved handling of continuous speech, though it required initial enrollment for optimal performance.^[31] These tools democratized speech input, transitioning the technology from specialized hardware to software integrated with Windows operating systems, and spurred market growth as processing speeds allowed real-time transcription.^[30] Entering the 2000s, speech recognition expanded into mobile and multilingual applications, leveraging hybrid models and government-funded initiatives to enhance robustness and scalability. Google's voice search feature, introduced in 2008 on Android devices and the iPhone Google Mobile App, enabled hands-free querying by transmitting audio to cloud servers for processing, marking an early integration of speech recognition with mobile ecosystems and achieving functional accuracy for short phrases in English.^[32] Concurrently, the DARPA Global Autonomous Language Exploitation (GALE) program, initiated in 2006, advanced speech-to-text translation for Arabic and Chinese, aiming to process broadcast news and conversational audio with integrated recognition and machine translation pipelines to support military intelligence needs.^[33] These developments highlighted the technology's potential beyond desktops, fostering investments in portable and cross-lingual systems. A key technical milestone during this era was the widespread adoption of hidden Markov model-Gaussian mixture model (HMM-GMM) hybrids, which became the dominant acoustic modeling approach by the mid-1990s, combining probabilistic state transitions with density estimation to better capture phonetic variations in large-vocabulary continuous speech recognition.^[34] This led to significant word error rate (WER) reductions, with systems achieving approximately 10% WER on clean, read speech by the late 2000s, a marked improvement from over 30% in the early 1990s, primarily through refined feature extraction and larger training corpora.^[35] However, persistent challenges included vocabulary constraints for domain-specific terms, often limited to 50,000-100,000 words in commercial systems, and difficulties in handling accents and dialects, which could increase WER by 20-50% due to insufficient diverse training data.^[34] Early cloud-based services, such as those pioneered by Nuance Communications in the early 2000s for automated call centers, began addressing these by offloading computation to servers, enabling scalable recognition for telephony applications like customer service dialogues.^[36]

Modern era (2010–present)

The modern era of speech recognition, beginning in the 2010s, marked a paradigm shift driven by deep neural networks (DNNs), which largely supplanted Gaussian mixture models (GMMs) in acoustic modeling due to their superior ability to capture complex patterns in speech data. Early breakthroughs included DNN-hybrid systems that achieved substantial error rate reductions on benchmarks like Switchboard, with relative improvements of 10-30% over GMM-HMM baselines.^[37]^[38] A pivotal advancement came in 2014 when Baidu introduced Deep Speech, an end-to-end deep learning system that processed raw audio directly to text, attaining word error rates (WER) competitive with human transcribers on English datasets and demonstrating scalability through massive GPU training.^[39] This period also saw widespread commercialization, exemplified by Apple's launch of Siri in 2011 as an integrated voice assistant on the iPhone 4S, leveraging cloud-based speech recognition to enable natural language interactions and sparking consumer adoption of voice interfaces.^[40] Similarly, Amazon's Alexa debuted in 2014 with the Echo device, incorporating far-field speech recognition for hands-free control in home environments and rapidly expanding to millions of users.^[41]^[42] Entering the 2020s, speech recognition evolved toward end-to-end architectures powered by transformer models, which enabled direct mapping from audio to text without intermediate phonetic representations, further reducing latency and errors. A landmark was OpenAI's Whisper in 2022, a multilingual model trained on 680,000 hours of diverse audio that achieved robust performance across 99 languages, with WERs as low as 3-5% on clean English benchmarks like LibriSpeech.^[43] Integration with large language models (LLMs) enhanced contextual understanding, allowing systems to correct ASR errors through semantic reranking and generate more coherent transcripts, as seen in hybrid frameworks that improved accuracy by 15-20% on ambiguous utterances.^[44] Real-time multilingual models also advanced, with open benchmarks showing systems like those on Hugging Face's ASR Leaderboard supporting low-latency transcription in over 50 languages, often under 100ms delay for streaming applications.^[45] Key milestones included achieving WER below 5% on challenging English corpora such as Switchboard, even in oracle-free setups, and improved handling of noisy and accented speech through data augmentation and self-supervised learning, yielding 20-30% relative gains in adverse conditions like restaurants or diverse accents.^[46]^[47] From 2023 to 2025, innovations in speech language models (SpeechLMs) introduced direct tokenization of audio waveforms into discrete units compatible with LLMs, enabling generative approaches for tasks like zero-shot synthesis and recognition, as in models like AudioLM that preserved long-range dependencies in audio sequences.^[48]^[49] Improvements in disordered speech recognition addressed accessibility, with specialized training on dysarthric datasets reducing WER by up to 30% for conditions like Parkinson's, making voice interfaces viable for users with motor speech impairments.^[47] Open-source efforts, such as Mozilla's DeepSpeech released in 2017 and maintained through 2025, democratized access by providing embeddable end-to-end models trainable on custom data, influencing community-driven advancements in offline ASR.^[50] The market for speech recognition technologies expanded rapidly, projected to reach $23.11 billion by 2030, fueled by integrations in consumer devices, enterprise automation, and AI assistants.^[51]

Technologies and Methods

Traditional statistical models

Traditional statistical models in speech recognition primarily rely on probabilistic frameworks to model the temporal and acoustic variations in spoken language, predating the dominance of deep learning approaches. These methods treat speech as a sequence of observable features generated by hidden processes, using algorithms to align sequences, estimate probabilities, and decode the most likely utterance. Key techniques emerged in the 1970s and 1980s, forming the backbone of systems for isolated word recognition and evolving into frameworks for continuous speech.^[52]^[13] Dynamic Time Warping (DTW) was an early algorithm for aligning speech sequences of varying lengths, essential for comparing an input utterance against reference templates in isolated word recognition. It computes the optimal nonlinear alignment by minimizing the cumulative distance between feature sequences s and t, allowing for time compressions or expansions to handle speaking rate differences. The DTW distance is defined recursively as:

\text{DTW}(i,j) = \min \left[ \text{DTW}(i-1,j), \ \text{DTW}(i,j-1), \ \text{DTW}(i-1,j-1) \right] + \text{dist}(s_i, t_j)

with boundary conditions \text{DTW}(i,0) = \text{DTW}(0,j) = \infty and \text{DTW}(0,0) = 0, where \text{dist} is typically Euclidean distance between feature vectors. This dynamic programming approach enabled robust matching despite duration variability, achieving practical performance in early systems like those for digit recognition.^[52] Hidden Markov Models (HMMs) extended these ideas by modeling speech as a Markov process with hidden states representing phonetic units, such as phonemes, and observable acoustic features emitted from those states. Each state has transition probabilities to subsequent states, capturing the sequential nature of speech, while emission probabilities model the likelihood of observing a feature vector given the state. For decoding, the Viterbi algorithm finds the most probable state sequence Q^* = \arg\max_Q P(Q|O, \lambda) using dynamic programming:

\delta_t(j) = \max_{q_1, \dots, q_{t-1}} P(q_1, \dots, q_{t-1}, q_t = j, o_1, \dots, o_t | \lambda) = \left[ \max_i \delta_{t-1}(i) a_{ij} \right] b_j(o_t),

with backtracking to recover the path, where a_{ij} are transition probabilities and b_j(o_t) is the emission probability. Training involves the forward-backward algorithm to compute posterior probabilities, followed by Baum-Welch re-estimation to iteratively maximize the likelihood P(O|\lambda) via expectation-maximization, updating transitions, emissions, and initial probabilities. HMMs proved foundational for handling continuous speech with left-to-right topologies modeling phoneme durations.^[13] To model the continuous acoustic features more flexibly, Gaussian Mixture Models (GMMs) were integrated as emission densities in HMMs, representing the probability distribution of feature vectors (e.g., mel-frequency cepstral coefficients) for each state as a weighted sum of Gaussians. The likelihood is given by:

p(\mathbf{x}|\lambda) = \sum_{k=1}^K w_k \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k),

where w_k are mixture weights (\sum w_k = 1), \mathcal{N} is the Gaussian density with mean \boldsymbol{\mu}_k and covariance \boldsymbol{\Sigma}_k, and K (typically 8–32) captures multimodal distributions from limited training data. Parameters are re-estimated using posteriors \gamma_t(j,k) = P(q_t = j, m_t = k | O, \lambda), yielding updates like \bar{w}_k = \frac{\sum_t \gamma_t(k)}{\sum_t \sum_k \gamma_t(k)}, \bar{\boldsymbol{\mu}}_k = \frac{\sum_t \gamma_t(k) \mathbf{x}_t}{\sum_t \gamma_t(k)}, and similar for covariances, often diagonal to reduce complexity. This combination addressed the limitations of discrete HMMs, improving accuracy on real-world acoustic variability.^[13] The hybrid HMM-GMM architecture became the standard for large vocabulary continuous speech recognition (LVCSR), scaling to thousands of words by chaining context-dependent phoneme models with n-gram language models for disambiguation. In LVCSR systems, triphone HMMs (modeling phonemes influenced by neighbors) with GMM emissions enabled recognition of fluent speech, as demonstrated in benchmarks achieving word error rates below 20% on read news corpora with 64,000-word vocabularies in the 1990s. These models laid the groundwork for subsequent neural network-based approaches that enhanced feature extraction and modeling capacity.^[13]

Neural network-based approaches

Neural network-based approaches in speech recognition leverage deep learning architectures to learn hierarchical representations from acoustic features, surpassing the limitations of traditional statistical models by automatically extracting relevant patterns without explicit hand-crafted features. These methods emerged prominently in the early 2010s, integrating neural networks with existing hidden Markov model (HMM) frameworks to improve acoustic modeling. Deep feedforward neural networks (DNNs) were among the first to demonstrate substantial gains, serving as classifiers that map input features to phonetic states or senones in hybrid systems.^[53] In DNNs, bottleneck layers play a crucial role in feature compression, where a narrow hidden layer—typically with fewer neurons than the input or output layers—forces the network to learn compact, discriminative representations of the acoustic data. This dimensionality reduction aids in mitigating overfitting and enhancing generalization, particularly when tandem-connected with HMMs in the hybrid HMM-DNN architecture. The hybrid setup treats the DNN as a probabilistic classifier over context-dependent HMM states, replacing Gaussian mixture models (GMMs) for emission probabilities, which led to relative word error rate (WER) reductions of up to 30% on benchmarks like Switchboard in initial implementations.^[53] Recurrent neural networks (RNNs) extend feedforward architectures to handle the sequential nature of speech, processing time-dependent inputs through recurrent connections that maintain a hidden state across frames. Standard RNNs, however, suffer from vanishing gradients during backpropagation through time, hindering learning of long-range dependencies in utterances. To address this, long short-term memory (LSTM) units incorporate gating mechanisms: the forget gate (sigmoid activation) decides information retention from prior states, the input gate (sigmoid) and candidate values (tanh activation) control new information addition, and the output gate (sigmoid) modulates the cell state for the hidden output. These components enable LSTMs to preserve gradients over extended sequences, making them suitable for modeling temporal dynamics in speech.^[54] A key advancement for training RNNs on unaligned speech data is connectionist temporal classification (CTC), which enables alignment-free optimization by marginalizing over all possible monotonic paths between input sequences and label outputs. The CTC loss function is defined as

\mathcal{L} = -\log P(\mathbf{y} | \mathbf{x}) = -\log \sum_{\pi \in \mathcal{B}^{-1}(\mathbf{y})} P(\pi | \mathbf{x}),

where \mathbf{x} is the input acoustic sequence, \mathbf{y} is the target label sequence, \pi represents a path over extended labels (including blanks), and \mathcal{B} collapses repeated labels and removes blanks to yield valid alignments. This approach, often paired with LSTMs, eliminates the need for forced alignments during training, simplifying the pipeline for sequential labeling in speech recognition.^[54] Bidirectional RNNs (BRNNs), particularly bidirectional LSTMs, further enhance context utilization by processing input sequences in both forward and backward directions, allowing each time step to access information from the entire utterance. This bidirectional context proves especially effective for acoustic modeling, as it captures dependencies across the full audio span without assuming causality. In the early 2010s, BRNN-based systems achieved notable WER reductions, such as 10–15% on challenging datasets like TIMIT and Switchboard, outperforming unidirectional counterparts by incorporating global utterance information.^[55]^[56]

End-to-end and transformer models

End-to-end automatic speech recognition (ASR) systems represent a paradigm shift by directly mapping raw audio waveforms or acoustic features to text sequences without relying on intermediate phonetic or pronunciation models, enabling joint optimization of all components during training. These models typically employ recurrent neural networks (RNNs) combined with connectionist temporal classification (CTC) loss to handle variable-length inputs and alignments implicitly. A seminal example is Deep Speech, introduced in 2014, which uses a deep RNN architecture trained end-to-end on large-scale audio-text pairs to achieve competitive performance on English speech recognition tasks.^[15] Transformer architectures have since revolutionized end-to-end ASR by replacing recurrent layers with self-attention mechanisms, allowing for parallel processing of sequences and better capture of long-range dependencies in speech signals. The core self-attention operation computes weighted sums of values based on query-key similarities, formulated as:

\text{[Attention](/page/Attention)}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

where Q, K, and V are query, key, and value matrices, and d_k is the dimension of the keys to scale the dot products. Transformers incorporate positional encodings to inject sequence order information into the input embeddings, enabling an encoder-decoder structure that processes audio features through stacked self-attention and feed-forward layers. This non-recurrent design facilitates efficient training on GPUs and has been adapted for ASR in models like Speech-Transformer, which applies the full encoder-decoder framework directly to acoustic sequences for sequence-to-sequence prediction. In the 2020s, hybrid extensions of transformers have further enhanced end-to-end ASR performance. Conformer models integrate convolutional modules with transformer blocks to model both local spectral patterns and global temporal contexts, stacking feed-forward, self-attention, and convolution sublayers within each encoder layer for improved representation learning from raw audio. Meanwhile, OpenAI's Whisper, released in 2022, employs a transformer-based encoder-decoder with a multilingual tokenizer and vocoder pipeline, processing diverse languages through multitask training on weakly supervised web-scale data to generate transcriptions directly from audio.^[57]^[43] One key advantage of end-to-end and transformer-based models is the simplification of system design, reducing the need for hand-engineered components like pronunciation lexicons or separate acoustic and language models, which streamlines development and deployment. Recent advances from 2023 to 2025 have leveraged transfer learning from high-resource to low-resource languages, enabling robust ASR in underrepresented languages through pre-training on massive multilingual datasets and fine-tuning with limited target data, as demonstrated in projects scaling to over 1,000 languages.^[58]

Multilingual and robust recognition techniques

Multilingual speech recognition systems leverage shared embedding spaces to enable zero-shot learning, allowing models to generalize to unseen languages without explicit training data for each one. This approach involves pre-training on high-resource languages to create language-agnostic representations that capture phonetic and semantic similarities across languages. For instance, the SENSE framework uses shared embeddings for multilingual speech and text processing, demonstrating effective zero-shot performance on diverse language pairs.^[59] Similarly, fine-tuning models like Whisper in a shared embedding space has shown zero-shot cross-lingual transfer capabilities in speech translation tasks. These techniques are particularly valuable for low-resource languages, where direct training data is scarce. Handling code-switching, where speakers alternate between languages within utterances, is crucial for natural multilingual interactions, such as in English-Spanish bilingual conversations. Unified models that incorporate concatenated tokenizers and linguistic constraints, like part-of-speech labeling, improve recognition accuracy by modeling intra- and inter-sentence switches. For English-Spanish code-switching, end-to-end approaches have been developed to generate transcripts that preserve the mixed-language structure, achieving robust performance on conversational datasets. These methods often build on end-to-end models as a base for adaptations to handle such variability. To enhance robustness against noise and acoustic variability, data augmentation techniques such as speed perturbation and noise injection are widely employed during training. Speed perturbation alters the playback rate of audio samples (e.g., by ±10%) to simulate variations in speaking tempo without changing pitch, helping models generalize to real-world speech rates. Noise injection adds environmental sounds like background chatter or echoes to clean audio, fostering noise-invariant representations that reduce error rates in adverse conditions. These augmentations have been shown to significantly improve model performance on noisy benchmarks by increasing dataset diversity without requiring additional real recordings. Recent advances from 2023 to 2025 have focused on disordered speech recognition, particularly for dysarthria, using fine-tuned transformer models to accommodate atypical articulation patterns. Personalized fine-tuning with speaker-specific vectors and synthetic speech augmentation has reduced character error rates from over 36% in zero-shot settings to as low as 7.3% on dysarthric datasets.^[60] Transformer-based frameworks like Swin transformers and UTran-DSR, when fine-tuned on artificially generated dysarthric speech, have achieved up to 81.8% word error rate reductions by capturing idiosyncratic speech characteristics.^[61] These developments emphasize iterative pseudo-labeling and controllable synthesis to address data scarcity in clinical applications. Accent adaptation techniques, such as adversarial training, mitigate performance drops due to speaker accents by learning domain-invariant features. Domain adversarial neural networks train the model to minimize accent-specific discrepancies while maximizing recognition accuracy, effectively transferring knowledge from neutral to accented speech. For example, adversarial transfer learning has been applied to end-to-end systems, enforcing intermediate representations that are invariant across accents like Indian-English or non-native variants. This approach has demonstrated substantial improvements in word error rates on accented test sets without requiring accent-specific data. Transfer learning from high-resource to low-resource languages bridges data gaps by pre-training on abundant corpora and fine-tuning on limited target data. Strategies like transliterating high-resource text to match low-resource phonetics enable effective knowledge transfer, outperforming traditional methods on unseen languages. Cross-lingual approaches, including multilingual meta-transfer learning, further enhance this by optimizing for rapid adaptation, achieving notable gains in automatic speech recognition for under-resourced scenarios. Federated learning supports privacy-preserving updates in multilingual and robust recognition by training models across decentralized devices without sharing raw audio data. This technique aggregates model gradients from user devices, protecting sensitive speech patterns while enabling continuous improvement. In dysarthric and elderly speech contexts, regularized federated learning has been applied to maintain performance privacy, reducing risks associated with centralized data collection. Such methods are increasingly adopted in commercial systems to handle diverse, user-specific variations securely.

Applications

Everyday consumer tools

Speech recognition technology has become integral to virtual assistants, enabling seamless voice interactions for routine tasks in consumer settings. Apple's Siri, launched on October 4, 2011, with the iPhone 4S, supports voice commands for playing music, setting reminders, and managing calendars through natural language processing.^[62] Google's Assistant, introduced on May 18, 2016, at Google I/O, extends these capabilities to devices like smartphones and speakers, allowing users to request weather updates, control playback of podcasts, or schedule appointments via conversational queries.^[63] Amazon's Alexa, debuted on November 6, 2014, with the Echo speaker, similarly handles voice instructions for streaming audio content, creating shopping lists, and integrating with third-party services for reminders.^[42] In mobile devices and wearables, speech recognition facilitates real-time text input and multimedia accessibility features. Google's Gboard keyboard incorporates voice typing, which converts spoken words to text in messaging apps and documents, supporting over 60 languages for efficient dictation on the go.^[64] Android's Live Caption, available since October 2019 in Android 10, provides on-device subtitles for videos, podcasts, and audio calls without internet connectivity, enhancing comprehension during playback.^[65] Home devices leverage speech recognition for intuitive control of Internet of Things (IoT) ecosystems, transforming living spaces into responsive environments. Users can command Alexa-enabled systems to adjust lighting, such as dimming Philips Hue bulbs or turning on switches, through simple phrases like "Alexa, turn off the living room lights," integrating with thousands of compatible devices.^[66] In the 2020s, enhancements like Amazon's Alexa+ (announced February 26, 2025) have advanced conversational AI, enabling multi-turn dialogues for complex smart home routines, such as sequencing lights, thermostats, and security cameras in natural speech flows.^[67] These tools deliver significant consumer benefits, including hands-free operation and broader accessibility in daily applications. For instance, voice commands in navigation apps support safer driving by allowing route queries without manual input, with extensions to in-car systems like Android Auto for verbal directions.^[68] Otter.ai, a consumer transcription service, uses speech recognition to generate real-time notes from meetings or lectures, automatically identifying speakers and key action items for personal productivity.^[69]

Professional and enterprise uses

In professional and enterprise settings, speech recognition is widely adopted for enhancing productivity through automated documentation and interaction. In healthcare, tools like Nuance Dragon Medical enable clinicians to dictate patient notes directly into electronic health records, achieving high accuracy (up to 99% in optimal conditions) and reducing documentation time by up to 50% compared to traditional typing or manual transcription.^[70]^[71]^[72] This efficiency allows physicians to spend more time on patient care, with studies showing it can be 3-5 times faster than keyboard entry, while integrating specialized medical vocabularies to handle complex terminology accurately.^[73] Customer service operations leverage speech recognition in interactive voice response (IVR) systems to automate call routing and query handling. These systems use natural language processing alongside speech-to-text to interpret spoken requests, such as directing callers to billing or support departments, thereby reducing wait times and agent workload.^[74] In modern implementations, speech-to-text powers AI chatbots for voice-enabled support, converting customer speech into text for real-time responses across channels like phone and web, improving resolution rates in telecommunications and retail sectors.^[75]^[76] In legal and journalism fields, speech recognition facilitates real-time transcription and subtitling for proceedings and broadcasts. Enterprise platforms like Microsoft Azure Speech Services provide customizable speech-to-text models that generate instant transcripts during court depositions or depositions, ensuring accurate records with support for legal jargon and multiple speakers.^[77] For journalism, it enables live captioning of news events and interviews, allowing reporters to produce voice-to-text reports efficiently without post-production delays.^[78] Since the 2020s, AI-driven speech recognition has expanded to meeting summarization in enterprise collaboration tools. Platforms like Zoom integrate automatic transcription and AI analysis to generate concise summaries, action items, and speaker attributions from spoken discussions, incorporating domain-specific vocabularies for industries like finance and consulting.^[79] This capability streamlines post-meeting follow-ups, with tools processing audio in real-time to highlight key insights and decisions.^[80]

Accessibility for disabilities

Speech recognition technology plays a pivotal role in empowering individuals with motor disabilities by enabling hands-free control of mobility aids and environmental devices. Voice-controlled wheelchairs, for instance, utilize speech recognition integrated with microcontrollers and sensors to interpret commands like "forward" or "stop," allowing users with severe physical limitations to navigate obstacles independently and safely. These systems often incorporate auditory feedback and emergency overrides to enhance reliability. Similarly, smart home platforms such as Google Home leverage built-in speech recognition to manage appliances, lighting, and security systems through simple voice instructions, promoting greater autonomy for those with limited manual dexterity. Recent regulations like the EU AI Act (effective 2025) emphasize transparency in AI-based accessibility tools.^[81]^[82]^[83]^[84] For people with speech disorders like dysarthria and aphasia, customized speech recognition models address the challenges of atypical articulation by training on small, speaker-specific datasets of disordered speech. These personalized approaches, often employing deep learning techniques such as hidden Markov models or neural networks, significantly outperform general models, achieving word error rates as low as 10-20% for severe dysarthria in controlled settings. Such adaptations enable more accurate transcription for communication aids and therapy tools. Complementing this, applications like Ava provide real-time captioning to augment lip-reading during interactions, helping users with speech impairments follow and participate in conversations by displaying transcribed text from others' speech.^[85]^[86]^[87]^[88] In supporting sensory impairments, speech recognition facilitates real-time captioning for those with hearing loss, converting ambient or conversational audio to on-screen text for immediate comprehension. Google's Live Transcribe app, for example, uses on-device processing to deliver low-latency transcriptions in over 70 languages, making everyday dialogues accessible without external hardware. For visual impairments, hybrid systems integrate speech recognition for input—such as dictating notes or issuing commands—with text-to-speech output to provide audible responses, allowing users to navigate digital content or environments through voice interaction alone. These combined modalities, often powered by AI frameworks, achieve up to 92% accuracy in object and text processing tasks tailored for low-vision users.^[89]^[90]^[91] Between 2023 and 2025, inclusive AI developments in speech recognition have emphasized low-latency processing for real-time assistive feedback, with models like those in Google's Project Euphonia collecting over 1,000 hours of disordered speech data to train robust systems that reduce transcription delays to under 500 milliseconds in interactive scenarios.^[92] This enables seamless integration into therapy and daily tools, such as adaptive communication devices. Emerging integrations with prosthetics incorporate speech recognition for intuitive control, where AI interprets voice commands to adjust limb movements or mobility aids, drawing on machine learning to personalize responses for users with neuromotor challenges.^[93]^[94]^[95]

Specialized and emerging domains

In military applications, speech recognition enables pilots to maintain focus on flight operations by allowing hands-free control of aircraft systems. The F-35 Lightning II Joint Strike Fighter incorporates a speech recognition system as the first U.S. fighter aircraft to process spoken commands for managing subsystems like communications and displays, reducing manual interactions in high-stress environments.^[96] In helicopters, voice-activated systems facilitate hands-free communication and data entry, keeping pilots' hands on controls during missions, as explored in early NASA evaluations of voice technology for cockpit integration.^[97] For air traffic control (ATC) training, simulation tools like UFA's ATVoice® and Adacel's ICE use speech recognition to provide realistic phraseology practice, enabling controllers and pilots to rehearse interactions with high accuracy in immersive scenarios.^[98]^[99] In education, speech recognition supports interactive language learning and assessment by evaluating spoken responses in real time. Duolingo integrates AI-driven speech recognition in its mobile app to score pronunciation during speaking exercises, offering immediate feedback on how closely users match native-like articulation across multiple languages.^[100] For automated grading of oral exams, systems like Pearson's Versant employ advanced speech recognition to assess fluency, pronunciation, and content in non-native speech, providing objective scores that correlate with human evaluations and enabling scalable proficiency testing.^[101] Research on automatic speech recognition for oral proficiency further demonstrates its utility in scoring tasks like sentence repetition and read-alouds, with models achieving reliable results on standardized tests such as Linguaskill.^[102] Emerging domains leverage speech recognition for seamless, context-aware interactions in dynamic environments. Real-time translation via devices like Google Pixel Buds uses onboard speech recognition paired with machine learning to convert spoken languages during conversations, supporting over 40 languages through conversation or transcribe modes for immediate audio output.^[103] In augmented reality (AR) and virtual reality (VR) interfaces, speech recognition enhances user immersion by enabling natural voice commands for navigation and object manipulation, as seen in medical training simulations where it outperforms traditional controllers in task efficiency and presence.^[104] By 2025, AI advancements in speech emotion recognition (SER) integrate deep learning models, such as LSTM networks, to detect emotions like stress or joy from vocal features with accuracies exceeding 90% in controlled settings, supporting applications in mental health monitoring through platforms that analyze real-time audio.^[105]^[106] In telephony, advanced interactive voice response (IVR) systems incorporate speech recognition and natural language processing to handle open-ended dialogues, allowing users to speak freely rather than following rigid menus, which improves resolution rates in customer service calls.^[74] Conversational IVR, as in solutions from VoiceSpin, processes natural speech for tasks like account inquiries, reducing hold times by up to 50% compared to traditional touch-tone systems.^[107] Similarly, in gaming, speech recognition powers dynamic interactions with non-player characters (NPCs), enabling players to engage in free-form dialogue that influences narratives, as demonstrated in prototypes using natural language understanding to generate contextually relevant responses.^[108] This approach fosters deeper immersion, with systems like those employing real-time transcription and AI response generation handling varied player inputs in single-player environments.^[109]

Performance and Challenges

Evaluation metrics and benchmarks

The primary metric for evaluating the accuracy of automatic speech recognition (ASR) systems is the Word Error Rate (WER), which measures the percentage of errors in the transcribed output compared to a ground-truth reference transcript. WER is calculated using the Levenshtein distance algorithm to align the hypothesis and reference texts, accounting for substitutions (S), deletions (D), and insertions (I) relative to the total number of words (N) in the reference:

\text{WER} = \frac{S + D + I}{N} \times 100\%

A WER of 0% indicates perfect transcription, while lower values reflect better performance; human transcription on clean read speech achieves near 0% WER, serving as the ground truth for benchmarks like LibriSpeech. This metric is widely adopted because it captures the practical impact of recognition errors on downstream tasks like information retrieval or machine translation.^[110]^[111] For languages without explicit word boundaries, such as Chinese or Japanese, the Character Error Rate (CER) serves as a more appropriate alternative to WER, evaluating errors at the character level using a similar edit-distance approach. CER is computed analogously as the ratio of character substitutions, deletions, and insertions to the total number of characters in the reference, providing finer-grained assessment of spelling and segmentation accuracy in non-whitespace-separated scripts. It is particularly valuable in multilingual ASR evaluations where word-level tokenization is unreliable.^[112]^[113] Beyond accuracy, the Real-Time Factor (RTF) assesses the computational efficiency of ASR systems, defined as the ratio of the real-time duration of the input audio to the processing time required for transcription. An RTF greater than 1 indicates real-time or faster performance, essential for interactive applications like live captioning or voice assistants, where delays can degrade user experience. RTF evaluations often consider hardware constraints, with values below 0.5 desirable for mobile or edge devices. Standard benchmarks for ASR rely on well-curated datasets to ensure reproducible comparisons. LibriSpeech, comprising approximately 1,000 hours of English read speech from audiobooks, is a cornerstone for evaluating clean and noisy conditions, with its "test-clean" and "test-other" subsets used to report WER across models. Switchboard, a corpus of about 300 hours of conversational telephone speech, tests performance on spontaneous, accented, and overlapping dialogue, simulating real-world telephony scenarios. These datasets have become de facto standards since the 2010s, enabling consistent tracking of ASR progress.^[114]^[115] Leaderboards provide ongoing comparisons of state-of-the-art models, such as the Open ASR Leaderboard hosted on Hugging Face, which benchmarks systems like OpenAI's Whisper on metrics including WER across English and multilingual tasks. For instance, Whisper large-v3 achieves ~2% WER on LibriSpeech test-clean as of 2025, highlighting advancements in zero-shot multilingual recognition. To evaluate subjective qualities like the naturalness of transcribed or synthesized speech in ASR pipelines, the Mean Opinion Score (MOS) is employed, where human raters score outputs on a 1-5 scale for fluency and intelligibility, often complementing objective metrics in hybrid systems.^[45]^[116] In the 2020s, multilingual benchmarks like Mozilla's Common Voice dataset—crowdsourced with over 33,000 hours of speech across 130+ languages as of 2025—have emerged as key standards for assessing inclusivity and low-resource performance, reporting CER and WER stratified by speaker demographics. Robustness tests, such as those in the Speech Robust Bench (SRB), evaluate ASR under corruptions like additive noise (e.g., babble or factory sounds) and accents (e.g., non-native English variants), using augmented versions of LibriSpeech to quantify degradation; for example, top models like Whisper large achieve around 40% WER in moderate noise but 11-14% WER in accented speech conditions. These frameworks emphasize equitable evaluation, prioritizing diverse real-world conditions over controlled settings.^[117]^[118]

Accuracy limitations and improvements

Speech recognition systems face several inherent limitations that degrade their accuracy across diverse real-world scenarios. Variability in accents and dialects poses a significant challenge, as models trained predominantly on standard varieties, such as American English, exhibit higher word error rates (WER) when encountering regional or non-native pronunciations due to differences in phonetic realization and prosody. Background noise, including environmental sounds like traffic or crowds, further distorts the audio signal, reducing the signal-to-noise ratio and leading to misinterpretations of speech features. Homophones, words with similar pronunciations but different meanings (e.g., "there," "their," and "they're"), exacerbate ambiguity in disambiguation, particularly without sufficient contextual cues. Out-of-vocabulary (OOV) words—terms not present in the training lexicon—result in substitutions or deletions, especially in rapidly evolving domains like technology or slang. Domain mismatch, where the acoustic or linguistic characteristics of test data differ from training data (e.g., medical versus conversational speech), causes performance drops of up to 20-30% in WER due to inadequate generalization. Improvements in accuracy have been driven by scaling training data to massive volumes, with modern models like Google's Universal Speech Model (USM) leveraging 12 million hours of multilingual speech to enhance robustness and reduce WER by capturing broader acoustic patterns. Self-supervised learning methods, such as wav2vec 2.0, pretrain representations on unlabeled audio via contrastive tasks, enabling fine-tuning with minimal labeled data and achieving WERs as low as 1.8% on clean benchmarks like LibriSpeech while improving low-resource scenarios by up to 100x efficiency in data usage. Ensemble methods further boost performance by combining outputs from hybrid (e.g., DNN-HMM) and end-to-end models, such as integrating Kaldi with wav2vec 2.0 via voting mechanisms, yielding 14-20% relative WER reductions by compensating for complementary error types. Over time, these advancements have markedly lowered error rates: in the 2000s, state-of-the-art systems on clean English benchmarks like Switchboard achieved around 30% WER, whereas by 2025, leading models attain under 5% WER on similar clean conditions through deep learning and data scaling. However, challenges persist in low-resource languages, where WER often ranges from 20-50% due to limited training data and linguistic diversity. Looking ahead, continual learning techniques promise further gains by enabling models to adapt incrementally to individual user speech patterns, such as evolving accents or health-related changes, without catastrophic forgetting of prior knowledge, as demonstrated in frameworks fusing multi-layer features for dynamic task adaptation.

Security, privacy, and ethical issues

Speech recognition systems are vulnerable to security threats, including adversarial attacks that introduce subtle audio perturbations to mislead models. These attacks can cause automatic speech recognition (ASR) systems to misinterpret commands, such as altering "turn off the lights" to "turn on the lights," by adding imperceptible noise that exploits the vulnerabilities in deep neural networks.^[119] Similarly, spoofing attacks leverage voice synthesis to impersonate users, enabling unauthorized access to biometric authentication systems; for instance, synthetic speech generated from short audio samples can fool speaker verification with high success rates in real-world scenarios.^[120] Privacy concerns arise prominently in always-listening devices like Amazon's Alexa, where voice data is continuously captured and stored in the cloud for processing, raising risks of unauthorized access or data breaches that expose sensitive personal information.^[121] Voice biometrics, used for authentication, further amplify these issues by treating audio as personally identifiable information, potentially revealing traits like accent or health conditions without explicit user awareness.^[122] Compliance with regulations such as the General Data Protection Regulation (GDPR) mandates that organizations processing audio logs obtain informed consent, minimize data retention, and implement encryption to protect against misuse, though many systems struggle with full adherence due to the volume of incidental recordings.^[123] Ethical challenges in speech recognition include biases that disproportionately affect certain demographics, such as higher word error rates (WER) for non-native speakers and women compared to native English male speakers, leading to exclusion in applications like virtual assistants.^[124] Additionally, the deployment of speech recognition in surveillance applications, such as public monitoring or workplace tracking, often lacks robust consent mechanisms, infringing on individual autonomy and enabling discriminatory profiling based on linguistic patterns.^[125] From 2023 to 2025, regulatory efforts like the EU AI Act have classified certain speech recognition uses, particularly real-time biometric identification in public spaces, as high-risk or prohibited practices, requiring transparency, risk assessments, and human oversight to mitigate harms.^[126] Mitigation strategies include differential privacy techniques, which add calibrated noise to training data to prevent inference of individual voices while preserving overall model accuracy in ASR tasks.^[127] For deepfake voice threats, detection methods employing spectrogram analysis and convolutional neural networks have emerged to identify synthetic audio, achieving over 90% accuracy in distinguishing fakes from genuine speech in controlled evaluations.^[128]

References

[1]
[PDF] Introduction to Automatic Speech Recognition
Speech Recognition: Where Are We Now? • High performance, speaker-independent speech recognition is now possible. – Large vocabulary (for cooperative speakers ...
[2]
[PDF] Speech Recognition in Machines - ece.ucsb.edu
Over the past several decades, a need has arisen to enable humans to communicate with machines in order to control their actions or to obtain information.Missing: definition | Show results with:definition
[3]
https://ieeexplore.ieee.org/document/8639627
[4]
https://arxiv.org/pdf/2403.01255.pdf
[5]
Automatic Speech Recognition - an overview | ScienceDirect Topics
Automatic speech recognition is a high-tech that makes machine turn the speech signal to the corresponding text or command after recognizing and understanding.
[6]
https://www.sciencedirect.com/topics/engineering/automatic-speech-recognition
[7]
https://www.sciencedirect.com/science/article/pii/B0122272404001647
[8]
8.3. Speech Recognition - Introduction to Speech Processing
Speaker Dependence: Speaker dependent speech recognition system requires the user to be involved in its development whereas speaker independent systems do not.
[9]
https://speechprocessingbook.aalto.fi/Recognition/Speech_Recognition.html
[10]
[PDF] End-to-End Speech Recognition: A Survey - arXiv
Mar 3, 2023 · All relevant aspects of E2E ASR are covered in this work: modeling, training, decoding, and external language model integration, accompanied by ...<|control11|><|separator|>
[11]
A Comprehensive Review of Face, Speech, and Text Modalities - arXiv
Feb 1, 2025 · The primary goals of preprocessing in speech systems are noise reduction, normalisation, segmentation, and feature extraction from raw audio ...
[12]
https://arxiv.org/html/2502.06803v1
[13]
Bob: A lexicon and pronunciation dictionary generator - IEEE Xplore
This paper presents Bob, a tool for managing lexicons and generating pronunciation dictionaries for automatic speech recognition systems.
[14]
Deep Speech: Scaling up end-to-end speech recognition - arXiv
Dec 17, 2014 · We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional ...
[15]
NVIDIA Accelerates Real Time Speech to Text Transcription 3500x ...
Mar 18, 2019 · This means 24 hours worth of human speech can be transcribed in 25 seconds. We tested a variety of GPUs – from the 30W Jetson AGX Xavier to the ...
[16]
The Origins of Sound Recording - Thomas Edison National ...
Sound recording was invented twice: first by Edouard-Léon Scott in 1857, then by Thomas Edison in 1877. Scott's phonautograph graphed sound waves.
[17]
Early Sound Recording Collection and Sound Recovery Project
In 1877, Thomas Edison invented the phonograph, the first machine that could record sound and play it back. On the first audio recording Edison recited, “Mary ...
[18]
Inventing Sound Recording - Thomas A. Edison Papers
Edison initially used a diaphragm on paraffin paper, then a tinfoil cylinder on a cylinder, and named it the "Phonograph" for recording speech.
[19]
[PDF] a short history of acoustic phonetics in the us - Haskins Laboratories
1970. Formant concentration positions in the speech of children at two levels of linguistic development. Journal of the. Acoustical Society of America, 48, 1404 ...
[20]
[PDF] from visible speech to voiceprints – the missing link - ISCA Archive
R. K. Potter invented the sound spectrograph” [7, p. 1]. The job was finished by the end of 1941 [7, p. 7]. More precisely, “[e]arly in 1941, a rough ...
[21]
[PDF] Harvey Fletcher's role in the creation of communication acousticsa)
He helped develop the vacuum tube hearing aid, the commercial audiometer, the artificial larynx, and ste- reophonic sound. His first book Speech and Hearing, ...
[22]
[PDF] Speaker Identification by Speech Spectrograms
Four spectrograms of the spoken word "science." The vertical scale repre- sents frequency, the horizontal dimension is time, and darkness represents intensity.
[23]
Automatic Recognition of Spoken Digits - AIP Publishing
The recognizer discussed will automatically recognize telephone‐quality digits spoken at normal speech rates by a single individual, with an accuracy ...
[24]
The machines that learned to listen - BBC
Feb 15, 2017 · But all those baby steps kept machines passive – until “Audrey”, the Automatic Digit Recognition machine, came along in 1952. Made by Bell Labs ...
[25]
Audrey, Alexa, Hal, and More - CHM - Computer History Museum
Jun 9, 2021 · We start our story in 1952 at Bell Laboratories. It's a modest start: The machine, known as AUDREY—the Automatic Digit Recognizer—can ...
[26]
[PDF] Automatic Speech Recognition – A Brief History of the Technology ...
Oct 8, 2004 · In this article, we review some major highlights in the research and development of automatic speech recognition during the last few decades so ...
[27]
Speech recognition - IBM
The world's first speech-recognition system, capable of understanding the numbers zero through nine and six command words, was the size of a shoebox.Missing: 1970-1990 ARPA SUR HARPY HMM
[28]
Status on Speech Recognition in Japan - ResearchGate
Aug 9, 2025 · This paper provides the review of developments in speech recognition in Japan. Attention is paid to research activities in 1980's which ...
[29]
9 Development in Artificial Intelligence | Funding a Revolution
... Dragon brokered a deal whereby Seagate Technologies bought 25 percent of Dragon's stock. By July 1997, Dragon had launched Dragon Naturally Speaking, a ...
[30]
Comparative Evaluation of Three Continuous Speech Recognition ...
The following continuous speech recognition packages were evaluated in this study: IBM ViaVoice 98 with IBM General Medicine vocabulary (IBM, Armonk, New ...Missing: history | Show results with:history
[31]
Google Search by Voice: A Case Study
Sep 9, 2010 · Our first foray in search by voice was doing local searches with GOOG-411. Then, in November 2008, we launched Google Search by Voice. Now ...
[32]
Defense Department funds massive speech recognition and ...
Nov 9, 2006 · The program, called the Global Autonomous Language Exploitation (GALE), attempts to address the lack of qualified linguists and analysts who ...
[33]
A Historical Perspective of Speech Recognition
Jan 1, 2014 · Early methods of speech recognition aimed to find the closest matching sound label from a discrete set of labels. In non-probabilistic ...Missing: radar | Show results with:radar
[34]
A historical perspective of speech recognition
Jan 2, 2014 · historical progress of speech recognition word error rate on more and more difficult tasks.10 The latest system for the switchboard task is ...
[35]
THE MARKETS: Market Place; Nuance, despite falling shares ...
Nov 29, 2000 · Automated call centers are only the most obvious way speech recognition will be used. The software is now becoming sophisticated enough to ...
[36]
[PDF] Deep Neural Networks for Acoustic Modeling in Speech Recognition
Apr 27, 2012 · The previous section reviewed experiments in which GMMs were replaced by DBN-DNN acoustic models to give hybrid DNN-HMM systems in which the ...Missing: Rise | Show results with:Rise
[37]
[PDF] ACHIEVEMENTS AND CHALLENGES OF DEEP LEARNING
Better optimization criteria and methods are another area where significant advances have been made over the past several years in applying DNNs to ASR. In 2010 ...
[38]
[PDF] Deep Speech: Scaling up end-to-end speech recognition - arXiv
Dec 19, 2014 · In this paper, we describe an end-to-end speech system, called “Deep Speech”, where deep learning supersedes these processing stages. Combined ...
[39]
Apple's Next Big Thing Already Here: Siri More Than Speech ...
Oct 7, 2011 · Siri is unique because it meshes voice recognition capabilities with both sophisticated artificial intelligence capabilities and tight ...
[40]
The Secret Origins of Amazon's Alexa - WIRED
May 11, 2021 · Amazon was anything but embarrassed. By 2014 it had increased its store of speech data by a factor of 10,000 and largely closed the data gap ...
[41]
Alexa at five: Looking back, looking forward - Amazon Science
A few months back, we announced that we'd trained a speech recognition system on a million hours of unlabeled speech using the teacher-student paradigm of deep ...
[42]
Robust Speech Recognition via Large-Scale Weak Supervision - arXiv
Dec 6, 2022 · We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet.
[43]
Contextualization of ASR with LLM Using Phonetic Retrieval-Based ...
In this work, we start with a speech recognition task and propose a retrieval-based solution to contextualize the LLM.
[44]
Open ASR Leaderboard - a Hugging Face Space by hf-audio
This application displays benchmark results for speech recognition models across various datasets and languages. Users can view leaderboards, multilingual
[45]
[PDF] Toward Zero Oracle Word Error Rate on the Switchboard Benchmark
In this more detailed and reproducible scheme, even commercial ASR systems can score below 5% WER and the established record for a research system is lowered ...
[46]
Improving Voice Recognition for People with Speech Disabilities
Sep 27, 2024 · A new study shows that automatic speech recognition (ASR) systems trained on speech from people with Parkinson's disease are 30% more accurate.
[47]
Recent Advances in Speech Language Models: A Survey - arXiv
Feb 6, 2025 · Speech tokenizer is the first component in SpeechLMs, which encodes continuous audio signals (waveforms) into tokens. Speech tokenizer aims to ...
[48]
AudioLM: A Language Modeling Approach to Audio Generation
Jun 21, 2023 · We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens.
[49]
mozilla/DeepSpeech - GitHub
Jun 19, 2025 · DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high ...DeepSpeech · Releases 105 · Issues · Issue #3608
[50]
Speech and Voice Recognition Industry worth $23.11 billion by 2030
Aug 26, 2025 · Speech and Voice Recognition Market value is projected to be USD 23.11 billion by 2030, growing from USD 9.66 billion in 2025, at a Compound ...
[51]
Dynamic programming algorithm optimization for spoken word ...
This paper reports on an optimum dynamic progxamming (DP) based time-normalization algorithm for spoken word recognition.
[52]
[PDF] Deep Neural Networks for Acoustic Modeling in Speech Recognition
Deep neural net- works (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech ...Missing: seminal | Show results with:seminal
[53]
[PDF] Connectionist Temporal Classification: Labelling Unsegmented ...
Connectionist Temporal Classification (CTC) uses RNNs to label unsegmented sequences by interpreting outputs as a probability distribution over label sequences ...
[54]
[PDF] Speech Recognition with Deep Recurrent Neural Networks
This paper in- vestigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks ...Missing: seminal | Show results with:seminal
[55]
Speech Recognition with Deep Recurrent Neural Networks - arXiv
Mar 22, 2013 · Speech Recognition with Deep Recurrent Neural Networks. Authors:Alex Graves, Abdel-rahman Mohamed, Geoffrey Hinton.Missing: seminal | Show results with:seminal
[56]
Convolution-augmented Transformer for Speech Recognition - arXiv
May 16, 2020 · In this work, we achieve the best of both worlds by studying how to combine convolution neural networks and transformers to model both local and ...
[57]
Scaling Speech Technology to 1000+ Languages - arXiv
May 22, 2023 · The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
[58]
Apple Launches iPhone 4S, iOS 5 & iCloud
Oct 4, 2011 · Apple today announced iPhone 4S, the most amazing iPhone yet, packed with incredible new features including Apple's dual-core A5 chip for blazing fast ...
[59]
I/O: Building the next evolution of Google - The Keyword
May 18, 2016 · It's designed to fit your home with customizable bases in different colors and materials. Google Home will be released later this year.
[60]
Use advanced voice typing features - Gboard Help
To activate advanced voice typing features, open any app that you can type with and tap on the Keyboard mic Microphone. Say a command.
[61]
If it has audio, now it can have captions - The Keyword
Oct 16, 2019 · Live Caption automatically captions videos and spoken audio on your device (except phone and video calls). It happens in real time and completely on-device.
[62]
Alexa Smart Home - Learn about Home Automation | Amazon.com
From lights and plugs to thermostats and cameras, Alexa can help make your home smarter and more automated by simplifying your everyday routines.
[63]
Introducing Alexa+, the next generation of Alexa - About Amazon
Feb 26, 2025 · With these experts, Alexa+ can control your smart home with products from Philips Hue, Roborock, and more; make reservations or appointments ...
[64]
Drive with Android Auto. The best of Android, on your in-car display.
Android Auto lets you use your voice to do more in your car. No need to download anything, simply connect your phone and go. Explore Android Auto.
[65]
Convert Speech to Text: Free, Instant, and Accurate - Otter.ai
Otter is powered by advanced AI speech-to-text software that delivers highly accurate speech recognition, even in noisy environments or with multiple speakers.
[66]
8 ways AI medical transcription is transforming global healthcare in ...
Rating 4.8 (49) Jan 13, 2025 · AI medical transcription reduces admin time, enhances patient care, reduces documentation time by up to 50%, and allows doctors to focus on ...Missing: percentage | Show results with:percentage<|control11|><|separator|>
[67]
Dragon Medical One Speech Recognition - Philips dictation
Accent adjustments and microphone calibration are automatic, providing even greater accuracy up to 99%, and an optimal clinician experience from the start.<|control11|><|separator|>
[68]
Nuance Voice Recognition - Dictation for Physicians - ModMed
EMA® EHR With Dragon Medical Speechkit by Nuance · Amazing Speech-to-Text Functionality in the. Palm of Your Hand · 3-5 times faster than typing · Get Up and ...
[69]
What Is IVR? - Interactive Voice Response Explained - Amazon AWS
Advanced IVR systems use speech recognition and natural language processing to understand user requests. For example, the system prompt could ask, “What can I ...
[70]
How Speech Recognition Improves Customer Service in ...
May 2, 2023 · With speech-to-text enabled AI applications, companies can accurately identify customer needs and promptly address them.
[71]
Enhancing Customer Interactions with Speech Recognition 1
Oct 10, 2024 · AI-powered virtual assistants and chatbots that use speech recognition can answer questions, place orders, or help out with other tasks anytime.What is Speech Recognition... · How Can Speech Recognition...
[72]
Speech to text overview - Azure AI services - Microsoft Learn
Azure AI Speech service offers advanced speech to text capabilities. This feature supports both real-time and batch transcription.Get started with speech to text · Speech SDK · How to recognize speech
[73]
How can the news media industry use speech recognition ...
Aug 6, 2025 · 2. Real-Time Subtitling & Captioning ... Speech recognition enables real-time captioning for live broadcasts, news streams, or video content, ...
[74]
Enabling or disabling meeting summary with AI Companion
Account owners/admins can enable/disable the AI meeting summary in the Zoom web portal under Account Settings, AI Companion tab, under Meeting.Missing: 2020s | Show results with:2020s
[75]
Don't Forget: Zoom's AI Companion Can Enhance Meetings with AI ...
Sep 15, 2025 · Keep track of key takeaways from Zoom-based class sessions and meetings with Meeting Summary, which captures essential points from discussions ...
[76]
Development of Voice Controlled Wheelchair for Persons with ...
The voice-controlled wheelchair uses speech recognition, a microphone, Arduino, ultrasonic sensors for obstacle detection, and an emergency stop button.
[77]
[PDF] Voice Controlled Wheelchair for Physically Disabled People and ...
Jan 28, 2025 · The wheelchair uses voice commands, speech recognition, obstacle detection, and auditory feedback for navigation, and has GPS. It also has ...
[78]
Accessibility features on Google Nest or Home devices
Google Nest or Home speakers or displays, and the Google Home app include features that can be helpful for users with accessibility needs.
[79]
Google Home: smart speaker as environmental control unit - PubMed
Such system can be utilized by clients with physical and/or functional disability to enhance their ability to control their environment, to promote independence ...
[80]
Personalized Automatic Speech Recognition Trained on Small ...
In contrast, personalized models trained using samples from the end-user speaker, can be highly accurate -even for severe dysarthria [2,13,14] under some ...<|separator|>
[81]
Assessment of Dysarthria Using One-Word Speech Recognition with ...
We developed an automatic speech recognition based software to assess dysarthria severity using hidden Markov models (HMMs).
[82]
A Comparative Investigation of Automatic Speech Recognition ...
This paper evaluated and compared custom machine learning (ML) speech recognition algorithms against off-the-shelf platforms using healthy and aphasic speech ...
[83]
Professional & AI-Based Captions for Deaf & HoH | Ava
Empowering Deaf & hard-of-hearing people and inclusive organizations with the best live captioning solution for any situation.Pricing · About · Ava Terms of Use · Ava StoreMissing: augmentation | Show results with:augmentation
[84]
Use Live Transcribe - Android Accessibility Help
You can use Live Transcribe on your Android device to capture speech and sound and see them as text on your screen. Download and turn on Live Transcribe ...
[85]
Live Transcribe & Notification - Apps on Google Play
Rating 3.7 (219,750) · Free · AndroidLive Transcribe & Sound Notifications makes everyday conversations and surrounding sounds more accessible among people who are deaf and hard of hearing.Missing: impairments | Show results with:impairments
[86]
A Hybrid Artificial Intelligence System for the Visually Impaired
The hybrid AI system enhances independence for the visually impaired with object recognition, text-to-speech, and speech-to-text, achieving 92% accuracy.
[87]
Project Euphonia: advancing inclusive speech recognition through ...
Jun 19, 2025 · Project Euphonia, a Google Research initiative, is tackling this challenge by building the world's largest dataset of disordered speech.Missing: latency | Show results with:latency
[88]
The Interspeech 2025 Speech Accessibility Project Challenge - arXiv
Jul 29, 2025 · Automatic Speech Recognition (ASR) has witnessed remarkable advancements in recent years, primarily driven by the development of deep neural ...
[89]
[PDF] Artificial Intelligence in Prosthetics and Orthotics - IJFMR
Through advanced machine learning, neural networks, and pattern recognition algorithms, AI enables prosthetic and orthotic systems to interpret bio signals, ...
[90]
Researchers fine-tune F-35 pilot-aircraft speech system
Oct 11, 2007 · The F-35 will be the first US fighter aircraft with a speech recognition system able to "hear" a pilot's spoken commands to manage various aircraft subsystems.
[91]
[PDF] the role of voice technology in advanced helicopter cockpits
Abstract. This paper describes the status of voice output and voice recognition technology in relation to helicopter cockpit applications.Missing: free | Show results with:free
[92]
Speech Recognition - UFA Inc | ATC Simulation Systems
Explore UFA's advanced speech recognition technology, ATVoice®, offering unmatched accuracy for ATC training and real-time voice control in simulation systems.
[93]
Intelligent Communications Environment ICE - Adacel
ICE is an aviation phraseology training tool for air traffic controllers and pilots. This easy-to-use application features an accent-tolerant speech ...
[94]
Covering all the bases: Duolingo's approach to speaking skills
Oct 29, 2020 · Speaking exercises use AI voice recognition (neat!) to grade how close your pronunciation is to the goal, so you get real-time feedback about ...
[95]
Automated scoring for speaking tests - Pearson Support
This article explains the automated scoring process for speaking tests in Versant by Pearson. It details the use of advanced speech recognition technology ...
[96]
Developing an Automatic Pronunciation Scorer: Aligning Speech ...
Jul 14, 2025 · Finally, CASE is the automatic speech scorer used to score the Linguaskill General Speaking test (Linguaskill) by Cambridge Assessment English.
[97]
Translate with Google Pixel Buds
Google Pixel Buds help you translate easily with your Pixel or Android 6.0+ phone. Use Conversation Mode to talk directly or Transcribe Mode to follow along ...
[98]
A Study of NLP-Based Speech Interfaces in Medical Virtual Reality
Our research explored the potential of intelligent speech interfaces to enhance user interaction while conducting complex medical tasks.
[99]
https://www.adacel.com/intelligent-communications-environment-ice
[100]
Speech Emotion Recognition in Mental Health: Systematic Review ...
Sep 30, 2025 · Background: The field of speech emotion recognition (SER) encompasses a wide variety of approaches, with artificial intelligence ...
[101]
Conversational IVR vs Traditional IVR vs AI Voice Bots - VoiceSpin
Aug 4, 2025 · Conversational IVR is an advanced form of traditional IVR that uses speech recognition and natural language processing to let callers interact ...
[102]
Let's Talk Games: An Expert Exploration of Speech Interaction with ...
This work investigates the potential and challenges of using speech interaction in single-player video games, particularly for interactions with NPCs.
[103]
Real-time NPC Interaction and Dialogue Systems in Video Games
Nov 14, 2024 · Speech Recognition: Converts player speech into text. Natural Language Understanding (NLU): Interprets the meaning of the text. Natural ...<|control11|><|separator|>
[104]
[PDF] Master of Science in Computer Science Thesis May 2023
May 1, 2023 · The WER can be calculated by counting the number of words that need to be substituted (S), deleted (D), and inserted (I) to go from a ground- ...
[105]
Decoding disparities: evaluating automatic speech recognition ... - NIH
Dec 10, 2024 · Calculating ASR errors using WER Word error rate16 is a standard metric for assessing ASR system accuracy by comparing ASR-generated text with ...
[106]
Advocating Character Error Rate for Multilingual ASR Evaluation
Oct 9, 2024 · Our work documents the limitations of WER as an evaluation metric and advocates for the character error rate (CER) as the primary metric in multilingual ASR ...
[107]
Metrics for ASR Performance: WER and CER - ApX Machine Learning
Character Error Rate (CER) ... While WER is the default metric, it is less suitable for languages that are not whitespace-segmented, such as Mandarin or Japanese.
[108]
LibriSpeech ASR corpus - openslr.org
LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey.
[109]
[PDF] How Might We Create Better Benchmarks for Speech Recognition?
Aug 6, 2021 · These benchmark sets cover a range of speech use cases, including read speech (e.g. Librispeech), and spontaneous speech (e.g. Switchboard).
[110]
openai/whisper-large-v3 - Hugging Face
Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper Robust Speech Recognition via Large- ...
[111]
Mozilla Common Voice datasets
Common Voice. New Common Voice datasets are now available to download exclusively through our sister platform, Mozilla Data Collective.
[112]
[2202.10594] Adversarial Attacks on Speech Recognition Systems ...
Feb 22, 2022 · This paper reviews speech recognition techniques, investigates adversarial attacks and defenses, and outlines research challenges for mission- ...
[113]
[PDF] Deep Learning-based Speech Synthesis Attacks in the Real World
Sep 16, 2021 · However, our work focuses on the “shadow side” of these uses – generating synthetic speech with malintent to deceive both humans and machines.
[114]
Listening In: Privacy Concerns of Voice Assistants
Aug 5, 2024 · The FTC argued Amazon deceived users and kept years of data obtained by their Alexa voice assistant despite deletion requests.
[115]
GDPR, CCPA and Voice Recognition Privacy - Picovoice
Nov 2, 2022 · GDPR considers voice as Personally Identifiable Information (PII) as voice recordings provide information on gender, ethnic origin or potential diseases.
[116]
How Does GDPR Compliance Apply to Speech Datasets?
Oct 31, 2025 · This article explores how GDPR applies to speech datasets, and the compliance procedures required to ensure responsible and lawful handling ...
[117]
Voice Recognition Still Has Significant Race and Gender Biases
May 10, 2019 · Voice Recognition Still Has Significant Race and Gender Biases ... In 2017, Google announced that their speech recognition had a 95% accuracy rate ...
[118]
Briefing note on the ethical issues arising from the public sector use ...
Sep 9, 2025 · Obtaining clear and informed consent from all users is fundamental to the ethical use of voice recognition technology, as well as ensuring ...Missing: speech | Show results with:speech
[119]
Biometrics under the EU AI Act - IAPP
Oct 18, 2023 · Finally, the Council of the EU defines "general purpose AI," which covers image and speech recognition systems that could constitute biometric ...Related Stories · The Good And Bad Biometrics · Special-Category Data Under...
[120]
https://people.cs.uchicago.edu/~ravenben/publications/pdf/voiceml-ccs21.pdf
[121]
Deepfake Voice Detection Using Convolutional Neural Networks
This paper proposes a CNN-based approach using spectrogram analysis to detect deepfake audio, trained on real and deepfake voice records.