Fact-checked by Grok 2 weeks ago

Speech processing

Speech processing is the computational analysis and manipulation of spoken language, encompassing the study of speech signals and methods to process them digitally for various applications.^[1] It involves extracting features from audio signals, such as spectral characteristics and temporal patterns, to enable tasks like understanding, generating, or enhancing human speech.^[1] Key components of speech processing include speech analysis, which examines signal properties like pitch and formants; coding for efficient storage or transmission; synthesis to produce artificial speech; recognition to convert spoken words into text; speaker verification for identification; and enhancement to reduce noise.^[1] Common techniques range from traditional signal processing methods, such as mel-frequency cepstral coefficients (MFCC) for feature extraction, to statistical models like hidden Markov models (HMM) and modern deep learning approaches like deep neural networks (DNN) and convolutional neural networks (CNN).^[1] These methods have evolved with advances in computational power, following principles like Moore's Law, and the availability of large datasets, such as the Switchboard corpus, enabling more accurate models. Recent advances as of 2025 include end-to-end Speech Language Models (SpeechLMs) and transformer-based systems for more natural, context-aware processing.^[1]^[2] Applications of speech processing span telecommunications for voice compression, consumer devices like virtual assistants, healthcare for assistive technologies, and security systems for biometric authentication.^[1] Research in the field dates back to the mid-20th century, with significant progress in the 1970s through initiatives like the DARPA speech understanding project, leading to the integration of artificial intelligence for real-time, context-aware processing.

Fundamentals

Acoustic Properties of Speech

Speech is an acoustic signal produced by the human vocal tract, originating from the airflow modulated by the vibration of the vocal folds in the larynx and shaped by the resonances of the supralaryngeal vocal tract.^[3] This signal serves as the foundational medium for speech processing tasks, encompassing both periodic vibrations for voiced sounds and aperiodic noise for unvoiced elements.^[4] Key properties include a typical frequency range of 80–450 Hz for the fundamental frequency, with higher harmonics and formants extending up to approximately 8 kHz to capture the full spectral content relevant to human audition.^[5]^[6] The waveform of speech exhibits variations in amplitude, frequency, pitch, and duration that reflect articulatory dynamics. Amplitude corresponds to the intensity or loudness of the signal, with higher values during vowel production and transient bursts in stop consonants, distinguishing speech segments from silence based on energy thresholds.^[3] Frequency is characterized by the fundamental frequency (F0), determined by the rate of vocal fold oscillation, while pitch represents the perceptual correlate of F0, influenced by both the source and vocal tract filtering.^[4] Duration varies across phonemic units, with glottal cycles consisting of an open phase and closed phase, where the period length equals the sum of these phases, typically 4–8 ms for adult speakers.^[3] Spectral content of speech is analyzed through representations like the short-time Fourier transform (STFT), which yields spectrograms displaying time-varying frequency and amplitude distributions.^[3] In spectrograms, dark bands indicate energy concentrations at harmonics and formants, providing a visual map of the signal's frequency components over time; for instance, horizontal striations reveal F0 periodicity in voiced segments.^[4] This analysis highlights the quasi-periodic nature of voiced speech and the broadband noise in unvoiced portions. The source-filter model, pioneered by Gunnar Fant, describes speech production as the convolution of a source signal—typically glottal airflow pulses from vocal fold vibration—with a linear time-invariant filter representing the vocal tract's transfer function.^[7] The source provides the basic excitation, often modeled as a half-wave rectified pulse train rich in harmonics, while the filter selectively amplifies certain frequencies to form the spectral envelope.^[3] This model underpins acoustic phonetics by separating excitation from resonance effects. Formants are the resonant frequencies of the vocal tract, with the nth formant frequency approximated by F_n \approx \frac{n \cdot c}{4 \cdot L}, where c is the speed of sound (approximately 343 m/s), L is the vocal tract length (around 17.5 cm for adult males), and n is an odd integer for the quarter-wave approximation in a uniform tube model.^[4] For example, the first formant (F1) typically falls near 500–800 Hz, varying with vowel height. The signal-to-noise ratio (SNR), a measure of speech quality, is given by \text{SNR} = 10 \log_{10} \left( \frac{P_s}{P_n} \right) in decibels, where P_s is the signal power and P_n is the noise power; values above 20 dB are generally required for clear intelligibility in noisy environments.^[8] Examples of acoustic variability include voiced sounds, such as vowels, which exhibit periodic waveforms with clear harmonic structure and pitch, versus unvoiced sounds like fricatives, characterized by random noise spectra lacking periodicity.^[4] Diphones, short segments spanning the transition between two adjacent phonemes, capture coarticulation effects where anticipatory or carryover articulation from neighboring sounds alters formant trajectories and spectral envelopes, introducing context-dependent variability in the signal.^[9]^[10]

Phonetic and Linguistic Elements

Speech processing at the phonetic and linguistic levels involves the decomposition of spoken language into discrete symbolic units that capture its structural and meaningful components. Phonemes represent the smallest units of sound that distinguish meaning in a language, such as /p/ and /b/ in English words like "pat" and "bat," while allophones are non-contrastive variants of a phoneme produced in specific phonetic contexts, like the aspirated [pʰ] in "pin" versus the unaspirated in "spin."^[11] Syllables organize phonemes into rhythmic units, typically consisting of a nucleus (often a vowel) flanked by optional onsets and codas, providing the foundational rhythm for speech flow. Prosody encompasses supra-segmental features such as stress (emphasis on certain syllables), intonation (pitch contours signaling questions or statements), and rhythm (timing patterns), which overlay these segmental units to convey additional layers of meaning.^[12] The International Phonetic Alphabet (IPA) serves as a standardized system for transcribing these sounds, enabling precise representation across languages. Consonants in the IPA are classified by manner of articulation (e.g., stops like /t/ or fricatives like /s/) and place of articulation (e.g., bilabial /p/ or alveolar /t/), with acoustic correlates including burst releases for stops or turbulent noise for fricatives. Vowels are charted by tongue height and frontness/backness, such as high front /i/ in "see" or low back /ɑ/ in "father," acoustically linked to formant frequencies where lower first formants indicate lower tongue positions. These phonetic categories bridge the physical acoustics of speech signals to their perceptual and linguistic roles.^[12]^[13] Linguistic hierarchies in speech processing extend from sub-phonemic features, such as nasality (airflow through the nasal cavity, as in /m/ versus /b/) or voicing (vibration of vocal folds), to segmental units like phonemes and syllables, and upward to supra-segmental elements. Supra-segmental features include prosodic patterns and, in tonal languages like Mandarin Chinese, lexical tones where pitch variations distinguish word meanings (e.g., high tone /mā/ for "mother" versus rising tone /má/ for "hemp"). This hierarchy structures speech from fine-grained articulatory details to broader intonational phrases, facilitating comprehension of syntax and pragmatics.^[11]^[14] Speech exhibits significant variability influenced by speaker-specific factors, including age (e.g., children's higher pitch and simpler articulations), gender (e.g., women's generally higher fundamental frequency), and accents (regional variations like rhoticity in American versus British English). Context-dependent variations, such as assimilation where adjacent sounds influence each other (e.g., /n/ becoming [ŋ] before /k/ in "bank"), further introduce phonetic diversity that speech processing systems must account for to achieve robustness. English, for instance, comprises approximately 44 phonemes, including 24 consonants and 20 vowels (12 monophthongs and 8 diphthongs).^[15] Prosody plays a crucial role in disambiguating syntax (e.g., stress placement distinguishing "record" as noun versus verb) and conveying emotion (e.g., rising intonation for surprise or falling for sadness).^[16]

Historical Development

Early Foundations (Pre-1950)

The foundations of speech processing emerged from early attempts to understand, visualize, and replicate human speech through mechanical and acoustic means, laying the groundwork for later scientific inquiry. In 1791, Wolfgang von Kempelen constructed an acoustic-mechanical speech machine that simulated vocal tract functions using a bellows for airflow, a reed for vibration, and adjustable resonators to produce vowels, consonants, and short phrases, marking one of the first successful efforts at mechanical speech synthesis.^[17] This device, detailed in Kempelen's treatise Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine, demonstrated that speech could be generated by mimicking the physiological processes of the larynx and oral cavity, influencing subsequent inventors despite its manual operation and limited intelligibility.^[18] Building on physiological acoustics, Hermann von Helmholtz advanced the resonance theory of hearing in his 1863 work On the Sensations of Tone as a Physiological Basis for the Theory of Music, proposing that the inner ear functions like a series of resonators tuned to specific frequencies, decomposing complex sounds into their harmonic components for pitch perception.^[19] Helmholtz's model, supported by experiments with his invented Helmholtz resonators—glass spheres connected to tubes that selectively amplify tones—provided a theoretical framework for analyzing speech as a composite of resonant frequencies, bridging auditory physiology and sound decomposition.^[20] In the 1870s, Alexander Graham Bell applied his father Alexander Melville Bell's Visible Speech system, a phonetic notation representing articulatory positions of the tongue, lips, and vocal organs, to teach speech to the deaf, emphasizing visual cues for accurate pronunciation and influencing early speech therapy methods.^[21] Key inventions in the late 19th century enabled the visualization of speech waveforms. Thomas Edison's 1877 phonograph, using a tinfoil-wrapped cylinder to record and reproduce sound via a stylus that etched and retraced vibrations, was the first device to capture audible speech mechanically, though primarily for playback rather than analysis. Around the same period, Karl Rudolph Koenig developed the manometric flame apparatus in the 1860s, refined through the early 1900s, which converted sound pressure into visible oscillations of a gas flame projected onto a rotating drum, producing early graphical representations of speech spectra akin to precursors of the modern spectrogram.^[22] Koenig's tools, including vowel analyzers with tunable resonators, allowed precise measurement of formant-like resonances in speech sounds, advancing empirical study of acoustic properties.^[23] In the early 20th century, theoretical models of speech production gained traction. In the late 1940s, researchers at Haskins Laboratories, including Franklin S. Cooper, developed the Pattern Playback synthesizer, completed in 1950, that generated vowel sounds by electrically controlling formant frequencies through filtered oscillators and pattern playback, enabling experiments on how spectral patterns contribute to vowel perception.^[24] This mechanical device synthesized isolated vowels by varying resonance patterns, providing empirical validation for acoustic theories of speech timbre. Meanwhile, Tsutomu Chiba and Masato Kajiyama's 1941 book The Vowel: Its Nature and Structure introduced a foundational formant theory, using three-dimensional vocal tract models and area functions to calculate the first two formant frequencies for Japanese vowels, demonstrating that vowel quality arises from resonant cavities in the vocal tract.^[25] Their work, based on X-ray imaging and mathematical approximations, established formants as key spectral peaks defining vowel identity.^[26] World War II spurred acoustic analysis through military applications, particularly in code-breaking efforts where voice identification required dissecting speech signals into frequency components, influencing the development of early spectrum analyzers for detecting phonetic patterns in intercepted communications.^[27] These analog techniques, reliant on mechanical and optical methods, were constrained by the absence of digital computation, preventing real-time processing and limiting analysis to static, labor-intensive recordings.^[28]

Post-War Advances (1950-2000)

Following World War II, speech processing shifted toward electronic and early digital techniques, emphasizing bandwidth-efficient transmission and rudimentary recognition systems. Homer Dudley's channel vocoder, initially developed in 1928 at Bell Laboratories, gained widespread adoption post-war for compressing speech signals by analyzing amplitude envelopes across 10 bandpass filters (250-3000 Hz) and synthesizing them using a buzz-hiss source, reducing bandwidth requirements for telephony while preserving intelligibility.^[29] This innovation, detailed in Dudley's 1939 paper, enabled secure voice communications during the war and influenced subsequent analog-to-digital transitions in the 1950s.^[30] In the early 1950s, pattern recognition emerged as a foundational approach, exemplified by Bell Laboratories' Audrey system, the first automatic digit recognizer completed in 1952 by K. H. Davis, R. Biddulph, and S. Balashek.^[31] Audrey achieved 90-98% accuracy for isolated digits (0-9) spoken by a single user at normal rates over telephone-quality channels, using formant-based spectral analysis and zero-crossing detection, though it struggled with variability beyond its trained speaker.^[32] This marked the inception of automated voice input, paving the way for voice-responsive systems, albeit limited to isolated words due to challenges in detecting speech endpoints and handling coarticulation in continuous utterances.^[33] The 1960s introduced advanced signal modeling, with linear predictive coding (LPC) emerging as a cornerstone for speech analysis. Pioneered by B. S. Atal at Bell Labs in 1966 and independently by F. Itakura at NTT, LPC modeled speech as an all-pole filter excited by a source, enabling efficient parameter estimation for compression at rates like 16 kb/s.^[34] By the late 1960s, LPC facilitated formant tracking and synthesis, influencing vocoders like LPC-10 standardized at 2.4 kb/s for secure communications.^[35] Precursors to hidden Markov models (HMMs) also appeared, with L. E. Baum and colleagues at the Institute for Defense Analyses developing probabilistic frameworks in 1966-1967 for sequential data, initially applied to noisy signal processing before speech.^[36] T. K. Vintsyuk's 1968 dynamic programming for time alignment further anticipated HMMs by addressing temporal variations in speech signals.^[33] During the 1970s and 1980s, techniques for alignment and probabilistic modeling matured, tackling the limitations of isolated-word systems. Dynamic time warping (DTW), formalized by H. Sakoe and S. Chiba in 1978, optimized pattern matching for variable speaking rates using dynamic programming, becoming essential for word-level recognition with endpoint constraints to reduce computation.^[37] HMMs, refined via the Baum-Welch algorithm for parameter estimation, were adapted for speech by J. K. Baker at Carnegie Mellon in the early 1970s, enabling speaker-independent modeling of phonetic sequences.^[38] These advances shifted focus from isolated digits to continuous speech, though challenges persisted: isolated recognition achieved near-perfect accuracy for small vocabularies by the 1980s, while continuous systems grappled with segmentation errors, disfluencies, and error rates exceeding 20% for large-vocabulary tasks due to coarticulation and noise.^[33] The 1990s saw milestones in large-vocabulary continuous speech recognition through DARPA-funded initiatives, which standardized evaluations and corpora. Projects like the 1990 Resource Management task and 1995 Air Travel Information System (ATIS) benchmarked HMM-based systems, with CMU's Sphinx-II achieving word error rates under 10% for 5,000-word vocabularies by integrating neural nets for feature enhancement.^[33] Commercial viability emerged with Dragon Systems' NaturallySpeaking, released in June 1997, offering continuous dictation for general use with a 23,000-word vocabulary and 95%+ accuracy after training, marking the first widely accessible consumer speech-to-text software.^[39] These developments, driven by statistical methods, laid the groundwork for practical applications while highlighting ongoing hurdles in robust continuous processing across accents and environments.^[33]

Modern Era (2000-Present)

The modern era of speech processing, from 2000 onward, has been defined by exponential growth in computational power, massive datasets, and the dominance of deep learning, shifting the field from hybrid statistical models to end-to-end neural architectures that handle raw audio directly. This period marks a departure from earlier reliance on hidden Markov models (HMMs) as baselines, with neural networks enabling scalable, data-driven solutions for recognition, synthesis, and beyond. Seminal works in the late 2000s laid the groundwork, such as the 2009 application of deep belief networks (DBNs) for phone recognition, which used unsupervised pretraining to achieve error rate reductions of up to 20% over Gaussian mixture models on TIMIT datasets. The 2012 success of AlexNet in image recognition further catalyzed adaptations of convolutional neural networks (CNNs) to mel-spectrogram inputs, improving acoustic modeling in speech systems by the mid-2010s. A pivotal advancement came in 2015 with Baidu's Deep Speech 2, an end-to-end recurrent neural network (RNN) model that bypassed traditional feature engineering, attaining a 7.7% word error rate on a development set from an English corpus including 500 hours of read speech—comparable to human transcription levels at the time.^[40] This era's progress accelerated with the availability of large corpora like LibriSpeech, a 2015 dataset of 1,000 hours of English audiobooks, which standardized benchmarking and trained models robust to real-world variability. Cloud-based APIs, such as Google Cloud Speech-to-Text launched in 2016, further democratized access, supporting real-time transcription in over 120 languages via scalable deep learning backends. The 2020s introduced transformer architectures, enabling self-supervised learning and multilingual capabilities. Facebook AI's Wav2Vec 2.0 (2020) pretrained on 960 hours of unlabeled audio to learn phonetic representations, reducing word error rates by 50% or more on low-resource languages through fine-tuning and transfer learning. Similarly, Microsoft's SpeechT5 (2022) integrated speech recognition, synthesis, and translation in a unified transformer framework, achieving state-of-the-art results across tasks with a single model pretrained on 960 hours of labeled speech data from LibriSpeech, along with large-scale text data.^[41] These methods addressed longstanding challenges, including support for low-resource languages via zero-shot transfer, where models pretrained on high-resource data adapt to new languages with minimal supervision, as demonstrated in benchmarks showing 30-40% relative improvements. Recent innovations focus on expressiveness and efficiency, with 2023 advancements in emotional speech synthesis incorporating variational autoencoders and prosody predictors to generate affect-infused audio, enhancing naturalness in applications like virtual assistants—evidenced by mean opinion scores exceeding 4.0 on emotional expressivity scales. Real-time edge processing has advanced through lightweight models, such as distilled transformers, enabling on-device inference with latencies under 100 ms. Integration with large language models (LLMs) has fostered conversational AI, where speech inputs are transcribed and contextualized by models like OpenAI's Whisper (2022), which achieves 5-10% lower error rates than predecessors on diverse accents, feeding into generative text systems for seamless dialogue. Post-2023 developments have further integrated speech processing with multimodal AI. In May 2024, OpenAI released GPT-4o, enabling real-time voice conversations with low-latency speech-to-speech processing, supporting natural interruptions and emotional tone detection, achieving word error rates under 5% in controlled multilingual settings.^[42] By 2025, advancements in self-supervised models like Meta's MMS (Massively Multilingual Speech) have expanded zero-shot capabilities to over 1,100 languages, reducing error rates by up to 60% in low-resource scenarios through massive pretraining on diverse audio corpora.

Analysis Techniques

Feature Extraction Methods

Feature extraction in speech processing involves transforming raw audio signals into compact representations that capture essential acoustic characteristics, facilitating tasks such as recognition and analysis. These methods aim to mimic human auditory perception while reducing dimensionality and noise sensitivity. Traditional approaches focus on spectral features derived from short-time analysis, while advanced techniques incorporate speaker-specific or prosodic information. One of the most widely adopted core methods is the computation of Mel-Frequency Cepstral Coefficients (MFCCs), which provide a perceptually scaled cepstral representation of the speech spectrum. The process begins with pre-emphasis and windowing of the signal, followed by application of the Short-Time Fourier Transform (STFT) to obtain the power spectrum. This spectrum is then filtered through a set of triangular filters spaced according to the mel scale, which approximates the nonlinear frequency resolution of the human ear. The mel scale is defined as

m(f) = 2595 \log_{10} \left(1 + \frac{f}{700}\right),

where f is the frequency in Hz. The log energies from these filters are transformed via the discrete cosine transform (DCT) to yield the cepstral coefficients:

c_m = \sum_{k=1}^K \log(S_k) \cos \left[ m (k - 0.5) \frac{\pi}{K} \right],

with S_k as the log filter-bank energies, K the number of filters, and m the coefficient index. Typically, the first 12-13 coefficients, along with their deltas and delta-deltas, form the feature vector. MFCCs excel in capturing formant structures and have been foundational in speech recognition systems since their introduction. Another core method is Perceptual Linear Prediction (PLP), which models the auditory system's psychophysical response more explicitly than MFCCs by incorporating equal-loudness curves and intensity-to-loudness power laws. The signal is processed through critical-band spectral analysis, compressed via a cubic root operation to simulate loudness perception, and then linear prediction coefficients are derived from an all-pole model of the warped spectrum. PLP features, often 12-16 coefficients, demonstrate robustness to variations in speaking rate and noise, outperforming MFCCs in certain adverse conditions. Time-frequency analysis methods address the non-stationary nature of speech by providing localized representations. The Short-Time Fourier Transform (STFT) divides the signal into overlapping frames (typically 20-40 ms) and applies the Fourier transform to each, yielding a spectrogram that balances time and frequency resolution via window choice, such as Hamming or Hanning. This approach underpins many feature extraction pipelines, including MFCC computation, and reveals harmonic and formant trajectories essential for phonetic analysis. For enhanced handling of transient events like plosives or pitch variations, the Discrete Wavelet Transform (DWT) decomposes the signal into multi-resolution subbands using scalable orthogonal bases, avoiding the fixed resolution trade-off of STFT. DWT employs quadrature mirror filters to iteratively split the spectrum, preserving time locality at high frequencies and frequency locality at low ones, making it suitable for denoising and segmentation in speech signals. Advanced features extend beyond spectral envelopes to include prosodic elements, such as fundamental frequency (F0) contours and energy profiles, which encode intonation, stress, and rhythm. F0, estimated via autocorrelation or cepstrum methods, traces pitch variations critical for prosody modeling, while short-term energy contours capture amplitude modulations indicative of emphasis. These are often extracted frame-wise and smoothed to form continuous trajectories, aiding in emotion detection and language identification.^[43] For speaker identification, low-dimensional embeddings like i-vectors and x-vectors provide utterance-level representations. I-vectors model total variability in a factor analysis framework, projecting high-dimensional GMM supervectors into a compact subspace (e.g., 400 dimensions) that disentangles speaker and channel effects, achieving low equal error rates, such as approximately 1% on certain NIST SRE conditions. X-vectors, derived from time-delay neural networks, directly learn embeddings from frame-level features, offering improved robustness to short utterances and noise through data augmentation.^[44] Recent advancements in end-to-end models bypass handcrafted features by processing raw waveforms directly, using convolutional layers to learn hierarchical representations akin to filter banks. This approach, demonstrated on corpora like the Wall Street Journal, yields word error rates competitive with traditional pipelines (e.g., around 6% on clean evaluation sets) while eliminating preprocessing mismatches.^[45] Further progress includes self-supervised learning models, such as wav2vec 2.0, which learn robust representations from unlabeled raw audio data, achieving state-of-the-art results on downstream tasks like automatic speech recognition as of 2020.^[46]

Signal Representation

Speech signals are fundamentally represented in the time domain as continuous or discrete waveforms, capturing the amplitude variations over time. In digital speech processing, analog signals are sampled according to the Nyquist-Shannon sampling theorem, which requires a sampling rate at least twice the highest frequency component to avoid aliasing; for telephony speech limited to about 4 kHz bandwidth, a common rate is 8 kHz, while higher-quality applications like wideband speech use 16 kHz. This discrete-time representation, denoted as s(n) where n is the sample index, forms the basis for subsequent analysis, with the waveform directly visualizing phonetic events like plosives or vowels through amplitude envelopes. Frequency-domain representations transform the time signal into the spectral domain to reveal harmonic structures and formant frequencies inherent to speech production. The short-time Fourier transform (STFT) yields spectrograms, which plot frequency against time with intensity encoded by color or grayscale, providing a 2D spectro-temporal image that highlights time-varying spectral content such as vowel transitions. Periodograms, as non-parametric spectral estimates, offer power spectral density views but are less common for dynamic speech due to their stationarity assumption. Parametric models like linear predictive coding (LPC) approximate the signal as an all-pole filter driven by excitation, expressed as:

s(n) = \sum_{k=1}^{p} a_k s(n-k) + G u(n)

where a_k are the predictor coefficients modeling vocal tract resonances, p is the prediction order (typically 10-12 for speech at 8-16 kHz), G is the gain, and u(n) is the excitation (quasi-periodic for voiced speech or noise for unvoiced). This compact representation reduces dimensionality while preserving perceptual qualities, with LPC coefficients often serving as inputs to further processing stages. Visual and compact forms enhance interpretability and efficiency. Formant tracks trace the frequency loci of vocal tract resonances over time, illustrating phonetic contrasts like the rising F2 in diphthongs, while pitch contours delineate fundamental frequency (F0) variations crucial for prosody and speaker identity. Vector quantization (VQ) further compresses representations by mapping high-dimensional vectors, such as spectral frames, to a finite codebook of prototypes, enabling dimensionality reduction for storage or machine learning embeddings without significant perceptual loss; seminal work by Linde, Buzo, and Gray established VQ optimality via the generalized Lloyd algorithm. Multi-dimensional aspects extend to embedding spaces, where speech segments are projected into low-dimensional manifolds via techniques like autoencoders, facilitating tasks like similarity search. Hybrid time-frequency representations, such as the constant-Q transform, address spectrogram limitations by using logarithmically spaced frequencies better suited to the perceptual scale of pitch, though they are less prevalent than STFT in standard speech pipelines. Representations like mel-frequency cepstral coefficients (MFCCs), derived from spectral features, exemplify how these formats interface with downstream analysis.

Recognition and Modeling Techniques

Statistical Models

Statistical models form the backbone of early speech recognition systems by providing probabilistic frameworks to account for the inherent variability in speech signals, such as acoustic noise, speaker differences, and coarticulation effects. These models treat speech as a Markov process, where observable acoustic features are generated from hidden states representing phonetic or subword units, enabling the estimation of likely transcriptions from audio inputs derived from feature extraction techniques like mel-frequency cepstral coefficients.^[47] The primary statistical model in speech processing is the Hidden Markov Model (HMM), which models speech sequences as transitions between hidden states, typically corresponding to phonemes or subphonemic units. Each state has associated transition probabilities to model temporal dependencies and emission probabilities to generate observed acoustic features. For continuous speech features, emission probabilities are often parameterized using Gaussian Mixture Models (GMMs), where the probability density of an observation vector \mathbf{o}_t in state i is given by a mixture of M Gaussians:

b_i(\mathbf{o}_t) = \sum_{m=1}^M c_{im} \mathcal{N}(\mathbf{o}_t; \boldsymbol{\mu}_{im}, \boldsymbol{\Sigma}_{im}),

with mixture weights c_{im}, means \boldsymbol{\mu}_{im}, and covariances \boldsymbol{\Sigma}_{im}. This GMM-HMM hybrid effectively captures the multimodal nature of speech distributions.^[47] The likelihood of an observation sequence O = \{\mathbf{o}_1, \dots, \mathbf{o}_T\} given the HMM parameters \lambda (including transition matrix A, emission probabilities B, and initial state probabilities \pi) is:

P(O|\lambda) = \sum_{Q} P(O|Q,\lambda) P(Q|\lambda),

where the sum is over all possible state sequences Q = \{q_1, \dots, q_T\}. Direct computation is intractable due to the exponential number of paths, but the forward-backward algorithm efficiently calculates this via dynamic programming. The forward variables \alpha_t(i) = P(\mathbf{o}_1, \dots, \mathbf{o}_t, q_t = i | \lambda) are computed recursively as:

\alpha_1(i) = \pi_i b_i(\mathbf{o}_1), \quad \alpha_{t+1}(j) = \left[ \sum_{i=1}^N \alpha_t(i) a_{ij} \right] b_j(\mathbf{o}_{t+1}),

and the backward variables \beta_t(i) = P(\mathbf{o}_{t+1}, \dots, \mathbf{o}_T | q_t = i, \lambda) as:

\beta_T(i) = 1, \quad \beta_t(i) = \sum_{j=1}^N a_{ij} b_j(\mathbf{o}_{t+1}) \beta_{t+1}(j).

The total likelihood is then P(O|\lambda) = \sum_{i=1}^N \alpha_T(i). These recursions also provide posterior state probabilities essential for training.^[47] During recognition, the Viterbi algorithm approximates the maximum likelihood state sequence by finding the most probable path through the HMM trellis, using dynamic programming to avoid exhaustive search and incorporating language model scores for whole-utterance decoding. This beam-search variant efficiently handles large vocabularies in continuous speech recognition.^[47] HMM parameters are estimated using the Baum-Welch algorithm, an expectation-maximization procedure that iteratively maximizes the likelihood by computing expected state occupancies from forward-backward variables and updating transitions, emissions, and initials accordingly. For discrete emissions, updates involve counts normalized by posteriors; for continuous GMMs, they include re-estimation of mixture components via k-means-like clustering. To handle the context-dependent variability in continuous speech, triphone models extend monophone HMMs by conditioning states on preceding and following phonemes, reducing modeling errors from coarticulation—early implementations clustered thousands of triphones into shared states for parameter tying. Complementary statistical techniques include Vector Quantization (VQ), which clusters high-dimensional acoustic feature vectors into a finite codebook to reduce computational complexity in HMM emissions, using algorithms like Linde-Buzo-Gray for codebook design based on minimizing distortion. Additionally, n-gram language models provide sequence probabilities for word-level predictions, estimating P(w_i | w_{i-n+1}, \dots, w_{i-1}) from corpora via maximum likelihood with smoothing to handle sparse data, integrating seamlessly with HMM decoding to improve recognition accuracy on fluent speech.^[48]

Neural Network Approaches

Neural network approaches in speech processing, particularly for automatic speech recognition (ASR), have shifted the paradigm from hybrid hidden Markov model (HMM)-based systems to direct, data-driven mappings from audio waveforms or features to textual outputs. These methods leverage deep learning to capture complex acoustic patterns, temporal dynamics, and contextual dependencies, enabling scalable training on large datasets without explicit phonetic modeling. By the 2010s, neural architectures had demonstrated superior performance over statistical predecessors, reducing reliance on hand-crafted features like mel-frequency cepstral coefficients.^[49] Recurrent Neural Networks (RNNs), especially Long Short-Term Memory (LSTM) units, address the sequential nature of speech by maintaining hidden states that propagate information across time steps, effectively modeling variable-length utterances. Deep bidirectional LSTMs, which process audio forward and backward, achieved a 17.7% phone error rate on the TIMIT dataset, outperforming prior deep feedforward networks.^[49] Convolutional Neural Networks (CNNs) treat spectrograms as two-dimensional images, applying filters to extract local spectral and temporal features while reducing dimensionality through pooling, which proved effective in hybrid NN-HMM setups with over 10% relative error rate reductions on standard speech recognition tasks.^[50] Transformers, introduced to ASR via self-attention mechanisms, eliminate recurrence to better capture long-range dependencies in audio sequences, as seen in the Speech-Transformer model, which matched LSTM performance on English datasets such as Switchboard using positional encodings adapted for continuous inputs.^[51] End-to-end neural models further simplify ASR by bypassing intermediate alignments, with Connectionist Temporal Classification (CTC) enabling direct training of RNNs on unsegmented audio-label pairs. The CTC loss function marginalizes over all possible monotonic alignments, formulated as

L = -\log \sum_{\pi \in \mathcal{B}^{-1}(\mathbf{y})} P(\pi | \mathbf{x}),

where \mathbf{x} is the input sequence, \mathbf{y} the target labels, \pi a path including blanks, and \mathcal{B} the function collapsing repeats and blanks to recover \mathbf{y}.^[52] Attention-based sequence-to-sequence frameworks, exemplified by Listen, Attend and Spell (LAS), extend this by using an encoder to produce audio representations and a decoder that attends to relevant portions during text generation, computing attention weights as

\alpha_{ti} = \softmax(\score(h_t, s_i)),

where h_t is the decoder hidden state at time t and s_i the encoder states; LAS achieved competitive character-level error rates on English datasets without pronunciation dictionaries.^[53] Recent advances emphasize self-supervised learning and scalability, with models like HuBERT pre-training representations on unlabeled audio through iterative clustering and masked prediction, yielding fine-tuned ASR WERs as low as 2.0% on LibriSpeech clean subsets when transferred to downstream tasks.^[54] Multilingual systems, trained on massive diverse corpora, leverage transfer learning to handle low-resource languages, driving 2020s benchmarks below 5% WER on clean English speech, as evidenced by models like Whisper on LibriSpeech test-clean.^[55] Emerging diffusion models, while primarily generative, are adapting to speech enhancement tasks that indirectly boost recognition robustness by denoising inputs, bridging gaps in handling noisy real-world audio.^[56] As of 2024-2025, Speech Language Models (SpeechLMs) have emerged, combining speech representations with large language models to enhance contextual and multilingual speech recognition.^[2]

Phase and Time-Domain Methods

Phase and time-domain methods in speech processing emphasize the manipulation and preservation of temporal structure and phase information in audio signals, which are crucial for tasks like alignment, pitch estimation, and time-scale modification. These techniques operate directly on the waveform or its short-time Fourier transform (STFT), focusing on geometric matching or derivative-based analysis rather than probabilistic modeling. By addressing nonlinear variations in timing and phase discontinuities, they enable robust handling of variable-rate speech without relying on frequency-domain magnitude alone.^[57] Dynamic Time Warping (DTW) is a foundational algorithm for aligning sequences with differing durations, such as speech utterances or music performances, by finding an optimal nonlinear path that minimizes the cumulative distance between corresponding points. The DTW distance between two sequences x and y of lengths m and n is computed via dynamic programming, where the cost matrix entry is given by

D(i,j) = \dist(x_i, y_j) + \min \left\{ D(i-1,j),\ D(i,j-1),\ D(i-1,j-1) \right\},

with boundary conditions D(0,0) = 0 and D(i,0) = D(0,j) = \infty for i,j > 0, and \dist typically being the Euclidean distance. To improve computational efficiency, particularly for long sequences in speech recognition, the Sakoe-Chiba band constrains the warping path within a diagonal band of width $2r+1, limiting admissible alignments to |i - j| \leq r and reducing time complexity from O(mn) to O(mr) when r \ll n. DTW has been widely applied in speech recognition and alignment tasks, achieving mean alignment errors of around 50 ms on benchmark speech datasets.^[58] Phase-aware processing leverages the phase component of the STFT to capture temporal dynamics often discarded in magnitude-based methods, enabling modifications like time-stretching while preserving perceptual qualities. The group delay function, defined as the negative derivative of the unwrapped phase \tau_g(\omega) = -\frac{d\phi(\omega)}{d\omega}, quantifies the time delay of signal envelopes at each frequency \omega, providing insights into formant locations and vocal tract resonances in speech.^[59] For instance, in voiced speech, peaks in the group delay spectrum correspond to glottal closures, aiding in segmentation tasks with accuracy improvements of up to 20% over magnitude-only features.^[57] Phase vocoding extends this by resampling the phase trajectory in the STFT domain to achieve time-stretching without altering pitch: the instantaneous frequency f_i(\omega) = \frac{\phi'(\omega)}{2\pi} is scaled by a stretch factor \alpha, allowing the signal to be elongated or compressed while maintaining harmonic structure, as demonstrated in applications yielding mean opinion scores above 4.0 for naturalness in speech modification.^[60] Time-domain approaches directly analyze the waveform to extract temporal features, bypassing frequency transformations for simplicity and low latency. Autocorrelation, computed as R(\tau) = \sum_n s(n) s(n+\tau), detects pitch periods by identifying the lag \tau of the first maximum peak beyond zero lag, robust to noise in voiced segments with fundamental frequency estimation errors under 2% on clean speech databases.^[61] This method preprocesses the signal with nonlinearities like clipping to suppress formant ripples, enhancing peak sharpness for reliable detection in real-time systems. Zero-phase filtering, achieved by processing the signal forward and backward with a linear-phase filter and conjugating the intermediate phase, eliminates phase distortion while preserving the original waveform timing, ideal for preprocessing in speech analysis where it reduces group delay to zero across the band, improving pitch tracking precision by 15-25% in filtered segments.^[62] Recent advancements in phase reconstruction for neural vocoders incorporate these principles by explicitly estimating phase derivatives from amplitude spectrograms, enhancing waveform synthesis fidelity in time-domain generation.^[63]

Synthesis and Generation Techniques

Rule-Based Synthesis

Rule-based synthesis generates speech through explicit algorithmic rules that map symbolic inputs, such as phonemes or text, to acoustic parameters, predating data-driven methods and relying on hand-crafted models of speech production. These systems typically involve linguistic processing to derive phonetic representations, followed by rules for acoustic realization, including formant frequencies, source excitation, and prosodic features like pitch and duration. Seminal implementations emphasize modularity, allowing independent control over vocal tract modeling and glottal source to produce intelligible speech from unrestricted text. Formant synthesis, a cornerstone of rule-based approaches, simulates the vocal tract as a series of resonators to produce spectral peaks known as formants. The Klatt synthesizer, introduced in 1980, combines cascade and parallel configurations to generate up to five formants (F1 through F5), with rules specifying steady-state frequencies and bandwidths derived from acoustic measurements of natural speech. For vowels, these rules often position F1 (related to tongue height) and F2 (related to tongue advancement) within a triangular acoustic space, where formant values for intermediate vowels are interpolated from corner vowels like /i/, /a/, and /u/ based on phoneme context. Formant transitions between phonemes are modeled via linear interpolation, ensuring smooth coarticulatory effects; for instance, the frequency of a formant F over time t in a segment of duration T is given by

F(t) = F_{\text{start}} + (F_{\text{end}} - F_{\text{start}}) \cdot \frac{t}{T},

where F_{\text{start}} and F_{\text{end}} are target values at segment boundaries. This approach, while computationally efficient, produces speech with a characteristic buzz-like quality due to simplified source-filter assumptions.^[64]^[65] Dipphone-based rule synthesis extends formant methods by concatenating minimal units—typically the transition between two adjacent phonemes (diphones)—selected via rules that account for phonetic context. Allophone selection rules determine the appropriate variant of a phoneme based on neighboring sounds, minimizing discontinuities at join points through signal processing techniques like pitch-synchronous overlap-add (PSOLA) for smoothing. Prosody is imposed post-concatenation using rules for duration (e.g., lengthening in stressed syllables) and pitch (fundamental frequency, F0), often drawing from phonetic principles to model rhythm and emphasis. Coarticulation, the influence of adjacent phonemes on articulation, is handled by rule-driven adjustments to formant trajectories or diphone boundaries.^[66]^[67] Intonation in rule-based systems is generated through models like the Tilt model, which decomposes F0 contours into sequential rise-fall shapes parameterized by amplitude, duration, and tilt (slope ratio of rise to fall). These parameters are set by rules tied to linguistic features, such as phrase boundaries or accents, with coarticulatory effects modeled by overlapping events. Alternatively, pitch contours can be constructed via superposition of accent and phrase components, as in models where the global F0 is the sum of local accent pulses and a slower phrase curve, enabling rule-based control over declarative or interrogative patterns. Such methods rely on phonetic rules for event placement and scaling to approximate natural variability.^[68]^[69] Despite their foundational role, rule-based synthesizers often yield robotic-sounding output due to overly simplistic prosodic rules that fail to capture subtle human variations in timing and intonation. This limitation is particularly pronounced for non-English languages, where hand-crafted rule sets for tonal or stress-based prosody remain underdeveloped compared to English-focused systems, leading to unnatural rhythm and emphasis.^[70]

Concatenative and Statistical Synthesis

Concatenative speech synthesis generates speech by selecting and joining pre-recorded units, such as diphones or phonemes, from a large speech corpus to minimize discontinuities and maximize naturalness. This approach emerged in the 1990s as a data-driven alternative to rule-based methods, relying on extensive databases to capture speaker-specific variations in prosody and timbre. The Festival Speech Synthesis System, initially released in 1996, exemplifies this paradigm by providing an open-source framework for unit selection-based synthesis, enabling developers to build voices through corpus labeling and search algorithms. In practice, unit selection involves constructing a graph of candidate units and optimizing a path that balances fidelity to the target utterance with seamless concatenation. A core component of concatenative synthesis is the cost function used to evaluate unit candidates during selection. The total cost C for a sequence is typically formulated as a weighted sum:

C = w_t \cdot TC + w_c \cdot CC

where TC is the target cost measuring how closely a unit matches linguistic and prosodic specifications (e.g., spectral similarity via Mel-cepstral distortion), CC is the concatenation cost assessing join quality (e.g., via waveform alignment or perceptual metrics), and w_t, w_c are empirically tuned weights. This formulation, introduced in early unit selection systems, ensures selected units align with the desired phoneme sequence while minimizing audible artifacts at boundaries. Signal processing techniques, such as linear predictive coding or prosodic modifications, are often applied post-selection to refine pitch, duration, and amplitude for smoother output.^[71] Statistical parametric synthesis, prominent since the 2000s, models speech as sequences of acoustic parameters predicted from text inputs, contrasting with direct waveform concatenation. Hidden Markov models (HMMs) form the backbone, where context-dependent HMMs jointly estimate spectral envelopes, fundamental frequency, and durations from training data. The HTS (HMM-based Speech Synthesis) system, version 2.0 released in 2007, advanced this by incorporating multi-stream modeling for spectrum and excitation, along with decision tree-based parameter generation to handle linguistic contexts efficiently. Synthesis proceeds by sampling parameters from the trained HMMs and reconstructing the waveform via vocoders like STRAIGHT or WORLD, yielding compact representations suitable for limited-resource devices. Unlike concatenative methods, parametric approaches allow flexible prosody control but may introduce over-smoothing, reducing expressiveness.^[72] Duration modeling in statistical synthesis is critical for natural rhythm, often employing explicit distributions to predict state occupancy in HMMs. Gamma distributions are commonly used for their flexibility in capturing skewed, positive-valued durations: the probability density function is f(d; \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} d^{\alpha-1} e^{-\beta d}, where d is duration, \alpha shapes the distribution, and \beta scales it. This modeling, refined in HMM frameworks, jointly optimizes state-level and higher-unit (e.g., syllable) durations to better fit empirical data, improving timing accuracy over implicit geometric distributions. Studies show gamma fits outperform Gaussian assumptions, enhancing perceived fluency in generated speech.^[73] Hybrid approaches integrate concatenative unit selection with statistical modeling to leverage the strengths of both, such as natural timbre from databases and parametric flexibility for unseen contexts. In these systems, HMMs generate target specifications to guide unit search, followed by signal processing for modifications like duration scaling or spectral interpolation. A 2011 hybrid framework demonstrated improved mean opinion scores by selecting natural units when available and falling back to parametric generation otherwise, reducing buzziness in concatenative joins while preserving speaker identity. These methods, building on HMM parameter prediction, address limitations in pure concatenative systems for expressive or low-resource synthesis.^[74]

Neural Synthesis Methods

Neural synthesis methods represent a paradigm shift in text-to-speech (TTS) systems, leveraging deep learning architectures to generate high-fidelity speech waveforms directly from text inputs, often in an end-to-end manner without relying on traditional parametric modeling.^[75] Pioneered in the late 2010s, these approaches integrate sequence-to-sequence (seq2seq) models for acoustic feature prediction with neural vocoders for waveform synthesis, achieving naturalness comparable to human speech in controlled domains.^[76] Unlike earlier concatenative or statistical techniques, neural methods learn hierarchical representations, enabling scalable training on large datasets and improved generalization across speakers and languages.^[77] A foundational example is the combination of Tacotron and WaveNet, introduced in 2017. Tacotron employs a seq2seq architecture with an encoder-decoder framework and attention mechanism to map input text to mel-spectrogram representations, which are then conditioned to WaveNet for autoregressive raw waveform generation.^[75]^[76] WaveNet models audio as a sequence of samples, using stacked dilated causal convolutions to capture long-range dependencies in the waveform; the output at time t is computed as y_t = g\left( \sum_k h_k \cdot x_{t - d_k} \right) + b, where h_k are filter coefficients, d_k denote dilation factors, g is a gating function, and b is a bias term, allowing an exponentially expanding receptive field without excessive parameters.^[77] This autoregressive process, while producing highly coherent speech, incurs high computational cost due to sequential generation. To address inference latency, parallel generation techniques emerged, such as WaveGAN in 2018, which applies generative adversarial networks (GANs) to waveform synthesis.^[78] WaveGAN trains a generator to produce audio samples from noise, adversarial against a discriminator that distinguishes real from synthetic waveforms; the objective is formulated as \min_G \max_D \mathbb{E}[\log D(\mathbf{x})] + \mathbb{E}[\log(1 - D(G(\mathbf{z})))], where \mathbf{x} is real data and \mathbf{z} is noise, enabling non-autoregressive, parallel synthesis at speeds orders of magnitude faster than WaveNet while maintaining perceptual quality.^[78] Subsequent variants like Parallel WaveGAN further optimize this by distilling knowledge from teacher models, achieving real-time factor (RTF) below 0.1 on standard hardware. In the 2020s, diffusion and flow-based models advanced neural vocoders for even higher fidelity and efficiency. Denoising diffusion probabilistic models (DDPMs), such as Grad-TTS (2021), iteratively refine noise into mel-spectrograms or waveforms by reversing a forward diffusion process, offering stable training and superior prosody control compared to GANs.^[79] Flow-based approaches, exemplified by WaveGlow (2018), model exact likelihoods via invertible transformations, allowing parallel generation of waveforms from conditioned spectrograms with RTF near 0.01 and mean opinion scores (MOS) exceeding 4.0 on naturalness.^[80] These methods surpass earlier autoregressive vocoders in scalability, with diffusion models particularly excelling in diverse acoustic conditions. Expressive synthesis in neural TTS incorporates emotion and style through conditioning vectors, enabling control over prosody, timbre, and affective attributes. Techniques like global style tokens (GSTs), integrated into Tacotron frameworks since 2018, learn unsupervised embeddings from reference audio to capture speaking styles, which condition the decoder for transferring expressiveness without explicit labels. For emotion-specific control, emotional Tacotron variants (2017) embed categorical or continuous emotion vectors into the input sequence, adjusting pitch, energy, and duration to synthesize affective speech, as demonstrated with improvements in emotional intelligibility scores. Real-time systems like FastSpeech 2 (2020), a non-autoregressive acoustic model, further enable low-latency expressive TTS by predicting durations and variances explicitly, achieving MOS above 4.2 and RTF under 0.01 when paired with parallel vocoders.^[81] Since 2022, neural TTS has integrated large language models (LLMs) for enhanced contextual understanding and zero-shot synthesis, allowing high-fidelity speech generation from minimal speaker data. Models like VALL-E (2023) and NaturalSpeech 2 (2023) achieve MOS scores above 4.5 by leveraging in-context learning from short audio clips, enabling voice cloning with just seconds of reference material. By 2025, advancements in neural codec language models and diffusion-based systems, such as those achieving MOS up to 5.53, have further improved efficiency and multilingual support, with real-time factors under 0.05 on consumer hardware.^[82]^[83]^[84]

Enhancement and Coding Techniques

Noise Reduction and Enhancement

Noise reduction and enhancement in speech processing aim to suppress unwanted distortions such as background noise and reverberation while preserving the integrity of the target speech signal. These techniques are essential for improving speech intelligibility in adverse acoustic environments, particularly in applications like telecommunications and assistive devices. Traditional methods rely on statistical estimation of noise characteristics, while modern approaches integrate deep learning to achieve more robust performance. Spectral subtraction is a foundational single-channel technique that estimates the noise spectrum from non-speech segments and subtracts it from the noisy speech spectrum to recover the clean signal, often using |\hat{S}(\omega)| = \max(0, |X(\omega)| - \alpha |N(\omega)|), where X(\omega) is the noisy spectrum, N(\omega) is the estimated noise spectrum, and \alpha is an over-subtraction factor typically between 1 and 5 to account for estimation errors. Introduced in the 1970s and refined in the 1980s, the Ephraim-Malah method uses a minimum mean-square error (MMSE) short-time spectral amplitude (STSA) estimator, modeling speech and noise as conditionally independent Gaussian random variables given their variances. This approach mitigates musical noise artifacts common in basic subtraction by deriving a soft decision gain function based on the a priori signal-to-noise ratio (SNR), involving modified Bessel functions: \hat{A}_k = \frac{\sqrt{\pi}}{2} \frac{\Gamma(1.5)}{\Gamma(1)} \sqrt{\frac{\xi_k}{1 + \xi_k}} \exp\left(-\frac{\xi_k}{2(1 + \xi_k)}\right) \left[ I_0\left(\frac{\xi_k}{2(1 + \xi_k)}\right) + I_1\left(\frac{\xi_k}{2(1 + \xi_k)}\right) \right] R_k, where \xi_k is the a priori SNR and R_k is the noisy spectral amplitude.^[85] Wiener filtering complements spectral subtraction by providing an optimal linear estimator that minimizes the mean square error between the clean and enhanced signals. It computes the gain as the ratio of the a priori signal-to-noise ratio (SNR) to the sum of 1 plus the a priori SNR, effectively attenuating frequency bins dominated by noise. This method assumes additive noise and is particularly effective for stationary noise conditions, achieving up to 10 dB improvement in segmental SNR on benchmark datasets.^[86] Time-frequency masking represents an advancement over direct spectral methods, treating speech enhancement as a binary classification problem in the time-frequency domain. The ideal binary mask (IBM) assigns each time-frequency unit to the target speech if the local SNR exceeds 0 dB, otherwise to noise, yielding near-perfect separation in oracle scenarios with intelligibility gains of 10-15 dB. Recent hybrids leverage deep learning, such as recurrent neural networks (RNNs), to predict soft or ratio masks from noisy spectrograms, outperforming traditional estimators by incorporating temporal dependencies and reducing phase distortions. For instance, deep recurrent neural network-based masking has shown signal-to-distortion ratio (SDR) improvements of 2.3-5 dB over non-negative matrix factorization baselines in speech denoising tasks.^[87] Multi-microphone techniques exploit spatial diversity to enhance speech, with beamforming being a cornerstone method. The minimum variance distortionless response (MVDR) beamformer minimizes output noise power while maintaining unity gain toward the target direction. The optimal weights are given by \mathbf{w} = \frac{\mathbf{R}_n^{-1} \mathbf{a}}{\mathbf{a}^H \mathbf{R}_n^{-1} \mathbf{a}}, where \mathbf{R}_n is the noise covariance matrix and \mathbf{a} is the steering vector for the desired source. This achieves 5-10 dB noise reduction in reverberant settings when combined with accurate direction-of-arrival estimation. Post-filtering extends MVDR by applying a single-channel suppressor to the beamformer output, further addressing residual reverberation through covariance subtraction or Kalman-based dereverberation, improving overall SNR by an additional 3-5 dB.^[88]^[89] Evaluation of these methods often uses datasets like the NOIZEUS corpus, which provides diverse noisy speech scenarios for benchmarking perceptual quality and intelligibility. As of 2025, research has increasingly emphasized low-latency implementations for hearing aids, targeting delays under 5 ms to avoid perceptual artifacts, with deep filtering and hybrid beamforming achieving real-time performance while boosting speech reception thresholds by 4-6 dB in complex noise, as demonstrated in studies like the URGENT 2024 Challenge.^[90]^[91]

Speech Coding Standards

Speech coding standards define algorithms and protocols for compressing speech signals to enable efficient storage and transmission while maintaining perceptual quality. These standards, primarily developed by organizations like the International Telecommunication Union (ITU-T) and the 3rd Generation Partnership Project (3GPP), balance bitrate reduction with reconstruction fidelity, often employing lossy compression techniques that exploit human auditory perception, unlike lossless methods more common in general audio coding. One of the earliest and most foundational standards is ITU-T G.711, which uses pulse-code modulation (PCM) to encode speech at 64 kbps, providing uncompressed logarithmic or linear representations suitable for telephony without perceptual loss at that rate. For lower bitrates, predictive coding techniques emerged, particularly linear predictive coding (LPC) variants. Code-Excited Linear Prediction (CELP), introduced in the 1980s, models speech as an LPC-filtered excitation signal, where the excitation is selected from a codebook to minimize the error between the original and synthesized speech:

\min_{e} \| s - \hat{S}(\hat{a}, e) \|

with s as the input speech frame, \hat{a} as LPC coefficients, and e as the codebook excitation. This approach achieves high-quality coding at 4.8–16 kbps, forming the basis for standards like ITU-T G.729, a conjugate-structure algebraic CELP (CS-ACELP) codec operating at 8 kbps for voice over IP and digital networks. Building on CELP, the Adaptive Multi-Rate (AMR) codec, standardized by 3GPP in the late 1990s for mobile communications, supports variable bitrates from 4.75 to 12.2 kbps, adapting to channel conditions in GSM and UMTS networks. Similarly, the Enhanced Voice Services (EVS) codec, introduced by 3GPP in 2014, integrates super-wideband coding up to 20 kHz at bitrates from 5.9 to 128 kbps and is integrated into 5G systems for immersive voice services. Another versatile standard is Opus, defined in IETF RFC 6716 in 2012, which combines SILK (a linear prediction-based codec) and CELT (a modified discrete cosine transform-based codec) for variable bitrates from 6 to 510 kbps, excelling in real-time applications like WebRTC. Transform coding methods, such as the modified discrete cosine transform (MDCT) in AAC-ELD (Advanced Audio Coding - Extended Low Delay), enable low-latency encoding at 12.8–64 kbps for speech and music in VoIP, prioritizing delay under 30 ms. Recent advancements incorporate neural networks for ultra-low bitrate coding; for instance, SoundStream (2021) uses a generative model with residual vector quantization to achieve high-fidelity reconstruction at bitrates targeted by speech codecs (around 3-32 kbps), outperforming traditional codecs in perceptual metrics. As of 2025, ongoing efforts like the LRAC Challenge focus on ultra-low bitrate (1-6 kbps), low-complexity codecs for everyday hardware.^[92]^[93] Optimal speech coding adheres to rate-distortion theory, where the minimum bitrate R(D) for a distortion level D equals the mutual information I(X; \hat{X}) between input X and reconstruction \hat{X}, guiding the design of lossy codecs that discard inaudible components unlike lossless audio formats. These standards emphasize lossy approaches for speech due to its structured nature, enabling bandwidth savings critical for telecommunications.

Applications

Human-Computer Interfaces

Human-computer interfaces leverage speech processing to enable natural, voice-based interactions between users and computing systems, primarily through automatic speech recognition (ASR) and synthesis technologies that facilitate seamless command execution and conversational exchanges. These interfaces have evolved from early command-response systems to sophisticated virtual assistants, where speech serves as the primary input modality for tasks like information retrieval, device control, and entertainment. Seminal developments include Apple's Siri, launched on October 4, 2011, with the iPhone 4S, which integrated ASR to handle user queries via natural language processing.^[94] Similarly, Amazon's Alexa debuted on November 6, 2014, with the Echo smart speaker, employing cloud-based speech recognition to support home automation and multimedia control.^[95] A critical component of these voice assistants is wake-word detection, implemented through keyword spotting algorithms that continuously monitor audio streams for activation phrases like "Hey Siri" or "Alexa" without full transcription until triggered. These systems often use deep neural networks (DNNs) combined with hidden Markov models for efficient, low-power detection on edge devices, achieving high accuracy while minimizing false positives in noisy environments.^[96] In dialogue systems, spoken language understanding (SLU) pipelines process recognized speech to extract semantic intent, enabling multi-turn conversations where the system maintains context across exchanges. For instance, SLU modules parse utterances into structured representations, such as intents and entities, to guide responses in task-oriented dialogues. Error handling in these multi-turn interactions involves techniques like user confirmation prompts or self-correction mechanisms to mitigate ASR inaccuracies, ensuring robust conversation flow even with ambiguous or erroneous inputs.^[97] Speech processing enhances accessibility in human-computer interfaces by supporting real-time captioning, which transcribes live audio into text for deaf or hard-of-hearing users during video calls or presentations, adhering to standards like WCAG 2.1 Success Criterion 1.2.4 for synchronized captions. Voice control systems integrated with eye-tracking further empower users with motor disabilities; for example, Tobii Dynavox's eye gaze-enabled devices combine speech synthesis with gaze selection to generate output, allowing nonverbal individuals to communicate via synthesized voice. Apple's 2024 accessibility updates (iOS 18) enable eye-tracking on iPhone and iPad devices for hands-free navigation, including voice command selection, reducing physical barriers to interaction.^[99]^[100]^[101] Key challenges in these interfaces include achieving low latency, ideally under 500 milliseconds for end-to-end response times to mimic natural conversation pacing, as delays beyond this threshold degrade user experience in real-time applications. Privacy concerns arise from always-on listening modes, where devices process ambient audio for wake-words, potentially capturing unintended sensitive data; mitigation strategies include local processing and user-configurable privacy controls to limit cloud uploads. In the 2020s, multimodal interfaces have emerged, integrating speech with gestures for more intuitive HCI, as seen in systems combining voice commands with hand-tracking for enhanced expressiveness in virtual environments. Neural recognition and synthesis methods underpin these advancements, enabling fluid integration across modalities. As of 2025, further integrations include AI-enhanced emotion-aware virtual assistants using prosody analysis for more empathetic interactions.^[102]^[103]^[104]^[105]

Telecommunications and Media

In telecommunications, speech processing plays a crucial role in enabling high-quality voice transmission over networks, particularly through techniques that mitigate distortions and optimize bandwidth usage. Acoustic echo cancellation (AEC) is essential for VoIP and telephony systems, where it subtracts echoes caused by acoustic coupling between speakers and microphones, ensuring full-duplex communication without feedback. ^[106] ^[107] Wideband codecs, such as G.722, further enhance telephony by supporting high-definition (HD) voice with a frequency range of 50 Hz to 7 kHz, doubling the bandwidth of traditional narrowband codecs like G.711 to deliver clearer, more natural-sounding calls. ^[108] ^[109] In broadcasting, speech processing facilitates content localization and personalization. Automatic dubbing leverages AI-driven speech-to-speech pipelines to transcribe, translate, and synthesize audio while preserving the original speaker's voice and emotional tone, enabling seamless multilingual adaptations for global audiences. ^[110] ^[111] Voice cloning technologies, exemplified by Adobe's Project VoCo demonstrated in 2016, allow for the manipulation of recorded speech to generate new phrases in a speaker's voice from short audio samples, raising possibilities for media production while sparking ethical discussions on audio authenticity. ^[112] ^[113] Streaming services for podcasts and live audio rely on speech processing to maintain quality amid variable network conditions. Adaptive bitrate streaming dynamically adjusts audio compression rates based on available bandwidth, ensuring uninterrupted playback by switching between lower and higher bitrates without perceptible artifacts in speech-heavy content like podcasts. ^[114] ^[115] Real-time speech translation, as integrated into Google Translate since its speech features rollout in the late 2000s, processes live audio input to provide instant multilingual output, supporting conversational flows in streaming applications. ^[116] Standards integration has standardized speech processing for modern networks. WebRTC, defined by IETF and W3C specifications, incorporates audio processing requirements including noise suppression and codec support like Opus for low-latency browser-based voice communication. ^[117] ^[118] In 5G networks, Ultra-Reliable Low-Latency Communication (URLLC) targets end-to-end latencies of 5-50 ms with up to 99.999% reliability, enabling immersive speech calls for applications like remote collaboration. ^[119] ^[120] These standards often reference established speech coding methods, such as those in ITU-T recommendations, to ensure interoperability. Interactive Voice Response (IVR) systems in telecommunications have evolved from touchtone-based prompts in the 1970s to AI-enhanced platforms incorporating speech recognition and synthesis for natural interactions. Early IVR relied on dynamic call routing with pre-recorded announcements, but advancements in the 1990s introduced automated speech recognition (ASR), allowing voice commands to navigate menus efficiently. ^[121] ^[122] By the 2020s, integration of natural language processing and machine learning has enabled conversational IVR, reducing call abandonment rates by handling complex queries without human intervention. ^[123] ^[124] Emerging applications include AI-driven speech synthesis for live sports commentary, where generative models analyze game data in real-time to produce dynamic, natural-sounding narratives. Systems like those from CAMB.AI use proprietary speech models to translate and synthesize multilingual commentary, enhancing global accessibility for events. ^[125] ^[126] This technology, powered by voice cloning and text-to-speech advancements, delivers broadcast-quality output with emotional inflection, transforming how international audiences experience live sports. ^[127] ^[128]

Medical and Assistive Technologies

Speech processing plays a crucial role in medical diagnostics by enabling objective assessment of speech impairments associated with neurological conditions. In dysarthria evaluation, articulation metrics such as speech rate and rhythm profiles are derived from acoustic analysis of connected speech samples to quantify motor speech disorders.^[129] These metrics, including syllable duration and pause ratios, help speech-language pathologists differentiate dysarthria subtypes from healthy speech with high reliability.^[130] For Parkinson's disease detection, voice tremor analysis extracts features like fundamental frequency jitter and shimmer from sustained vowels or diadochokinetic tasks, achieving diagnostic accuracies above 90% using machine learning models.^[131] Such non-invasive techniques facilitate early identification through remote voice recordings, reducing the need for in-clinic visits.^[132] In speech therapy, processing technologies provide real-time feedback to improve articulation in patients with motor speech disorders. Visual-acoustic biofeedback systems display formant trajectories—resonant frequencies of the vocal tract—on screens during therapy sessions, allowing users to adjust tongue positioning for accurate phoneme production.^[133] This approach has shown significant gains in speech sound accuracy for residual errors.^[134] Augmentative and alternative communication (AAC) devices leverage text-to-speech synthesis with predictive algorithms to support individuals with severe impairments; for instance, the Predictable app uses dynamic word prediction to generate natural-sounding speech from typed input, aiding users with conditions like ALS or cerebral palsy.^[135] The Constant Therapy app, an FDA-designated breakthrough device since 2020, delivers personalized exercises targeting aphasia and dysarthria, with clinical trials demonstrating measurable improvements in naming and sentence repetition tasks.^[136] Prosthetic applications of speech processing restore auditory and vocal functions post-surgery or injury. In cochlear implants, signal processing strategies decompose incoming audio into spectral bands via fast Fourier transforms, mapping them to electrode pulses that stimulate the auditory nerve for speech perception in noise.^[137] Advanced coding like continuous interleaved sampling (CIS) enhances temporal resolution, improving consonant recognition in quiet environments.^[138] For laryngectomized patients, esophageal speech enhancement employs voice conversion techniques, such as Gaussian mixture models, to reduce noise and stabilize pitch fluctuations inherent in air-insufflation-based phonation.^[139] These methods boost intelligibility by aligning esophageal acoustics to normal voice spectra, with perceptual tests showing enhanced naturalness ratings.^[140] Recent advancements integrate AI-driven speech processing into rehabilitation and monitoring. In the 2020s, AI models for aphasia therapy analyze utterance patterns to adapt exercises, promoting generalization of language skills in post-stroke patients through gamified apps.^[141] Telehealth platforms use voice biometrics for remote monitoring, extracting tremor and prosody features to track Parkinson's progression via smartphone recordings, enabling timely interventions with 85-90% accuracy in symptom detection.^[142] However, the application of voice deepfakes in therapy raises ethical concerns, including consent for synthetic voice replication and risks of misinformation in patient simulations, potentially undermining trust in clinical outcomes.^[143] Speech enhancement techniques from noisy medical settings, such as adaptive filtering, further support these tools by improving signal clarity during remote sessions.^[144]

References

[1]
Speech Processing - an overview | ScienceDirect Topics
Speech processing is the computational analysis and manipulation of spoken language, involving tasks such as speech recognition, speech synthesis, and speaker ...
[2]
2.2. Speech production and acoustic properties
Signal amplitude or intensity over time is another important characteristic and in its most crude form can be the difference between speech and silence (see ...<|separator|>
[3]
Speech Acoustics
### Summary of Voiced vs. Unvoiced Sounds, Spectral Content, and Basic Acoustics of Speech
[4]
3.10. Fundamental frequency (F0) - Introduction to Speech Processing
Typically fundamental frequencies lie roughly in the range 80 to 450 Hz, where males have lower voices than females and children. The F0 of an individual ...
[5]
Generating and understanding speech - Ecophon
To be able to understand speech clearly, it is therefore important to have good hearing across the entire range of frequencies from 125 – 8,000 Hz, but ...
[6]
[PDF] The Lowdown on the Science of Speech Sounds - UT Dallas ...
... speech examples. Gunnar Fant and the source-filter theory. The source-filter theory of speech production was the brainchild of Gunnar Fant (1919–2009), a ...
[7]
What is Signal to Noise Ratio and How to calculate it?
Jul 17, 2024 · SNR is the ratio of signal power to the noise power, and its unit of expression is typically decibels (dB).
[8]
Diphone - speech.zone
Diphones have about the same duration as a phone, but their boundaries are in the centres of phones. They are the units of co-articulation.
[9]
Practice and experience predict coarticulation in child speech - PMC
Coarticulation is not simply noise in the speech signal. It conveys important auditory-acoustic information for speakers and listeners alike.
[10]
[PDF] L3: Organization of speech sounds
• Phonemes, phones, and allophones. • Taxonomies of phoneme classes. • Articulatory phonetics. • Acoustic phonetics. • Speech perception. • Prosody. Page 2 ...
[11]
[PDF] Phonetics - Stanford University
Phones can be described by how they are produced articulatorily by the vocal organs; consonants are defined in terms of their place and manner of articu- lation ...
[12]
[PDF] Sounds of Language: Phonetics and Phonology
Speech sounds are divided into two main types, consonants and vowels. Consonants involve a constriction in the vocal tract, obstructing the flow of air; the ...
[13]
[PDF] Prosody, Tone, and Intonation - University College London
Introduction: Prosody refers to all suprasegmental aspects of speech, including pitch, duration, amplitude and voice quality that are used to make lexical ...
[14]
The 44 Phonemes in English - Academia.edu
There are approximately 44 unique sounds, also known as phonemes. The 44 sounds help distinguish one word or meaning from another.
[15]
[PDF] The social life of phonetics and phonology - UC Berkeley Linguistics
In this article we define and illustrate sociophonetic variation within speech, highlighting both its pervasiveness and also the relatively minor role it ...
[16]
Introduction to Prosody: A Mini-Tutorial and a Short Course
Prosody is essential in human interaction, enabling people to show interest, establish rapport, efficiently convey nuances of attitude or intent, and so on.
[17]
Von Kempelen Builds the First Successful Speech Synthesizer
"The machine consisted of a bellows that simulated the lungs and was to be operated with the right forearm (uppermost drawing). A counterweight provided for ...
[18]
Wolfgang von Kempelen
The machine was able to produce connected speech. He published a detailed description of his device and experience with it in a 1791 volume Mechanismus der ...
[19]
Sound Control: The Ubiquitous Helmholtz Resonator - audioXpress
May 31, 2023 · His invention of the Helmholtz resonator, described in his book On the Sensations of Tone (which was first published in German in 1863) grew out ...
[20]
Hermann von Helmholtz - Sound and Science
In acoustics, he contributed the theory of air velocity in open tubes and the resonance theory of hearing, and invented the Helmholtz resonator, which can be ...<|separator|>
[21]
Studying Sound: Alexander Graham Bell (1847–1922)
In 1864 Bell's father, Alexander Melville Bell, had invented visible speech, a symbol-based system to help deaf people learn to speak.
[22]
Manometric Apparatus | National Museum of American History
Description: In 1862, Rudolph Koenig, an acoustic instrument maker in Paris, devised a manometric apparatus in which the flame of a burning gas jet vibrates ...Missing: Karl spectrogram development
[23]
Rudolph Koenig's Instruments for Studying Vowel Sounds
Aug 6, 2025 · This article describes the origins of instruments used to study vowel sounds: synthesizers for production, resonators for detection, and ...
[24]
[PDF] Speech synthesis - Bell System Memorial
The "Speech Synthesis" experiment is intended to advance the student's understanding of speech production and recognition. The electronic circuit, if assembled ...
[25]
What Tsutomu Chiba Left Behind - J-Stage
Dec 3, 2016 · In the early 1940's, Tsutomu Chiba and his associ- ate, Masato Kajiyama, published the classic book, The vowel: Its nature and structure ...
[26]
[PDF] The Replication of Chiba and Kajiyama's Mechanical Models of the ...
Chiba and Kajiyama (1941) was founda- tional in the establishment of the modern acous- tic theory of speech production (Fant, 1960;. Stevens, 1998). Chiba ...Missing: 1940s | Show results with:1940s
[27]
The Secret Military Origins of the Sound Spectrograph
Jul 26, 2018 · This meant that the source signal could be compressed before coding in order to disguise speech cadences and then re-expanded after decoding ...
[28]
[PDF] a short history of acoustic phonetics in the us - Haskins Laboratories
1 Chiba and Kajiyama in Japan had made this point in The vowel - Its nature and structure (1941/1958). But most copies of this book were lost during the war ...<|separator|>
[29]
Dudley's Channel Vocoder - Stanford CCRMA
The first major effort to encode speech electronically was Homer Dudley's channel vocoder (``voice coder'') [68] developed starting in October of 1928.
[30]
[PDF] The Origins of DSP and Compression - Audio Engineering Society
Dudley's 1928 VOCODER was the first successful electronic speech analyzer and synthesizer. Modern speech and signal processing and compression began with ...
[31]
Automatic Recognition of Spoken Digits - Semantic Scholar
The recognizer discussed will automatically recognize telephone‐quality digits spoken at normal speech rates by a single individual, with an accuracy ...Missing: Lee | Show results with:Lee
[32]
Audrey, Alexa, Hal, and More - CHM - Computer History Museum
Jun 9, 2021 · The machine, known as AUDREY—the Automatic Digit Recognizer—can recognize the digits zero to nine, with 90% accuracy, but only if spoken by its ...
[33]
[PDF] Automatic Speech Recognition – A Brief History of the Technology ...
Oct 8, 2004 · In 1952,. Davis, Biddulph, and Balashek of Bell Laboratories built a system for isolated digit recognition for a single speaker [9], using the ...
[34]
[PDF] The History of Linear Prediction
My story, told next, recollects the events that led to proposing the linear prediction coding (LPC) method, then the multi- pulse LPC and the code-excited LPC.<|separator|>
[35]
Part I of Linear Predictive Coding and the Internet Protocol
Mar 1, 2010 · Linear prediction has long played an important role in speech processing, especially in the development during the late 1960s of the first ...
[36]
[PDF] Hidden Markov Models
A Hidden Markov Model (HMM) is based on Markov chains, dealing with hidden events like part-of-speech tags, and observed events like words.
[37]
[PDF] Dynamic programming algorithm optimization for spoken word ...
Abstract-This paper reports on an optimum dynamic programming. (DP) based time-normalization algorithm for spoken word recognition. First, a general ...
[38]
[PDF] A tutorial on hidden Markov models and selected applications in ...
Although initially introduced and studied in the late 1960s and early 1970s, statistical methods of Markov source or hidden Markov modeling have become ...Missing: precursors | Show results with:precursors
[39]
Dragon Systems Introduces Dragon NaturallySpeaking Speech ...
In June 1997 Dragon Systems of Newton, Massachusetts introducted Dragon NaturallySpeaking Offsite Link speech recognition software.Missing: commercial | Show results with:commercial
[40]
Modeling prosodic differences for speaker recognition - ScienceDirect
In this work, we propose the use of the rate of change of F0 and short-term energy contours to characterize speaker-specific information.
[41]
[PDF] X-Vectors: Robust DNN Embeddings for Speaker Recognition
In this paper, we use data augmentation to improve performance of deep neural network (DNN) embeddings for speaker recognition. The DNN, which is trained to ...
[42]
[PDF] End-to-End Speech Recognition From the Raw Waveform
State-of-the-art speech recognition systems rely on fixed, hand- crafted features such as mel-filterbanks to preprocess the wave-.
[43]
[PDF] A Tutorial on Hidden Markov Models and Selected Applications in ...
The basic theory was published in a series of classic papers by Baum and his colleagues [1]-[5] in the late 1960s and early 1970s and was implemented for speech ...
[44]
A Maximization Technique Occurring in the Statistical Analysis of ...
February, 1970 A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains. Leonard E. Baum, Ted Petrie, ...
[45]
https://www.isca-archive.org/interspeech_2018/zeghidour18_interspeech.pdf
[46]
Speech Recognition with Deep Recurrent Neural Networks - arXiv
Mar 22, 2013 · This paper investigates deep recurrent neural networks for speech recognition, achieving a 17.7% error on the TIMIT benchmark.
[47]
Applying Convolutional Neural Networks concepts to hybrid NN ...
In this paper, we propose to apply CNN to speech recognition within the framework of hybrid NN-HMM model. ... Ossama Abdel-Hamid; Abdel-rahman Mohamed; Hui Jiang; ...
[48]
Speech-Transformer: A No-Recurrence Sequence-to ... - IEEE Xplore
In this paper, we present the Speech-Transformer, a no-recurrence sequence-to-sequence model entirely relies on attention mechanisms to learn the positional ...
[49]
[PDF] Connectionist Temporal Classification: Labelling Unsegmented ...
Connectionist Temporal Classification (CTC) uses RNNs to label unsegmented sequences by interpreting outputs as a probability distribution over label sequences ...
[50]
[1508.01211] Listen, Attend and Spell - arXiv
Aug 5, 2015 · Abstract:We present Listen, Attend and Spell (LAS), a neural network that learns to transcribe speech utterances to characters.
[51]
HuBERT: Self-Supervised Speech Representation Learning ... - arXiv
Jun 14, 2021 · We propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide ...
[52]
[PDF] Robust Speech Recognition via Large-Scale Weak Supervision
Sep 1, 2022 · We tested the noise robustness of Whisper models and 14. LibriSpeech-trained models by measuring the WER when either white noise or pub noise ...
[53]
Investigating the Design Space of Diffusion Models for Speech ...
Dec 7, 2023 · Abstract:Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature.
[54]
Group delay functions and its applications in speech technology
Nov 22, 2011 · Applications of group delay functions for speech processing are discussed in some detail. They include segmentation of speech into syllable ...
[55]
[PDF] 4 Dynamic Time Warping
Dynamic time warping (DTW) is a technique to find an optimal alignment between two time-dependent sequences by warping them nonlinearly.
[56]
Speech processing using group delay functions - ScienceDirect.com
We propose a technique to extract the vocal tract system component of the group delay function by using the spectral properties of the excitation signal.
[57]
[PDF] new phase-vocoder techniques for pitch-shifting, harmonizing and
The phase-vocoder is a well-established tool for the time- scale modification of audio and speech signals. Introduced over 30 years ago [2], the phase vocoder ...
[58]
https://www.audiolabs-erlangen.de/content/05_fau/professor/00_mueller/03_publications/2007_Mueller_DTW-Chapter04-IR_Springer.pdf
[59]
Pitch detection based on zero-phase filtering - ScienceDirect.com
The algorithm is based on the iterative use of a linear filter with zero phase and monotonically decreasing frequency response (low pass). The results show that ...
[60]
A Neural Vocoder with Hierarchical Generation of Amplitude and ...
Jun 23, 2019 · This paper presents a neural vocoder named HiNet which reconstructs speech waveforms from acoustic features by predicting amplitude and phase spectra ...
[61]
Software for a cascade/parallel formant synthesizer
A software formant synthesizer isdescribed that can generate synthetic speech using a laboratory digital computer. A flexible synthesizer configuration ...Missing: phoneme triangles
[62]
[PDF] speech synthesis by rule - Haskins Laboratories
The values for the parameter during a transition are calculated by linear interpolation between the boundary values and the steady-state values. With the ...
[63]
Diphone speech synthesis - ScienceDirect.com
Text-to-speech synthesis requires two steps: linguistic processing (to convert text into phonemes and intonation parameters) and simulation of speech ...
[64]
Degas: a system for rule-based diphone speech synthesis
Diphone segment assembly is a technique for synthesizing a potentially unlimited variety of continuous utterances under computer control.<|separator|>
[65]
[PDF] THE TILT INTONATION MODEL - ISCA Archive
The tilt intonation model facilitates automatic analysis and syn- thesis of intonation. The analysis algorithm detects intonational.
[66]
[PDF] Decomposition of Pitch Curves in the General Superpositional ...
The core goal of this paper was to describe an algorithm for decomposition of pitch contours into accent curves and phrase curves while making minimal ...Missing: superposition | Show results with:superposition
[67]
(PDF) Speech synthesis systems: Disadvantages and limitations
Aug 6, 2025 · The aim of this paper is to present the current state of development of speech synthesis systems and to examine their drawbacks and limitations.
[68]
[PDF] UNIT SELECTION IN A CONCATENATIVE SPEECH SYNTHESIS ...
ABSTRACT. One approach to the generation of natural-sounding syn- thesized speech waveforms is to select and concatenate units from a large speech database.Missing: Festival seminal
[69]
[PDF] The HMM-based Speech Synthesis System (HTS) Version 2.0
Aug 22, 2007 · This paper described the details of the HMM-based speech syn- thesis system (HTS) version 2.0. This version includes a num- ber of new ...
[70]
[PDF] Duration Refinement by Jointly Optimizing State and Longer Unit ...
We propose a refined duration model which jointly optimizes the likelihoods of state, phone and syllable durations. The joint optimization procedure is ...
[71]
https://www.ee.columbia.edu/~dpwe/e6820/papers/HuntB96-speechsynth.pdf
[72]
[1703.10135] Tacotron: Towards End-to-End Speech Synthesis - arXiv
Mar 29, 2017 · In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters.
[73]
Natural TTS Synthesis by Conditioning WaveNet on Mel ... - arXiv
Dec 16, 2017 · The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, ...
[74]
[1609.03499] WaveNet: A Generative Model for Raw Audio - arXiv
Sep 12, 2016 · This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive.
[75]
[1802.04208] Adversarial Audio Synthesis - arXiv
Feb 12, 2018 · Abstract:Audio signals are sampled at high temporal resolutions, and learning to synthesize audio requires capturing structure across a ...
[76]
Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech - arXiv
May 13, 2021 · In this paper we introduce Grad-TTS, a novel text-to-speech model with score-based decoder producing mel-spectrograms by gradually transforming noise.
[77]
WaveGlow: A Flow-based Generative Network for Speech Synthesis
Oct 31, 2018 · WaveGlow is a flow-based network for generating high-quality speech from mel-spectrograms, combining insights from Glow and WaveNet.
[78]
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
Jun 8, 2020 · In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS.
[79]
[PDF] Square Error Short-Time Spectral Amplitude Estimator - David Malah
EPHRAIM AND MALAH: SPEECH ENHANCEMENT USING A SPECTRAL AMPLITUDE ESTIMATOR ... “spectral subtraction” estimator. Case III: Using the MMSE amplitude ...
[80]
[PDF] Speech Enhancement Using a-Minimum Mean-Square Error ...
This paper derives a minimum mean-square error STSA estimator, based on modeling speech and noise spectral components as statistically independent Gaussian ...
[81]
[PDF] Joint Optimization of Masks and Deep Recurrent Neural Networks ...
Denoising-based approaches: These methods utilize deep learning based models to learn the mapping from the mixture signals to one of the sources among the ...
[82]
Enhanced MVDR Beamforming for Arrays of Directional Microphones
In this paper we propose an improved MVDR beamformer which takes into account the effect of sensors (e.g. microphones) with arbitrary, potentially directional ...
[83]
Post-Filtering Techniques | SpringerLink
In the context of microphone arrays, the term post-filtering denotes the post-processing of the array output by a single-channel noise suppression filter.
[84]
[PDF] A Categorization of Robust Speech Processing Datasets
Sep 5, 2014 · Each dataset may be used for one or more applications: automatic speech. 1. Page 4. recognition, speaker identification and verification, source ...
[85]
How Siri got on the iPhone - CNBC
Jun 29, 2017 · Siri launched on the iPhone on Oct. 4, 2011. Jobs died the next day. Like this story? Like CNBC Make It on Facebook. See also: Here's what ...
[86]
Alexa at five: Looking back, looking forward - Amazon Science
With that mission in mind and the Star Trek computer as an inspiration, on November 6, 2014, a small multidisciplinary team launched Amazon Echo, with the ...
[87]
https://www.mit.edu/~paris/pubs/huang-taslp2015.pdf
[88]
[PDF] arXiv:2106.15919v3 [cs.CL] 25 Jul 2022
Jul 25, 2022 · A key component of any spoken dialog system is its spoken language understanding (SLU) system that extracts se- mantic information ...
[89]
Towards Preventing Overreliance on Task-Oriented Conversational ...
Rather than self-correcting, the conversational agent can also confirm the detected errors with the user through a conversation turn. For example in Fig. 1 ...
[90]
Understanding Success Criterion 1.2.4: Captions (Live) | WAI - W3C
The intent of this Success Criterion is to enable people who are deaf or hard of hearing to watch real-time presentations.
[91]
Eye Tracking Drives Innovation and Improves Healthcare - Tobii
Our eye tracking technology helps the healthcare sector and researchers to develop new and inventive ways to diagnose and detect illnesses and disabilities.
[92]
Apple unveils powerful accessibility features coming later this year
May 13, 2025 · Eye Tracking users on iPhone and iPad will now have the option to use a switch or dwell to make selections. · With Head Tracking, users will be ...<|separator|>
[93]
Voice AI agents compared on latency: performance benchmark
Sep 29, 2025 · In real-world deployments, Telnyx consistently delivers sub-200ms audio round-trip time across standard voice AI workloads, including customer ...
[94]
[PDF] Privacy Controls for Always-Listening Devices - People @EECS
In all of these form factors, the voice assistant operates by always listening for “wake-words” (such as “Hey Siri” or “Ok. Google”), then recording and ...
[95]
Generative AI in Multimodal User Interfaces: Trends, Challenges ...
Nov 15, 2024 · The 2020s introduced multimodal interfaces, combining text, voice, and video for richer interaction, exemplified by platforms like ChatGPT ...
[96]
[PDF] All You Wanted to Know About Acoustic Echo Cancellation
In order to ensure that the users of VoIP enabled phones have an overall echo-free experience, there are three major aspects that need to be understood. These ...
[97]
Acoustic Echo Cancellation: All you need to know - EE Times
1. Direct path between the speaker and microphone, if any · 2. Reflections from the surface where the VoIP phone is kept · 3. Reflections from the walls and other ...
[98]
What Are VoIP Codecs & How Do They Affect Call Sound Quality?
Feb 14, 2024 · 2. Wideband codecs · G.722 – An HD voice codec with improved audio quality due to a wider bandwidth of 50 Hz to 7 kHz compared to narrowband ...
[99]
HD VoIP and HD Voice Codecs - OnSIP
Wideband audio codecs expand the sound frequencies that narrowband codecs transmit, enabling HD VoIP calls.
[100]
Automatic Dubbing - AppTek.ai
AppTek.ai's automatic dubbing uses AI to transcribe, translate, and replicate the source speaker's voice and emotion, using a full speech-to-speech pipeline.
[101]
AI Audio Translation & Dubbing for Broadcasting - AI-Media
AI-Media and ElevenLabs offer real-time audio translation and dubbing, including LEXI Voice, which uses ElevenLabs' Text to Speech Turbo model.
[102]
Adobe demos “photoshop for audio,” lets you edit speech as easily ...
Nov 7, 2016 · Adobe has demonstrated tech that lets you edit recorded speech so that you can alter what that person said or create an entirely new sentence from their voice.Missing: cloning | Show results with:cloning
[103]
#VoCo. Adobe Audio Manipulator Sneak Peak with Jordan Peele
Nov 4, 2016 · Visit Adobe Creative Cloud for more information: https://www.adobe.com/creativecloud.html #VoCo is an audio manipulator that allows you to ...
[104]
The cycle of satisfied listeners and profitable publishers - SoundStack
May 22, 2024 · Adaptive bitrate streaming (ABR) solves the problem by enabling streams to adjust automatically based on a listener's bandwidth.Missing: speech processing
[105]
All About Adaptive Audio Streaming | Telos Alliance
May 25, 2016 · Adaptive audio streaming works to deliver the highest bitrate for the currently available bandwidth, switching bitrates as networks conditions change.
[106]
The History of Google Translate (2004-Today): A Detailed Analysis
Jul 9, 2024 · Real-time translation – Machine learning has enabled real-time translation capabilities. You can now actively hold a conversation in ...
[107]
RFC 7874 - WebRTC Audio Codec and Processing Requirements
This specification will outline the audio processing and codec requirements for WebRTC endpoints.
[108]
WebRTC: Real-Time Communication in Browsers - W3C
Mar 13, 2025 · This document defines a set of ECMAScript APIs in WebIDL to allow media and generic application data to be sent to and received from another browser or device.
[109]
Ultra Reliable and Low Latency Communications - 3GPP
Jan 2, 2023 · URLLC requires high reliability (e.g., 99.9999%) and low latency (e.g., 50ms) simultaneously, achieved through 5G and edge computing, and is a ...
[110]
[PDF] Ultra-Reliable Low-Latency Communication - 5G Americas
A prime example is Ultra-Reliable Low-Latency Communication (URLLC), a set of features designed to support mission-critical applications such as industrial ...
[111]
History of IVR & Its Evolution Through the Years
Aug 29, 2023 · The history of IVR began in the 1930s with the synthesis of human speech. Find out how it developed and what's in store for IVR technology.1930s: Successful Synthesis of... · 2020s to Present Day: Modern...
[112]
The Evolution of IVR Systems - Speech Technology
Jun 1, 2008 · Over the years, IVR technology has evolved in four major phases: Generation 1: Touchtone input and voice output Systems presented ...
[113]
Evolution of IVR building techniques: from code writing to AI ... - arXiv
Nov 16, 2024 · This paper explores the evolution of IVR building techniques, highlighting the industry's revolution and shaping the future of IVR systems.
[114]
IVR Systems: The Past, Present, and Future - CX Today
Jan 3, 2024 · The Evolution of the IVR System · Improved efficiency: IVR systems are one of the most common forms of contact center automation. · Enhanced ...
[115]
AI Commentary in Sports Transformation
AI commentary in sports works by analyzing real-time game data, converting it into natural-sounding commentary, and delivering it through text-to-speech ...
[116]
CAMB.AI, a solution for multilingual sports commentary - TM Broadcast
Apr 25, 2025 · At CAMB.AI, we utilize two proprietary AI models for live sports commentary translation: MARS and BOLI. MARS is our speech model, while BOLI ...
[117]
Voice Cloning Technology: Enhancing Sports Content Creation
May 18, 2023 · A form of speech synthesis in sports that powers AI voice generators for sports content, enabling personalized sports commentary, real-time AI ...
[118]
Generative AI technologies revolutionizing live sports coverage and ...
Nov 15, 2024 · Generative AI technologies have been developed that automatically add coverage and commentary when watching sporting events.
[119]
Speech and Nonspeech Parameters in the Clinical Assessment of ...
Jan 7, 2023 · The articulation rate (parameter RATE) was calculated by dividing the number of spoken syllables by the duration of the speech sample minus ...
[120]
Quantifying Speech Rhythm Abnormalities in the Dysarthrias - PMC
Conclusions: This study confirms the ability of rhythm metrics to distinguish control speech from dysarthrias and to discriminate dysarthria subtypes. Rhythm ...
[121]
Voice analysis in Parkinson's disease - a systematic literature review
Voice analysis for the diagnosis and prognosis of Parkinson's disease using machine learning techniques can be achieved, with very satisfactory performance ...
[122]
Explainable artificial intelligence to diagnose early Parkinson's ...
Apr 5, 2025 · Recent advancements in AI and ML have demonstrated significant potential in diagnosing Parkinson's disease using voice analysis. Various studies ...
[123]
Tutorial: Using Visual–Acoustic Biofeedback for Speech Sound ...
Jan 9, 2023 · This tutorial summarizes current practices using visual–acoustic biofeedback (VAB) treatment to improve speech outcomes for individuals with speech sound ...<|separator|>
[124]
Traditional and Visual–Acoustic Biofeedback Treatment via ...
This study examined telepractice treatment for /ɹ/ using visual-acoustic biofeedback and motor-based therapy, with six of seven participants showing a ...
[125]
Predictable - Therapy Box
Giving a voice to people. A text-to-speech app, with smart word prediction, designed for people who have difficulty speaking. AppStore. PlayStore.
[126]
FDA Grants Constant Therapy Health Breakthrough Device ...
Apr 14, 2020 · Constant Therapy Health's Speech Therapy (ST) App is a digital therapeutic designed to provide accessible cognitive, speech and language therapy to stroke ...
[127]
Signal processing & audio processors - PubMed
Signal processing algorithms are the hidden components in the audio processor that converts the received acoustic signal into electrical impulses.
[128]
A Hundred Ways to Encode Sound Signals for Cochlear Implants
May 1, 2025 · The field of cochlear implant coding investigates interdisciplinary approaches to translate acoustic signals into electrical pulses transmitted ...
[129]
[PDF] Enhancement of esophageal speech using voice conversion ...
This paper presents a novel approach for enhancing esophageal speech using voice conversion techniques. Esophageal speech. (ES) is an alternative voice that ...
[130]
https://pmc.ncbi.nlm.nih.gov/articles/PMC3738185/
[131]
Effectiveness of AI-Assisted Digital Therapies for Post-Stroke ... - NIH
Sep 18, 2025 · Recent research on AI-assisted aphasia assessment and treatment has provided crucial insights into the mechanisms underlying generalization.Missing: 2020s | Show results with:2020s
[132]
Comprehensive real time remote monitoring for Parkinson's disease ...
Jul 27, 2024 · A comprehensive connected care platform for Parkinson's disease (PD) that delivers validated, quantitative metrics of all motor signs in PD in real time.
[133]
Promising for patients or deeply disturbing? The ethical and legal ...
Jul 9, 2024 · Can using deepfakes be part of good care? ... In the following, we consider how deepfake therapy relates to principles of good care, in relation ...Missing: issues | Show results with:issues
[134]
Enhancing speech perception in challenging acoustic scenarios for ...
Sep 5, 2024 · This clinical study investigated the impact of the Naída M hearing system, a novel cochlear implant sound processor and corresponding hearing aid.