Fact-checked by Grok 2 weeks ago

Speech coding

Speech coding is the process of converting analog speech signals into a compact digital representation for efficient storage, transmission, or processing, while preserving the perceptual quality, intelligibility, and naturalness of the original voice.^[1] This involves techniques that exploit the redundancies in speech signals, such as correlations between samples and perceptual irrelevancies, to reduce the bit rate without significantly degrading the listener's experience.^[2] Most speech coders are lossy, producing synthesized speech that is perceptually similar but not identical to the input, and they typically operate on short frames of audio data.^[1] The primary goals of speech coding are to minimize the bit rate for bandwidth conservation, lower computational complexity and delay for real-time applications, and maximize subjective quality metrics like mean opinion score (MOS).^[2] These objectives balance trade-offs inherent in speech, which is non-stationary and generated by a source-filter model involving the vocal tract and excitation.^[3] Speech coding enables applications in resource-constrained environments, such as digital cellular networks where bit rates must be reduced for efficient spectrum use.^[4] Key techniques in speech coding include waveform coders, which preserve the exact shape of the speech waveform at higher bit rates (16–64 kbps), such as pulse code modulation (PCM) and adaptive differential PCM (ADPCM).^[3] Source coders, like linear predictive coding (LPC), model the speech production process to achieve lower bit rates (2–16 kbps) by parameterizing spectral envelopes and excitation.^[1] Hybrid approaches, such as code-excited linear prediction (CELP) and subband coding, combine elements of both for improved quality at rates around 4–12 kbps, often incorporating features like post-filtering and variable-rate adaptation.^[2] Standards for speech coding, developed by organizations like the International Telecommunication Union (ITU) and European Telecommunications Standards Institute (ETSI), ensure interoperability across systems.^[1] Notable examples include ITU G.711 for toll-quality PCM at 64 kbps, G.729 for CELP-based coding at 8 kbps in VoIP, and ETSI's Adaptive Multi-Rate (AMR) codec for mobile networks at 4.75–12.2 kbps.^[5]^[6] These standards support diverse applications, from public switched telephone networks (PSTN) and satellite telephony to voice over IP (VoIP) and wideband audio in videoconferencing.^[2]

Fundamentals

Definition and Objectives

Speech coding is the process of converting an analog speech signal into a compact digital representation suitable for efficient storage or transmission over band-limited channels, thereby minimizing the required bandwidth while preserving essential speech information.^[1] This involves exploiting the inherent redundancies in the speech signal to reduce data volume without significant loss of perceptual fidelity.^[4] The primary objectives of speech coding include achieving substantial bit rate reduction—from the standard 64 kbps of pulse code modulation (PCM) used in digital telephony to lower rates such as 2–8 kbps—while maintaining high perceptual quality, intelligibility, and naturalness of the decoded speech.^[7]^[8] Additional goals encompass low computational complexity to enable real-time processing on resource-constrained devices, minimal algorithmic delay for interactive applications, and robustness against transmission errors like bit flips in noisy channels.^[9] These objectives often require trade-offs, as lower bit rates typically demand more sophisticated algorithms that balance quality and efficiency.^[2] Unlike general audio coding, which treats sound signals agnostically, speech coding specifically leverages the source-filter model of human speech production to exploit redundancies such as pitch periodicity in voiced segments and formant structures that define vowel resonances.^[4] This speech-specific approach allows for greater compression efficiency tailored to human vocal tract characteristics, rather than broad-spectrum music or noise.^[7] The motivation for speech coding arose in the 20th century from the bandwidth limitations of early telephony systems, which restricted transmission to narrowband signals in the 300–3400 Hz range to conserve scarce spectrum resources in analog and emerging digital networks.^[10]^[11]

Speech Signal Characteristics

Human speech signals exhibit a quasi-periodic nature in voiced segments, characterized by a fundamental frequency typically ranging from 100 to 300 Hz (corresponding to pitch periods of approximately 3.3 to 10 ms), which is produced by the vibrations of the vocal cords during phonation.^[12] These segments are modulated by formants, which are resonant frequencies of the vocal tract primarily concentrated between approximately 500 and 3000 Hz, shaping the spectral envelope that defines vowel quality and phonetic content. In contrast, unvoiced segments, such as fricatives, appear noise-like due to turbulent airflow without periodic vibration, lacking distinct pitch structure but contributing broadband energy to the signal.^[13] In the time domain, speech signals display significant amplitude variations driven by the intensity of phonemes and prosodic features like stress and intonation, with rapid onsets and offsets marking transitional sounds such as plosives and diphthongs.^[14] Silence gaps, often lasting tens to hundreds of milliseconds between utterances or words, represent periods of low energy that facilitate natural pauses and background noise separation in processing.^[15] The frequency-domain representation of speech reveals a concentration of energy in lower bands, particularly within the 300-3400 Hz range for narrowband applications like telephony, where higher frequencies contribute less to intelligibility.^[16] This spectral tilt, with most power below 4 kHz, underscores the signal's suitability for bandlimited transmission while preserving perceptual quality. Statistically, speech is non-stationary, with rapid spectral changes necessitating short-time analysis in frames of 10-30 ms to approximate stationarity for modeling.^[17] This non-stationarity arises from the underlying redundancy in the source-filter model, where the glottal source provides excitation—periodic pulses for voiced speech or noise for unvoiced—and the vocal tract acts as a linear time-varying filter that imparts the spectral envelope.^[13]

Historical Development

Early Analog Techniques

Early analog techniques for speech coding emerged in the mid-20th century to address the challenges of transmitting speech signals over limited-bandwidth telephony channels, primarily through methods that compressed the dynamic range without digitization. Companding, combining signal compression at the transmitter and expansion at the receiver, was introduced in the 1920s and gained prominence in the 1930s and 1940s to mitigate distortion in analog transmission systems by applying nonlinear amplification that allocated greater resolution to lower-amplitude sounds via logarithmic curves. This approach effectively extended the usable dynamic range of telephony links, which were constrained by noise and fading in long-distance copper wire or radio circuits.^[18] These techniques were essential for early frequency-division multiplexed (FDM) telephony, where multiple voice channels shared bandwidth, but they operated entirely in the analog domain using vacuum-tube circuits for compression and expansion. For instance, analog companding emphasized quieter speech components, reducing perceived distortion in conversational audio over noisy lines. A pioneering parametric analog method was the channel vocoder, invented by Homer Dudley at Bell Laboratories beginning in October 1928 and publicly demonstrated in 1936. This device analyzed speech by passing the signal through a bank of 10 contiguous bandpass filters (covering 250–3000 Hz) to extract spectral envelopes, while separately detecting pitch frequency and voicing (voiced or unvoiced). The parameters—envelope amplitudes, pitch, and voicing flags—were transmitted at low rates and used at the receiver to modulate a synthetic source (buzz for voiced sounds, hiss for unvoiced) through matching filters, reconstructing intelligible speech with significantly reduced bandwidth requirements, equivalent to about 2.4 kbps in modern terms. Dudley's design, detailed in his 1936 publication, marked the first practical electronic encoding of speech based on a source-filter model.^[19]^[20] Despite their innovations, these early analog techniques had notable limitations, including high susceptibility to channel noise and interference, which degraded the reconstructed speech quality, particularly in the parametric vocoder where errors in envelope or pitch extraction caused unnatural artifacts. Additionally, the lack of digital processing prevented easy storage or manipulation of encoded signals, and systems were confined to fixed bandwidths around 4 kHz to match telephone standards, limiting versatility. These constraints spurred post-World War II advancements toward fully digital methods.^[19]

Digital Coding Evolution

Early efforts in digital speech coding began during World War II with systems like SIGSALY, developed by Bell Laboratories in 1943 for secure voice communications. SIGSALY digitized parameters from an 8-channel vocoder using pulse-code modulation, transmitting speech at around 12 kbps over high-frequency radio links, marking the first practical digital speech transmission system.^[21] The digital evolution of speech coding saw widespread adoption in the 1960s with Pulse Code Modulation (PCM) in telephony systems. PCM digitizes analog speech by sampling at the Nyquist rate of 8 kHz—sufficient for the typical 4 kHz bandwidth of telephone speech—and applying 8-bit logarithmic quantization to each sample, yielding a constant bit rate of 64 kbps. This approach, formalized in the ITU-T G.711 recommendation, enabled reliable digital transmission over networks like the T1 carrier system introduced in 1962, marking the transition from analog to digital voice processing. Specific companding standards like μ-law, developed at Bell Laboratories for North American systems, and A-law, adopted in Europe, utilized piecewise linear approximations of logarithmic functions to compress 12-13 bits of linear dynamic range into 8 bits, improving signal-to-noise ratios in digital multiplexing setups.^[22]^[23] The 1970s brought refinements focused on prediction to achieve bit rate reductions without significant quality loss. Differential PCM (DPCM) emerged as a key development, encoding the difference between the actual sample and a predicted value based on prior samples, which exploited the correlation in speech signals to lower redundancy. Concurrently, Linear Predictive Coding (LPC), first conceptualized in the late 1960s by researchers like Fumitada Itakura at NTT and Bishnu Atal at Bell Labs, underwent digital implementation and refinement during this decade; LPC modeled speech production through an all-pole filter driven by excitation, enabling parametric representation at rates below 64 kbps. These techniques laid the groundwork for more sophisticated compression by emphasizing speech's predictable structure over raw waveform fidelity.^[24]^[8]^[25] By the 1980s and 1990s, the incorporation of vector quantization (VQ)—a method introduced by Allen Gersho, Robert M. Gray, and colleagues in seminal work around 1980—allowed joint quantization of parameter vectors, further optimizing efficiency for multidimensional speech features like LPC coefficients. This innovation facilitated international standards tailored for bandwidth-constrained applications, including ITU-T G.721, which specified Adaptive DPCM (ADPCM) at 32 kbps for improved quality over PCM at half the rate, approved in 1984. Subsequent advancements produced ITU-T G.728, employing Low-Delay Code-Excited Linear Prediction (LD-CELP) at 16 kbps with a minimal 0.625 ms algorithmic delay, standardized in 1992 to support real-time communications. In mobile telephony, the GSM full-rate codec, based on Regular Pulse Excitation with Long-Term Prediction (RPE-LTP) at 13 kbps, was ratified by ETSI in 1990, enabling efficient voice over digital cellular networks. These standards were driven primarily by the demands of satellite communications, where transponder bandwidth was scarce, and emerging mobile systems like GSM, which required robust low-bit-rate coding to handle multipath fading and limited spectrum allocation.^[26]^[27]^[28]^[29]

Classification of Speech Coders

Waveform Coders

Waveform coders represent a class of speech coding techniques that directly quantize samples of the speech waveform in the time domain, aiming to preserve the shape of the original signal through uniform or non-uniform quantization without incorporating models of speech production.^[1] These methods exploit redundancies in the waveform, such as short-term correlations between adjacent samples, to reduce the bit rate while minimizing quantization error between the input and reconstructed signals.^[30] The speech signal's periodicity and stationarity enable effective prediction in differential variants, enhancing efficiency.^[31] A foundational example is pulse code modulation (PCM), which uniformly samples the speech signal at a rate like 8 kHz and applies non-uniform 8-bit logarithmic quantization (e.g., μ-law) to handle the signal's wide dynamic range.^[31] The bit rate for PCM is given by R = f_s \times b, where f_s is the sampling frequency and b is the number of bits per sample; for standard telephony speech, this yields 64 kbps.^[31] Adaptive differential PCM (ADPCM) improves upon PCM by predicting the next sample from previous ones and quantizing the prediction error, often using backward adaptation for the predictor and quantizer; a common implementation operates at 32 kbps with 4 bits per sample.^[1] Delta modulation (DM), a simpler differential approach, quantizes the first-order difference between consecutive samples using just 1 bit, typically at oversampled rates of 16–32 kHz for speech, achieving bit rates around 16–32 kbps in continuously variable slope variants (CVSD).^[1] Waveform coders offer advantages including low algorithmic delay (often under 1 ms), simplicity in implementation with modest computational requirements, and robustness to transmission errors since they do not rely on fragile parametric models.^[32] They also provide high fidelity for both speech and non-speech signals, making them versatile for general audio applications.^[1] However, their efficiency is limited by the need for higher bit rates—typically 16–64 kbps—to maintain perceptual quality, as they do not exploit speech-specific redundancies like vocal tract modeling, leading to larger data requirements compared to parametric approaches.^[33]

Source and Hybrid Coders

Source coders represent a class of parametric speech coding techniques that exploit the source-filter model of speech production, treating the vocal tract as an all-pole linear filter excited by either quasi-periodic pulses for voiced segments or random noise for unvoiced segments.^[34] This approach encodes only the filter parameters—typically 10-12 coefficients per frame—along with excitation details such as pitch period, voicing decisions, and gain, enabling significant data reduction by focusing on physiological rather than raw signal characteristics.^[34] Linear Predictive Coding (LPC), introduced in the early 1970s, exemplifies this method, where the all-pole filter approximates the spectral envelope derived from short-time autocorrelation analysis of the speech signal.^[34] Parameter estimation in source coders varies: formant-based methods explicitly track resonant frequencies and bandwidths of the vocal tract (typically 3-5 formants), transmitting these directly for synthesis via parallel resonators, as in early formant vocoders operating below 1.2 kbps.^[35] In contrast, LPC implicitly captures formants through the roots of its predictor polynomial, offering robustness to noise but assuming a minimum-phase all-pole structure that may not fully model nasal or fricative sounds. Cepstral analysis provides an alternative parameterization by inverse-transforming the log spectral envelope into the quefrency domain, where low-order cepstral coefficients represent the smooth vocal tract response, though it is less common in traditional source coders due to higher sensitivity to spectral fine structure. Operating at bit rates of 1-4.8 kbps, source coders trade perceptual naturalness for compression efficiency, often resulting in synthetic-sounding output with buzziness in transitions or muffled unvoiced speech. Hybrid coders bridge source and waveform paradigms by retaining the LPC filter model for spectral shaping while adopting a codebook-based search for excitation sequences that closely match the waveform residual, thereby enhancing perceptual quality at low rates.^[36] Code-Excited Linear Prediction (CELP), proposed in 1985, is the foundational hybrid technique, using stochastic codebooks of innovation vectors filtered through adaptive and fixed predictors to minimize synthesis error.^[36] Variants such as Algebraic CELP (ACELP) employ structured codebooks with fixed pulses for reduced complexity, while Mixed Excitation Linear Prediction (MELP) incorporates multiband excitation for better handling of mixed voicing. These hybrids dominate modern speech coding standards, including the Adaptive Multi-Rate (AMR) codec for GSM at 4.75-12.2 kbps and the Enhanced Variable Rate Codec (EVRC) for CDMA, due to their balance of intelligibility, low delay, and robustness in noisy channels.^[33] At 4-8 kbps, hybrid coders achieve near-toll quality by preserving transient details absent in pure source models, though at the expense of increased computational demands for codebook searches.

Key Techniques

Pulse Code Modulation (PCM)

Pulse Code Modulation (PCM) is a foundational digital technique for representing analog speech signals through uniform sampling and quantization, forming the basis of early digital telephony standards. It operates by directly digitizing the waveform without relying on speech production models, making it robust but bandwidth-intensive compared to later parametric methods. Developed in the mid-20th century, PCM achieved widespread adoption due to its simplicity and reliable reproduction of speech quality at standard rates. The PCM process begins with sampling the continuous-time speech signal at a rate sufficient to capture its frequency content; for narrowband telephony speech limited to 4 kHz bandwidth, an 8 kHz sampling rate is used to satisfy the Nyquist criterion. This produces discrete-time pulse amplitude modulated (PAM) samples. These samples then undergo quantization, mapping continuous amplitudes to a finite set of discrete levels—either via uniform (linear) steps for higher precision or non-uniform logarithmic companding such as μ-law (North America/Japan) or A-law (Europe/international) to optimize for 8-bit encoding. Finally, the quantized values are encoded into binary pulse code for transmission or storage, typically using natural binary or Gray coding to minimize errors. The μ-law and A-law companding functions apply logarithmic compression to expand the effective dynamic range, achieving the equivalent of a 14-bit linear quantizer within 8 bits by using finer steps for small amplitudes and coarser steps for large ones. This reduces quantization noise in quiet speech segments, where human hearing is most sensitive, improving overall perceptual quality for signals with 40–50 dB dynamic range typical in telephony. The μ-law curve is defined as F(x) = \sgn(x) \frac{\ln(1 + \mu |x|)}{\ln(1 + \mu)}, with \mu = 255 standard, followed by uniform quantization and binary coding. At 8 kHz sampling and 8 bits per sample, PCM delivers a constant bit rate of 64 kbps, supporting toll-quality speech with a signal-to-quantization-noise ratio (SQNR) around 35–40 dB for typical voice levels. The mean-square quantization noise power for uniform quantization is approximated by

\sigma_q^2 \approx \frac{\Delta^2}{12},

where \Delta is the step size, assuming the error is uniformly distributed over [-\Delta/2, \Delta/2]. This noise becomes more prominent at lower bit rates, limiting PCM's efficiency without modifications. PCM remains the baseline codec for the Public Switched Telephone Network (PSTN) and Integrated Services Digital Network (ISDN), ensuring compatibility in digital trunks and switches worldwide. Its limitations at rates below 64 kbps—manifesting as audible distortion from elevated quantization noise—necessitated subsequent developments for bandwidth-constrained applications.

Linear Predictive Coding (LPC)

Linear Predictive Coding (LPC) is a parametric speech coding method that represents the speech signal as the output of a linear time-invariant all-pole filter driven by an excitation signal, enabling efficient compression by estimating the vocal tract filter characteristics from short speech frames. This approach assumes that each speech sample can be approximated as a linear combination of previous samples, capturing the spectral envelope of the signal while discarding fine waveform details. Developed in the 1970s, LPC forms the foundation of many source-filter model-based vocoders, achieving low bit rates suitable for bandwidth-constrained applications like telephony and secure communications.^[34] The core LPC model is given by the difference equation:

s(n) = \sum_{k=1}^{p} a_k s(n-k) + G u(n),

where s(n) is the speech sample at time n, a_k are the predictor coefficients defining the all-pole filter, p is the model order, G is the gain factor scaling the excitation, and u(n) represents the excitation signal—typically quasi-periodic pulses for voiced speech or noise for unvoiced speech. The predictor coefficients are derived by minimizing the mean-squared prediction error over a short frame (typically 20-30 ms), using the autocorrelation method to solve the Yule-Walker equations. For speech sampled at 8 kHz, the model order p is commonly set to 10-12 to adequately represent the first few formants without overfitting noise. The Levinson-Durbin recursion efficiently computes these coefficients in O(p^2) time, leveraging the Toeplitz structure of the autocorrelation matrix for numerical stability.^[34] To transmit the LPC parameters, the predictor coefficients must be quantized efficiently while ensuring filter stability. Line spectral pairs (LSPs), introduced as an alternative representation,^[37] map the coefficients to frequencies on the unit circle, guaranteeing stability if the LSPs are ordered and lie within [0, \pi]. LSPs exhibit desirable interpolation properties, allowing smooth transitions between frames with minimal spectral distortion, and are quantized using vector quantization (VQ) techniques to achieve bit rates of 2-4 kbps for the spectral parameters alone when combined with pitch and gain encoding. For example, the U.S. Department of Defense's FS1015 LPC-10 standard operates at 2.4 kbps using scalar quantization of 10 reflection coefficients (PARCOR), along with voicing, pitch, and gain parameters, for a total of 54 bits per 22.5 ms frame. LPC originated in the early 1970s through work at Bell Laboratories, where adaptive predictive coding was proposed to exploit speech predictability for differential encoding, reducing redundancy in transmission. It became widely adopted in vocoders for its ability to synthesize intelligible speech at low rates, though early implementations suffered from perceptual artifacts such as buzziness during voiced segments at rates below 4 kbps, due to simplistic excitation modeling and sensitivity to parameter errors. These issues prompted refinements like LSP quantization and improved excitation strategies, but core LPC remains a benchmark for parametric coding efficiency.^[38]^[39]

Code-Excited Linear Prediction (CELP)

Code-Excited Linear Prediction (CELP) represents a significant advancement in hybrid speech coding, extending linear predictive coding (LPC) by incorporating stochastic excitation modeling through codebook searches to produce natural-sounding synthesized speech at low bit rates. Originally proposed in 1985, CELP employs an analysis-by-synthesis framework where the vocal tract is modeled by an LPC synthesis filter, and the excitation signal is generated by combining contributions from an adaptive codebook—capturing periodic pitch components—and a fixed codebook—providing stochastic innovation to the signal.^[36] This dual-codebook structure allows for efficient representation of both voiced and unvoiced speech characteristics, enabling bit rates as low as 4.8 kbps while preserving perceptual quality.^[36] The core optimization in CELP occurs through a closed-loop search process that minimizes the perceptual error between the input speech frame and the synthesized output. Specifically, the algorithm selects codebook entries by minimizing the weighted mean-squared error E = \| x - \hat{s} \|_W, where x is the preprocessed input signal, \hat{s} is the synthesized signal, and W denotes a perceptual weighting filter that emphasizes formant regions to align with human auditory perception.^[36] The adaptive codebook typically uses past synthesized speech segments delayed by the pitch period, while the fixed codebook consists of predefined stochastic vectors, with gains quantized to further compress the representation. This iterative search ensures robust excitation matching but contributes to the method's computational demands.^[36] A key variant, Algebraic Code-Excited Linear Prediction (ACELP), optimizes the fixed codebook by representing it as a sparse set of algebraic pulses with fixed positions and signs, reducing storage and search complexity without sacrificing quality. ACELP gained prominence in the ITU-T G.729 standard, ratified in 1996, which operates at 8 kbps using a conjugate-structure codebook for efficient indexing. CELP-based coders, including ACELP implementations, deliver near-toll-quality speech at 4-8 kbps, bridging the gap between higher-rate waveform coders and lower-rate parametric methods.^[40] Despite its effectiveness, CELP exhibits high computational complexity, often requiring several million arithmetic operations per frame for codebook exhaustive searches and error minimization, though post-2000 optimizations have mitigated this for real-time applications. Typical algorithmic delays range from 5 to 20 ms, influenced by frame lengths of 10-30 ms and look-ahead processing for pitch analysis.^[41]

Modern and Neural Methods

Neural Network-Based Codecs

Neural network-based speech codecs have emerged since the mid-2010s as a data-driven alternative to classical methods, employing deep learning architectures to learn compact representations directly from raw audio signals, thereby addressing limitations in modeling complex speech variations. Early advancements leveraged autoencoders for end-to-end compression and generative adversarial networks (GANs) to enhance perceptual quality at low bitrates. For instance, the WaveNet vocoder, introduced by DeepMind in 2016, pioneered autoregressive neural generation of raw audio waveforms, achieving natural-sounding speech synthesis that outperformed traditional parametric systems in mean opinion score tests. Subsequent works, such as the 2018 end-to-end optimized speech coding framework using deep neural networks, demonstrated scalable compression for wideband speech by jointly optimizing encoding and decoding processes. Key techniques in these codecs revolve around neural waveform processing, where encoder-decoder networks compress audio into latent codes, often quantized for transmission, and a decoder reconstructs the signal with high fidelity. A prominent example is SoundStream, developed by Google in 2021, which integrates convolutional encoders, residual vector quantization (RVQ) with multiple codebook layers, and GAN-based discriminators for training. This approach enables variable bitrates from 3 to 18 kbps while maintaining streamable, low-latency operation suitable for real-time applications on mobile devices. At 3 kbps, SoundStream surpasses the quality of the Opus codec at 12 kbps and approaches the enhanced voice services (EVS) codec at 9.6 kbps in subjective MUSHRA evaluations for 24 kHz audio, including speech and music.^[42] Similarly, the Neural End-to-End Speech Codec (NESC) from 2022 employs a dual-path convolutional recurrent network encoder paired with a GAN-based decoder like Streamwise-StyleMelGAN, achieving robust wideband speech coding at 3 kbps even under noisy conditions.^[43] Recent advancements emphasize efficiency and adaptability in resource-constrained settings, as highlighted by the 2025 Low-Resource Audio Codec (LRAC) Challenge, which fosters neural and hybrid codecs optimized for edge devices with ultralow bitrates, low latency, and integrated enhancement.^[44] These systems demonstrate superior robustness to non-speech elements, such as background noise and reverberation, compared to traditional CELP-based coders, filling gaps in real-world deployment scenarios through objective metrics and crowdsourced evaluations.^[45] Emerging research post-LRAC explores diffusion-based neural codecs for further quality gains at extreme low bitrates.^[46] Despite these gains, neural network-based codecs face challenges including the need for large-scale training datasets to generalize across speakers and environments, which can limit accessibility for low-resource development. Real-time latency remains a concern due to computational demands, though techniques like structured dropout in quantization layers and efficient architectures mitigate this for applications under 250 ms delay thresholds. Ongoing research prioritizes lightweight models to balance quality and deployability without sacrificing perceptual superiority.

Recent Standards

The Adaptive Multi-Rate Wideband (AMR-WB) codec, standardized as ITU-T G.722.2 in 2003, received post-2010 updates including a fixed-point C-code implementation in Annex C (December 2017) and a corrigendum for corrections in August 2018, enhancing its implementability for wideband speech coding at around 16 kbit/s.^[47] A significant advancement came with the Enhanced Voice Services (EVS) codec, standardized by 3GPP in Release 12 (2014), providing backward compatibility with AMR-WB (ITU-T G.722.2) for wideband modes.^[48] EVS operates as a hybrid codec supporting bitrates from 5.9 to 128 kbit/s, integrating switched linear predictive coding for speech with transform coding for general audio to achieve superwideband quality up to 20 kHz.^[48] It incorporates channel-aware coding for improved error resilience in packet loss scenarios, outperforming AMR-WB in narrowband and wideband conditions under network impairments.^[49] In 2023, the IEEE 1857.3 standard introduced a Real-Time Communication (RTC) speech codec within its advanced audio/video coding framework, targeting low-bitrate operation below 8 kbit/s for applications like video conferencing and VoIP.^[50] This codec supports wideband (16 kHz) and superwideband (32 kHz) speech with built-in error resilience and stereo capabilities, enabling neural-capable enhancements for efficient real-time processing.^[51] EVS was further integrated into 5G networks via 3GPP Release 15 (2018) and subsequent updates, supporting Voice over New Radio (VoNR) with enhanced adaptive multi-rate modes for variable network conditions in ultra-reliable low-latency communications. These evolutions maintain backward compatibility with legacy systems while optimizing for 5G's higher bandwidth and lower latency.^[52] Neural integrations emerged in WebRTC standards during the 2020s, particularly through extensions to the Opus codec (RFC 6716, 2012 baseline). Opus version 1.5 (2024) incorporated machine learning for deep packet loss concealment and deep redundancy, improving low-bitrate speech quality down to 6 kbit/s and robustness in noisy wireless environments.^[53] These AI-driven post-filters and enhancements, proposed in IETF drafts, enable adaptive bitrate adjustments and error recovery without altering the core Opus structure, facilitating deployment in browser-based real-time communication.^[54] The 2025 Low-Resource Audio Codec (LRAC) Challenge, hosted by Cisco and IEEE Signal Processing Society, advanced neural standards for edge devices by evaluating codecs under compute, latency, and bitrate constraints (e.g., <10 kbit/s). Outcomes highlighted top transparency codecs like teamwzqaq's entry and speech enhancement solutions from nju-aalab, demonstrating perceptual quality rivaling uncompressed audio in reverberant, noisy settings via crowdsourced MUSHRA tests with over 186,000 ratings.^[45] Recent trends emphasize adaptive multi-rate capabilities, as in EVS's variable bitrate switching for bandwidth efficiency, alongside enhanced error resilience through techniques like in-band forward error correction in wireless channels.^[55] These features address 5G and IoT demands for robust, low-latency coding in error-prone environments, with neural methods increasingly standardized for hybrid resilience.^[56]

Applications and Evaluation

Communication Systems

In traditional telephony systems, particularly the Public Switched Telephone Network (PSTN), the G.711 codec serves as the foundational standard for speech encoding. This Pulse Code Modulation (PCM)-based algorithm operates at 64 kbps, delivering toll-quality narrowband audio suitable for circuit-switched environments where uncompressed transmission ensures minimal latency and high fidelity. G.711's μ-law and A-law variants accommodate regional companding needs, making it ubiquitous in legacy infrastructure for interconnecting analog and digital voice signals. For bridging PSTN with IP-based networks, VoIP gateways frequently employ the G.729 codec, a low-bitrate Code-Excited Linear Prediction (CELP) method standardized at 8 kbps. This enables efficient compression of speech for transmission over bandwidth-constrained links while maintaining near-toll quality, reducing the load on gateways handling mixed circuit- and packet-switched traffic. Implementations in such gateways often include extensions like G.729 Annex B for voice activity detection and G.729.1 for scalable variable-rate operation (8–32 kbps) to optimize resource use.^[6] Mobile communication systems have evolved to incorporate specialized speech coders tailored to wireless constraints. The GSM Full Rate (FR) codec, deployed in 2G networks during the 1990s, uses Regular Pulse Excitation with Long Term Prediction (RPE-LTP) at a fixed 13 kbps bitrate, processing 20 ms frames to balance quality and spectrum efficiency in early digital cellular deployments. In contrast, LTE and 5G networks utilize the Enhanced Voice Services (EVS) codec, which supports variable bitrates from 5.9 kbps to 128 kbps for adaptive encoding, enabling super-wideband audio up to 20 kHz in Voice over LTE (VoLTE) and Voice over New Radio (VoNR) scenarios.^[57] EVS integrates source-controlled variable bitrate modes to dynamically adjust to channel conditions, enhancing user experience in packet-switched mobile IMS architectures. Voice over IP (VoIP) and real-time streaming platforms rely on versatile coders like Opus, a hybrid algorithm combining SILK (for low-delay speech) and CELP (for music) elements, standardized in 2012 with bitrates ranging from 6 kbps to 510 kbps. Opus is mandated in WebRTC for browser-based communications, supporting seamless switching between narrowband and fullband modes to handle diverse applications from telephony to video conferencing.^[58] By 2025, neural network-based codecs have emerged in cloud VoIP services, achieving high perceptual quality at low bitrates (around 3–6 kbps) through end-to-end learning models optimized for latency-sensitive environments.^[59] These deployments encounter key challenges from IP network variability, including jitter—variations in packet delay that disrupt speech continuity and cause audible artifacts in VoIP and mobile streams.^[60] Packet loss, common in congested or error-prone 5G links, is mitigated via integrated concealment mechanisms, such as those in EVS, which employ predictive synthesis to reconstruct missing frames without user-perceptible degradation. In 5G integrations, advanced jitter buffer management further addresses these issues by dynamically adapting to fluctuating network conditions in ultra-reliable low-latency communications.

Performance Metrics

Performance metrics for speech coding evaluate the trade-offs between quality, efficiency, and robustness, encompassing both objective measures derived from signal processing and subjective assessments based on human perception. Objective metrics provide quantifiable assessments without requiring listeners, while subjective ones capture perceptual nuances. These metrics are essential for comparing coders across applications, ensuring they meet requirements for bandwidth, computational resources, and transmission reliability.^[2] A fundamental objective metric is the signal-to-noise ratio (SNR), which quantifies the ratio of the power of the clean speech signal to the power of the noise or distortion introduced by coding, typically expressed in decibels (dB). Higher SNR values indicate better fidelity, with values above 20 dB often corresponding to good quality in clean conditions; however, SNR correlates imperfectly with perceived quality due to its lack of perceptual modeling.^[61] In speech coding, SNR is commonly used to benchmark waveform coders like PCM, where it helps assess quantization noise.^[2] For more perceptually aligned evaluation, the Perceptual Evaluation of Speech Quality (PESQ) metric simulates human auditory perception by mapping the difference between a reference and coded signal onto a scale from -0.5 to 4.5, where scores above 3.0 approximate good quality. Standardized by ITU-T Recommendation P.862, PESQ analyzes time-aligned signals through auditory filtering and psychoacoustic modeling to predict listening quality, making it suitable for narrowband codecs in telephone networks. PESQ has been widely adopted for end-to-end assessment, correlating highly (up to 0.94) with subjective scores in codec testing. Subjective metrics, such as the Mean Opinion Score (MOS), rely on human listeners rating speech quality on a 1-5 scale (1: bad, 5: excellent) via listening tests, as defined in ITU-T Recommendation P.800. MOS provides the gold standard for perceptual validation, often used to calibrate objective metrics; for instance, toll-quality speech typically scores 4.0-4.5 MOS. These tests involve multiple raters to compute an arithmetic mean, ensuring reliability through controlled conditions like absolute category rating. Bit rate versus quality trade-offs are central to speech coding design, balancing compression with fidelity. At 64 kbps, pulse code modulation (PCM) as in ITU-T G.711 achieves toll-quality speech indistinguishable from the original for narrowband telephony, yielding MOS scores around 4.2. Lower rates, such as 2.4–4 kbps in mixed-excitation linear prediction (MELP) variants, produce intelligible speech suitable for secure communications but with synthetic artifacts, resulting in MOS scores of 3.2-3.8 and reduced naturalness. These trade-offs highlight how parametric coders enable bandwidth savings at the cost of perceptual realism.^[62] Computational complexity is measured in millions of instructions per second (MIPS), reflecting processing demands on hardware like DSPs; low-complexity coders target under 15 MIPS for mobile devices, while advanced ones may exceed 50 MIPS. Algorithmic delay, the time from input to output (typically 10-40 ms), combines frame size and buffering, with total end-to-end delay including network latency ideally below 150 ms for interactive applications. Robustness to bit errors is assessed via bit error rate (BER), where coders maintain quality up to BER of 10^{-3} through error concealment or soft decoding, preventing audible glitches in noisy channels like wireless links.^[2]^[63] Emerging metrics like ViSQOL (Virtual Speech Quality Objective Listener) address limitations in traditional measures, particularly for neural codecs, by computing spectro-temporal similarity between signals to predict MOS on a 1-5 scale with high correlation (up to 0.92). Developed for full-reference evaluation, ViSQOL excels in wideband and neural-generated speech, capturing envelope and phase distortions better than PESQ, and is increasingly used in modern codec benchmarks.^[64]

References

[1]
[PDF] eot156 - speech coding: fundamentals and applications
Speech coding is the process of obtaining a compact representation of voice signals for efficient transmission over band-limited wired and wireless channels ...
[2]
[PDF] Speech Coding Methods, Standards, and Applications - ViVoNets
The goal of speech coding is to represent speech in digital form with as few bits as possible while maintaining the intelligibility and quality required for ...
[3]
[PDF] Speech Coding - MIT OpenCourseWare
This chapter introduces the methods of encoding speech digitally for use in such diverse environments as talking toys, compact audio discs, and transmission ...
[4]
[PDF] ECE438 - Laboratory 9: Speech Processing (Week 1)
Oct 6, 2010 · For example, speech coding is used to reduce the bit rate in digital cellular systems. In this lab, we will describe some elementary properties ...
[5]
https://www.itu.int/rec/T-REC-G.711
[6]
https://www.itu.int/rec/T-REC-G.729
[7]
[PDF] Speech and Audio Processing for Coding, Enhancement ... - Index of /
Speech Coding can be defined as the means by which the information-bearing ... The objective of speech coding is to represent speech signals in a format that is.
[8]
[PDF] Linear Predictive Coding and the Internet Protocol A survey of LPC ...
This is the story of the development of linear predictive coded (LPC) speech and how it came to be used in the first successful packet speech ex- periments.<|separator|>
[9]
[PDF] Perception-based Objective Estimators of Speech Quality
Speech coding often involves a four-way compromise among complexity, delay, bit-rate, and the perceived quality of decoded speech. The most critical.
[10]
Effect of bandwidth extension to telephone speech recognition ... - NIH
For example, the telephone bandwidth in use today is limited to 300–3400 Hz. Compared to speech in face-to-face conversational settings, telephone speech does ...
[11]
[PDF] The Past, Present, and Future of Speech - IEEE Signal Processing ...
The bit rate is the communication channel bandwidth at which the coder operates. Digital network telephony generally operates at 64 kb/s, cellular systems ...
[12]
The effect of whisper and creak vocal mechanisms on vocal tract ...
Apr 1, 2010 · is typically 100–300 Hz in conversational speech, but may be considerably higher in singing where the resolution is correspondingly much ...
[13]
[PDF] Source-Filter Model of Speech Production - MIT OpenCourseWare
An acoustic filter is a device which passes certain frequencies and attenuates others. " • An important characteristic of a filter is its transfer function - ...Missing: signal 100 300-3400 frames ms
[14]
Speech Processing - an overview | ScienceDirect Topics
The calculation of sound pressure is known as amplitude. This sound is depicted as a speech waveform in the time domain. It represents a change in amplitude ...
[15]
[PDF] Speech Quality Assessment
Most objective measures of speech quality are implemented by first segmenting the speech signal into 10-30 ms frames, and then computing a distortion measure.
[16]
[PDF] Artificialbandwidth extension of narrowband speech-enhanced ...
Narrowband speech transmission and coding uses a sampling rate of 8 kHz that restricts the speech bandwidth to 300–. 3400Hz. ABE methods aim to improve quality ...
[17]
Frequency, Time, Representation and Modeling Aspects for Major ...
Speech signals are non-stationary; however, in short intervals (10 to 30 ms) they can be regarded as rather stationary. Their useful frequency range differs ...Missing: period 300-3400
[18]
[PDF] ANALOG-DIGITAL CONVERSION - 1. Data Converter History
The system used 7-bit logarithmic encoding with 26-dB of companding, and was later expanded to 8-bit encoding. ... Efforts continued during the 1960s at Bell ...
[19]
µ-Law Compressed Sound Format - The Library of Congress
Jun 10, 2025 · "Mu-law (also written µ-Law) is the encoding scheme used in North America and Japan for voice traffic. A-Law (or a-Law) is used in Europe and ...
[20]
[PDF] AN2095 Algorithm - Logarithmic Signal Companding - It Is µ-Law
μ-Law (pronounced mu law) is a technique of data compression and expansion ... A-Law is the compression standard used for European telephony applications.
[21]
Dudley's Channel Vocoder - Stanford CCRMA
The first major effort to encode speech electronically was Homer Dudley's channel vocoder (``voice coder'') [68] developed starting in October of 1928.
[22]
(PDF) History of speech synthesis - ResearchGate
Dudley, H., 1936. Synthesizing speech, Bell Laboratories Record, 15, pp. 98-102. Dudley, H., 1939. The vocoder, Bell ...
[23]
[PDF] The development of speech coding and the first standard coder for ...
Jan 1, 2005 · The final published version features the final layout of the paper including the volume, issue and page numbers. Link to publication. General ...
[24]
[PDF] Adaptive Quantization in Differential PCM Coding of Speech - vtda.org
We describe an adaptive differential PCM (ADPCM) coder which makes instantaneous exponential changes of quantizer step-size. The coder includes a simple first- ...
[25]
[PDF] From Concept to Production in Secure Voice Communications
In the late-1960s the concepts of Adaptive Predictive Coding and slightly later Linear Predictive. Coding were developed by Atal, Itakura and others [2] .LPC ...<|separator|>
[26]
[PDF] Vector Quantization in Speech Coding - LabROSA
Vector quantization is presented as a process of redundancy removal that makes effective use of four interrelated properties of vector parameters: linear ...
[27]
[PDF] Robert M. Gray
A vector quantizer is a system for mapping a sequence of continuous or discrete vectors into a digital sequence suitable for communication over or storage ...
[28]
(PDF) Low rate speech coding for telecommunications - ResearchGate
Aug 9, 2025 · Over the last decade major advances have been made in speech coding technology which is now widely used in inter-national, digital mobile ...
[29]
[PDF] Low bit rate speech coding - CORE
Speech coding is the process of reducing the data rate of digital voice to manageable levels. Parametric speech coders or vocoders utilise a-priori information ...
[30]
Waveform Coding - an overview | ScienceDirect Topics
Waveform coding is defined as a method of speech coding that aims to minimize the quantization error between the reconstructed speech signal and the ...
[31]
[PDF] Source Coding Basics and Speech Coding
• Speech coding refers to a process that reduces the bit rate of a speech file. • Speech coding enables a telephone company to carry more voice calls in a ...Missing: historical telephony
[32]
[PDF] Waveform Coding Algorithms - TCS RWTH
Aug 24, 2012 · Waveform Codecs gives hight speech quality, without any prior knowledge of how the signal to be coded was generated, to produce a reconstructed ...
[33]
Review of methods for coding of speech signals
Feb 7, 2023 · This paper reviews the history of speech coding techniques, from early mu-law logarithmic compression to recent neural-network methods.
[34]
https://ieeexplore.ieee.org/document/1451722
[35]
https://ieeexplore.ieee.org/document/1094078
[36]
Code-excited linear prediction(CELP): High-quality speech at very ...
CELP selects an innovation sequence from a code book, filters it with long and short delay predictors, and codes speech at 1/4 bit per sample.
[37]
Milestones:Line Spectrum Pair (LSP) for high-compression speech ...
Sep 7, 2022 · Line Spectrum Pair, invented at NTT in 1975, is an important technology for speech synthesis and coding. A speech synthesizer chip was designed ...
[38]
Adaptive Predictive Coding of Speech Signals - Atal - 1970
In this coding method, both the transmitter and the receiver estimate the signal's current value by linear prediction on the previously transmitted signal. The ...
[39]
[PDF] ON REDUCING THE BUZZ IN LPC SYNTHESIS
I. Introduction. The technique of linear prediction (LPC)12 has rightfully enjoyed a great deal of popularity for the analysis and synthesis of speech.
[40]
G.729 Win32 - DSP Wizard
The ITU-T G.729 fixed-rate speech coder provides toll quality at very low bandwidth. G.729 compresses narrowband linear speech signals at a sample rate of ...
[41]
[PDF] Multiprocessor Implementation of a Real-Time Celp Algorithm. - DTIC
The entire CELP encoder takes an estimated 500,000 cycles per frame on the C40 while the decoder is much less computationally intensive.
[42]
[2107.03312] SoundStream: An End-to-End Neural Audio Codec
Jul 7, 2021 · By training with structured dropout applied to quantizer layers, a single model can operate across variable bitrates from 3kbps to 18kbps, with ...Missing: waveform | Show results with:waveform
[43]
NESC: Robust Neural End-2-End Speech Coding with GANs - arXiv
Jul 7, 2022 · We present Neural End-2-End Speech Codec (NESC) a robust, scalable end-to-end neural speech codec for high-quality wideband speech coding at 3 kbps.
[44]
Low-Resource Audio Codec (LRAC): 2025 Challenge Description
Oct 27, 2025 · To catalyze progress in this area, we introduce the 2025 Low-Resource Audio Codec Challenge, which targets the development of neural and hybrid ...
[45]
G.722.2 : Wideband coding of speech at around 16 kbit/s using ... - ITU
Mar 9, 2023 · G.722.2 (01/02), Wideband coding of speech at around 16 kbit/s using Adaptive Multi-Rate Wideband (AMR-WB), Superseded ; G.722.2 Annex C (01/02)Missing: extensions 2010
[46]
3GPP – The Mobile Broadband Standard
**Summary of EVS Codec Standardization in Release 12 (2014):**
[47]
[PDF] 3GPP Enhanced Voice Services (EVS) codec - Nokia
The EVS codec is the successor to the HD mobile voice codec. AMR-WB. It provides full interoperability with HD voice. EVS offers cutting-edge performance that ...<|separator|>
[48]
IEEE 1857.3-2023 - IEEE SA
Apr 5, 2024 · This standard defines a new Real-Time Communication (RTC) speech codec for encoding and decoding that can operate at low bitrate (e.g., less ...
[49]
https://www.nokia.com/asset/f/200002/
[50]
Enhanced Voice Services Codec for LTE - 3GPP
Nov 7, 2014 · For narrowband and wideband audio bandwidths, the EVS codec delivers higher quality, higher frame/packet error resilience, and higher ...
[51]
Opus Codec
### Summary of Neural Extensions or AI Enhancements to Opus Codec in the 2020s, Especially for WebRTC
[52]
https://www.3gpp.org/news-events/3gpp-news/evs-news
[53]
2025 LRAC Challenge Results - Cisco
Summary of the results for the 2025 Low Resource Neural Audio Codec (LRAC) Challenge.Missing: rate outcomes
[54]
[PDF] ETSI TR 126 952 V12.4.0 (2016-04)
When compared to AMR-WB in the same test, EVS-SWB modes outperform AMR-WB. The channel aware coding mode of the 3GPP EVS codec offers a highly error resilient ...<|separator|>
[55]
[PDF] On Improving Error Resilience of Neural End-to-End Speech Coders
Sep 1, 2024 · More advanced state-of-the-art communication codecs like. 3GPP Enhanced Voice Service (EVS) [4] support two types of error resilient tools. The ...
[56]
Implementation of ITU-T G. 729 speech codec in IP telephony gateway
ITU-T G. 729 is the primarily recommended speech codec by H. 323 standard. This paper describes how to implement G. 729 codec in IP telephony gateway, and.
[57]
RFC 6716 - Definition of the Opus Audio Codec - IETF Datatracker
This document defines the Opus interactive speech and audio codec. Opus is designed to handle a wide range of interactive audio applications.
[58]
Split and Prediction for Neural Speech Codec - Samsung Research
Aug 20, 2025 · Speech coding is essential for voice communication and streaming media, ensuring efficient compression while maintaining perceptual quality. A ...Missing: cloud Zoom
[59]
Understanding Jitter in Packet Voice Networks (Cisco IOS Platforms)
Feb 2, 2006 · Jitter is a variation in packet latency for voice packets. The DSPs inside the router can make up for some jitter, but can be overcome by excessive jitter.
[60]
NIST Speech Signal to Noise Ratio Measurements
May 19, 2015 · A Signal to Noise Ratio (SNR) Metric for speech in noise. The NIST Speech SNR Measurement. In the service of the NIST mission to facilitate ...
[61]
[PDF] Low-Bit-Rate Speech Coding - Semantic Scholar
Low-bit-rate speech coding, at rates below 4 kb/s, is needed for both communication and voice storage applications and a number of different approaches for ...
[62]
Speech coding based on adaptive MEL-cepstral analysis for noisy ...
It is shown that the proposed coder produces much higher quality speech than that of 16kb/s G.726 at BER(Bit Error Rate)=0 and BER=10~3. Although the coder ...
[63]
ViSQOL: an objective speech quality model
May 17, 2015 · ViSQOL focuses on the similarity between a reference and degraded signal by using a distance metric called the Neurogram Similarity Index ...