Fact-checked by Grok 2 weeks ago

Code-excited linear prediction

Code-excited (CELP) is a linear predictive algorithm that employs an analysis-by-synthesis approach to model the spectral envelope of speech signals using filters while selecting an optimal sequence from a predefined to minimize the perceptual error between the original and synthesized speech, thereby achieving high-quality compression at low such as 4.8 kbit/s. Introduced in 1985 by Manfred R. Schroeder and Bishnu S. Atal, CELP builds on earlier techniques by incorporating a of vectors, typically containing hundreds to thousands of entries, which are filtered through short-term and long-term predictors to reconstruct speech waveforms with natural-sounding quality even under constrained bandwidth. The method's efficiency stems from its ability to exploit both short-term correlations and long-term periodicity in voiced speech, using perceptual to prioritize audible bands during searches. CELP has become foundational to numerous international speech coding standards, influencing and applications worldwide. Key examples include the G.728 recommendation for low-delay CELP (LD-CELP) at 16 kbit/s, standardized in 1992 for voice communications with minimal algorithmic delay of 0.625 ms; the G.729 standard using conjugate-structure algebraic CELP (CS-ACELP) at 8 kbit/s, adopted in 1996 for efficient VoIP and digital telephony; and the GSM Enhanced Full Rate (EFR) codec based on algebraic CELP (ACELP) at 12.2 kbit/s, released in 1995 to enhance mobile network speech quality. These variants and others, such as those in wideband standards like G.722.2, demonstrate CELP's adaptability to diverse environments including cellular networks, links, and protocols, where it balances with robust performance against transmission errors.

History

Invention and Early Development

Code-excited linear prediction (CELP) emerged as a significant advancement in low-bitrate , building on earlier innovations in excitation modeling for (LPC). In 1982, Bishnu S. Atal and Joel R. Remde at Bell Laboratories introduced multipulse excitation as a method to generate more natural-sounding speech by using multiple pulses per pitch period to approximate the residual signal, rather than relying on simplistic quasi-periodic impulses or . This approach improved speech quality at rates around 9.6 kbps but required determining optimal pulse locations and amplitudes iteratively, which increased complexity while still demanding substantial bitrate for encoding multiple parameters. Subsequent developments addressed these limitations by structuring the more efficiently. In 1986, Peter Kroon, Ed F. Deprettere, and Rob J. Sluyter proposed regular-pulse (RPE), which arranged at fixed intervals within subframes to reduce the search space for optimal and lower computational demands compared to arbitrary multipulse configurations. RPE evolved the multipulse by imposing regularity on positions, enabling better modeling of speech at bitrates near 13 kbps with toll-quality output, and paving the way for vector-based quantization techniques that could capture complex patterns more compactly. These precursors highlighted the need for methods that balanced perceptual quality, bitrate efficiency, and feasibility in hardware-constrained environments. The CELP algorithm was formally proposed in 1985 by Manfred R. Schroeder and Bishnu S. Atal at Bell Laboratories as an enhancement over multipulse LPC, employing an analysis-by-synthesis framework to select excitation vectors from a codebook that minimized perceptual . By quantizing the via codebook indices rather than explicit pulse parameters, CELP achieved more efficient representation of the speech residual, targeting very low bitrates suitable for digital communications. A key early challenge in CELP was the high associated with exhaustively searching large stochastic codebooks—typically containing 1024 or more random vectors per subframe—to identify the optimal excitation match. This exhaustive search, involving weighted error minimization through synthesis filtering, demanded significant processing power, limiting initial implementations to offline or high-end systems despite the algorithm's promise. Initial demonstrations in the showcased CELP's potential, delivering speech quality comparable to standards at bitrates of 4.8 to 9.6 kbps, with the core innovation sequence coded at approximately 2 kbps and additional overhead for LPC coefficients and gains. These results marked a breakthrough for transmission in bandwidth-limited channels, such as military applications.

Standardization Milestones

In 1990, the U.S. Department of Defense adopted Federal Standard 1016 (FS1016), specifying a 4.8 kbps CELP coder for analog-to-digital conversion of radio voice in secure communications applications. This standard, developed jointly by the and , marked the first formal governmental endorsement of CELP technology, enabling efficient low-bitrate encoding for military voice transmission over constrained channels. The (ITU-T) advanced CELP standardization in 1992 with Recommendation G.728, which defined Low-Delay CELP (LD-CELP) operating at 16 kbps for . LD-CELP reduced algorithmic delay to 0.625 ms through backward adaptive and a fixed , supporting applications in multiplication equipment and early packet-switched networks. A significant milestone occurred in 1996 when ITU-T Recommendation G.729 standardized Conjugate-Structure Algebraic CELP (CS-ACELP) at 8 kbps, providing toll-quality speech for international . CS-ACELP employed a sparse algebraic structure, utilizing four signed pulses positioned across 40 samples to represent efficiently while minimizing perceptual distortion through perceptual weighting. Building on these foundations, the (ETSI) incorporated Algebraic CELP (ACELP) in the 1995 GSM Enhanced Full Rate (EFR) codec, operating at 12.2 kbps to enhance speech quality in second-generation mobile networks. EFR, detailed in GSM 06.60, improved upon the original GSM full-rate codec by leveraging ACELP's structured codebook for better robustness against channel errors in cellular environments. In 2014, the finalized the in Release 12, integrating advanced CELP modes—including enhanced ACELP—for applications across bitrates from 5.9 to 128 kbps. EVS extended CELP principles to support super-wideband audio up to 20 kHz, with CELP modes ensuring backward compatibility and low-latency performance in modern packet-based networks. These standardization milestones facilitated CELP's widespread adoption in global , enabling low-bandwidth mobile speech transmission that supported the proliferation of digital cellular systems like and , thereby reducing spectral demands while maintaining intelligible voice quality for billions of users.

Fundamentals

Linear Predictive Coding Principles

Linear predictive coding (LPC) models by representing the vocal tract as an all-pole excited by an input signal that approximates the glottal source. For voiced speech, the excitation consists of quasi-periodic pulses modeling glottal airflow, while unvoiced speech is modeled as filtered ; the combined effects of glottal and radiation are incorporated into the filter response. This all-pole approximation effectively captures the spectral envelope, particularly the formants, enabling efficient representation of speech spectra. The core LPC equation predicts the current speech sample \hat{s}(n) as a linear combination of p previous samples: \hat{s}(n) = \sum_{k=1}^{p} a_k s(n-k), where a_k are the predictor coefficients and p is the model order, typically 10 for speech sampled at 8 kHz to adequately represent formants up to about 4 kHz. The prediction error e(n) = s(n) - \hat{s}(n) serves as the excitation signal; in code-excited linear prediction (CELP), this excitation is quantized by selecting from a codebook to minimize perceptual distortion. Predictor coefficients are estimated using the autocorrelation method, which minimizes the mean-squared prediction error over a short speech segment by solving the normal equations \mathbf{R} \mathbf{a} = \mathbf{r}, where \mathbf{R} is the p \times p symmetric Toeplitz autocorrelation matrix with elements R_{ij} = r(|i-j|), \mathbf{a} = [a_1, \dots, a_p]^T, and \mathbf{r} = [r(1), \dots, r(p)]^T derived from the windowed speech autocorrelation r(k) = \sum_m s(m) s(m+k). Due to the Toeplitz structure, the Levinson-Durbin recursion efficiently solves these equations in O(p^2) time, yielding stable filters by ensuring reflection coefficients |k_i| < 1. For transmission in speech coding, LPC coefficients are often converted to reflection coefficients, which naturally enforce stability and facilitate between frames, or to line spectral pairs (LSPs), which represent the roots of polynomials derived from the predictor and provide a stable, quantized parameterization with good spectral sensitivity for preservation. LPC relies on key assumptions: the speech spectrum is stationary over short intervals (typically 5-20 ms), justifying frame-based , and the all-pole model validly approximates the vocal tract for formants in non-nasal sounds, with poles inside the unit circle ensuring filter stability.

Analysis-by-Synthesis Framework

Code-excited linear prediction (CELP) operates within an analysis-by-synthesis framework, a paradigm where the encoder iteratively simulates the decoder's process to evaluate and select excitation parameters that produce the highest-fidelity of the input speech signal. This closed-loop approach ensures that the encoding decisions are optimized directly in the domain, minimizing perceptual rather than relying on open-loop approximations common in earlier methods. Introduced in the seminal work on low-bit-rate , this framework enables high-quality speech at rates as low as 4.8 kbit/s by exhaustively searching for the best-matching sequence. At the core of the CELP structure lies the excitation generation mechanism, comprising an adaptive codebook for capturing long-term pitch periodicity via (LTP) and a fixed codebook for introducing the necessary to model the speech signal's aperiodic components. The adaptive codebook, derived from past segments, represents periodic voiced speech through a delayed and scaled version of previous excitations, while the fixed codebook provides random-like vectors to account for the residual elements. These codebook outputs are scaled by their respective gains and summed to form the overall signal, which is then passed through the LPC —as detailed in the principles of —to generate the synthesized speech. This dual-codebook design enhances modeling accuracy for both voiced and unvoiced speech segments. The optimization process centers on minimizing the weighted mean-squared between the original input speech x(n) and the synthesized output \hat{y}(n). The signal is computed as e(n) = x(n) - \hat{y}(n), and then filtered through a perceptual W(z) to produce the weighted e_w(n) = W(z) [e(n)], which emphasizes bands more sensitive to human perception, such as regions, while de-emphasizing others. The encoder selects the entry (or entries) and gains that minimize \sum e_w^2(n) over the analysis frame or subframe, ensuring the synthesized speech aligns closely with in a psychoacoustically relevant manner. This perceptual weighting is crucial for achieving transparent quality at constrained bit rates. The exhaustive nature of the codebook search in this framework imposes significant computational demands, with complexity scaling linearly as O(N) for a codebook of size N, often requiring evaluations of thousands of entries per subframe. Early implementations, such as the original CELP prototype, demanded substantial processing power—equivalent to 125 seconds of time per second of speech—highlighting the need for algorithmic efficiencies like sequential searches or approximations in practical systems. These trade-offs have driven ongoing innovations in search strategies to balance quality and feasibility in standards like and .

Encoding Process

LPC Coefficient Estimation

In the CELP encoding process, LPC coefficient estimation begins with preprocessing the input speech signal to enhance its suitability for analysis. A second-order is applied to remove low-frequency noise and emphasize higher spectral components, typically with a cutoff around 80 Hz to mitigate DC offset and . Following this, the signal is segmented into analysis frames of 20-30 duration and windowed using a Hamming window to minimize and ensure smooth transitions between frames. This windowing reduces in the time-domain signal, preparing it for frequency-domain modeling via . The core of LPC estimation involves computing the autocorrelation function from the windowed speech frames. The autocorrelation coefficients r(k) are calculated as r(k) = \sum_{n=0}^{N-1-k} x(n) x(n-k), where x(n) is the windowed speech sample, N is the frame length, and k = 0, 1, \dots, p, with p denoting the predictor order. These coefficients form the basis for solving the Yule-Walker equations to obtain the LPC coefficients a_k. The is employed for efficient computation, iteratively deriving the coefficients while ensuring through reflection coefficients bounded by 1 in magnitude. This algorithm converges in O(p^2) operations, yielding the all-pole filter A(z) = 1 - \sum_{k=1}^{p} a_k z^{-k}. To transmit the LPC coefficients efficiently at low bit rates, they are transformed into line spectral pairs (LSPs), which offer better quantization properties due to their uniform distribution and inherent stability. Quantization typically uses (VQ) of the LSP vector or split-VQ, where the LSPs are divided into subvectors and quantized separately to reduce complexity and bit allocation—often 20-38 bits per in CELP systems. For smoothness across frames, interpolated LSPs are generated by linearly combining consecutive frame LSPs, preventing abrupt spectral changes during synthesis. Filter stability is enforced to ensure all roots of A(z) lie inside the unit circle, avoiding instability in the synthesis filter. This is achieved through bandwidth expansion, where quantized LPC coefficients are modified as a_k' = a_k \gamma^k with \gamma < 1 (e.g., 0.8). In CELP, a 10th-order predictor (p=10) is standard for 8 kHz sampling, with coefficients updated every 5 ms subframe via from 20-30 ms analysis frames to track vocal tract variations. These estimated coefficients model the short-term spectral envelope, driving the subsequent search in the analysis-by-synthesis loop.

Codebook Excitation Selection

In code-excited linear prediction (CELP), the adaptive codebook models the long-term in speech due to periodicity by storing segments of past synthesized signals. It is indexed by a lag L (typically ranging from 20 to 120 samples at an 8 kHz sampling rate) and a gain G_p, allowing the encoder to select and scale a delayed version of prior to predict the current frame's periodic component. This approach enhances efficiency at low bit rates by exploiting the quasi-periodic nature of voiced speech. The fixed codebook provides stochastic or structured innovation to capture the non-periodic residual after long-term prediction, representing random fluctuations in the excitation signal. Entries consist of predefined vectors, such as random sequences in early CELP designs or algebraic pulse patterns (e.g., four signed pulses at ±1 in ACELP variants) to approximate the speech with sparse representations. The encoder searches this codebook to select the vector c_k(n) that best fits the remaining target signal after adaptive contribution. The search procedure operates sequentially: first, the adaptive codebook is searched by maximizing the normalized between the signal and filtered past excitation candidates over possible lags, yielding the optimal L and G_p. The is then updated by subtracting the adaptive contribution, and the fixed codebook is searched to find the entry c_k(n) and gain G_c that minimize the squared weighted \|W(e)\|^2, where W(z) is a perceptual emphasizing regions. Gains G_p and G_c are computed as zero-mean correlations between the and filtered codebook outputs, normalized by codebook energies, and jointly quantized to periodic and innovative components. The resulting excitation for the current subframe is formed as: u(n) = G_p \, u(n - L) + G_c \, c_k(n), \quad n = 0, \dots, 39 where u(n - L) is the interpolated past excitation at lag L, and the subframe spans 40 samples. Processing occurs in 5 ms subframes to track rapid pitch variations, with the adaptive codebook updated using the newly synthesized excitation after each subframe. This granular approach ensures accurate modeling of pitch changes, particularly in voiced segments, while keeping computational demands manageable through efficient correlation-based searches.

Noise Weighting Optimization

In code-excited linear prediction (CELP), noise weighting optimization employs a perceptual weighting filter to shape the quantization error spectrum, ensuring that is primarily introduced in spectral regions where the human is less sensitive, such as the valleys between , thereby enhancing subjective speech quality based on psychoacoustic principles. This approach leverages the masking properties of the ear, where quantization in formant peaks (high perceptual importance) is minimized, while in less audible areas is tolerated. The core of this optimization is the perceptual weighting filter W(z), defined as W(z) = \frac{A(z/\gamma)}{A(z/\gamma')} where A(z) is the analysis filter (the inverse of the synthesis filter), \gamma typically ranges from 0.8 to 1.0 to emphasize formants by broadening their spectral peaks, and \gamma' < \gamma (often around 0.4 to 0.6) dampens the weighting in those regions to control overall emphasis. This formulation results in a filter that amplifies the error signal near formants while attenuating it elsewhere, aligning the noise spectrum with thresholds. For efficient implementation, the filter can be realized in the using fast transforms for subband processing or as an all-pole lattice filter structure derived from the LPC coefficients, which ensures stability and low . The is applied to both the target signal x_w(n) = W(x(n)), where x(n) is the pre-processed input minus the adaptive contribution, and the synthesized signal y_w(n) = W(y(n)), with the search minimizing the mean-squared weighted error \| x_w(n) - y_w(n) \|^2. To adapt to varying speech characteristics, the parameters \gamma and \gamma' are often adjusted based on voicing analysis; for unvoiced segments, which exhibit a flatter , higher values of \gamma (closer to 1.0) are used to reduce emphasis and flatten the weighting, preventing over-amplification of noise in regions. This adaptation, typically derived from measures like tilt or open-loop correlation, improves robustness across voiced and unvoiced frames. By concentrating quantization noise in perceptually masked spectral valleys, this optimization enables more efficient bit allocation, allowing coarser quantization (fewer bits) in low-sensitivity areas while allocating higher precision to formant regions, which is critical for low-bitrate operation in standards like G.729.

Decoding Process

Signal Synthesis

In the CELP decoder, the process begins with the reception of quantized parameters transmitted from the encoder, including indices for the LPC coefficients, pitch lag L, adaptive and fixed codebook gains G_p and G_c, and the fixed codebook vector index k. These parameters are dequantized to reconstruct the excitation signal u(n), which combines the periodic component from the adaptive codebook and the stochastic component from the fixed codebook:
u(n) = G_p \, u(n - L) + G_c \, c_k(n),
where u(n - L) is derived from the buffer of previously synthesized excitation, and c_k(n) is the selected entry from the fixed codebook. This reconstruction occurs per subframe, typically within frames of 40 to 160 samples (corresponding to 5-20 ms at an 8 kHz sampling rate), with buffers maintaining the past excitation history for the adaptive codebook update.
The synthesized speech signal \hat{s}(n) is then generated by passing the u(n) through the all-pole LPC synthesis with $1/A(z), where A(z) is the inverse formed from the dequantized LPC coefficients, often interpolated across subframes for smoothness:
\hat{s}(n) = \frac{u(n)}{A(z)}.
The LPC coefficients, typically 10th order for speech, model the spectral envelope and are updated every frame to adapt to changing vocal tract characteristics. Buffer management ensures seamless processing, with the synthesis state updated using the newly generated \hat{s}(n) to prepare for the next subframe.
To handle frame erasures due to transmission errors, the incorporates robustness mechanisms such as muting the output for severe losses or predictive fill-in using extrapolated parameters from prior , preventing abrupt discontinuities in the speech . This maintains perceptual without requiring additional side . The final output is 8 kHz (PCM) speech, suitable for , with low-delay variants like LD-CELP achieving algorithmic latencies under 5 ms to support applications.

Post-Filtering Techniques

Post-filtering techniques in code-excited linear prediction (CELP) decoders serve as optional enhancements applied after signal synthesis to mitigate perceptual artifacts, such as noise concentrated in formant regions, thereby improving the subjective quality of reconstructed speech. These methods exploit the analysis-by-synthesis nature of CELP by shaping the output spectrum to align more closely with human auditory perception, emphasizing formants and pitch harmonics while suppressing inter-formant noise. Unlike encoder-side perceptual weighting, which minimizes weighted error during codebook search, post-filtering operates solely on the decoder side to refine the final output without requiring additional bit allocation. The core of post-filtering is the adaptive post-filter, which combines short-term and tilt compensation components to boost formant peaks and counteract spectral tilt introduced by quantization. Its transfer function is typically given by F(z) = \frac{A_{\text{tilt}}(z / \gamma_p) \, A(z / \gamma_f)}{A(1)}, where A(z) is the filter, \gamma_f (often around 0.8–0.9) controls emphasis by sharpening spectral peaks, and \gamma_p (typically 0.4–0.6) adjusts the tilt filter A_{\text{tilt}}(z) to prevent excessive low-frequency amplification. The denominator A(1) normalizes the gain to maintain energy levels. This filter is updated per subframe using the decoded LPC coefficients, ensuring adaptation to varying speech characteristics. A long-term post-filter complements the adaptive one by enhancing periodicity, particularly for voiced speech, through a structure analogous to the long-term predictor in CELP encoding. It uses the received lag to apply a comb-like that smooths the periodic , reducing between harmonics without altering the fundamental period; the is often clipped to avoid over-smoothing unvoiced segments. A short-term high-pass filter is often applied as a simpler enhancement to attenuate low-frequency bias and remove DC components, typically H(z) = 1 - \mu z^{-1} with \mu \approx 0.7–0.9. These techniques introduce minimal algorithmic delay, usually 1–2 ms due to filter buffering, but yield notable perceptual gains, with (MOS) improvements of 0.2–0.5 points in subjective tests, particularly under noisy conditions. In standards, the adaptive post-filter is mandatory in for 8 kbit/s CS-ACELP coding to ensure toll-quality performance, and is included as part of the decoder in standards such as Enhanced Full Rate (EFR) and (AMR), where it contributes to quality without impacting core delay budgets.

Variants and Extensions

Algebraic CELP

Algebraic code-excited linear prediction (ACELP) is a variant of the CELP framework that employs a structured fixed composed of sparse pulse to achieve computational efficiency and high speech quality at low bit rates. In this approach, the fixed excitation is generated algebraically rather than through storage of precomputed vectors, allowing for a compact that facilitates rapid searching during encoding. This is particularly suited for modeling the impulsive of voiced speech excitations. The core of the ACELP fixed codebook consists of a small number of non-zero distributed across a subframe, with their positions and signs optimized to best match the target signal. For instance, in the standard, each 40-sample subframe uses exactly four , where each has an of +1 or -1 and is placed at one of several possible positions divided into tracks to ensure even and avoid pulse overlaps. The codevector c_k(n) for the k-th entry is thus defined as c_k(n) = \sum_{i=1}^{P} s_i \delta(n - p_i), where P = 4 is the number of pulses, s_i = \pm 1 are the signs, p_i are the integer positions within the subframe (0 to 39), and \delta(\cdot) is the Kronecker delta function. This formulation ensures sparsity, with only P non-zero samples per 40-sample vector, enabling the codebook to be fully described by the 17 bits allocated for positions and signs in G.729, rather than storing thousands of full vectors. The search for the optimal ACELP codevector follows an analysis-by-synthesis paradigm, where the correlation between the target signal and the filtered codevector is maximized while minimizing perceptual . To mitigate the high of exhaustive —potentially on the of $2^{40} operations for larger configurations—efficient techniques such as focused tree searches and amplitude pre-selection are employed. In , for example, pulse signs are initially set based on the sign of the backward-filtered target, followed by a that prunes unlikely branches using thresholds on gains, limiting the total candidates to around 1 million operations per subframe. This reduces the computational load significantly compared to brute-force methods while preserving near-optimal performance. ACELP offers key advantages over traditional stochastic codebooks, including substantially lower memory requirements since no full codevectors need to be stored—only the algebraic parameters are quantized and transmitted. Additionally, the sparse structure excels at capturing sharp transients in speech, such as plosives and onsets, by concentrating energy at precise locations, which enhances perceptual quality during dynamic signal segments. These benefits make ACELP ideal for resource-constrained environments like mobile communications. ACELP forms the foundation of several international speech coding standards, notably serving as the core excitation mechanism in for 8 kbit/s toll-quality coding. It is also integral to the Adaptive Multi-Rate () codec's mode (3GPP TS 26.090), where algebraic pulses enable robust performance across varying channel conditions in / networks. Furthermore, enhanced ACELP variants underpin the and wideband modes of the Enhanced Voice Services () codec (3GPP TS 26.445), providing backward compatibility with while supporting super-wideband audio up to 20 kHz.

Low-Delay CELP

Low-delay code-excited (LD-CELP) is a variant of the CELP framework optimized for real-time speech communication, where minimizing algorithmic delay is paramount to support interactive applications such as and video conferencing. Unlike standard CELP, which relies on lookahead for accurate linear prediction coefficient (LPC) estimation, LD-CELP employs backward techniques to process incoming speech samples with minimal buffering, achieving delays as low as 0.625 ms. This approach was standardized in G.728 for 16 kbit/s , providing toll-quality performance suitable for digital circuit multiplication equipment and other low-latency systems. A key innovation in LD-CELP is the backward LPC estimation, where predictor coefficients are derived from the future synthesis buffer of previously decoded speech rather than the input signal, thereby eliminating the need for lookahead delay. The LPC analysis uses a high-order filter, typically 50th order, updated periodically—every 2.5 ms (20 samples at 8 kHz sampling rate)—via hybrid windowing on the quantized synthesis signal and Durbin's recursion for coefficient computation. Processing is organized into vectors of 5 samples (0.625 ms each), with predictor coefficients adapted every 2.5 ms (every 4 vectors or 20 samples) within full frames of 10 ms (80 samples) to maintain without accumulating errors. The excitation is smaller and , comprising 128 shape (7 bits) and 8 levels (3 bits) for a 10-bit index per , avoiding long-term prediction (LTP) loops that introduce additional delay in standard CELP. Backward-adaptive further ensures that only the codebook index is transmitted, streamlining the encoding process. In terms of performance, LD-CELP at 16 kbit/s in G.728 delivers speech quality comparable to or exceeding the 32 kbit/s ADPCM (ITU-T ), with an algorithmic delay under 2 ms (one-way), making it ideal for interactive voice. However, this comes at the cost of a higher bitrate than many standard CELP variants optimized for , as the small and frequent adaptations demand more bits for representation. Trade-offs include slightly reduced perceptual quality due to limited lookahead for shaping and , though post-filtering enhances robustness against errors; overall, LD-CELP excels in delay-sensitive scenarios but trades some compression for responsiveness.

Applications and Evaluation

Integration in Standards

Code-excited linear prediction (CELP) has been integrated into numerous telephony standards to enable efficient voice compression while maintaining quality. In Voice over Internet Protocol (VoIP) systems using Session Initiation Protocol (SIP), G.729, a conjugate-structure algebraic CELP variant operating at 8 kbit/s, serves as a common fallback codec when the primary G.711 pulse-code modulation exceeds bandwidth limits, ensuring reliable transmission over packet-switched networks. Similarly, the Adaptive Multi-Rate (AMR) codec, based on algebraic CELP with bit rates from 4.75 to 12.2 kbit/s, is mandated for circuit-switched voice in Global System for Mobile Communications (GSM) and Universal Mobile Telecommunications System (UMTS) networks, providing adaptive quality based on channel conditions. In wireless standards, the (EVS) codec, standardized by for Long-Term Evolution () and (VoLTE), incorporates CELP-based modes for super-wideband speech up to 20 kHz bandwidth at starting from 5.9 kbit/s, enabling high-fidelity audio in mobile environments with robustness to packet loss. The codec, defined in RFC 6716, employs a hybrid structure where its component—a linear prediction-based speech coder akin to CELP—handles narrowband to fullband voice in modes up to 48 kHz, making it suitable for real-time applications like . Secure and leverage CELP for its balance of security and intelligibility. The FS1016 standard, a 4.8 kbit/s CELP coder developed by the U.S. (NSA), was implemented in Secure Telephone Unit III () devices for encrypted voice over analog lines, providing classified communications compatibility. Contemporary military systems build on this with mixed-excitation (MELP) hybrids that incorporate CELP elements for improved naturalness at low bit rates, such as in the 2.4 kbit/s MELPe standard used in tactical radios. Open-source and commercial implementations further embed CELP in modern devices. Speex, an open-source codec library under the Xiph.Org Foundation, directly implements CELP for narrowband to wideband speech compression at 2–44 kbit/s, supporting VoIP and embedded applications without licensing fees. In commercial telephony stacks, Android's Open Source Project integrates CELP-derived codecs like GSM-FR (full-rate) and AMR for mobile voice calls, while iOS employs similar variants in its AVFoundation framework for SIP-based VoIP, ensuring interoperability across carriers.

Performance Characteristics

Code-excited linear prediction (CELP) coders typically operate at bitrates ranging from 4.8 to 16 kbps to achieve toll-quality speech, corresponding to mean opinion scores () of 3.5 to 4.0 on a of 1 to 5, where higher values indicate better perceived . For example, the Federal Standard 1016 CELP at 4.8 kbps yields an MOS of approximately 3.2, while at 8 kbps achieves around 3.9, reflecting natural-sounding output due to the perceptual weighting in the analysis-by-synthesis that minimizes audible distortions. variants require higher bitrates, often exceeding 16 kbps, to maintain similar MOS levels across extended frequency ranges up to 7 kHz. Objective metrics such as segmental (segSNR) for CELP typically fall between 10 and 15 dB, providing a quantitative measure of fidelity, though perceptual often exceeds what segSNR alone suggests. CELP exhibits advantages in robustness to , with appropriate post-filtering and concealment techniques enabling recovery from loss rates of 15-20% while preserving MOS degradation to below 0.5 points. This resilience stems from the parametric nature of the excitation and prediction coefficients, allowing or for lost frames without severe artifacts. Additionally, the perceptual optimization in CELP's codebook search contributes to a natural-sounding quality, bridging the gap between synthetic artifacts and waveform preservation. Limitations include significant computational demands, with typical implementations requiring 10-50 million (MIPS), though optimized variants like algebraic CELP reduce this to under 10 MIPS. Sensitivity to frame errors can cause audible glitches if losses exceed concealment capabilities, and at bitrates below 4 kbps, speech often exhibits muffled or robotic quality due to insufficient resolution. Complexity is also measured in weighted million operations per second (WMOPS), where standards like register around 8-14 WMOPS for encoder-decoder pairs. Compared to waveform coders like PCM or ADPCM, CELP offers superior at low bitrates, achieving comparable quality to 32 kbps ADPCM using only 8 kbps or less, thanks to its source-filter modeling. Against modern neural coders, CELP remains a legacy approach but provides deterministic performance without training data dependencies, ensuring consistent quality in resource-constrained environments.

References

  1. [1]
    Code-excited linear prediction(CELP): High-quality speech at very ...
    CELP selects an innovation sequence from a code book, filters it with long and short delay predictors, and codes speech at 1/4 bit per sample.
  2. [2]
    [PDF] Speech Coding Methods, Standards, and Applications - ViVoNets
    In CELP speech coders, a segment of speech (say, 5 ms) is synthesized using the linear prediction model along with a long-term redundancy predictor for all ...
  3. [3]
    G.728 : Coding of speech at 16 kbit/s using low-delay code ... - ITU
    Sep 2, 2025 · G.728 (06/12), Coding of speech at 16 kbit/s using low-delay code excited linear prediction. The corresponding ANSI-C code is available in the G ...Missing: standards 729
  4. [4]
  5. [5]
    [PDF] Enhanced Full Rate (EFR) speech transcoding; (GSM 06.60 ... - ETSI
    The coding scheme is the so-called Algebraic Code Excited Linear Prediction Coder, hereafter referred to as ACELP. This EN also specifies the conversion between ...
  6. [6]
  7. [7]
  8. [8]
    The federal standard 1016 4800 bps CELP voice coder
    Federal Standard 1016, Telecommunications: Analog to Digital Conversion of Radio Voice by 4,800 bit/second Code Excited Linear Prediction (CELP).Missing: DoD | Show results with:DoD
  9. [9]
    The Dod 4.8 Kbps Standard (Proposed Federal Standard 1016)
    Campbell, J., T. Tremain and V. Welch, “The Proposed Federal Standard 1016 4800 bps Voice Coder: CELP,” submitted to Speech Technology Magazine, April/May 1990 ...
  10. [10]
    [PDF] g729.pdf
    The CS-ACELP coder is based on the Code-Excited Linear-Prediction (CELP) coding model. The coder operates on speech frames of 10 ms corresponding to 80 samples ...
  11. [11]
    GSM enhanced full rate speech codec - IEEE Xplore
    Abstract: This paper describes the GSM enhanced full rate (EFR) speech codec that has been standardised for the GSM mobile communication system.
  12. [12]
    Enhanced Voice Services Codec for LTE - 3GPP
    Nov 7, 2014 · EVS is the first 3GPP conversational codec offering up to 20 kHz audio bandwidth, delivering speech quality that matches other audio input such as stored music.Missing: CELP | Show results with:CELP
  13. [13]
    [PDF] ETSI TS 126 445 V12.6.0 (2016-05)
    This Technical Specification (TS) has been produced by ETSI 3rd Generation Partnership Project (3GPP). The present document may refer to technical ...Missing: advanced | Show results with:advanced
  14. [14]
    [PDF] Speech Analysis and Synthesis by Linear Prediction of the Speech ...
    Speech Analysis and Synthesis by Linear Prediction of the Speech Wave. B. S. ... ATAL. AND. HANAUER. Fro. 3. Waveform of the speech signal to- gether with ...
  15. [15]
    None
    ### Summary of LPC Principles and Related Topics from Lecture 13
  16. [16]
    [PDF] Code-excited linear prediction(CELP): High-quality speech at ...
    Code-excited linear prediction(CELP): High-quality speech at very low bit rates · M. Schroeder, B. Atal · Published in ICASSP '85. IEEE… 26 April 1985 · Computer ...Missing: Bishnu | Show results with:Bishnu
  17. [17]
    [PDF] Line spectrum pair (LSP) and speech data compression
    Line Spectrum Pair (LSP) was first introduced by Itakura [1,2] as an alternative LPC ... A new model of LPC excitation for producing natural-sounding speech at ...
  18. [18]
    [PDF] “Code-excited Linear Prediction (CELP): High Quality Speech at ...
    We describe in this paper a code-excited linear predictive coder in which the optimum innovation sequence is selected from a code book of stored sequences ...
  19. [19]
    Introduction to CELP Coding - Speex
    The CELP technique is based on three ideas: The use of a linear prediction (LP) model to model the vocal tract; The use of (adaptive and fixed) codebook entries ...
  20. [20]
    [PDF] ETSI TS 126 090 V17.0.0 (2022-05)
    Schroeder and B.S. Atal, "Code-Excited Linear Prediction (CELP): High quality speech at very low bit rates," in Proc. ICASSP'85, pp. 937-940, 1985. 2). L.R. ...<|control11|><|separator|>
  21. [21]
    [PDF] Real-time implementation of a variable rate CELP ... - SFU Summit
    adaptive and stochastic codebooks were needed to attain a complexity of 10 MIPS. In order to improve the quality, better codebook search techniques and the ...
  22. [22]
    [PDF] LECTURE 16: LINEAR PREDICTION-BASED REPRESENTATIONS
    The prediction coefficients can be efficiently computed for the autocorrelation method using the Levinson-Durbin ... ○ Pre-emphasis. ○ The LP spectrum.
  23. [23]
    [PDF] VSELP ON THE TMS320C5X - Texas Instruments
    The analysis by synthesis proceeds with three code books (unlike CELP, which proceeds with two). First, the adaptive code book is searched and the resulting.<|control11|><|separator|>
  24. [24]
    [PDF] ETSI EN 300 726 V7.0.2 (1999-12)
    The pitch synthesis filter is implemented using the so-called adaptive codebook approach. The CELP speech synthesis model is shown in figure 2. In this model, ...
  25. [25]
    US6807524B1 - Perceptual weighting device and method for ...
    Traditionally, the weighted signal s w(n) is computed by a weighting filter having a transfer function W(z) in the form: W(z)=A(z/γ 1)/A(z/γ 2) where 0<γ 2<γ 1≦ ...
  26. [26]
    [PDF] an improved 4 kbit/s celp speech coding algorithm - ISCA Archive
    Perceptual Weighting. The perceptual weighting is based on the unquantized. LP filter, A(z), and is given by. 1. 2. ( / ). ( ). ( / ). A z. W z. A z γ γ. = (2).
  27. [27]
    [PDF] Presentation of Specification to TSG-SA - 3GPP
    Once the fractional pitch lag is determined, v'(n) is computed by interpolating the past excitation signal u(n) at the given phase (fraction). (The ...
  28. [28]
    [PDF] Technical Paper - ITU
    Jul 30, 2010 · ... G.729 standard codec except the LPC analysis window, pre- filtering, and post-filtering. The pre-filtering and post-filtering are suppressed.
  29. [29]
    10.2. Code-excited linear prediction (CELP)
    CELP is based on a source-filter model of speech, where linear prediction is used to model the filtering effect of the vocal tract (and other effects)Missing: seminal paper Schroeder Atal
  30. [30]
    [PDF] Code Excited Linear Prediction Coding of Speech at 4.8 kb/s
    This report will provide an overview of the CELP coder. To this end we describe the CELP coder implementation and the design philosophy for the CELP algorithm ...
  31. [31]
    G.722 (1988) App. IV (07/2007) - ITU-T Recommendation database
    Once a good frame is received, the decoded signal is cross-faded with the extrapolated signal. In the higher band, the decoder repeats the previous frame pitch- ...<|control11|><|separator|>
  32. [32]
  33. [33]
  34. [34]
    [PDF] EN 301 245 - Digital cellular telecommunications system (Phase 2);
    Adaptive post-filtering. The adaptive postfilter is the cascade of two filters: a formant postfilter, and a tilt compensation filter. The postfilter is.
  35. [35]
  36. [36]
  37. [37]
    An efficient complexity reduction algorithm for G.729 speech codec
    The main coding flow for an ACELP coding technique is to perform a linear predictive coding (LPC) on the input speech signal, and then perform an adaptive ...
  38. [38]
    [PDF] ETSI TS 126 090 V7.1.0 (2009-06)
    The algebraic codebook structure is based on interleaved single-pulse permutation (ISPP) design. 12.2 kbit/s mode. In this codebook, the innovation vector ...
  39. [39]
  40. [40]
    [PDF] G.728
    The LD-CELP algorithm consists of an encoder and a decoder described in §§ 2.1 and 2.2 respectively, and illustrated in Figure 1/G.728. The essence of CELP ...
  41. [41]
    RFC 6716 - Definition of the Opus Audio Codec - IETF Datatracker
    When switching from CELT-only mode to SILK-only or Hybrid mode with redundancy, the CELT decoder is not reset for decoding the redundant CELT frame. Valin ...
  42. [42]
    Speex: a free codec for free speech
    Speex is based on CELP and is designed to compress voice at bitrates ranging from 2 to 44 kbps . Some of Speex's features include: Narrowband (8 kHz), wideband ...Downloads · Comparison · Plugins & Software · Free and Open Source Software
  43. [43]
    Understanding Codecs: Complexity, Hardware Support, MOS, and ...
    Feb 2, 2006 · Codec Mean Opinion Score (MOS) ; G.728 LD-CELP, 16, 3.61 ; G.729 CS-ACELP, 8, 3.92 ; G.729 x 2 Encodings, 8, 3.27 ; G.729 x 3 Encodings, 8, 2.68 ...
  44. [44]
    A Packet Loss Concealment Algorithm Robust to ... - SpringerLink
    In this paper, a packet loss concealment (PLC) algorithm for CELP-type speech coders is proposed to improve the quality of decoded speech under burst packet ...
  45. [45]
    7 KBPS — 7 MIPS — High Quality ACELP for Cellular Radio
    In this chapter, we propose a modification of the classical CELP (Code Excited Linear Predictive) algorithm in order to reduce its computational complexity andMissing: WMOPS | Show results with:WMOPS
  46. [46]
    [PDF] Comparison of Speech Coding Algorithms: ADPCM, CELP and VSELP
    CELP algorithm can produce low-rate coded speech comparable to that of medium-rate waveform coders thereby bridging the gap between waveform coders and vocoders ...Missing: neural | Show results with:neural
  47. [47]
    Review of methods for coding of speech signals
    Feb 7, 2023 · This paper reviews the history of speech coding techniques, from early mu-law logarithmic compression to recent neural-network methods.