Fact-checked by Grok 2 weeks ago

Code-excited linear prediction

Code-excited linear prediction (CELP) is a linear predictive speech coding algorithm that employs an analysis-by-synthesis approach to model the spectral envelope of speech signals using linear prediction filters while selecting an optimal excitation sequence from a predefined codebook to minimize the perceptual error between the original and synthesized speech, thereby achieving high-quality compression at low bit rates such as 4.8 kbit/s.^[1] Introduced in 1985 by Manfred R. Schroeder and Bishnu S. Atal, CELP builds on earlier linear prediction techniques by incorporating a codebook of stochastic excitation vectors, typically containing hundreds to thousands of entries, which are filtered through short-term and long-term predictors to reconstruct speech waveforms with natural-sounding quality even under constrained bandwidth.^[1] The method's efficiency stems from its ability to exploit both short-term spectral correlations and long-term pitch periodicity in voiced speech, using perceptual weighting to prioritize audible frequency bands during codebook searches.^[2] CELP has become foundational to numerous international speech coding standards, influencing telecommunications and multimedia applications worldwide. Key examples include the ITU-T G.728 recommendation for low-delay CELP (LD-CELP) at 16 kbit/s, standardized in 1992 for real-time voice communications with minimal algorithmic delay of 0.625 ms;^[3] the ITU-T G.729 standard using conjugate-structure algebraic CELP (CS-ACELP) at 8 kbit/s, adopted in 1996 for efficient VoIP and digital telephony;^[4] and the GSM Enhanced Full Rate (EFR) codec based on algebraic CELP (ACELP) at 12.2 kbit/s, released in 1995 to enhance mobile network speech quality.^[5] These variants and others, such as those in wideband standards like G.722.2, demonstrate CELP's adaptability to diverse environments including cellular networks, satellite links, and internet protocols, where it balances computational complexity with robust performance against transmission errors.^[2]

History

Invention and Early Development

Code-excited linear prediction (CELP) emerged as a significant advancement in low-bitrate speech coding, building on earlier innovations in excitation modeling for linear predictive coding (LPC). In 1982, Bishnu S. Atal and Joel R. Remde at Bell Laboratories introduced multipulse excitation as a method to generate more natural-sounding speech by using multiple pulses per pitch period to approximate the residual signal, rather than relying on simplistic quasi-periodic impulses or noise.^[6] This approach improved speech quality at rates around 9.6 kbps but required determining optimal pulse locations and amplitudes iteratively, which increased complexity while still demanding substantial bitrate for encoding multiple parameters.^[6] Subsequent developments addressed these limitations by structuring the excitation more efficiently. In 1986, Peter Kroon, Ed F. Deprettere, and Rob J. Sluyter proposed regular-pulse excitation (RPE), which arranged pulses at fixed intervals within subframes to reduce the search space for optimal excitation and lower computational demands compared to arbitrary multipulse configurations.^[7] RPE evolved the multipulse concept by imposing regularity on pulse positions, enabling better modeling of speech excitation at bitrates near 13 kbps with toll-quality output, and paving the way for vector-based quantization techniques that could capture complex residual patterns more compactly.^[7] These precursors highlighted the need for excitation methods that balanced perceptual quality, bitrate efficiency, and feasibility in hardware-constrained environments. The CELP algorithm was formally proposed in 1985 by Manfred R. Schroeder and Bishnu S. Atal at Bell Laboratories as an enhancement over multipulse LPC, employing an analysis-by-synthesis framework to select excitation vectors from a stochastic codebook that minimized perceptual distortion.^[1] By quantizing the innovation sequence via codebook indices rather than explicit pulse parameters, CELP achieved more efficient representation of the speech residual, targeting very low bitrates suitable for digital communications.^[1] A key early challenge in CELP was the high computational complexity associated with exhaustively searching large stochastic codebooks—typically containing 1024 or more random vectors per subframe—to identify the optimal excitation match.^[1] This exhaustive search, involving weighted error minimization through synthesis filtering, demanded significant processing power, limiting initial implementations to offline or high-end systems despite the algorithm's promise.^[1] Initial demonstrations in the 1985 proposal showcased CELP's potential, delivering speech quality comparable to toll standards at bitrates of 4.8 to 9.6 kbps, with the core innovation sequence coded at approximately 2 kbps and additional overhead for LPC coefficients and gains.^[1] These results marked a breakthrough for secure voice transmission in bandwidth-limited channels, such as military applications.^[1]

Standardization Milestones

In 1990, the U.S. Department of Defense adopted Federal Standard 1016 (FS1016), specifying a 4.8 kbps CELP coder for analog-to-digital conversion of radio voice in secure communications applications.^[8] This standard, developed jointly by the DoD and AT&T, marked the first formal governmental endorsement of CELP technology, enabling efficient low-bitrate encoding for military voice transmission over constrained channels.^[9] The International Telecommunication Union (ITU-T) advanced CELP standardization in 1992 with Recommendation G.728, which defined Low-Delay CELP (LD-CELP) operating at 16 kbps for real-time telephony. LD-CELP reduced algorithmic delay to 0.625 ms through backward adaptive linear prediction and a fixed codebook, supporting applications in digital circuit multiplication equipment and early packet-switched networks.^[3] A significant milestone occurred in 1996 when ITU-T Recommendation G.729 standardized Conjugate-Structure Algebraic CELP (CS-ACELP) at 8 kbps, providing toll-quality speech for international telephony. CS-ACELP employed a sparse algebraic codebook structure, utilizing four signed pulses positioned across 40 samples to represent excitation efficiently while minimizing perceptual distortion through perceptual weighting.^[10] Building on these foundations, the European Telecommunications Standards Institute (ETSI) incorporated Algebraic CELP (ACELP) in the 1995 GSM Enhanced Full Rate (EFR) codec, operating at 12.2 kbps to enhance speech quality in second-generation mobile networks.^[11] EFR, detailed in GSM 06.60, improved upon the original GSM full-rate codec by leveraging ACELP's structured codebook for better robustness against channel errors in cellular environments.^[5] In 2014, the 3rd Generation Partnership Project (3GPP) finalized the Enhanced Voice Services (EVS) codec in Release 12, integrating advanced CELP modes—including enhanced ACELP—for Voice over LTE (VoLTE) applications across bitrates from 5.9 to 128 kbps.^[12] EVS extended CELP principles to support super-wideband audio up to 20 kHz, with CELP modes ensuring backward compatibility and low-latency performance in modern packet-based networks.^[13] These standardization milestones facilitated CELP's widespread adoption in global telecommunications, enabling low-bandwidth mobile speech transmission that supported the proliferation of digital cellular systems like GSM and LTE, thereby reducing spectral demands while maintaining intelligible voice quality for billions of users.^[2]

Fundamentals

Linear Predictive Coding Principles

Linear predictive coding (LPC) models speech production by representing the vocal tract as an all-pole filter excited by an input signal that approximates the glottal source. For voiced speech, the excitation consists of quasi-periodic pulses modeling glottal airflow, while unvoiced speech is modeled as filtered white noise; the combined effects of glottal pulse shaping and radiation are incorporated into the filter response.^[14] This all-pole approximation effectively captures the spectral envelope, particularly the formants, enabling efficient representation of speech spectra.^[15] The core LPC equation predicts the current speech sample \hat{s}(n) as a linear combination of p previous samples:

\hat{s}(n) = \sum_{k=1}^{p} a_k s(n-k),

where a_k are the predictor coefficients and p is the model order, typically 10 for speech sampled at 8 kHz to adequately represent formants up to about 4 kHz. The prediction error e(n) = s(n) - \hat{s}(n) serves as the excitation signal; in code-excited linear prediction (CELP), this excitation is quantized by selecting from a codebook to minimize perceptual distortion.^[14]^[15]^[16] Predictor coefficients are estimated using the autocorrelation method, which minimizes the mean-squared prediction error over a short speech segment by solving the normal equations \mathbf{R} \mathbf{a} = \mathbf{r}, where \mathbf{R} is the p \times p symmetric Toeplitz autocorrelation matrix with elements R_{ij} = r(|i-j|), \mathbf{a} = [a_1, \dots, a_p]^T, and \mathbf{r} = [r(1), \dots, r(p)]^T derived from the windowed speech autocorrelation r(k) = \sum_m s(m) s(m+k). Due to the Toeplitz structure, the Levinson-Durbin recursion efficiently solves these equations in O(p^2) time, yielding stable filters by ensuring reflection coefficients |k_i| < 1.^[14]^[15] For transmission in speech coding, LPC coefficients are often converted to reflection coefficients, which naturally enforce stability and facilitate interpolation between frames, or to line spectral pairs (LSPs), which represent the roots of polynomials derived from the predictor and provide a stable, quantized parameterization with good spectral sensitivity for formant preservation.^[15]^[17] LPC relies on key assumptions: the speech spectrum is stationary over short intervals (typically 5-20 ms), justifying frame-based analysis, and the all-pole model validly approximates the vocal tract transfer function for formants in non-nasal sounds, with poles inside the unit circle ensuring filter stability.^[14]^[15]

Analysis-by-Synthesis Framework

Code-excited linear prediction (CELP) operates within an analysis-by-synthesis framework, a paradigm where the encoder iteratively simulates the decoder's synthesis process to evaluate and select excitation parameters that produce the highest-fidelity reconstruction of the input speech signal. This closed-loop approach ensures that the encoding decisions are optimized directly in the synthesis domain, minimizing perceptual distortion rather than relying on open-loop approximations common in earlier predictive coding methods. Introduced in the seminal work on low-bit-rate speech coding, this framework enables high-quality speech at rates as low as 4.8 kbit/s by exhaustively searching for the best-matching excitation sequence.^[18] At the core of the CELP structure lies the excitation generation mechanism, comprising an adaptive codebook for capturing long-term pitch periodicity via linear prediction (LTP) and a fixed stochastic codebook for introducing the necessary innovation to model the speech signal's aperiodic components. The adaptive codebook, derived from past excitation segments, represents periodic voiced speech through a delayed and scaled version of previous excitations, while the fixed codebook provides random-like vectors to account for the residual stochastic elements. These codebook outputs are scaled by their respective gains and summed to form the overall excitation signal, which is then passed through the LPC synthesis filter—as detailed in the principles of linear predictive coding—to generate the synthesized speech. This dual-codebook design enhances modeling accuracy for both voiced and unvoiced speech segments.^[19]^[20] The optimization process centers on minimizing the weighted mean-squared error between the original input speech x(n) and the synthesized output \hat{y}(n). The error signal is computed as

e(n) = x(n) - \hat{y}(n),

and then filtered through a perceptual weighting filter W(z) to produce the weighted error e_w(n) = W(z) [e(n)], which emphasizes frequency bands more sensitive to human perception, such as formant regions, while de-emphasizing others. The encoder selects the codebook entry (or entries) and gains that minimize \sum e_w^2(n) over the analysis frame or subframe, ensuring the synthesized speech aligns closely with the original in a psychoacoustically relevant manner. This perceptual weighting is crucial for achieving transparent quality at constrained bit rates.^[18]^[20] The exhaustive nature of the codebook search in this framework imposes significant computational demands, with complexity scaling linearly as O(N) for a codebook of size N, often requiring evaluations of thousands of entries per subframe. Early implementations, such as the original CELP prototype, demanded substantial processing power—equivalent to 125 seconds of Cray-1 supercomputer time per second of speech—highlighting the need for algorithmic efficiencies like sequential searches or approximations in practical systems. These trade-offs have driven ongoing innovations in search strategies to balance quality and real-time feasibility in standards like G.729 and AMR.^[18]^[20]

Encoding Process

LPC Coefficient Estimation

In the CELP encoding process, LPC coefficient estimation begins with preprocessing the input speech signal to enhance its suitability for analysis. A second-order high-pass filter is applied to remove low-frequency noise and emphasize higher spectral components, typically with a cutoff around 80 Hz to mitigate DC offset and respiratory sounds.^[20] Following this, the signal is segmented into analysis frames of 20-30 ms duration and windowed using a Hamming window to minimize spectral leakage and ensure smooth transitions between frames.^[21] This windowing reduces edge effects in the time-domain signal, preparing it for frequency-domain modeling via linear prediction. The core of LPC estimation involves computing the autocorrelation function from the windowed speech frames. The autocorrelation coefficients r(k) are calculated as

r(k) = \sum_{n=0}^{N-1-k} x(n) x(n-k),

where x(n) is the windowed speech sample, N is the frame length, and k = 0, 1, \dots, p, with p denoting the predictor order. These coefficients form the basis for solving the Yule-Walker equations to obtain the LPC coefficients a_k. The Levinson-Durbin recursion is employed for efficient computation, iteratively deriving the coefficients while ensuring numerical stability through reflection coefficients bounded by 1 in magnitude.^[22] This algorithm converges in O(p^2) operations, yielding the all-pole filter

A(z) = 1 - \sum_{k=1}^{p} a_k z^{-k}.

To transmit the LPC coefficients efficiently at low bit rates, they are transformed into line spectral pairs (LSPs), which offer better quantization properties due to their uniform distribution and inherent stability.^[21] Quantization typically uses vector quantization (VQ) of the LSP vector or split-VQ, where the LSPs are divided into subvectors and quantized separately to reduce complexity and bit allocation—often 20-38 bits per frame in narrowband CELP systems.^[20] For smoothness across frames, interpolated LSPs are generated by linearly combining consecutive frame LSPs, preventing abrupt spectral changes during synthesis.^[21] Filter stability is enforced to ensure all roots of A(z) lie inside the unit circle, avoiding instability in the synthesis filter. This is achieved through bandwidth expansion, where quantized LPC coefficients are modified as a_k' = a_k \gamma^k with \gamma < 1 (e.g., 0.8).^[23] In narrowband CELP, a 10th-order predictor (p=10) is standard for 8 kHz sampling, with coefficients updated every 5 ms subframe via interpolation from 20-30 ms analysis frames to track vocal tract variations.^[20] These estimated coefficients model the short-term spectral envelope, driving the subsequent excitation search in the analysis-by-synthesis loop.

Codebook Excitation Selection

In code-excited linear prediction (CELP), the adaptive codebook models the long-term correlation in speech due to pitch periodicity by storing segments of past synthesized excitation signals. It is indexed by a pitch lag L (typically ranging from 20 to 120 samples at an 8 kHz sampling rate) and a pitch gain G_p, allowing the encoder to select and scale a delayed version of prior excitation to predict the current frame's periodic component. This approach enhances efficiency at low bit rates by exploiting the quasi-periodic nature of voiced speech. The fixed codebook provides stochastic or structured innovation to capture the non-periodic residual after long-term prediction, representing random fluctuations in the excitation signal. Entries consist of predefined vectors, such as random Gaussian noise sequences in early CELP designs or algebraic pulse patterns (e.g., four signed pulses at ±1 in ACELP variants) to approximate the speech innovation with sparse representations. The encoder searches this codebook to select the vector c_k(n) that best fits the remaining target signal after adaptive contribution.^[1] The search procedure operates sequentially: first, the adaptive codebook is searched by maximizing the normalized cross-correlation between the target signal and filtered past excitation candidates over possible lags, yielding the optimal L and G_p. The target is then updated by subtracting the adaptive contribution, and the fixed codebook is searched to find the entry c_k(n) and gain G_c that minimize the squared weighted prediction error \|W(e)\|^2, where W(z) is a perceptual weighting filter emphasizing formant regions. Gains G_p and G_c are computed as zero-mean correlations between the target and filtered codebook outputs, normalized by codebook energies, and jointly quantized to balance periodic and innovative components. The resulting excitation for the current subframe is formed as:

u(n) = G_p \, u(n - L) + G_c \, c_k(n), \quad n = 0, \dots, 39

where u(n - L) is the interpolated past excitation at lag L, and the subframe spans 40 samples. Processing occurs in 5 ms subframes to track rapid pitch variations, with the adaptive codebook updated using the newly synthesized excitation after each subframe. This granular approach ensures accurate modeling of pitch changes, particularly in voiced segments, while keeping computational demands manageable through efficient correlation-based searches.

Noise Weighting Optimization

In code-excited linear prediction (CELP), noise weighting optimization employs a perceptual weighting filter to shape the quantization error spectrum, ensuring that noise is primarily introduced in spectral regions where the human auditory system is less sensitive, such as the valleys between formants, thereby enhancing subjective speech quality based on psychoacoustic principles.^[1] This approach leverages the masking properties of the ear, where quantization noise in formant peaks (high perceptual importance) is minimized, while noise in less audible areas is tolerated.^[24] The core of this optimization is the perceptual weighting filter W(z), defined as

W(z) = \frac{A(z/\gamma)}{A(z/\gamma')}

where A(z) is the linear prediction analysis filter (the inverse of the synthesis filter), \gamma typically ranges from 0.8 to 1.0 to emphasize formants by broadening their spectral peaks, and \gamma' < \gamma (often around 0.4 to 0.6) dampens the weighting in those regions to control overall emphasis.^[25] This formulation results in a filter that amplifies the error signal near formants while attenuating it elsewhere, aligning the noise spectrum with auditory masking thresholds.^[26] For efficient implementation, the filter can be realized in the frequency domain using fast Fourier transforms for subband processing or as an all-pole lattice filter structure derived from the LPC coefficients, which ensures stability and low computational complexity. The weighting is applied to both the target signal x_w(n) = W(x(n)), where x(n) is the pre-processed input minus the adaptive codebook contribution, and the synthesized signal y_w(n) = W(y(n)), with the codebook search minimizing the mean-squared weighted error \| x_w(n) - y_w(n) \|^2.^[24] To adapt to varying speech characteristics, the parameters \gamma and \gamma' are often adjusted based on voicing analysis; for unvoiced segments, which exhibit a flatter spectrum, higher values of \gamma (closer to 1.0) are used to reduce formant emphasis and flatten the weighting, preventing over-amplification of noise in broadband regions.^[26] This adaptation, typically derived from measures like spectral tilt or open-loop pitch correlation, improves robustness across voiced and unvoiced frames.^[27] By concentrating quantization noise in perceptually masked spectral valleys, this optimization enables more efficient bit allocation, allowing coarser quantization (fewer bits) in low-sensitivity areas while allocating higher precision to formant regions, which is critical for low-bitrate operation in standards like G.729.^[28]

Decoding Process

Signal Synthesis

In the CELP decoder, the process begins with the reception of quantized parameters transmitted from the encoder, including indices for the LPC coefficients, pitch lag L, adaptive and fixed codebook gains G_p and G_c, and the fixed codebook vector index k. These parameters are dequantized to reconstruct the excitation signal u(n), which combines the periodic component from the adaptive codebook and the stochastic component from the fixed codebook:
u(n) = G_p \, u(n - L) + G_c \, c_k(n),
where u(n - L) is derived from the buffer of previously synthesized excitation, and c_k(n) is the selected entry from the fixed codebook. This reconstruction occurs per subframe, typically within frames of 40 to 160 samples (corresponding to 5-20 ms at an 8 kHz sampling rate), with buffers maintaining the past excitation history for the adaptive codebook update.^[29]^[30] The synthesized speech signal \hat{s}(n) is then generated by passing the excitation u(n) through the all-pole LPC synthesis filter with transfer function $1/A(z), where A(z) is the inverse filter polynomial formed from the dequantized LPC coefficients, often interpolated across subframes for smoothness:
\hat{s}(n) = \frac{u(n)}{A(z)}.
The LPC coefficients, typically 10th order for narrowband speech, model the spectral envelope and are updated every frame to adapt to changing vocal tract characteristics. Buffer management ensures seamless processing, with the synthesis filter state updated using the newly generated \hat{s}(n) to prepare for the next subframe.^[18]^[30] To handle frame erasures due to transmission errors, the decoder incorporates robustness mechanisms such as muting the output for severe losses or predictive fill-in using extrapolated parameters from prior frames, preventing abrupt discontinuities in the speech waveform. This maintains perceptual continuity without requiring additional side information. The final output is 8 kHz pulse-code modulation (PCM) speech, suitable for telephony, with low-delay variants like LD-CELP achieving algorithmic latencies under 5 ms to support real-time applications.^[3]

Post-Filtering Techniques

Post-filtering techniques in code-excited linear prediction (CELP) decoders serve as optional enhancements applied after signal synthesis to mitigate perceptual artifacts, such as noise concentrated in formant regions, thereby improving the subjective quality of reconstructed speech. These methods exploit the analysis-by-synthesis nature of CELP by shaping the output spectrum to align more closely with human auditory perception, emphasizing formants and pitch harmonics while suppressing inter-formant noise. Unlike encoder-side perceptual weighting, which minimizes weighted error during codebook search, post-filtering operates solely on the decoder side to refine the final output without requiring additional bit allocation.^[31] The core of post-filtering is the adaptive post-filter, which combines short-term and tilt compensation components to boost formant peaks and counteract spectral tilt introduced by quantization. Its transfer function is typically given by

F(z) = \frac{A_{\text{tilt}}(z / \gamma_p) \, A(z / \gamma_f)}{A(1)},

where A(z) is the linear prediction filter, \gamma_f (often around 0.8–0.9) controls formant emphasis by sharpening spectral peaks, and \gamma_p (typically 0.4–0.6) adjusts the tilt filter A_{\text{tilt}}(z) to prevent excessive low-frequency amplification. The denominator A(1) normalizes the DC gain to maintain energy levels. This filter is updated per subframe using the decoded LPC coefficients, ensuring adaptation to varying speech characteristics.^[31]^[32] A long-term post-filter complements the adaptive one by enhancing pitch periodicity, particularly for voiced speech, through a structure analogous to the long-term predictor in CELP encoding. It uses the received pitch lag to apply a comb-like filter that smooths the periodic waveform, reducing noise between harmonics without altering the fundamental pitch period; the filter gain is often clipped to avoid over-smoothing unvoiced segments.^[32]^[10] A short-term high-pass filter is often applied as a simpler enhancement to attenuate low-frequency bias and remove DC components, typically H(z) = 1 - \mu z^{-1} with \mu \approx 0.7–0.9.^[20] These techniques introduce minimal algorithmic delay, usually 1–2 ms due to filter buffering, but yield notable perceptual gains, with mean opinion score (MOS) improvements of 0.2–0.5 points in subjective tests, particularly under noisy conditions.^[21]^[31] In ITU-T standards, the adaptive post-filter is mandatory in G.729 for 8 kbit/s CS-ACELP coding to ensure toll-quality performance, and is included as part of the decoder in standards such as GSM Enhanced Full Rate (EFR) and Adaptive Multi-Rate (AMR), where it contributes to quality without impacting core delay budgets.^[32]^[24]

Variants and Extensions

Algebraic CELP

Algebraic code-excited linear prediction (ACELP) is a variant of the CELP framework that employs a structured fixed codebook composed of sparse pulse excitations to achieve computational efficiency and high speech quality at low bit rates. In this approach, the fixed codebook excitation is generated algebraically rather than through storage of precomputed vectors, allowing for a compact representation that facilitates rapid searching during encoding. This structure is particularly suited for modeling the impulsive nature of voiced speech excitations. The core of the ACELP fixed codebook consists of a small number of non-zero pulses distributed across a subframe, with their positions and signs optimized to best match the target signal. For instance, in the ITU-T G.729 standard, each 40-sample subframe uses exactly four pulses, where each pulse has an amplitude of +1 or -1 and is placed at one of several possible positions divided into tracks to ensure even distribution and avoid pulse overlaps. The codevector c_k(n) for the k-th entry is thus defined as

c_k(n) = \sum_{i=1}^{P} s_i \delta(n - p_i),

where P = 4 is the number of pulses, s_i = \pm 1 are the signs, p_i are the integer positions within the subframe (0 to 39), and \delta(\cdot) is the Kronecker delta function. This formulation ensures sparsity, with only P non-zero samples per 40-sample vector, enabling the codebook to be fully described by the 17 bits allocated for positions and signs in G.729, rather than storing thousands of full vectors.^[33] The search for the optimal ACELP codevector follows an analysis-by-synthesis paradigm, where the correlation between the target signal and the filtered codevector is maximized while minimizing perceptual distortion. To mitigate the high complexity of exhaustive enumeration—potentially on the order of $2^{40} operations for larger configurations—efficient techniques such as focused tree searches and amplitude pre-selection are employed. In G.729, for example, pulse signs are initially set based on the sign of the backward-filtered target, followed by a depth-first search that prunes unlikely branches using thresholds on partial correlation gains, limiting the total candidates to around 1 million operations per subframe. This reduces the computational load significantly compared to brute-force methods while preserving near-optimal performance.^[33]^[34] ACELP offers key advantages over traditional stochastic codebooks, including substantially lower memory requirements since no full codevectors need to be stored—only the algebraic parameters are quantized and transmitted. Additionally, the sparse pulse structure excels at capturing sharp transients in speech, such as plosives and onsets, by concentrating excitation energy at precise locations, which enhances perceptual quality during dynamic signal segments. These benefits make ACELP ideal for resource-constrained environments like mobile communications.^[33]^[35] ACELP forms the foundation of several international speech coding standards, notably serving as the core excitation mechanism in ITU-T G.729 for 8 kbit/s toll-quality narrowband coding. It is also integral to the Adaptive Multi-Rate (AMR) codec's narrowband mode (3GPP TS 26.090), where algebraic pulses enable robust performance across varying channel conditions in GSM/UMTS networks. Furthermore, enhanced ACELP variants underpin the narrowband and wideband modes of the Enhanced Voice Services (EVS) codec (3GPP TS 26.445), providing backward compatibility with AMR while supporting super-wideband audio up to 20 kHz.^[33]^[36]

Low-Delay CELP

Low-delay code-excited linear prediction (LD-CELP) is a variant of the CELP framework optimized for real-time speech communication, where minimizing algorithmic delay is paramount to support interactive applications such as telephony and video conferencing. Unlike standard CELP, which relies on lookahead for accurate linear prediction coefficient (LPC) estimation, LD-CELP employs backward adaptation techniques to process incoming speech samples with minimal buffering, achieving delays as low as 0.625 ms. This approach was standardized in ITU-T G.728 for 16 kbit/s speech coding, providing toll-quality performance suitable for digital circuit multiplication equipment and other low-latency systems.^[37]^[3] A key innovation in LD-CELP is the backward LPC estimation, where predictor coefficients are derived from the future synthesis buffer of previously decoded speech rather than the input signal, thereby eliminating the need for lookahead delay. The LPC analysis uses a high-order filter, typically 50th order, updated periodically—every 2.5 ms (20 samples at 8 kHz sampling rate)—via hybrid windowing on the quantized synthesis signal and Durbin's recursion for coefficient computation. Processing is organized into vectors of 5 samples (0.625 ms each), with predictor coefficients adapted every 2.5 ms (every 4 vectors or 20 samples) within full frames of 10 ms (80 samples) to maintain stability without accumulating errors. The excitation codebook is smaller and stochastic, comprising 128 shape vectors (7 bits) and 8 gain levels (3 bits) for a 10-bit index per vector, avoiding long-term prediction (LTP) loops that introduce additional delay in standard CELP. Backward-adaptive gain further ensures that only the codebook index is transmitted, streamlining the encoding process.^[38]^[37]^[3] In terms of performance, LD-CELP at 16 kbit/s in G.728 delivers speech quality comparable to or exceeding the 32 kbit/s ADPCM standard (ITU-T G.726), with an algorithmic delay under 2 ms (one-way), making it ideal for real-time interactive voice. However, this comes at the cost of a higher bitrate than many standard CELP variants optimized for bandwidth efficiency, as the small frame size and frequent adaptations demand more bits for excitation representation. Trade-offs include slightly reduced perceptual quality due to limited lookahead for noise shaping and prediction, though post-filtering enhances robustness against transmission errors; overall, LD-CELP excels in delay-sensitive scenarios but trades some compression efficiency for responsiveness.^[3]^[37]

Applications and Evaluation

Integration in Standards

Code-excited linear prediction (CELP) has been integrated into numerous telephony standards to enable efficient voice compression while maintaining quality. In Voice over Internet Protocol (VoIP) systems using Session Initiation Protocol (SIP), G.729, a conjugate-structure algebraic CELP variant operating at 8 kbit/s, serves as a common fallback codec when the primary G.711 pulse-code modulation exceeds bandwidth limits, ensuring reliable transmission over packet-switched networks. Similarly, the Adaptive Multi-Rate (AMR) codec, based on algebraic CELP with bit rates from 4.75 to 12.2 kbit/s, is mandated for circuit-switched voice in Global System for Mobile Communications (GSM) and Universal Mobile Telecommunications System (UMTS) networks, providing adaptive quality based on channel conditions. In wireless standards, the Enhanced Voice Services (EVS) codec, standardized by 3GPP for 4G Long-Term Evolution (LTE) and 5G Voice over LTE (VoLTE), incorporates CELP-based modes for super-wideband speech up to 20 kHz bandwidth at bit rates starting from 5.9 kbit/s, enabling high-fidelity audio in mobile environments with robustness to packet loss. The Opus codec, defined in RFC 6716, employs a hybrid structure where its SILK component—a linear prediction-based speech coder akin to CELP—handles narrowband to fullband voice in modes up to 48 kHz, making it suitable for real-time applications like WebRTC.^[39] Secure and military communications leverage CELP for its balance of security and intelligibility. The FS1016 standard, a 4.8 kbit/s CELP coder developed by the U.S. National Security Agency (NSA), was implemented in Secure Telephone Unit III (STU-III) devices for encrypted voice over analog lines, providing classified communications compatibility. Contemporary military systems build on this with mixed-excitation linear prediction (MELP) hybrids that incorporate CELP elements for improved naturalness at low bit rates, such as in the 2.4 kbit/s MELPe standard used in tactical radios. Open-source and commercial implementations further embed CELP in modern devices. Speex, an open-source codec library under the Xiph.Org Foundation, directly implements CELP for narrowband to wideband speech compression at 2–44 kbit/s, supporting VoIP and embedded applications without licensing fees.^[40] In commercial telephony stacks, Android's Open Source Project integrates CELP-derived codecs like GSM-FR (full-rate) and AMR for mobile voice calls, while iOS employs similar variants in its AVFoundation framework for SIP-based VoIP, ensuring interoperability across carriers.

Performance Characteristics

Code-excited linear prediction (CELP) coders typically operate at bitrates ranging from 4.8 to 16 kbps to achieve toll-quality speech, corresponding to mean opinion scores (MOS) of 3.5 to 4.0 on a scale of 1 to 5, where higher values indicate better perceived quality.^[41] For example, the Federal Standard 1016 CELP at 4.8 kbps yields an MOS of approximately 3.2, while ITU-T G.729 at 8 kbps achieves around 3.9, reflecting natural-sounding output due to the perceptual weighting in the analysis-by-synthesis framework that minimizes audible distortions.^[1] Wideband variants require higher bitrates, often exceeding 16 kbps, to maintain similar MOS levels across extended frequency ranges up to 7 kHz. Objective metrics such as segmental signal-to-noise ratio (segSNR) for CELP typically fall between 10 and 15 dB, providing a quantitative measure of waveform fidelity, though perceptual quality often exceeds what segSNR alone suggests.^[18] CELP exhibits advantages in robustness to packet loss, with appropriate post-filtering and concealment techniques enabling recovery from loss rates of 15-20% while preserving MOS degradation to below 0.5 points.^[42] This resilience stems from the parametric nature of the excitation and prediction coefficients, allowing interpolation or repetition for lost frames without severe artifacts. Additionally, the perceptual optimization in CELP's codebook search contributes to a natural-sounding quality, bridging the gap between synthetic vocoder artifacts and waveform preservation.^[1] Limitations include significant computational demands, with typical implementations requiring 10-50 million instructions per second (MIPS), though optimized variants like algebraic CELP reduce this to under 10 MIPS.^[43] Sensitivity to frame errors can cause audible glitches if losses exceed concealment capabilities, and at bitrates below 4 kbps, speech often exhibits muffled or robotic quality due to insufficient excitation resolution. Complexity is also measured in weighted million operations per second (WMOPS), where standards like G.729 register around 8-14 WMOPS for encoder-decoder pairs. Compared to waveform coders like PCM or ADPCM, CELP offers superior compression at low bitrates, achieving comparable quality to 32 kbps ADPCM using only 8 kbps or less, thanks to its source-filter modeling.^[44] Against modern neural coders, CELP remains a legacy approach but provides deterministic performance without training data dependencies, ensuring consistent quality in resource-constrained environments.^[45]

References

[1]
Code-excited linear prediction(CELP): High-quality speech at very ...
CELP selects an innovation sequence from a code book, filters it with long and short delay predictors, and codes speech at 1/4 bit per sample.
[2]
[PDF] Speech Coding Methods, Standards, and Applications - ViVoNets
In CELP speech coders, a segment of speech (say, 5 ms) is synthesized using the linear prediction model along with a long-term redundancy predictor for all ...
[3]
G.728 : Coding of speech at 16 kbit/s using low-delay code ... - ITU
Sep 2, 2025 · G.728 (06/12), Coding of speech at 16 kbit/s using low-delay code excited linear prediction. The corresponding ANSI-C code is available in the G ...Missing: standards 729
[4]
https://www.itu.int/rec/T-REC-G.729
[5]
[PDF] Enhanced Full Rate (EFR) speech transcoding; (GSM 06.60 ... - ETSI
The coding scheme is the so-called Algebraic Code Excited Linear Prediction Coder, hereafter referred to as ACELP. This EN also specifies the conversion between ...
[6]
https://ieeexplore.ieee.org/document/1171649
[7]
https://ieeexplore.ieee.org/document/1164946
[8]
The federal standard 1016 4800 bps CELP voice coder
Federal Standard 1016, Telecommunications: Analog to Digital Conversion of Radio Voice by 4,800 bit/second Code Excited Linear Prediction (CELP).Missing: DoD | Show results with:DoD
[9]
The Dod 4.8 Kbps Standard (Proposed Federal Standard 1016)
Campbell, J., T. Tremain and V. Welch, “The Proposed Federal Standard 1016 4800 bps Voice Coder: CELP,” submitted to Speech Technology Magazine, April/May 1990 ...
[10]
[PDF] g729.pdf
The CS-ACELP coder is based on the Code-Excited Linear-Prediction (CELP) coding model. The coder operates on speech frames of 10 ms corresponding to 80 samples ...
[11]
GSM enhanced full rate speech codec - IEEE Xplore
Abstract: This paper describes the GSM enhanced full rate (EFR) speech codec that has been standardised for the GSM mobile communication system.
[12]
Enhanced Voice Services Codec for LTE - 3GPP
Nov 7, 2014 · EVS is the first 3GPP conversational codec offering up to 20 kHz audio bandwidth, delivering speech quality that matches other audio input such as stored music.Missing: CELP | Show results with:CELP
[13]
[PDF] ETSI TS 126 445 V12.6.0 (2016-05)
This Technical Specification (TS) has been produced by ETSI 3rd Generation Partnership Project (3GPP). The present document may refer to technical ...Missing: advanced | Show results with:advanced
[14]
[PDF] Speech Analysis and Synthesis by Linear Prediction of the Speech ...
Speech Analysis and Synthesis by Linear Prediction of the Speech Wave. B. S. ... ATAL. AND. HANAUER. Fro. 3. Waveform of the speech signal to- gether with ...
[15]
None
### Summary of LPC Principles and Related Topics from Lecture 13
[16]
[PDF] Code-excited linear prediction(CELP): High-quality speech at ...
Code-excited linear prediction(CELP): High-quality speech at very low bit rates · M. Schroeder, B. Atal · Published in ICASSP '85. IEEE… 26 April 1985 · Computer ...Missing: Bishnu | Show results with:Bishnu
[17]
[PDF] Line spectrum pair (LSP) and speech data compression
Line Spectrum Pair (LSP) was first introduced by Itakura [1,2] as an alternative LPC ... A new model of LPC excitation for producing natural-sounding speech at ...
[18]
[PDF] “Code-excited Linear Prediction (CELP): High Quality Speech at ...
We describe in this paper a code-excited linear predictive coder in which the optimum innovation sequence is selected from a code book of stored sequences ...
[19]
Introduction to CELP Coding - Speex
The CELP technique is based on three ideas: The use of a linear prediction (LP) model to model the vocal tract; The use of (adaptive and fixed) codebook entries ...
[20]
[PDF] ETSI TS 126 090 V17.0.0 (2022-05)
Schroeder and B.S. Atal, "Code-Excited Linear Prediction (CELP): High quality speech at very low bit rates," in Proc. ICASSP'85, pp. 937-940, 1985. 2). L.R. ...<|control11|><|separator|>
[21]
[PDF] Real-time implementation of a variable rate CELP ... - SFU Summit
adaptive and stochastic codebooks were needed to attain a complexity of 10 MIPS. In order to improve the quality, better codebook search techniques and the ...
[22]
[PDF] LECTURE 16: LINEAR PREDICTION-BASED REPRESENTATIONS
The prediction coefficients can be efficiently computed for the autocorrelation method using the Levinson-Durbin ... ○ Pre-emphasis. ○ The LP spectrum.
[23]
[PDF] VSELP ON THE TMS320C5X - Texas Instruments
The analysis by synthesis proceeds with three code books (unlike CELP, which proceeds with two). First, the adaptive code book is searched and the resulting.<|control11|><|separator|>
[24]
[PDF] ETSI EN 300 726 V7.0.2 (1999-12)
The pitch synthesis filter is implemented using the so-called adaptive codebook approach. The CELP speech synthesis model is shown in figure 2. In this model, ...
[25]
US6807524B1 - Perceptual weighting device and method for ...
Traditionally, the weighted signal s w(n) is computed by a weighting filter having a transfer function W(z) in the form: W(z)=A(z/γ 1)/A(z/γ 2) where 0<γ 2<γ 1≦ ...
[26]
[PDF] an improved 4 kbit/s celp speech coding algorithm - ISCA Archive
Perceptual Weighting. The perceptual weighting is based on the unquantized. LP filter, A(z), and is given by. 1. 2. ( / ). ( ). ( / ). A z. W z. A z γ γ. = (2).
[27]
[PDF] Presentation of Specification to TSG-SA - 3GPP
Once the fractional pitch lag is determined, v'(n) is computed by interpolating the past excitation signal u(n) at the given phase (fraction). (The ...
[28]
[PDF] Technical Paper - ITU
Jul 30, 2010 · ... G.729 standard codec except the LPC analysis window, pre- filtering, and post-filtering. The pre-filtering and post-filtering are suppressed.
[29]
10.2. Code-excited linear prediction (CELP)
CELP is based on a source-filter model of speech, where linear prediction is used to model the filtering effect of the vocal tract (and other effects)Missing: seminal paper Schroeder Atal
[30]
[PDF] Code Excited Linear Prediction Coding of Speech at 4.8 kb/s
This report will provide an overview of the CELP coder. To this end we describe the CELP coder implementation and the design philosophy for the CELP algorithm ...
[31]
G.722 (1988) App. IV (07/2007) - ITU-T Recommendation database
Once a good frame is received, the decoded signal is cross-faded with the extrapolated signal. In the higher band, the decoder repeats the previous frame pitch- ...<|control11|><|separator|>
[32]
https://www.itu.int/rec/T-REC-G.729-199603-I/en
[33]
https://www.itu.int/rec/T-REC-G.729/en
[34]
[PDF] EN 301 245 - Digital cellular telecommunications system (Phase 2);
Adaptive post-filtering. The adaptive postfilter is the cascade of two filters: a formant postfilter, and a tilt compensation filter. The postfilter is.
[35]
G.729 : Coding of speech at 8 kbit/s using conjugate-structure algebraic-code-excited linear prediction (CS-ACELP)
### Summary of CS-ACELP in G.729
[36]
https://www.etsi.org/deliver/etsi_ts/126000_126099/126090/07.01.00_60/ts_126090v070100p.pdf
[37]
An efficient complexity reduction algorithm for G.729 speech codec
The main coding flow for an ACELP coding technique is to perform a linear predictive coding (LPC) on the input speech signal, and then perform an adaptive ...
[38]
[PDF] ETSI TS 126 090 V7.1.0 (2009-06)
The algebraic codebook structure is based on interleaved single-pulse permutation (ISPP) design. 12.2 kbit/s mode. In this codebook, the innovation vector ...
[39]
https://datatracker.ietf.org/doc/html/rfc6716
[40]
[PDF] G.728
The LD-CELP algorithm consists of an encoder and a decoder described in §§ 2.1 and 2.2 respectively, and illustrated in Figure 1/G.728. The essence of CELP ...
[41]
RFC 6716 - Definition of the Opus Audio Codec - IETF Datatracker
When switching from CELT-only mode to SILK-only or Hybrid mode with redundancy, the CELT decoder is not reset for decoding the redundant CELT frame. Valin ...
[42]
Speex: a free codec for free speech
Speex is based on CELP and is designed to compress voice at bitrates ranging from 2 to 44 kbps . Some of Speex's features include: Narrowband (8 kHz), wideband ...Downloads · Comparison · Plugins & Software · Free and Open Source Software
[43]
Understanding Codecs: Complexity, Hardware Support, MOS, and ...
Feb 2, 2006 · Codec Mean Opinion Score (MOS) ; G.728 LD-CELP, 16, 3.61 ; G.729 CS-ACELP, 8, 3.92 ; G.729 x 2 Encodings, 8, 3.27 ; G.729 x 3 Encodings, 8, 2.68 ...
[44]
A Packet Loss Concealment Algorithm Robust to ... - SpringerLink
In this paper, a packet loss concealment (PLC) algorithm for CELP-type speech coders is proposed to improve the quality of decoded speech under burst packet ...
[45]
7 KBPS — 7 MIPS — High Quality ACELP for Cellular Radio
In this chapter, we propose a modification of the classical CELP (Code Excited Linear Predictive) algorithm in order to reduce its computational complexity andMissing: WMOPS | Show results with:WMOPS
[46]
[PDF] Comparison of Speech Coding Algorithms: ADPCM, CELP and VSELP
CELP algorithm can produce low-rate coded speech comparable to that of medium-rate waveform coders thereby bridging the gap between waveform coders and vocoders ...Missing: neural | Show results with:neural
[47]
Review of methods for coding of speech signals
Feb 7, 2023 · This paper reviews the history of speech coding techniques, from early mu-law logarithmic compression to recent neural-network methods.