Algebraic code-excited linear prediction
Algebraic code-excited linear prediction (ACELP) is a speech coding algorithm that builds upon the code-excited linear prediction (CELP) framework by employing an algebraic codebook to represent the fixed excitation component as a sparse combination of pulses at selected positions with signs (±1), enabling efficient low-bit-rate compression while maintaining high speech quality.[1][2] In ACELP, the excitation signal is modeled using both an adaptive codebook for pitch periodicity and a fixed algebraic codebook that generates vectors on-the-fly without storage, typically consisting of 4 to 10 non-zero pulses per subframe to approximate the residual after linear prediction filtering.[1][3] This algebraic approach differs from traditional CELP by using structured permutation codes, such as interleaved single-pulse permutation, to optimize bit allocation and reduce computational complexity, making it suitable for real-time applications.[2][4] ACELP gained prominence through its adoption in international standards, notably as conjugate-structure ACELP (CS-ACELP) in the ITU-T G.729 codec, which operates at 8 kbit/s on 10 ms frames of 8 kHz sampled speech, achieving toll-quality compression for telephony.[4][2] It has also been integral to codecs like the Adaptive Multi-Rate (AMR) for mobile networks and the Enhanced Voice Services (EVS) standard, providing robust performance in noisy environments and variable bit rates (e.g., 4.75–12.2 kbit/s for AMR narrowband and up to 128 kbit/s for EVS).[1][5] Key advantages include sparsity in the codebook vectors for faster searches, perceptual weighting to enhance naturalness, and adaptability to different channel conditions, though it requires careful pulse positioning to avoid perceptual artifacts.[6][7]Fundamentals
Linear Prediction Basics
Linear predictive coding (LPC) models human speech production based on the source-filter theory, where the vocal tract acts as a time-varying filter shaping an excitation source into the observed speech signal. In this framework, the vocal tract is approximated by an all-pole filter with transfer function H(z) = \frac{1}{A(z)} = \frac{[G](/page/Gain)}{1 - \sum_{k=1}^p a_k z^{-k}}, where [G](/page/Gain) is a gain factor, p is the model order, and a_k are the prediction coefficients; the excitation consists of quasi-periodic pulses for voiced speech or noise for unvoiced speech, filtered to produce the spectral envelope.[8][9] The core of LPC involves predicting the current speech sample s(n) as a linear combination of p previous samples, yielding the prediction error (or residual) signal: e(n) = s(n) - \sum_{k=1}^p a_k s(n-k) This error e(n) represents the excitation driving the filter, and the coefficients a_k are chosen to minimize the mean-squared error over a short-time window of the signal, typically with p = 10 for speech sampled at 8 kHz to capture formant structure effectively.[9][10] To derive the LPC coefficients, the autocorrelation method is employed, solving the normal equations \sum_{k=1}^p a_k R(i-k) = -R(i) for $1 \leq i \leq p, where R(\cdot) is the autocorrelation function of the windowed speech segment; these Toeplitz equations are efficiently solved using the Levinson-Durbin recursion, which computes coefficients iteratively via reflection coefficients k_m, reducing complexity from O(p^3) to O(p^2).[9][10] In speech processing, LPC distinguishes short-term prediction, which models the slowly varying spectral envelope of the vocal tract over frames of 10-30 ms using the all-pole filter, from long-term prediction, which captures pitch-induced periodicity by predicting samples separated by the pitch period to remove harmonic redundancies.[10][9]CELP Framework
Code-excited linear prediction (CELP) is an analysis-by-synthesis speech coding technique introduced by Manfred R. Schroeder and Bishnu S. Atal in 1985, designed to produce high-quality speech at low bit rates by selecting an optimal excitation sequence from a codebook that minimizes the perceptual difference between the input speech and the synthesized output.[11] The method builds on linear predictive coding by incorporating a codebook-driven excitation model, where the speech signal is synthesized by passing the excitation through a time-varying linear filter representing the vocal tract spectral envelope. This approach allows for efficient representation of both deterministic (pitch-related) and stochastic components of the speech excitation, enabling bit rates as low as 4.8 kbit/s with toll-quality performance.[11] The CELP framework comprises key components including an adaptive codebook for long-term prediction, which models pitch periodicity by selecting delayed versions of past excitation signals, and a fixed codebook for the short-term innovation, which captures the residual stochastic elements after pitch removal.[2] A perceptual weighting filter, typically expressed as W(z) = \frac{A(z)}{ \hat{A}(z) }, where A(z) is the linear prediction filter and \hat{A}(z) is its bandwidth-expanded counterpart (e.g., \hat{A}(z) = A(z/\gamma) with $0.8 \leq \gamma < 1), shapes the error spectrum to emphasize perceptually important formant regions while attenuating others.[2] The synthesis process involves cascading a long-term predictor (from the adaptive codebook) with the short-term LPC filter, scaled by appropriate gains. The core optimization criterion minimizes the perceptually weighted mean-squared error between the original and synthesized speech, equivalently formulated in the frequency domain as \arg\min \int_{0}^{2\pi} |W(e^{j\omega})(S(e^{j\omega}) - \hat{S}(e^{j\omega}))|^2 \, d\omega, where S(e^{j\omega}) and \hat{S}(e^{j\omega}) denote the z-transforms of the input and output signals evaluated on the unit circle.[2] This is achieved through an analysis-by-synthesis search procedure that exhaustively evaluates codebook entries: first, the adaptive codebook is searched to find the optimal pitch lag and gain, followed by a search over the fixed codebook to select the best innovation vector and its gain, with the process repeated per subframe (typically 5 ms).[11] Algebraic code-excited linear prediction (ACELP) extends this framework with a structured fixed codebook based on sparse pulses for computational efficiency.[2]ACELP Specifics
Algebraic Codebook Design
The algebraic codebook represents the primary innovation in algebraic code-excited linear prediction (ACELP), serving as a structured mathematical framework for generating sparse excitation vectors in the fixed codebook of the CELP paradigm. Unlike traditional stochastic codebooks that store precomputed vectors, the algebraic codebook constructs entries on-the-fly using a limited set of signed unit pulses positioned at specific indices within a subframe, enabling a vast effective codebook size while eliminating the need for large memory storage. This approach was first proposed to address the computational and storage inefficiencies of exhaustive codebook searches in early CELP implementations.[12] The core structure relies on partitioning the subframe positions into multiple tracks to promote sparsity and facilitate efficient enumeration. For instance, in a typical 40-sample subframe, the codebook may divide positions into 4 tracks (e.g., track 0: positions 0,5,10,15,20,25,30,35; track 1: 1,6,11,16,21,26,31,36; track 2: 2,7,12,17,22,27,32,37; track 3: 3,8,13,18,23,28,33,38,4,9,14,19,24,29,34,39), with exactly one pulse selected per track to avoid overlaps. Each pulse has a binary sign (±1), and the excitation vector c_k is formed as the sum c_k(n) = \sum_{i=1}^{N} p_i \delta(n - pos_i), \quad 0 \leq n < L, where N is the number of pulses (often 4), p_i = \pm 1 is the sign for the i-th pulse, pos_i is its position from the selected track, \delta is the Kronecker delta function, and L is the subframe length (e.g., 40). This formulation ensures the vector has exactly N non-zero entries of magnitude 1, providing a sparse ternary representation (±1, 0) that approximates glottal pulses effectively. In the ITU-T G.729 standard, this design uses 17 bits for the codebook index: 13 bits for positions (3 bits each for the first three tracks with 8 choices, 4 bits for the fourth track with 16 choices) and 4 bits for signs, yielding an effective size of $2^{17} = 131072 vectors per subframe.[2][13] The track-based partitioning and pulse selection yield significant advantages in memory efficiency and computational complexity. A codebook with M bits (typically 35-40 for high-resolution excitation) can produce $2^M distinct vectors, but only the pulse positions and signs (indices) need to be encoded and transmitted, requiring no dedicated storage for the full codebook—contrasting with stochastic designs that demand O($2^M) space. Generation and evaluation complexity is further reduced from O($2^M) operations for full enumeration to O(M) per candidate via algebraic computation, making real-time implementation feasible on low-power hardware. These benefits have made algebraic codebooks foundational in standards like G.729, where they contribute to toll-quality speech at 8 kb/s with minimal overhead.[2][12][13]Pulse Structure and Search
In ACELP, the fixed codebook is constructed using a sparse set of non-zero pulses, typically 3 to 4 per subframe of 40 samples, with each pulse having a magnitude of ±1 and positions restricted to non-overlapping tracks to ensure efficient representation and searchability. For instance, in the CS-ACELP design standardized in ITU-T G.729, four pulses are used, divided into four interleaved tracks (e.g., track 0: positions 0,5,10,15,20,25,30,35; track 1: 1,6,11,16,21,26,31,36; track 2: 2,7,12,17,22,27,32,37; track 3: 3,8,13,18,23,28,33,38,4,9,14,19,24,29,34,39), with the first three tracks having 8 positions each and the fourth having 16 positions. The signs of the pulses are binary (±1), and the code vector c_k(n) is formed as the sum of these signed unit impulses at selected positions: c_k(n) = \sum_{i=0}^{3} s_i \delta(n - m_i), where s_i = \pm 1 is the sign, m_i is the position in the respective track, and \delta is the Kronecker delta. This structure generates $2^{4} \times 8^{3} \times 16 = 2^{17} possible vectors per subframe, with 13 bits allocated to positions (3 bits each for the first three tracks, 4 bits for the fourth) and 4 bits for signs, quantized with 17 bits for transmission.[2] The search for the optimal code vector involves an exhaustive or focused enumeration over pulse positions and signs to maximize the perceptual match between the target signal and the synthesized excitation. After subtracting the adaptive codebook contribution, the target signal t(n) (LPC-filtered backward prediction error) is correlated with potential code vectors, using a nested loop approach: positions for the first three pulses are searched sequentially by maximizing partial correlations, followed by a sign optimization for the fourth pulse. The criterion is to select the code vector c_k that maximizes the normalized cross-correlation \frac{ \langle t, c_k \rangle }{ \| c_k \| }, where \langle t, c_k \rangle = \sum_n t(n) c_k(n) is the correlation, and \| c_k \| = \sqrt{ c_k^T \Phi c_k } with \Phi the autocorrelation matrix of the impulse response; the gain is then computed as g_k = \langle t, c_k \rangle / \| c_k \|^2. This process leverages precomputed correlations d(n) = \sum_i t(i) h(i-n) to evaluate \langle t, c_k \rangle = \sum_i s_i d(m_i), reducing complexity during position selection. To mitigate the high computational cost of the full O(10^5) operations per subframe, pruning techniques such as focused search and depth-first tree search are employed, limiting the exploration to promising candidates. In the G.729 implementation, a focused approximation pre-selects the first pulse as the position of maximum |d(n)|, then iteratively adds subsequent pulses by testing only those that exceed a dynamic threshold (e.g., average plus 40% of the range of partial correlations), constraining the final loop to about 180 candidates and reducing complexity by over 90% while maintaining near-toll quality. Additional optimizations include sign masking based on the sign of d(n) at candidate positions and position focusing to avoid redundant evaluations in correlated impulse responses.[2]Operation
Encoder Workflow
The encoder in algebraic code-excited linear prediction (ACELP) begins with pre-processing of the input speech signal, which typically involves high-pass filtering to remove unwanted low-frequency components (e.g., a cutoff around 50-140 Hz) and scaling to normalize the signal amplitude, ensuring stability and compatibility with the sampling rate (usually 8 kHz for narrowband speech). Windowing is applied during linear predictive coding (LPC) analysis to minimize spectral leakage, often using a Hamming or asymmetric window centered on the current frame with overlap from previous and future samples. The pre-processed speech is divided into frames of 10 or 20 ms duration (80 or 160 samples at 8 kHz), typically subdivided into 2 or 4 subframes of 5 ms each (40 samples per subframe), depending on the specific codec (e.g., 10 ms with 2 subframes in G.729, 20 ms with 4 in AMR) to allow frequent updates of excitation parameters while maintaining low delay.[14] This structure balances computational efficiency and speech quality, with LPC parameters updated every frame and excitation searches performed per subframe. LPC analysis is conducted every 10-20 ms (once per frame) to derive the spectral envelope parameters, modeling the vocal tract as a 10th-order all-pole filter. The analysis uses autocorrelation methods on a windowed segment of speech (e.g., 30-40 ms including look-ahead), yielding predictor coefficients that are converted to line spectral pairs (LSPs) for efficient quantization and interpolation across subframes. The adaptive codebook search follows LPC analysis and targets the pitch periodicity in the residual signal (after short-term prediction). For each subframe, the encoder searches for the optimal pitch lag (typically 20-120 samples, with fractional resolution) and gain by minimizing the mean-squared error in an analysis-by-synthesis loop, using the past excitation as the adaptive codebook to model the long-term prediction component. Subsequently, the fixed codebook search employs the ACELP structure to model the stochastic innovation in the long-term predicted residual. This involves selecting an excitation vector from the algebraic codebook that best matches the target signal after perceptual weighting, optimizing for perceptual quality rather than waveform matching. Gain quantization jointly optimizes the adaptive and fixed codebook gains using vector quantization, often in multiple stages to reduce bits while preserving energy balance. The LPC coefficients (as LSPs) are quantized via predictive vector quantization, exploiting inter-frame correlations to achieve high resolution with limited bits. These quantized parameters form the bitstream for transmission. The decoder uses these parameters to reconstruct the speech via synthesis filtering. The total excitation is computed asu(n) = g_p \cdot I_p(n) + g_c \cdot c(n),
where g_p and g_c are the quantized adaptive and fixed gains, respectively, I_p(n) is the interpolated adaptive codebook vector, and c(n) is the ACELP fixed codebook vector, for n = 0, 1, \dots, 39 in a subframe. A representative bit allocation for an 8 kbit/s codec like G.729 (10 ms frame) totals 80 bits: 18 bits for LPC (LSPs), 8-11 bits per subframe for pitch lag (aggregated ~13-22 bits/frame), 17 bits per subframe for ACELP codebook indices (34 bits/frame), and 7-8 bits per subframe for gains (14-16 bits/frame).