Codec 2
Codec 2 is an open-source speech codec designed for communications-quality speech at low bit rates ranging from 700 to 3200 bits per second, primarily targeting bandwidth-constrained applications such as HF and VHF digital radio.[1] Developed by David Rowe (VK5DGR) and released under the GNU Lesser General Public License (LGPL), it employs sinusoidal coding techniques to compress speech while maintaining intelligibility and naturalness, outperforming proprietary codecs like MELP at very low bit rates such as 700 bit/s in informal subjective listening tests.[2] The codec's architecture supports real-time encoding and decoding on resource-limited devices, making it suitable for amateur radio and emergency communications.[3] Originating from Rowe's 1997 PhD thesis on speech coding, development of Codec 2 began in 2009 with an initial focus on 2400 bit/s modes, evolving to include lower rates like 700 and 1300 bit/s through iterative improvements and community contributions.[2] Supported by grants from the Amateur Radio Digital Communications (ARDC), including a $420,000 grant in 2023 to enhance FreeDV integration with commercial radios,[4] the project is hosted on GitHub, where ongoing enhancements incorporate advanced features such as neural network post-filtering for enhanced quality.[1] Its patent-free status and modular design have facilitated integration with modems like OFDM and coherent PSK, enabling robust performance over noisy channels.[3] Codec 2 powers the FreeDV digital voice protocol, which has seen widespread adoption in the amateur radio community since its 2012 launch, with global nets in regions including Australia, the UK, and the USA.[5] Applications extend to software like the FreeDV GUI for Windows, Linux, and macOS, as well as hardware interfaces such as ezDV, supporting activities like monthly worldwide FreeDV days.[5] In controlled evaluations, FreeDV modes using Codec 2 demonstrate speech quality comparable to or better than analog SSB on varied signal-to-noise ratios, underscoring its role in promoting open-source alternatives to proprietary digital voice systems.[5]Introduction
Overview
Codec 2 is an open-source speech codec designed for low-bitrate digital voice communications, targeting bit rates from 450 to 3200 bit/s to achieve communications-quality speech in bandwidth-constrained environments.[6] It primarily serves applications in amateur radio, enabling efficient voice transmission over narrow bandwidths in HF and VHF digital modes such as FreeDV.[3] The codec accepts input as 8 kHz sampled 16-bit linear PCM audio and processes it in frames of 10 ms or 20 ms duration, depending on the selected mode.[3][7] Licensed under the GNU Lesser General Public License version 2.1, Codec 2 was developed by David Grant Rowe to provide a patent-free alternative to proprietary low-bitrate codecs.[3] Codec 2 has received recognition for its innovation, including the 2012 ARRL Technical Innovation Award for advancing digital voice technology in amateur radio and the Linux Australia Conference's Best Presentation Award for Rowe's 2012 talk at linux.conf.au.[8][1]Development Background
Codec 2 was initiated in 2010 by David Grant Rowe, an Australian electrical engineer specializing in digital signal processing and telecommunications. Rowe earned his PhD in 1997 from the University of Wollongong for a thesis on techniques for harmonic sinusoidal coding of speech signals, which laid the groundwork for efficient low-bitrate representation of voiced speech using sinusoidal oscillators with parametric phase modeling.[9] His experience in digital signal processing includes developing speech codecs and modems for open-source amateur radio projects, such as FreeDV, which integrates Codec 2 with channel modulation techniques for high-frequency (HF) radio transmission.[10] The primary motivation for Codec 2 stemmed from the limitations of existing open-source speech codecs, such as Speex, which were designed for higher bit rates (typically above 2 kbit/s) and struggled to deliver intelligible speech at ultra-low rates suitable for bandwidth-constrained amateur radio applications over HF channels. Rowe was particularly inspired by Bruce Perens, a prominent advocate for open-source software and amateur radio, who in 2008 called for the development of patent-free alternatives to proprietary military-grade codecs like MELP, emphasizing the need for accessible, low-complexity solutions for hobbyist and emergency communications.[11] This push aligned with broader efforts to democratize digital voice technology, avoiding the licensing barriers that restricted adoption in non-commercial settings.[5] Codec 2's foundational research draws heavily from 1980s advancements in sinusoidal speech modeling, pioneered by researchers including Robert J. McAulay and Thomas F. Quatieri, who introduced methods for decomposing speech into harmonic sinusoids to enable low-bitrate coding while preserving perceptual quality.[9] (References to McAulay and Quatieri's 1986 work on sinusoidal transform coding.) Early development benefited from collaboration and support by Jean-Marc Valin, creator of the Speex codec, who provided insights on open-source implementation and integration challenges during initial discussions prompted by Perens.[11] The project's initial goals centered on achieving communications-quality speech—defined as highly intelligible with acceptable distortion—at bit rates below 700 bit/s, while minimizing computational demands to run on resource-limited embedded systems like microcontrollers in radio transceivers.[2]Technical Specifications
Encoding and Decoding Process
Codec 2 operates on input speech sampled at 8 kHz in PCM format, processing it in frames of 20 ms (160 samples) for higher bit rate modes (3200 and 2400 bit/s) or 40 ms (320 samples) for lower bit rate modes, with internal analysis using shorter windows such as 10 ms (80 samples) for LPC parameter estimation to capture quasi-stationary characteristics of the signal. This allows for efficient parameter estimation while minimizing delay.[9][2] The process begins with voiced/unvoiced detection for each frame, which classifies the speech segment as periodic (voiced) or aperiodic (unvoiced) to guide subsequent modeling. This classification relies on two primary features: the signal's short-term energy, which is higher in voiced frames due to glottal pulses, and the zero-crossing rate, which is lower for voiced speech owing to its periodic nature compared to the noise-like unvoiced segments. These metrics enable a simple yet effective decision threshold to distinguish frame types without complex computation.[9][2] For voiced frames, parameter extraction employs a sinusoidal model, representing the speech waveform as a sum of harmonically related sine waves: s(n) = \sum_{m=1}^{M} A_m \cos(\omega_0 m n + \theta_m), where \omega_0 is the fundamental frequency (pitch), A_m are the harmonic amplitudes, \theta_m the phases, and M the number of harmonics within the 4 kHz bandwidth. The pitch \omega_0 (typically 50-400 Hz) is estimated using an analysis-by-synthesis approach that minimizes spectral distortion, often via a non-linear pitch detection algorithm for robustness. Harmonic amplitudes are derived from the discrete Fourier transform (DFT) of the windowed frame, averaged over frequency bins around each harmonic to yield root-mean-square (RMS) magnitudes, with the spectral envelope modeled using line spectral pairs (LSPs). The spectral envelope, modeling vocal tract resonances, is captured using line spectral pairs (LSPs), which are roots of polynomials derived from linear predictive coding (LPC) coefficients; these provide stable and efficient quantization of the 10th-order filter typically used.[9][2] Encoding quantizes these extracted parameters into compact fixed-length bit fields, allocating bits to pitch, LSPs, harmonic amplitudes (or energy), and voicing flags without employing entropy coding to maintain low complexity and fixed delay. Vector quantization is applied to LSPs and sometimes amplitude vectors for perceptual optimality, as scalar methods may introduce spectral mismatches; for instance, in the 3200 bit/s mode, parameters are packed into 64 bits per 20 ms frame using multi-stage vector quantizers trained on speech data. Unvoiced frames simplify encoding by modeling noise excitation shaped by the LSP-derived envelope, reducing bit allocation for harmonics.[9][2][3] Decoding reconstructs the speech by synthesizing the sinusoidal components from the quantized parameters. For voiced frames, the speech is synthesized as a sum of harmonically related sine waves using the quantized pitch, amplitudes (derived from the LSP envelope sampled at harmonics), and phases (modeled continuously across frames via quadratic interpolation or mixed excitation to avoid discontinuities, with overlap-add windowing of adjacent frames). For unvoiced frames, random phase noise is generated and shaped by the spectral envelope derived from LSPs, using an LPC synthesis filter with coefficients a_k obtained by converting LSPs via the relation A(z) = \frac{P(z) + Q(z)}{2}, where P(z) and Q(z) are polynomials with roots at the conjugate pairs of LSP frequencies on the unit circle; transitions between voiced and unvoiced are blended seamlessly.[9][2]Supported Modes and Bit Rates
Codec 2 operates in several fixed-rate modes tailored to varying bandwidth requirements, ranging from 450 bit/s to 3200 bit/s, each defined by a specific number of bits per frame and frame duration to maintain constant output suitable for channel-constrained applications like HF radio. Higher-rate modes typically employ 20 ms frames, while lower-rate modes extend to 40 ms frames to optimize bit efficiency and reduce synchronization overhead. This structure ensures robust frame synchronization through predictable bit-field packing, where parameters are quantized and arranged in a mode-specific order without variable-length coding. The following table summarizes the supported modes, their bit rates, bits per frame, and frame durations:| Mode | Bit Rate (bit/s) | Bits per Frame | Frame Duration (ms) |
|---|---|---|---|
| 3200 | 3200 | 64 | 20 |
| 2400 | 2400 | 48 | 20 |
| 1600 | 1600 | 64 | 40 |
| 1400 | 1400 | 56 | 40 |
| 1300 | 1300 | 52 | 40 |
| 1200 | 1200 | 48 | 40 |
| 700 | 700 | 28 | 40 |
| 450 | 450 | 18 | 40 |