Fact-checked by Grok 2 weeks ago

MPEG-4 Part 3

MPEG-4 Part 3, officially designated as ISO/IEC 14496-3, is an for the coding of audio within the MPEG-4 framework, providing a comprehensive set of tools for compressing, synthesizing, manipulating, and playing back natural and synthetic audio content. It supports a wide range of applications, from low-bitrate to high-quality music reproduction and interactive virtual-reality audio, by integrating diverse coding technologies that enable object-based audio, dynamic soundtracks, and flexible synchronization with visual elements. The standard emphasizes perceptual quality and efficiency, making it suitable for , , mobile devices, and environments. Developed by the Moving Picture Experts Group (MPEG) under the Joint Technical Committee ISO/IEC JTC 1/SC 29, the first edition of ISO/IEC 14496-3 was published in 1999, marking a significant advancement over prior MPEG audio standards like MP3 by incorporating more versatile and scalable coding methods. Subsequent editions followed, including the second edition in 2001, third in 2005, fourth in 2009, and the current fifth edition in December 2019, which consolidates prior amendments and introduces enhancements for higher bitrates and improved error resilience. These updates have ensured the standard's relevance in evolving multimedia ecosystems, with ongoing support for new audio object types and profiles. At its core, MPEG-4 Part 3 defines the (AAC) family as its primary perceptual coding toolset, offering superior compression efficiency compared to earlier formats while supporting multichannel and parametric extensions like High-Efficiency AAC (HE-AAC) for low-bitrate scenarios. It also includes specialized codecs for speech and synthetic audio, such as (CELP), Harmonic Vector eXcitation Coding (HVXC), and Text-to-Speech Interface (TTSI), allowing seamless integration of voice, music, and generated sounds in interactive applications. Additionally, lossless and scalable options like Audio Lossless Coding (ALS) and Bit-Sliced Arithmetic Coding (BSAC) cater to archival and adaptive streaming needs, respectively. This modular structure—organized around Audio Object Types (AOTs) as specified in subclause 1.5.1.1—facilitates broad interoperability across devices and networks.

Development and Versions

Standardization History

The development of MPEG-4 Part 3, formally known as ISO/IEC 14496-3, began within the broader MPEG-4 standardization effort initiated by the (MPEG) under ISO/IEC JTC1/SC29/WG11. The initial proposal for audio components emerged in 1996 from the MPEG Audio subgroup, aligning with the overall MPEG-4 Systems framework outlined in Part 1 (ISO/IEC 14496-1), which emphasized object-based multimedia representation. This proposal responded to a call for technologies supporting natural, synthetic, and hybrid audio coding, including tools for text-to-speech (TTS) synthesis and structured audio representations, to enable interactive and scalable audio delivery in emerging digital environments. Key milestones advanced rapidly following the 1996 call for proposals, which solicited submissions for efficient audio coding with functionalities like error resilience and . Following evaluations, the core specification was finalized, leading to the publication of the (Edition 1) in December 1999. These steps involved iterative testing and integration of natural audio coders like (AAC) alongside synthetic tools such as TTS interfaces and precursors to spatial audio object coding (SAOC). Major contributions to the standard came from leading research institutions and companies, including Fraunhofer IIS, which provided foundational perceptual coding expertise from prior MPEG efforts; Dolby Laboratories, focusing on multichannel and efficiency enhancements; and , contributing scalable and synthetic audio innovations. These collaborators submitted key proposals during the 1996-1999 evaluation phases, ensuring with diverse applications from streaming to broadcast. The standard's integrated synthetic audio tools early, allowing seamless mixing of generated speech and objects with natural recordings, as specified in the core experiments for hybrid coding. Subsequent evolution through amendments expanded the scope to include 3D audio elements, addressing immersive soundscapes. Starting with Amendment 1 in 2003 (spectral band replication for HE-AAC), followed by Amendment 2 in 2004 (parametric stereo and SSC), which laid groundwork for spatial processing, while later updates incorporated precursors to spatial audio object coding for object-based 3D rendering, with full SAOC specified in a separate standard (ISO/IEC 23003-2). These developments maintained backward compatibility with Edition 1 while adapting to growing demands for virtual reality and multichannel playback.

Editions and Amendments

The first edition of ISO/IEC 14496-3, published in 1999, established the core specification for coding natural and synthetic audio within the framework, supporting a range of applications from speech to music and text-to-speech synthesis. A second edition followed in 2001, technically revising the initial version and incorporating Amendment 1 from 2000 along with corrections to enhance robustness and compatibility. Subsequent amendments to the 2001 edition introduced key enhancements for efficiency and spatial rendering. Amendment 1 in 2003 added replication (SBR), enabling the High-Efficiency AAC (HE-AAC) toolset for bandwidth extension at low bitrates. Amendment 2 in 2004 further extended this with parametric stereo coding and Spatial Audio Scene Coding (SSC), supporting immersive multi-channel audio representations. The third edition, released in 2005, integrated these advancements while expanding the framework for additional object types and profiles. Amendments to the 2005 edition focused on lossless and scalable coding. Amendment 2 in 2006 introduced Audio Lossless Coding () for bit-exact reconstruction of high-resolution audio, alongside extensions to bit-sliced arithmetic coding (). Amendment 3 in 2006 added Scalable Lossless Coding (), allowing progressive refinement from lossy to lossless quality. The fourth edition in 2009 consolidated these updates, providing a unified structure for diverse audio coding needs, including synthetic and parametric elements. Later developments emphasized unified coding for speech and enhanced spatial features. Amendment 3 to the 2009 edition in 2012 incorporated Unified Speech and Audio Coding (USAC), optimizing performance for low-bitrate speech transmission while maintaining compatibility with higher-quality audio. The fifth edition, published on December 11, , technically revised the 2009 version, integrating all prior amendments and refinements for broader synchronization and post-production capabilities. As of 2025, no major new editions have been issued beyond , though Amendment 1 to the 2019 edition—addressing media authenticity and immersive interchange formats—is advancing through ISO/IEC ballot stages, with references to synergies with Part 3 (ISO/IEC 23008-3) for audio extensions.

Technical Structure

Subparts

MPEG-4 Part 3, formally known as ISO/IEC 14496-3, is organized into a modular structure comprising 12 subparts that define various audio coding tools for natural, synthetic, and parametric audio signals within a unified framework. This division enables flexible integration and extension of specific functionalities while sharing a common . Subpart 1 (Main) establishes the core specification, including elements, format, audio object types, profiles, levels, multiplexing, error resilience, and interfaces to the MPEG-4 systems layer. Subpart 2 specifies using Harmonic Vector eXcitation Coding (HVXC) for low-bitrate speech (2–4 kbit/s). Subpart 3 details using (CELP), supporting scalable and error-resilient speech coding. Subpart 4 covers General Audio Coding tools, including (AAC) for perceptual of music and general audio, TwinVQ for speech-like signals, and Bit-Sliced Arithmetic Coding (BSAC) for scalable coding. Subpart 5 describes Structured Audio (SA) for algorithmic synthesis using SAOL (language) and SASL (scoring), enabling dynamic sound generation. Subpart 6 addresses the Text-to-Speech Interface (TTSI) for integrating synthesized speech from text parameters. Subpart 7 specifies Parametric Audio Coding using Harmonic and Individual Lines plus Noise (HILN) for low-bitrate structured audio. Subpart 8 provides tools for Parametric Coding of High Quality Audio, including and . Subpart 9 integrates into the MPEG-4 framework. Subpart 10 describes Lossless Coding of Oversampled Audio using Direct Stream Transfer (DST). Subpart 11 covers for high-fidelity compression without loss. Subpart 12 details Scalable Lossless Coding (), extending with scalability features. These subparts interconnect through the shared syntax in Subpart 1, with tools invoked via audio object types for hybrid configurations.

Audio Object Types

MPEG-4 Part 3 defines a set of audio object types (AOTs) as modular coding tools that enable the compression and representation of natural, synthetic, and hybrid audio signals within a unified framework. These object types serve as identifiers for specific encoding methods, allowing decoders to select appropriate tools for rendering diverse audio content, from speech and music to generated sounds and spatial scenes. Each AOT is assigned a encoded in the using a 5-bit field in the AudioSpecificConfig structure, with an escape mechanism for values exceeding 31, effectively supporting up to 8-bit codes (e.g., 0x02 for ). This design facilitates interoperability across applications like streaming, , and . Supported sampling rates range from 8 kHz to 96 kHz, accommodating low-bitrate speech up to high-fidelity multichannel audio. Natural audio object types focus on waveform-based coding for speech and general audio signals, emphasizing perceptual quality and efficiency at varying bitrates. Key examples include (AOT 2 for Low Complexity, widely used for its balance of quality and complexity), TwinVQ (AOT 7, a transform-based coder optimized for speech-like signals at low bitrates around 2-8 kbit/s), and (AOT 22 for Error Resilient Bit-Sliced Arithmetic Coding, providing fine-grained scalability in 1 kbit/s steps per channel for adaptive streaming). These tools integrate perceptual noise substitution and error resilience features to handle transmission errors in networked environments. Synthetic audio object types enable the generation of sounds from structured representations, supporting applications in and . TTSI (AOT 12, Text-to-Speech Interface) interfaces with external synthesizers to produce speech from text parameters, ideal for and low-bandwidth narration. HILN (AOT 26 for Error Resilient and Individual Lines plus ) parametrically models audio as harmonics, tonal lines, and , achieving very low bitrates (e.g., 4-8 kbit/s) for structured synthesis like tones or environmental effects. These types prioritize compactness over waveform , allowing via algorithmic means. Hybrid audio object types combine waveform and parametric elements for advanced spatial and scalable coding, enhancing immersion and flexibility. Parametric coding (AOT 27 for Error Resilient Parametric) decomposes signals into sinusoids, noise, and transients for high-quality low-bitrate representation (e.g., 6-18 kbit/s), suitable for music over narrowband channels. SAOC (AOT 43, Spatial Audio Object Coding) enables interactive rendering of multiple audio objects in a downmix with metadata, supporting user-controlled panning and effects for up to 128 channels. Recent extensions include USAC (AOT 42, Unified Speech and Audio Coding), which unifies speech and audio tools in a scalable framework compatible with AAC, SBR, and parametric enhancements for bitrates from 6.2 kbit/s upward, addressing both conversational and immersive use cases. These hybrids often layer with base types like AAC for bandwidth adaptation. The following table summarizes key audio object types by category, highlighting their primary roles and identifier codes:
CategoryAOT IDNameRole Overview
Natural2AAC LCPerceptual coding for general audio and speech, supporting stereo to 5.1 channels at 12-96 kHz.
Natural7TwinVQLow-bitrate speech coding using frequency-domain vector quantization.
Natural22ER BSACScalable coding with arithmetic entropy for error-prone channels.
Synthetic12TTSIParameter interface for text-driven speech synthesis.
Synthetic26ER HILNParametric modeling of harmonics, lines, and noise for synthetic tones.
Hybrid27ER ParametricSinusoidal/noise decomposition for scalable high-quality audio.
Hybrid42USACUnified scalable coding bridging speech and music domains.
Hybrid43SAOCObject-based spatial coding with interactive downmix rendering.
This catalog assumes familiarity with bitstream syntax from subparts 1-4; object types can be combined in profiles for application-specific configurations, such as HE-AAC using AOT 2 with SBR (AOT 5).

Profiles and Configurations

Audio Profiles

MPEG-4 Part 3 defines audio profiles as predefined combinations of audio object types, levels, and tools tailored to specific applications such as streaming, , and communication, ensuring efficient encoding and decoding for various bitrates and quality requirements. These profiles bundle compatible coding technologies to support across devices, allowing decoders to recognize and process s based on standardized configurations. Speech profiles in MPEG-4 Part 3 prioritize low-bitrate encoding for voice-centric applications like and . The Speech Audio Profile operates at 2-24 kbps, leveraging (CELP) or Harmonic Vector eXcitation Coding (HVXC) object types to achieve (8 kHz sampling, 100-3800 Hz bandwidth) or (16 kHz sampling, 50-7000 Hz bandwidth) speech quality suitable for mobile and low-bandwidth environments. Additionally, Unified Speech and Audio Coding (USAC)-based profiles, such as the Extended High Efficiency Profile, extend this capability for mobile use cases, providing scalable with low delay and error resilience at similar bitrates. Audio profiles focus on natural sound reproduction for music and . The Audio Profile employs Low Complexity (AAC-LC) at bitrates of 96-384 kbps, supporting and multichannel configurations for general-purpose streaming and storage. The High Quality Audio Profile supports advanced tools including AAC and extensions like Replication (SBR), as in High-Efficiency AAC (HE-AAC), to deliver near-transparent quality at lower bitrates or higher fidelity, with post-2005 extensions like HE-AAC version 2 enabling efficient coding for broadcasting. Within these profiles, levels from 1 to 6 define capabilities based on count, sampling , and ; for example, Level 1 supports mono audio at low sample rates (e.g., 8 kHz), while Level 6 accommodates multichannel (up to 5.1) high-fidelity audio at 48 kHz or higher. Profile information, including object types and levels, is signaled in the through the AudioSpecificConfig , which specifies parameters like sampling frequency and configuration to guarantee and seamless playback across diverse devices and networks. The standard defines several audio profiles, including Speech Audio Profile, Synthetic Audio Profile, Scalable Audio Profile, Main Audio Profile, High Quality Audio Profile, Low Delay Audio Profile, Natural Audio Profile, and Mobile Audio Internetworking Profile, with additional profiles like Profile, High Efficiency Profile, and High Efficiency v2 Profile added in later amendments.
Profile TypeExample Object TypesTypical Bitrate (kbps)LevelsUse Cases
Speech Audio ProfileCELP, HVXC2-241-2 (mono, low sampling),
Natural Audio Profile-LC96-3841-4 (stereo, up to 48 kHz)Streaming, general multimedia
High Quality Audio Profile + SBR (HE- v2)32-128 (enhanced)3-6 (multichannel, ), music distribution
Extended HE- ProfileUSAC8-961-4 (scalable, low delay)Mobile, unified speech/audio

Object Type Combinations

MPEG-4 Part 3 enables flexible combinations of audio object types within a single , allowing for the of diverse tools to support advanced applications such as immersive and interactive audio experiences. These combinations extend beyond predefined profiles by multiple object types, facilitating setups that blend natural and synthetic audio elements while maintaining compliance with the standard's syntax and semantics. One key application involves hybrid natural-synthetic combinations, where natural audio coding like (AAC, object type 2) is paired with synthetic tools such as the Text-to-Speech Interface (TTSI, object type 7). This setup supports the synthesis of speech from text data alongside natural soundtracks, incorporating prosodic parameters for enhanced quality in scalable bitstreams, suitable for interactive content. Scalable combinations leverage Bit-Sliced Arithmetic Coding (BSAC, object type 4) layered atop AAC to provide fine-grain . BSAC enables graceful degradation by adding enhancement layers at increments of approximately 1 kbit/s per channel for mono and 2 kbit/s for stereo, allowing extraction for varying conditions without significant quality loss. For spatial audio, Spatial Audio Object Coding (SAOC, object type 43) is integrated with AAC-based codecs like High-Efficiency AAC (HE-AAC), embedding parametric object into the core AAC stream (object type 2 or extensions). This combination supports object-based rendering for multi-channel or playback, enabling interactive adjustment of audio elements in soundscapes. Examples of such combinations include audio systems using Unified Speech and Audio Coding (USAC, object type 42) combined with SAOC for (VR) and (AR) applications. This pairing delivers low-latency, immersive sound with efficient bandwidth usage, supporting object manipulation for personalized listening environments. Constraints on these combinations are governed by the standard's bitstream structure, which supports of multiple object types but imposes practical limits based on definitions and target bitrates, typically ranging from 6 kbit/s to several hundred kbit/s per channel depending on the tools used. Non-standard but compliant mixes are achievable through scene description tools like AudioBIFS, allowing custom assemblies for specialized interactive audio scenes without violating the core syntax.

Core Coding Technologies

Advanced Audio Coding (AAC)

Advanced Audio Coding (AAC) serves as the foundational perceptual audio coding technology within MPEG-4 Part 3, enabling efficient compression of natural audio signals such as music and speech at various bitrates while maintaining high perceptual quality. Developed as an extension of the AAC standard, it incorporates advanced tools for multichannel audio, block switching, and noise shaping to outperform earlier formats like in terms of compression efficiency and audio fidelity. In MPEG-4 Part 3, AAC is defined as Audio Object Type 2, supporting sampling rates up to 96 kHz and channel configurations from mono to 48 channels, making it versatile for multimedia applications. The core principles of AAC revolve around a perceptual approach that exploits human to allocate bits efficiently. Audio input is transformed into the using a (MDCT) filterbank, which provides critical sampling and perfect reconstruction for overlapping windows. The MDCT operates with window sizes of 2048 samples for long blocks, yielding 1024 spectral coefficients, or 256 samples for short blocks, yielding 128 coefficients, allowing adaptive block switching to handle transient signals effectively. A psychoacoustic model analyzes the input signal to compute masking thresholds across critical bands, guiding the quantization and bit allocation process to minimize audible distortion by shaping quantization noise below these thresholds. Quantization in AAC applies a uniform scalar quantizer to the MDCT coefficients, scaled by global gain and scalefactors per scale factor band to control noise distribution. The quantized value for each coefficient X(k) is given by: \text{quant}(X(k)) = \text{sgn}(X(k)) \cdot \left\lfloor |X(k)| \cdot 2^{\frac{\text{global\_gain} - \text{scf}(n)}{4}} + 0.5 \right\rfloor where \text{sgn} is the , \text{scf}(n) is the scalefactor for band n, \text{global\_gain} adjusts overall quantization step size. This process, combined with for entropy reduction, ensures efficient representation of the spectral data. The filterbank relies on MDCT for its computational efficiency and low , with optional mid-side stereo processing to exploit inter-channel correlations. AAC in MPEG-4 Part 3 includes several variants tailored to different complexity and performance needs. The Main profile (Object Type 1) incorporates advanced features such as temporal noise shaping (TNS) for pre-echo control, enabling higher quality at lower bitrates but increasing encoder/decoder complexity. The AAC Long Term Prediction (LTP) profile (Object Type 4) adds LTP for redundancy reduction in tonal signals on top of . In contrast, the Low Complexity () profile (Object Type 2) simplifies the toolset by omitting LTP and limiting TNS order, making it suitable for real-time applications with reduced computational demands while retaining core perceptual coding capabilities. For stereo audio, AAC typically operates in the bitrate range of 64–320 kbps, balancing quality and bandwidth; at these rates, it delivers transparent quality comparable to uncompressed CD audio for most content. This foundational coder forms the base layer for extensions like High-Efficiency AAC in MPEG-4 Part 3.

High-Efficiency AAC (HE-AAC)

High-Efficiency AAC (HE-AAC) is an extension of the (AAC) format within MPEG-4 Part 3, designed to achieve high-quality audio at significantly lower bitrates by incorporating techniques for extension and coding. It builds on AAC as the core low-frequency coder, adding Spectral Band Replication (SBR) in HE-AAC version 1 and (PS) in version 2 to enable efficient compression for applications like mobile devices and internet streaming. Standardized in ISO/IEC 14496-3 amendments, HE-AAC targets bitrates below 64 kbps while maintaining perceptual quality close to that of higher-bitrate AAC. Spectral Band Replication (SBR) reconstructs the high-frequency content of an audio signal from the low-frequency core encoded by , using a approach that transmits only information rather than full data. The SBR process employs a 64-band Quadrature Mirror Filterbank (QMF) to analyze the lowband signal and generate highband replicas through , where spectral components are copied and shifted upward. High frequencies are then filled with a combination of noise and tonal elements, shaped by transmitted noise and tone that adjust the spectral tilt and fine structure. This hybrid filling is governed by a factor applied per subband and time slot, expressed as: G(s,f) = \alpha \cdot H(s,f) + (1 - \alpha) \cdot N(s,f) where G(s,f) is the for subband s at f, H(s,f) represents the () component, N(s,f) the component, and \alpha a mixing (0 ≤ α ≤ 1) determined by the signal's tonal characteristics. SBR data is framed separately from the AAC core, adding 1-3 kbps of overhead per channel, enabling effective bandwidth extension up to 16 kHz or higher at low bitrates. Parametric Stereo (PS) enhances HE-AAC version 2 by efficiently coding stereo signals at ultra-low bitrates, downmixing the stereo input to a mono signal using the AAC core while transmitting compact side information for reconstruction at the . PS parameterizes the spatial image with inter-channel parameters, including Inter-channel Intensity Difference (IID) for amplitude disparities between left and right channels, and Inter-channel Correlation () for phase and coherence differences across frequency bands. Additional parameters handle and all-pass filtering to simulate stereo width. This approach adds minimal overhead (typically under 2 kbps), allowing the to upmix the mono stream into a stereo output that preserves spatial cues without encoding full stereo waveforms. HE-AAC achieves near-CD quality for stereo audio at bitrates of 24-48 kbps, representing a 50% or greater reduction compared to AAC alone for equivalent perceptual performance, as validated in listening tests like MUSHRA. For instance, at 32 kbps stereo, HE-AAC v2 delivers quality comparable to 128 kbps MP3, making it suitable for bandwidth-constrained scenarios. Its adoption is widespread in mobile telecommunications (e.g., 3GPP standards for UMTS and LTE), digital broadcasting (DVB-H, DAB+, DRM), and adaptive streaming protocols like MPEG-DASH, where it supports seamless bitrate switching in platforms such as YouTube and iTunes radio.

Specialized Coding Methods

Bit-Sliced Arithmetic Coding (BSAC)

Bit-Sliced Arithmetic Coding (BSAC) is a scalable audio coding tool standardized in MPEG-4 Part 3 (ISO/IEC 14496-3) as an alternative method to provide fine-grained bitrate scalability for general audio signals. It enables the creation of layered bitstreams where a base layer delivers basic audio quality, and successive enhancement layers progressively improve fidelity in small increments, facilitating adaptive streaming and quality-of-service adjustments in bandwidth-constrained environments. The core principle of BSAC involves bit-slicing the quantized spectral coefficients derived from the (MDCT) in , ordering them by significance from most significant bits (MSBs) to least significant bits (LSBs) across frequency bands. These bit-planes are then entropy-coded using an arithmetic coder, which assigns probabilities to bit patterns within each slice to achieve compression efficiency while maintaining scalability; this top-down approach allows partial decoding of higher-importance bits for acceptable audio quality even if lower bits are discarded. Unlike traditional in , BSAC's arithmetic coding replaces the noiseless coding of both spectral data and scale factors, using a universal model that avoids the need for multiple codebooks by treating bits in a progressive manner. BSAC supports fine-grained through multiple enhancement layers, with each layer adding approximately 1 kbit/s per audio channel (e.g., 2 kbit/s for ), enabling extraction of lower-bitrate subsets from a single without re-encoding. The base layer provides core AAC-like functionality at minimal rates, while enhancement layers build upon it via bit-plane additions, supporting up to several dozen slices per frame depending on and desired granularity. This structure enhances error resilience by partitioning codewords across slices, adding minimal overhead (less than 1%) for robust transmission over error-prone channels. Built on the core framework, BSAC retains AAC's perceptual model, filterbank, and quantization but modifies the entropy stage for scalability, achieving comparable efficiency to non-scalable AAC at the highest bitrate while enabling seamless bitrate adaptation. It operates effectively in the 16–64 kbit/s per channel range, making it suitable for or content in early and applications. Originally developed for streaming services and quality-adaptive delivery in mobile networks, BSAC saw adoption in early 2000s applications like 3G audio transmission but has become less prevalent following the rise of High-Efficiency AAC (HE-AAC), which offers better efficiency at low bitrates without explicit layering.

AAC Scalable Sampling Rate (AAC-SSR)

AAC Scalable Sampling Rate (AAC-SSR) is an extension of Advanced Audio Coding (AAC) defined in MPEG-4 Part 3, designed to enable scalable audio delivery by supporting variable sampling rates in environments with fluctuating bandwidth availability. Introduced to facilitate compatibility with legacy telephone networks operating at low sampling rates, such as 8 kHz, and to provide adaptive audio streaming for early 3G mobile applications, AAC-SSR allows decoders to reconstruct audio at selected output sampling rates without requiring full bitrate transmission. This scalability addresses the needs of bandwidth-constrained systems by permitting graceful degradation in audio quality as network conditions vary. The core mechanism of AAC-SSR relies on a (PQF) bank for multi-rate encoding, which initially splits the input into four uniform subbands to enable efficient downsampling and independent processing of frequency components. The PQF applies gain control across these subbands (mandatory for ) to adjust dynamically, followed by an inverse PQF (IPQF) at the to recombine subbands in the while mitigating effects. Further occurs via modified discrete cosine transforms (MDCT) within each subband, supporting block switching for transient handling. The PQF's downsampling factor is defined as M = 2^k, where k represents the number of layers, allowing progressive halving of the effective sampling rate per layer to match capabilities. AAC-SSR organizes encoding into a base layer and multiple enhancement layers, with the base layer typically operating at a low sampling rate of 8 kHz for basic intelligibility, and successive enhancement layers incrementally increasing the rate up to 48 kHz for full-bandwidth audio. Decoders can select layers to output at intermediate rates, such as 12 kHz or 24 kHz, by discarding higher enhancement data, ensuring compatibility with diverse playback devices. This layered approach supports up to eight layers in configurations, enabling fine control over reconstruction bandwidth. Operating at bitrates ranging from 12 to 96 kbps per stereo channel, AAC-SSR incorporates (SNR) scalability by allocating bits preferentially to lower subbands in the base layer, with enhancements improving SNR in higher frequencies. This structure maintains perceptual quality at lower bitrates while allowing incremental improvements, though maximum rates are constrained by channel count and sampling frequency (e.g., up to 288 kbps per channel at 48 kHz). Despite its innovative scalability, -SSR exhibits higher computational complexity than standard due to the multi-band PQF processing and layer management, which increases encoder and decoder overhead. It has been largely phased out in favor of High-Efficiency (HE-), which offers superior efficiency at low bitrates through replication without the added complexity of sampling rate adjustments. Limited adoption stems from these drawbacks and the evolution of network infrastructures that better support fixed-rate high-efficiency codecs.

Advanced Extensions

Audio Lossless Coding (ALS)

Audio Lossless Coding () is a component of the MPEG-4 Audio standard defined in ISO/IEC 14496-3, designed for the of signals to enable high-fidelity storage and transmission without any perceptual degradation. It supports integer PCM input data ranging from 8 to 32 bits per sample and up to 32-bit floating-point formats, accommodating sample rates from 8 kHz to 384 kHz and channel configurations from mono to multi-channel setups. ALS achieves this through a flexible, forward-adaptive framework that ensures exact reconstruction of the original signal, making it suitable for applications requiring bit-perfect audio fidelity. The core method of ALS relies on to model audio signals, followed by of the prediction residuals. The prediction residual is computed as e(n) = x(n) - \sum_{k=1}^{p} a_k \cdot x(n-k), where x(n) is the input sample, p is the prediction order (up to ), and the coefficients a_k are derived using a predictor for efficient adaptation to signal statistics. These residuals are then encoded using Rice codes in a simple mode or Block Gilbert-Moore Codes (BGMC) for higher efficiency, with an optional (MDCT) applied to blocks of samples to improve compression for certain signal types. Key features include reversible integer transforms for multi-channel processing and joint stereo coding techniques, such as inter-channel or multi-channel correlation coding with 3-tap or 6-tap filters, which exploit redundancies across channels without introducing loss. In terms of performance, ALS typically achieves compression ratios of 50-70% for standard music content, reducing file sizes while maintaining variable bitrates that align with uncompressed PCM equivalents, such as 1411 kbps for 44.1 kHz 16-bit stereo audio. It supports up to 32 channels, enabling efficient handling of surround sound formats. For professional audio storage and archiving, ALS is widely used in studio environments, high-resolution disc mastering, and archival systems for preserving original quality in formats like WAVE or BWF. Decoder complexity is notably low, with CPU utilization ranging from 1-25% on mid-2000s hardware for real-time playback, outperforming contemporaries like FLAC in speed for multi-channel decoding. Scalable Lossless Coding (SLS) builds upon ALS by adding core scalability for hybrid lossy/lossless streams.

Scalable Lossless Coding (SLS)

Scalable Lossless Coding () is a backward-compatible extension to MPEG-4 Audio that enables fine-grained scalability from perceptual audio quality to bit-exact lossless reconstruction, utilizing an () compatible core layer augmented by enhancement layers. The core layer provides a lossy representation decodable by standard AAC tools, while the enhancement layers encode the residual signal to recover the original PCM data without loss, employing an integer (IntMDCT) for precise, reversible transformation. This hybrid approach ensures compatibility with existing AAC decoders for fallback playback when enhancement data is unavailable or truncated. The scalability in SLS is realized through progressive bit-plane coding in the lossless enhancement (LLE) layer, where each bit-plane represents a refinement level, allowing decoders to truncate the stream at any point for intermediate quality levels without artifacts. Bitrates begin at the AAC core level of approximately 64 kbit/s per channel for stereo signals and scale upward to full lossless rates, typically around 1,000–1,400 kbit/s depending on the audio material, supporting applications where bandwidth varies dynamically. Arithmetic coding, including bit-plane Golomb coding (BPGC) and context-based arithmetic coding (CBAC), is applied to the LLE for efficient entropy compression of the residual data. Key tools inherited from AAC enhance the core layer's efficiency, including bandwidth extension to synthesize high frequencies from lower-band information and Temporal Noise Shaping (TNS) to control pre-echoes by shaping quantization noise in the time-frequency domain. These elements ensure the base layer maintains high perceptual quality before enhancement, with the LLE operating in a low-energy mode for signals below the core's quantization threshold to minimize overhead. SLS builds briefly on techniques from Audio Lossless Coding () for the enhancement layer's prediction and coding, adapting them for scalability. In terms of , SLS achieves lossless performance comparable to state-of-the-art non-scalable coders, with the non-core mode (equivalent to pure lossless operation) outperforming formats like by 0.5–2.3% in bitrate savings across test sets. When including the core, the full lossless bitrate increases by about 1–5% relative to the non-core mode due to the base layer overhead, but this trade-off enables robust scalability for streaming. The layer scaling in SLS involves quantization thresholds derived from the core's MDCT coefficients, where residuals are mapped as e = c - \lfloor thr(i) + i \rfloor for non-zero indices i, ensuring additive reconstruction. SLS finds primary application in and delivery scenarios, where a single scalable stream can serve diverse receivers: low-bandwidth devices decode the AAC core for lossy playback, while high-end systems access full enhancements for lossless audio, facilitating efficient content distribution with graceful degradation. This is particularly useful in archiving and emission chains, supporting sample rates up to 192 kHz and multi-channel configurations without requiring separate lossy and lossless encodes.

Implementation Aspects

Storage and Transport Formats

The MPEG-4 Part 3 audio is structured with an initial AudioSpecificConfig that specifies parameters such as the audio object type, sampling rate, and channel configuration, followed by one or more raw_data_block elements containing the encoded audio frames. This syntax enables flexible decoding across various MPEG-4 audio profiles and object types, ensuring compatibility in diverse applications. For storage, MPEG-4 Part 3 audio is commonly encapsulated in the , which forms the basis of the MP4 container and supports timed media presentation with extensible metadata. The 3GP format, derived from the same base media specification, is optimized for mobile devices and includes provisions for MPEG-4 audio alongside video in low-bandwidth environments. In modern streaming protocols like (HLS), audio from MPEG-4 Part 3 is integrated via fragmented MP4 segments, enabling adaptive bitrate delivery over HTTP. In transport scenarios, MPEG-4 Part 3 audio can be carried within Transport Streams (TS) for broadcast applications, where the Sync Layer () packets provide access unit identification and timing independent of the underlying delivery mechanism. For IP-based transport, RTP payload formats encapsulate the audio bitstreams without requiring the full MPEG-4 Systems layer, supporting efficient packetization of AudioSpecificConfig and raw data blocks. An example of a lightweight transport for raw streams is the Audio Data Transport Stream (ADTS) format, which prepends each audio frame with a header containing , version, and length information to facilitate streaming without a full container. Low- variants, such as Enhanced Low Delay (-ELD) from MPEG-4 Part 3, are designed for voice-over-IP (VoIP) and conversational services, achieving algorithmic delays as low as 20-40 ms to minimize end-to-end while maintaining high-quality . This mode integrates seamlessly with RTP transport, allowing for real-time applications like teleconferencing.

Standard Bifurcation and Licensing

The (AAC) technology at the core of MPEG-4 Part 3 was initially standardized as part of in ISO/IEC 13818-7, with parallel development under Recommendation H.222.0 for the associated systems layer (equivalent to ISO/IEC 13818-1), enabling seamless integration of audio with video streams. This bifurcation arose from collaborative efforts between ISO's MPEG committee and to harmonize specifications for audiovisual synchronization in and applications, such as transmission where precise audio-video alignment is essential. In MPEG-4, these elements were unified under ISO/IEC 14496-3, which incorporates AAC as the baseline while adding extensions like parametric coding for enhanced efficiency, eliminating the need for separate tracks for audio-specific advancements. Licensing for MPEG-4 Part 3 technologies, including , High-Efficiency AAC (HE-AAC), and Audio Lossless Coding (ALS), is administered through the Via Licensing Alliance , formed by the 2023 merger of Via Licensing and to consolidate management of essential patents across audio codecs. The pool covers implementations in devices, software, and services, requiring licenses primarily from manufacturers of encoders and decoders rather than content providers. Royalty structures are tiered based on annual unit volumes to accommodate scale, with no fees for content encoded in formats but per-unit charges for compliant products. Under the standard rate schedule effective in 2025, fees begin at $0.98 per unit for the first 500,000 units, dropping to $0.78 for the next 500,000, $0.68 up to 2 million units, and $0.45 between 2 and 5 million units, with progressive reductions beyond that threshold. An alternative structure provides lower rates starting at $0.64 per unit for the initial 500,000 units, further decreasing to $0.51, $0.44, and $0.29 across similar volume bands, while special reduced terms apply to emerging markets to encourage global deployment. Patent disputes have periodically challenged , including a 2025 Unified Patent Court action by Fraunhofer IIS against for alleged infringement of essential , despite HMD receiving a offer from Via Licensing in 2017. Earlier settlements, such as Via Licensing's 2016 resolution with over unpaid royalties, underscore the pool's enforcement role. As alternatives, Fraunhofer provides the open-source FDK for non-commercial or licensed implementations, explicitly disclaiming coverage to direct users toward Via's pool for royalty-bearing uses.

References

  1. [1]
    ISO/IEC 14496-3:2019
    ### Summary of ISO/IEC 14496-3:2019
  2. [2]
    Advanced Audio Coding (MPEG-4) - The Library of Congress
    ISO/IEC 14496-3:2001. Information technology -- Coding of audio-visual objects -- Part 3: Audio. Later editions were published in 2005 and 2009; these have not ...<|control11|><|separator|>
  3. [3]
  4. [4]
    MPEG-4 audio codec | Apple Developer Documentation
    The list of audio object type constants is defined in specification ISO/IEC 14496-3, subclause 1.5.1.1. Examples: AAC LC is indicated by the value 2 , CELP ...
  5. [5]
    MPEG-4: Audio - Standards – MPEG
    Standard: MPEG-4. Part: 3. This standard specifies a compression format for audio objects. Editions. Edition - 1: EdO/IEC 14496-3:2005/Amd.6 14496-23:2008.Missing: history | Show results with:history
  6. [6]
    Codecs in common media types - MDN Web Docs - Mozilla
    Jul 6, 2025 · The Audio Object Types are defined in ISO/IEC 14496-3 subpart 1, section 1.5. 1.
  7. [7]
  8. [8]
    MPEG-4 – Development - Riding the Media Bits
    This happened for synthetic and hybrid coding tools in July 1996, a new general call for video and audio in November 1996, identification and protection of ...
  9. [9]
    (PDF) An overview of the coding standard MPEG-4 audio ...
    Aug 9, 2025 · In 2003 and 2004, the ISO/IEC MPEG standardization committee added two amendments to their MPEG-4 audio coding standard.
  10. [10]
    MPEG-4 Audio (Committee Draft 14496-3) - Sound at MIT edu
    This web page is intended to provide an overview of the "MPEG-4 Audio Committee Draft 14496-3" and is based on excerpts from this document. The complete ...Missing: history editions
  11. [11]
    [PDF] WHITE PAPER - Fraunhofer IIS
    The first version of Advanced Audio Coding (AAC) was standardized in 1994 as part of the MPEG-2 standard. Based on the experience of the development of MPEG ...
  12. [12]
    Tests on MPEG-4 audio codec proposals - ScienceDirect.com
    Proposals were subjectively tested for coding efficiency, error resilience, scalability and speed change: a subset of the MPEG-4 'functionalities'. This paper ...Missing: initial | Show results with:initial
  13. [13]
    ISO/IEC 14496-3:2005 - Coding of audio-visual objects
    ISO/IEC 14496-3:2005 (MPEG-4 Audio) is a new kind of audio International Standard that integrates many different types of audio coding.
  14. [14]
    An Overview of the Coding Standard MPEG-4 Audio Amendments 1 ...
    In 2001, MPEG identified two areas for improved audio coding technology and issued a Call for Proposals (CfP, [1]). These two areas were. (i)improved ...
  15. [15]
    [PDF] INTERNATIONAL STANDARD ISO/IEC 14496-3
    ISO/IEC 14496-3 is an audio standard integrating various coding types, including natural and synthetic sound, and speech with music.
  16. [16]
    [PDF] ISO/IEC 14496-3 - SRS
    ISO/IEC 14496-3 is an audio standard (MPEG-4 Audio) for coding audio-visual objects, integrating various audio types and applies to many applications.
  17. [17]
    ISO/IEC 14496-3:2005/Amd 2:2006
    Amendment 2: Audio Lossless Coding (ALS), new audio profiles and BSAC extensions. Withdrawn (Edition 3, 2006). This amendment applies to ISO/IEC 14496-3:2005.Missing: list | Show results with:list
  18. [18]
    [PDF] ISO/IEC 14496-3 - Computer Science Club
    Sep 1, 2009 · ISO/IEC 14496-3 is about coding audio-visual objects, specifically audio, and is a new audio standard integrating various audio coding types.
  19. [19]
    ISO/IEC 14496-3:2009 - Coding of audio-visual objects
    ISO/IEC 14496-3:2009 integrates many different types of audio coding: natural sound with synthetic sound, low bitrate delivery with high-quality delivery ...Missing: timeline | Show results with:timeline
  20. [20]
    ISO/IEC 14496-3:2009/Amd 3:2012 - Information technology
    Amendment 3: Transport of unified speech and audio coding (USAC) Withdrawn (Edition 4, 2012) This amendment applies to ISO/IEC 14496-3:2009.
  21. [21]
    ISO/IEC 14496-3:2019/CD Amd 1
    ... Amendment 1: Media authenticity and immersive interchange format. General information. Current stage: 30.60 Effective date: Jul 2, 2025. Originator: ISO/IEC.
  22. [22]
    MPEG-4 Audio - MultimediaWiki - Multimedia.cx
    Aug 13, 2021 · Subpart 1: Main (Systems Interaction); Subpart 2: Speech coding ... 38: SLS non-core; 39: ER AAC ELD (Enhanced Low Delay); 40: SMR ...
  23. [23]
    HILN-the MPEG-4 parametric audio coding tools | Request PDF
    The MPEG-4 Audio Standard combines tools for efficient and flexible coding of audio. For very low bitrate applications, tools based on a parametric signal ...<|separator|>
  24. [24]
    [PDF] INTERNATIONAL STANDARD ISO/IEC 14496-3
    Sep 1, 2009 · The USAC object type conveys Unified Speech and Audio Coding payload (see ISO/IEC 23003-3) in the. MPEG-4 Audio framework. ISO/IEC 14496-3:2009/ ...
  25. [25]
  26. [26]
  27. [27]
    [PDF] INTERNATIONAL ORGANISATION FOR STANDARDISATION ...
    TTSI. 7. The Natural Audio Profile contains all natural audio coding tools available in MPEG-4, but not the synthetic ones. 8. The Mobile Audio ...
  28. [28]
    [PDF] Spatial Audio Object Coding (SAOC) – The Upcoming MPEG ...
    SBR and PS are the vital enhancement tools in the MPEG-4 HE-AAC (V2) codec, and by combin- ing SAOC with SBR one can further improve low bitrate performance.
  29. [29]
    How Spatial Audio Meets Customer Needs and Boosts ... - SoftServe
    Jun 30, 2022 · The features of MPEG-H 3D Audio include advanced encoding technology, based on MPEG Unified Speech and Audio Coding (USAC), MPEG Spatial Audio ...
  30. [30]
    [PDF] Overview of the MPEG 4 Standard - Sound at MIT edu
    MPEG-4 is an ISO/IEC standard developed by MPEG (Moving Picture Experts Group), the committee that also developed the Emmy Award winning standards known as ...
  31. [31]
    [PDF] ETSI TS 126 403 V6.0.0 (2004-09)
    Advanced Audio Coding (AAC) part. (3GPP TS 26.403 version 6.0.0 Release 6) ... The equation for the quantization of the spectral coefficients is:.
  32. [32]
    None
    Summary of each segment:
  33. [33]
    [PDF] MPEG-4 HE-AAc v2 - audio coding for today's media world - EBU tech
    The combination of AAC and SBR is called “HE-AAC” (also known as “aacPlus v1”) and is specified in ISO/IEC 14496-3:2001/Amd.1. The European Telecommunications ...<|control11|><|separator|>
  34. [34]
    [PDF] mpeg-4 high-efficiency aac coding
    The following text describes the three core ingredients of HE-AAC, i.e. the AAC coding kernel, the SBR bandwidth extension, and the PS tools. Advanced Audio ...
  35. [35]
    [PDF] MPEG-4 BSAC Technology - Sound at MIT edu
    ◇ Bit Sliced Arithmetic Coding. ◇ alternative noiseless coding tool for MPEG-4 AAC to provide fine grain scalability functionality. ○ Characteristics of BSAC.Missing: Part | Show results with:Part
  36. [36]
    MPEG-4 Audio (MPEG-4 AAC) IP Licensing | Philips
    The MPEG-4 AAC patent license grants rights for multiple MPEG-4 AAC Object Types, including AAC LC (Low Complexity), AAC LTP (Long-Term Prediction), AAC ...
  37. [37]
    Audio coding standard overview: MPEG4-AAC, HE ... - ResearchGate
    In MPEG-2 AAC spec, three profiles can be supported: Main Profile, Low Complexity Profile (LC), and Scalable Sampling Rate Profile (SSR) [1]. While in MPEG-4 ...
  38. [38]
    [PDF] ISO/IEC 13818-7 - SRS
    Oct 15, 2004 · 7.1.3 Scalable Sampling Rate. In the Scalable Sampling Rate profile, the gain-control tool is required. Prediction and coupling channels are ...
  39. [39]
    [PDF] MPEG-4 ALS: an Emerging Standard for Lossless Audio Coding
    We provide a brief overview of an emerging ISO/IEC standard for lossless audio coding,. MPEG-4 ALS. We explain choices of algorithms used in its design, and ...
  40. [40]
    [PDF] MPEG-4 ALS – The Standard for Lossless Audio Coding - Publications
    As a part of the MPEG-4 audio standard [1], Audio Lossless Coding (ALS) provides methods for lossless coding of audio signals with arbitrary sampling rates, ...
  41. [41]
    MPEG-4 ALS: Performance, Applications, and Related ...
    This paper overviews MPEG-4 ALS and its possible applications. In addition, it introduces an audio archiving tool that uses MPEG-4 ALS as the encoding engine.<|control11|><|separator|>
  42. [42]
    Audio | MPEG
    AAC is a multi-channel perceptual audio coder that provides excellent compression of music signals while achieving transparent quality.
  43. [43]
    [PDF] MPEG-4 Scalable to Lossless Audio Coding - Fraunhofer IIS
    [4] ISO/IEC JTC1/SC29/WG11 (MPEG), “Call for. Proposals on MPEG-4 Lossless Audio Coding,”. N5040, Shanghai, China, Oct. 2002. [5] R. Yu, X. Lin, S. Rahardja ...<|separator|>
  44. [44]
  45. [45]
    ISO/IEC 14496-12:2012(en), Information technology — Coding of ...
    The ISO Base Media File Format is designed to contain timed media information for a presentation in a flexible, extensible format that facilitates interchange.
  46. [46]
    RFC 6416: RTP Payload Format for MPEG-4 Audio/Visual Streams
    This document describes Real-time Transport Protocol (RTP) payload formats for carrying each of MPEG-4 Audio and MPEG-4 Visual bitstreams without using MPEG-4 ...
  47. [47]
    Media container formats (file types) - MDN Web Docs
    Jun 10, 2025 · Defines the 3GP container format. ISO/IEC 14496-3 (MPEG-4 Part 3 Audio), Defines MP4 audio including ADTS. FLAC Format, The FLAC format ...
  48. [48]
    [PDF] MPEG-4 Enhanced Low Delay AAC - a new standard for high quality ...
    Unlike traditional communi- cation codecs, an audio codec does not only deliver good speech quality but is able to transmit any input signal without altering ...
  49. [49]
    (PDF) Adaptive Playout for VoIP Based on the Enhanced Low Delay ...
    The MPEG-4 Enhanced Low Delay AAC (AAC-ELD) codec extends the application area of the Advanced Audio Coding (AAC) family towards high quality conversational ...
  50. [50]
    [PDF] MPEG and Multimedia Communications
    ITU-T has collaborated with MPEG in the development of MPEG-2 Systems and Video which have become ITU-T. Recommendations for the purpose of broadband visual ...
  51. [51]
    Via Licensing and MPEG LA Unite to Form Via Licensing Alliance ...
    Via Licensing and MPEG LA Unite to Form Via Licensing Alliance, the Largest Patent Pool Administrator in the Consumer Electronics Industry. New ...
  52. [52]
    Advanced Audio Coding (AAC) - ViaLa - Via Licensing Alliance
    The AAC patent licensing program provides coverage for all the AAC technologies identified in the following diagram. AAC Licensing tech diagram. The core of ...AAC FAQs · License Fees · LicenseesMissing: ALS | Show results with:ALS
  53. [53]
    AAC FAQs - ViaLa - Via Licensing Alliance
    An AAC patent license is needed by manufacturers or developers of end-user encoder and/or decoder products. What products are covered by the license? Expand.Missing: ALS 2025
  54. [54]
    License Fees - ViaLa
    For the first 1 to 500,000 units, $0.98 ; For units 500,001 to 1,000,000, $0.78 ; For units 1,000,001 to 2,000,000, $0.68 ; For units 2,000,001 to 5,000,000, $0.45.
  55. [55]
    AAC License Fees Structures - ViaLa
    Alternative Rate Structure ; Volume(per unit* / annual reset), Per Unit Fee ; For the first 1 to 500,000 units, $ 0.64 ; For units 500,001 to 1,000,000, $ 0.51.
  56. [56]
    Access Advance, Via LA position multimedia patent pools for further ...
    Jul 22, 2025 · Via LA's AAC (Advanced Audio Coding) pool has offered lower rates for emerging markets for some time (AAC fee structures page), which may have ...
  57. [57]
    HMD received AAC license offer from Via more than seven years ...
    Sep 6, 2025 · HMD received an AAC license offer from Via in December 2017, more than seven years before Fraunhofer's UPC complaint. The court denied HMD an ...Missing: open- source tools
  58. [58]
    Via Licensing and Pegatron Settle Patent License Dispute ...
    Apr 4, 2016 · The confidential settlement includes a payment to Via of all royalties owed by Pegatron under the AAC Patent License Agreement as Via alleged in ...
  59. [59]
    Fraunhofer FDK Audio Codec | Open Source - Garmin Developers
    Fraunhofer provides no warranty of patent non-infringement with respect to this software. You may use this FDK AAC Codec software or modifications thereto only ...Missing: tools | Show results with:tools