ISO base media file format
The ISO base media file format (ISOBMFF) is a general-purpose, extensible container format standardized by the International Organization for Standardization (ISO) for storing timed sequences of multimedia data, such as audio-visual presentations, including their timing, structure, and media information.[1] It is specified in ISO/IEC 14496-12, with the seventh edition published in January 2022, and serves as the foundational structure for a range of derived file formats, including MP4 for video, HEIF for images, and others used in streaming and storage applications.[1][2] This format facilitates the interchange, management, editing, and presentation of media content, supporting both local playback and streaming scenarios.[2] Originally derived from Apple's QuickTime file format, ISOBMFF evolved from the initial MP4 specification in 2001 and was generalized as a standalone standard in 2004 under ISO/IEC 14496-12, allowing broader applicability beyond MPEG-4 video.[2] It is maintained by ISO/IEC JTC 1/SC 29, with ongoing development through the MPEG working group, incorporating enhancements such as support for depth and alpha maps, T.35 metadata, and integration with standards like Common Media Application Format (CMAF) and Dynamic Adaptive Streaming over HTTP (DASH).[3] Over multiple editions, the format has been refined to handle diverse media types, including timed and untimed data, while ensuring backward compatibility through extensible mechanisms.[3] At its core, ISOBMFF employs an object-oriented, box-based structure where all data—metadata and media samples alike—is encapsulated in self-contained "boxes" that include a length, four-character type code, and optional version, flags, or user data fields.[2] Mandatory elements include the file-type box (ftyp) for identifying compatible brands and the movie box (moov) for presenting timed metadata, while untimed metadata uses the meta box; unrecognized boxes can be skipped for extensibility.[2] Media data may reside in the primary file or be referenced externally via URLs in secondary files, enabling efficient handling of large-scale content like high-resolution video or panoramic media.[2] This modular design supports subtypes such as 'mp41' for MP4 files and 'mjp2' for motion JPEG 2000, making ISOBMFF a versatile foundation for modern multimedia ecosystems.[2]Overview
Definition and purpose
The ISO base media file format, formally specified in ISO/IEC 14496-12 (seventh edition, 2022),[1] serves as an international standard for a general-purpose container designed to encapsulate timed media information, including audio, video, text, and metadata, in a structured and extensible manner. This format enables the storage of multimedia data in a way that supports presentation, interchange, and management without dictating specific encoding methods for the media content itself.[4] The primary purpose of the ISO base media file format is to act as a flexible, media-independent container that distinctly separates structural metadata—such as timing and organization details—from the raw media streams, thereby promoting seamless interoperability across diverse devices, software applications, and playback systems.[2] By prioritizing this separation, the format facilitates efficient handling of complex multimedia presentations, including synchronization of multiple tracks and support for progressive downloading or streaming scenarios.[5] At its core, the format embodies modularity through a box-based architecture, where files are composed of hierarchically nested, typed boxes that encapsulate specific data elements, allowing for straightforward parsing, validation, and future extensions without disrupting existing implementations.[4] This design principle ensures the format's adaptability to evolving multimedia needs while maintaining compatibility. Originally evolved from Apple's QuickTime file format, it has been generalized to support broader applications within MPEG-4 systems and subsequent standards.[6]Key characteristics
The ISO base media file format (ISOBMFF) is characterized by its modular and extensible design, which allows for the inclusion of brand-specific identifiers in the file type box to signal compatibility with particular specifications or extensions without disrupting core functionality.[4] This mechanism, combined with optional boxes and support for unrecognized box types that can be skipped during parsing, enables the addition of custom data or proprietary features while maintaining backward compatibility for conforming players.[2] For instance, incompatible changes in derived specifications require registration of a new brand identifier, ensuring clear delineation of format variants.[7] A core feature is the support for multiple independent tracks within a single file, each handling distinct media types such as audio, video, or subtitles, with their own timing information and synchronization mechanisms.[4] Tracks are time-parallel, allowing for flexible composition of presentations where each track carries spatial and temporal data autonomously, and track references facilitate relationships like hint tracks for streaming protocols.[7] This structure permits alternatives within tracks, such as multiple audio options, selected via embedded track selection data, enhancing adaptability for diverse playback scenarios.[4] The format employs a hierarchical organization based on boxes (also known as atoms), where each box consists of a header specifying its size (32- or 64-bit) and type (four-character code or UUID), followed by data or sub-boxes, enabling efficient random access and parsing.[2] All file content is encapsulated within this box structure, with no external data required for basic navigation, and the object-oriented hierarchy supports decomposition into parent-child relationships for complex media assemblies.[7] Four-character codes for box types are registered to ensure unambiguous identification, promoting interoperability across implementations.[4] ISOBMFF is inherently self-describing, embedding all necessary metadata for playback—including track headers, sample descriptions, and timing information—directly within the file, independent of external codec definitions or runtime environments.[2] This includes details on sample dependencies and decoding timelines, allowing players to reconstruct presentations without additional resources, while structural metadata in boxes like 'moov' provides comprehensive details on media organization.[7] The format's logical structure decouples metadata from media data, further supporting self-sufficiency in varied storage or transmission contexts.[4] Finally, the format is optimized for progressive download and streaming through variants like fragmented MP4 files, which use movie fragment boxes to separate metadata from media data, enabling real-time assembly and incremental playback.[7] This fragmentation allows files to be generated and delivered in sequence without a complete upfront structure, with features like subsegment indexing for efficient byte-range requests in HTTP-based streaming.[2] Hint tracks further aid network delivery by providing packetization instructions for protocols such as RTP, ensuring seamless adaptation to bandwidth variations.[4]History and development
Origins in QuickTime
The ISO base media file format (ISOBMFF) traces its origins to Apple's QuickTime File Format (QTFF), which emerged in the early 1990s as a foundational container for multimedia content on personal computers.[8] QuickTime was initially released in 1991 for the Mac OS, introducing an innovative atom-based structure that allowed for the modular organization of audio, video, and other time-based media within a single file.[9] These atoms—self-contained units consisting of a size, type identifier, and data payload—served as the building blocks for embedding diverse media streams, enabling synchronized playback and editing capabilities that were groundbreaking at the time.[10] Throughout the 1990s, QTFF evolved to support cross-platform compatibility, including adaptations for Windows in 1994, while influencing broader multimedia standards through its flexible, extensible design.[8] Apple's decision to contribute elements of QTFF to international standardization efforts marked a pivotal shift toward broader adoption. In the late 1990s, as the internet and mobile devices gained prominence, Apple collaborated with the Moving Picture Experts Group (MPEG) to generalize QTFF's architecture for web-based and portable multimedia applications, addressing limitations of platform-specific features.[11] This contribution formed the core of MPEG-4 Systems, aiming to create a versatile container decoupled from Macintosh-specific components, such as resource forks used for metadata storage in Mac OS files.[12] By stripping away these proprietary elements, the format became suitable for diverse operating environments, paving the way for its use in streaming and file exchange across devices.[2] A key early milestone occurred with the formal integration of this evolved structure into MPEG-4 Part 12, published as ISO/IEC 14496-12 in 2004, which defined the initial ISOBMFF specification.[13] This standardization retained the atom-based hierarchy—renamed "boxes" for neutrality—while ensuring compatibility with emerging digital media needs, such as efficient storage and transmission over networks.[2] The result was a robust, open framework that extended QuickTime's legacy beyond Apple ecosystems, influencing subsequent formats like MP4 for widespread multimedia distribution.[12]Standardization by ISO
The ISO base media file format was first published in 2004 as ISO/IEC 14496-12, forming part of the MPEG-4 suite of standards for coding audio-visual objects and titled "ISO base media file format."[13] This inaugural edition established a flexible, extensible structure for storing timed media data, drawing from earlier proprietary formats while enabling broad interoperability across multimedia applications.[2] Subsequent major revisions have progressively expanded the format's functionality to address evolving multimedia needs. The second edition, released in 2005, introduced file branding mechanisms to specify compatible format variants and ensure interoperability. The third edition in 2008 added support for progressive downloading and streaming, facilitating real-time media delivery over networks.[14] Further advancements came in the fourth edition of 2012, which incorporated fragmented file structures for efficient handling of large or dynamically generated media streams.[15] The fifth edition in 2015 integrated support for High Efficiency Video Coding (HEVC) and refined media encapsulation capabilities.[16] Ongoing development continues through amendments and new editions, reflecting advancements in multimedia technologies. Post-2015 updates in the sixth (2020) and seventh (2022) editions have included provisions for spatial audio rendering, point cloud data storage, and byte stream formats compatible with web standards, such as the W3C Media Source Extensions (MSE) integration to enable seamless browser-based media processing.[17][1] The eighth edition was ratified in July 2024 and published in 2025, incorporating further enhancements.[18] The standard is jointly maintained by ISO/IEC JTC 1/SC 29/WG 11 (MPEG), with significant contributions from organizations including 3GPP for mobile adaptations and Apple for core structural refinements.[4] To preserve compatibility across versions, the format employs major and minor version fields within its box structures, allowing parsers to handle extensions without breaking legacy support.[1]Core file structure
Box architecture
The ISO base media file format is structured around boxes, which serve as the fundamental object-oriented building blocks for organizing all data within a file. Each box consists of a 32-bit unsigned integer size field indicating the total byte length of the box, including its header and payload; a 32-bit type field, typically a four-character code such as 'moov' for the movie box; an optional 64-bit large size field used when the initial size value is 1 to accommodate files exceeding 4 GiB; and a variable-length data payload that holds the box's content. This design allows boxes to have variable total lengths, enabling flexible encapsulation of metadata and media data. Boxes support nesting, where one box can contain other boxes as part of its payload, creating a hierarchical tree structure that organizes complex information such as timing, tracks, and media samples. Container boxes primarily hold sub-boxes without additional data, while full boxes include both sub-boxes and specific fields like version and flags. This nesting facilitates modular composition, allowing the format to represent timed sequences of multimedia data in a scalable manner. Parsing of the file begins sequentially from the start, with readers first interpreting the size and type fields to determine the box's extent and identity before processing the payload. The format supports both 32-bit and 64-bit size representations for broad compatibility, and custom or extended box types can employ a 16-byte UUID in place of the four-character code to ensure uniqueness. If the size field is 0, the box extends to the end of the file, which is particularly useful for media data containers. For robustness, full boxes incorporate an 8-bit version field to indicate the box format version and a 24-bit flags field to control conditional interpretation of data, enabling backward compatibility and optional features. Unrecognized box types, versions, or fields are designed to be skipped during parsing, preventing errors from invalidating the entire file and promoting graceful degradation in diverse implementations. At the root level, the file typically comprises top-level boxes such as the 'ftyp' box for declaring the file type and compatibility, the 'moov' box for overall movie metadata, and the 'mdat' box for raw media data, arranged without deep nesting in this initial layer.Top-level boxes
The ISO base media file format (ISOBMFF) organizes its content into a sequence of top-level boxes that form the root structure of the file, enabling parsers to identify the format, metadata, and media data efficiently. These boxes adhere to the general box architecture, where each begins with a size and type indicator, allowing sequential parsing without an index.[1] The primary top-level boxes include the File Type Box, Movie Box, Media Data Box, Free Space Box, and Skip Box, which collectively ensure compatibility, presentation control, and data storage while supporting optional elements for flexibility.[19] The File Type Box ('ftyp') is a mandatory top-level box that appears at the beginning of the file to declare its type and compatibility profile. It specifies a major brand identifying the primary specification (e.g., 'isom' for the base ISO format or 'mp41' for MPEG-4 Part 1 compatibility), a minor version for minor revisions within that brand, and a list of compatible brands that indicate supported extensions or variants.[1] This box enables parsers to quickly verify if the file can be processed under a given implementation, preventing errors from incompatible features. For instance, a file branded 'iso2' supports version 2 of the base format, ensuring interoperability across tools like media players and encoders. Without 'ftyp', the file is invalid according to the standard.[19] The Movie Box ('moov') serves as the mandatory container for all presentation-related metadata at the file root level and must appear before any media data in non-fragmented files to allow immediate access to timing and structure information. It encapsulates essential sub-elements such as the movie header for overall duration and timescale, along with track definitions that organize media streams, though its internal details are defined elsewhere.[1] Positioned early in the file, 'moov' facilitates efficient seeking and playback initialization, particularly in progressive download scenarios where metadata precedes content. Exactly one 'moov' box is required per file for standard conformance.[19] The Media Data Box ('mdat') is the optional yet essential top-level box for storing the raw, unstructured media payload, such as video frames, audio samples, or other timed content, and it may appear multiple times to group related data. Unlike metadata boxes, 'mdat' contains no headers beyond its size and type; the actual media samples are referenced by offsets and lengths from the 'moov' metadata.[1] This separation allows flexible file layouts, including interleaving media data after metadata or in separate files for advanced uses, while ensuring parsers can extract samples without interpreting the data itself. 'mdat' is absent in metadata-only files but required for any file with media content.[19] For managing unused or reserved space, the Free Space Box ('free') and Skip Box ('skip') are optional top-level boxes that parsers ignore during processing, providing mechanisms for padding, alignment, or future extensions without affecting validity. The 'free' box marks irrelevant data that can be safely discarded or overwritten, often used for unused space in edited files, while 'skip' indicates content to be overlooked entirely, such as obsolete padding.[1] Both can occur zero or more times at the root and contain arbitrary data up to their declared size, but they carry no semantic meaning and are not required for file conformance. These boxes help maintain file integrity during incremental updates or storage optimizations.[19]Media organization
Tracks and track types
In the ISO base media file format, media content is organized into one or more tracks, where each track represents a timed sequence of related samples that form an independent media stream, such as a sequence of video frames, audio samples, or streaming instructions.[7] Tracks are contained within the Movie Box ('moov'), enabling the encapsulation of multiple streams in a single file for synchronized presentation.[7] The Track Box, denoted by the four-character code 'trak', is the primary container for a track's metadata and references to its media data. It includes sub-boxes for track-specific information, such as the header, media details via the Media Box ('mdia'), and sample tables that describe sample timing, dependencies, and decoding parameters.[7] Within each Track Box, the mandatory Track Header Box ('tkhd') defines core track properties, including a unique track ID (a non-zero unsigned 32-bit integer, managed sequentially via the movie header's next_track_ID field), the track's duration expressed in the movie's timescale, visual dimensions (width and height as 16.16 fixed-point values for applicable tracks), a layer value for front-to-back rendering order among visual tracks, and flags controlling track usage (enabled, included in movie, included in preview).[7] The 'tkhd' box also specifies an alternate group identifier for mutually exclusive tracks (e.g., language variants) and a transformation matrix for spatial adjustments like scaling or rotation.[7] Tracks are specialized by type based on the media they handle, identified via the handler type in the Handler Box ('hdlr') within the Media Box. Audio tracks use the handler type 'soun' and contain encoded sound streams, often parameterized by sample entries for decoders like AAC.[7][4] Video tracks employ the handler type 'vide' to store visual streams, such as those compressed with H.264/AVC, with sample entries specifying decoder configurations and synchronization points like keyframes.[7][4] Hint tracks, using the handler type 'hint', hold packetization instructions for streaming protocols (e.g., RTP or HTTP), referencing underlying media tracks without carrying the actual media data.[7][4] Text and subtitle tracks manage timed textual content, such as captions or subtitles, synchronized to the presentation timeline via appropriate sample descriptions.[4] Metadata tracks store descriptive information, either timed (e.g., synchronized annotations) or untimed (e.g., track-level descriptors), often using custom handler types.[4] Synchronization across multiple tracks relies on a shared timescale from the movie header, ensuring samples from different tracks align temporally based on their decoding and composition timestamps.[7] Edit lists in the Edit Box ('edts') allow per-track timeline adjustments, such as offsets or empty segments, to fine-tune alignment without altering sample data.[7] Track references enable dependencies, such as a hint track linking to media tracks or stereo audio configured as parallel tracks (e.g., left and right channels sharing an alternate group).[7] The format imposes no explicit limit on the number of tracks, though implementations typically support dozens to accommodate complex presentations with multiple audio languages, subtitles, or metadata layers.[7]| Key Fields in Track Header Box ('tkhd') | Description | Data Type |
|---|---|---|
| track_ID | Unique 32-bit identifier for the track | unsigned int(32) |
| duration | Track duration in movie timescale units | unsigned int(32) or (64), version-dependent |
| layer | Rendering order (lower values in front) | int(16) |
| alternate_group | Group ID for exclusive tracks (0 if none) | int(16) |
| width/height | Visual dimensions (for video tracks) | unsigned int(32), 16.16 fixed-point |
| flags | Bitmask: enabled (0x1), in movie (0x2), in preview (0x4) | unsigned int(24) |
Samples and decoding times
In the ISO base media file format, a sample represents the basic unit of media data within a track, such as a single video frame or an audio frame, stored contiguously in the media data box ('mdat') and referenced by metadata in the sample tables.[7] These samples are implicitly numbered sequentially starting from 1 and are associated with unique timestamps, enabling precise synchronization during playback.[7] The Sample Table Box ('stbl'), contained within the media information box ('minf') of a track, serves as the central container for sample metadata, including one or more of the following sub-boxes: Sample Description Box ('stsd') for codec and initialization information, Sample Size Box ('stsz') for individual sample sizes, Chunk Offset Box ('stco' or 'co64' for large files) for locating samples in 'mdat', and Time to Sample Box ('stts') for timing details.[7] This structure allows efficient access to samples without parsing the entire media data, supporting variable bit rates and frame sizes.[7] Decoding times for samples are defined in the Decoding Time to Sample Box ('stts'), which maps sample indices to their durations using a table of entries, each specifying a count of consecutive samples and a shared delta value in the track's media timescale.[7] The decoding timestamp (DTS) for a sample is computed cumulatively from these deltas, accommodating variable frame rates by grouping samples with identical durations, such as in content with mixed frame rates for smooth motion or efficiency.[7] Presentation timestamps (PTS), which determine display order, are derived from DTS values adjusted by offsets in the optional Composition Time to Sample Box ('ctts'), essential for media like video with B-frames where decoding precedes presentation.[7] Each entry in 'ctts' applies an offset to a group of samples, ensuring unique PTS values across the track and enabling reordering without altering the decoding sequence.[7] Version 0 uses unsigned offsets for non-negative adjustments, while version 1 supports signed offsets for more flexible timing scenarios.[7] For random access, the Sync Sample Table Box ('stss') lists the indices of sync samples, such as keyframes or intra-coded frames, which can be decoded independently without relying on prior samples.[7] If 'stss' is absent, all samples are treated as sync samples, facilitating seeking and editing by identifying entry points in the track.[7] This mechanism is crucial for applications requiring quick navigation, like streaming, where sync samples mark stream access points.[7]Metadata and presentation
Movie header and timing
The Movie Header Box, identified by the four-character code 'mvhd', is a mandatory full box contained within the Movie Box ('moov') of the ISO base media file format, providing essential media-independent metadata for the entire presentation, including global timing information and playback parameters.[7] It declares the creation and modification timestamps, the timescale for time measurements, the overall duration, and other settings such as playback rate and audio volume, ensuring a unified temporal framework across all tracks.[7] The 'mvhd' box supports two versions to accommodate varying file durations and timestamp ranges: version 0 uses 32-bit integer fields for creation time, modification time, and duration, suitable for presentations up to approximately 2^32 timescale units (often until around 2040 depending on the timescale), while version 1 employs 64-bit fields for creation time, modification time, and duration to support longer content without overflow.[7] The timescale field, a 32-bit unsigned integer present in both versions, defines the time unit as ticks per second and serves as the common reference for all tracks in the file, such as 90000 ticks per second for high-frame-rate video to enable precise synchronization.[7] Duration is expressed as an integer multiple of the timescale, representing the total length of the presentation based on the longest track; if undetermined, it is set to the maximum value (all 1s in binary).[7] Creation and modification times are recorded as seconds since midnight on January 1, 1904, in UTC, using 32-bit or 64-bit unsigned integers depending on the version.[7] Following the duration, the preferred rate is a 32-bit fixed-point 16.16 integer (default 1.0, or 0x00010000 in hexadecimal) indicating the desired playback speed relative to normal, and the preferred volume is a 16-bit fixed-point 8.8 integer (default 1.0, or 0x0100) setting the initial audio mix level for the presentation. These are followed by reserved fields: a 16-bit reserved set to 0 and two 32-bit unsigned integers reserved and set to 0.[7] For spatial positioning, the 'mvhd' box includes a transformation matrix, an array of nine 32-bit fixed-point values (16.16 format, except the offset components u, v, w in 2.30 format), structured as a 3x3 matrix {a, b, u; c, d, v; x, y, w} that applies scaling, rotation, and translation to tracks during presentation, with default values forming an identity matrix (a=d=0x00010000, others 0 except w=0x40000000). This is followed by six 32-bit pre-defined fields reserved and set to 0 (mapping QuickTime preview, poster, selection, and current time fields). The next track ID, a 32-bit unsigned integer that specifies the identifier for the subsequent track to be added, ensuring uniqueness and exceeding any existing track IDs in the file, concludes the box.[7]| Field | Size (Version 0) | Size (Version 1) | Type | Notes |
|---|---|---|---|---|
| Version/Flags | 4 bytes | 4 bytes | Full box header | Version 0 or 1 |
| Creation Time | 4 bytes | 8 bytes | unsigned int | Seconds since 1904-01-01 UTC |
| Modification Time | 4 bytes | 8 bytes | unsigned int | Seconds since 1904-01-01 UTC |
| Timescale | 4 bytes | 4 bytes | unsigned int(32) | Ticks per second |
| Duration | 4 bytes | 8 bytes | unsigned int | In timescale units; all 1s if indeterminate |
| Preferred Rate | 4 bytes | 4 bytes | fixed32(16.16) | Default 1.0 |
| Preferred Volume | 2 bytes | 2 bytes | fixed16(8.8) | Default 1.0 |
| Reserved | 2 bytes | 2 bytes | bit(16) | Set to 0 |
| Reserved | 8 bytes | 8 bytes | unsigned int(32)[20] | Set to 0 |
| Matrix | 36 bytes | 36 bytes | array[21] fixed32 | 3x3 transformation (16.16 except u,v,w in 2.30); default identity |
| Pre-defined | 24 bytes | 24 bytes | unsigned int(32)[22] | Reserved; set to 0 (QuickTime legacy fields) |
| Next Track ID | 4 bytes | 4 bytes | unsigned int(32) | For next track |