Fact-checked by Grok 2 weeks ago

ISO base media file format

The ISO base media file format (ISOBMFF) is a general-purpose, extensible standardized by the (ISO) for storing timed sequences of multimedia data, such as audio-visual presentations, including their timing, structure, and media information. It is specified in ISO/IEC 14496-12, with the seventh edition published in January 2022, and serves as the foundational structure for a range of derived file formats, including MP4 for video, HEIF for images, and others used in streaming and storage applications. This format facilitates the interchange, management, editing, and presentation of media content, supporting both local playback and streaming scenarios. Originally derived from Apple's , ISOBMFF evolved from the initial MP4 specification in 2001 and was generalized as a standalone standard in 2004 under ISO/IEC 14496-12, allowing broader applicability beyond MPEG-4 video. It is maintained by ISO/IEC JTC 1/SC 29, with ongoing development through the MPEG , incorporating enhancements such as support for depth and alpha maps, T.35 , and integration with standards like Common Media Application Format (CMAF) and Dynamic Adaptive Streaming over HTTP (DASH). Over multiple editions, the format has been refined to handle diverse media types, including timed and untimed data, while ensuring through extensible mechanisms. At its core, ISOBMFF employs an object-oriented, box-based structure where all data— and samples alike—is encapsulated in self-contained "boxes" that include a length, four-character type code, and optional version, flags, or user data fields. Mandatory elements include the file-type box (ftyp) for identifying compatible brands and the movie box (moov) for presenting timed , while untimed uses the box; unrecognized boxes can be skipped for extensibility. data may reside in the primary file or be referenced externally via URLs in secondary files, enabling efficient handling of large-scale content like high-resolution video or panoramic . This modular design supports subtypes such as 'mp41' for MP4 files and 'mjp2' for , making ISOBMFF a versatile foundation for modern multimedia ecosystems.

Overview

Definition and purpose

The ISO base media file format, formally specified in ISO/IEC 14496-12 (seventh edition, 2022), serves as an for a general-purpose designed to encapsulate timed media information, including audio, video, text, and , in a structured and extensible manner. This format enables the storage of data in a way that supports presentation, interchange, and management without dictating specific encoding methods for the media content itself. The primary purpose of the ISO base media file format is to act as a flexible, media-independent that distinctly separates structural —such as timing and details—from the raw streams, thereby promoting seamless across diverse devices, software applications, and playback systems. By prioritizing this separation, the format facilitates efficient handling of complex presentations, including of multiple tracks and support for progressive downloading or streaming scenarios. At its core, the format embodies through a box-based , where files are composed of hierarchically nested, typed boxes that encapsulate specific elements, allowing for straightforward , validation, and future extensions without disrupting existing implementations. This design principle ensures the format's adaptability to evolving needs while maintaining compatibility. Originally evolved from Apple's , it has been generalized to support broader applications within MPEG-4 systems and subsequent standards.

Key characteristics

The ISO base media file format (ISOBMFF) is characterized by its modular and extensible design, which allows for the inclusion of brand-specific identifiers in the file type box to signal compatibility with particular specifications or extensions without disrupting core functionality. This mechanism, combined with optional boxes and support for unrecognized box types that can be skipped during parsing, enables the addition of custom data or proprietary features while maintaining backward compatibility for conforming players. For instance, incompatible changes in derived specifications require registration of a new brand identifier, ensuring clear delineation of format variants. A core feature is the support for multiple independent tracks within a single file, each handling distinct media types such as audio, video, or , with their own timing information and synchronization mechanisms. are time-parallel, allowing for flexible composition of presentations where each carries spatial and temporal data autonomously, and track references facilitate relationships like hint tracks for streaming protocols. This structure permits alternatives within , such as multiple audio options, selected via embedded track selection data, enhancing adaptability for diverse playback scenarios. The format employs a based on (also known as atoms), where each consists of a header specifying its size (32- or 64-bit) and type (four-character code or UUID), followed by data or sub-, enabling efficient and parsing. All file content is encapsulated within this box structure, with no external data required for basic navigation, and the object-oriented hierarchy supports decomposition into parent-child relationships for complex media assemblies. Four-character codes for box types are registered to ensure unambiguous identification, promoting interoperability across implementations. ISOBMFF is inherently self-describing, embedding all necessary for playback—including headers, sample descriptions, and timing information—directly within the , independent of external definitions or runtime environments. This includes details on sample dependencies and decoding timelines, allowing to reconstruct presentations without additional resources, while structural in boxes like 'moov' provides comprehensive details on organization. The format's logical decouples from data, further supporting self-sufficiency in varied storage or transmission contexts. Finally, the format is optimized for progressive download and streaming through variants like fragmented MP4 files, which use movie fragment boxes to separate from data, enabling assembly and incremental playback. This fragmentation allows files to be generated and delivered in sequence without a complete upfront structure, with features like subsegment indexing for efficient byte-range requests in HTTP-based streaming. Hint tracks further aid network delivery by providing packetization instructions for protocols such as RTP, ensuring seamless adaptation to variations.

History and development

Origins in QuickTime

The ISO base media file format (ISOBMFF) traces its origins to Apple's (QTFF), which emerged in the early 1990s as a foundational container for multimedia content on personal computers. was initially released in 1991 for the Mac OS, introducing an innovative atom-based structure that allowed for the modular organization of audio, video, and other time-based media within a single file. These atoms—self-contained units consisting of a size, type identifier, and data payload—served as the building blocks for embedding diverse media streams, enabling synchronized playback and editing capabilities that were groundbreaking at the time. Throughout the 1990s, QTFF evolved to support cross-platform compatibility, including adaptations for Windows in 1994, while influencing broader multimedia standards through its flexible, extensible design. Apple's decision to contribute elements of QTFF to international standardization efforts marked a pivotal shift toward broader adoption. In the late 1990s, as the internet and mobile devices gained prominence, Apple collaborated with the (MPEG) to generalize QTFF's architecture for web-based and portable applications, addressing limitations of platform-specific features. This contribution formed the core of MPEG-4 Systems, aiming to create a versatile container decoupled from Macintosh-specific components, such as resource forks used for storage in Mac OS files. By stripping away these proprietary elements, became suitable for diverse operating environments, paving the way for its use in streaming and file exchange across devices. A key early milestone occurred with the formal integration of this evolved structure into MPEG-4 Part 12, published as ISO/IEC 14496-12 in 2004, which defined the initial ISOBMFF specification. This standardization retained the atom-based hierarchy—renamed "boxes" for neutrality—while ensuring compatibility with emerging needs, such as efficient storage and transmission over networks. The result was a robust, open framework that extended QuickTime's legacy beyond Apple ecosystems, influencing subsequent formats like MP4 for widespread multimedia distribution.

Standardization by ISO

The ISO base media file format was first published in 2004 as , forming part of the suite of standards for coding audio-visual objects and titled "ISO base media file format." This inaugural edition established a flexible, extensible structure for storing timed media data, drawing from earlier proprietary formats while enabling broad interoperability across multimedia applications. Subsequent major revisions have progressively expanded the format's functionality to address evolving multimedia needs. The second edition, released in 2005, introduced file branding mechanisms to specify compatible format variants and ensure interoperability. The third edition in 2008 added support for progressive downloading and streaming, facilitating real-time media delivery over networks. Further advancements came in the fourth edition of 2012, which incorporated fragmented file structures for efficient handling of large or dynamically generated media streams. The fifth edition in 2015 integrated support for High Efficiency Video Coding (HEVC) and refined media encapsulation capabilities. Ongoing development continues through amendments and new editions, reflecting advancements in multimedia technologies. Post-2015 updates in the sixth (2020) and seventh (2022) editions have included provisions for spatial audio rendering, data storage, and byte stream formats compatible with web standards, such as the W3C (MSE) integration to enable seamless browser-based media processing. The eighth edition was ratified in July 2024 and published in 2025, incorporating further enhancements. The is jointly maintained by ISO/IEC JTC 1/SC 29/WG 11 (MPEG), with significant contributions from organizations including for mobile adaptations and Apple for core structural refinements. To preserve compatibility across versions, the format employs major and minor version fields within its structures, allowing parsers to handle extensions without breaking legacy support.

Core file structure

Box architecture

The ISO base media file format is structured around , which serve as the fundamental object-oriented building blocks for organizing all within a . Each consists of a 32-bit unsigned size field indicating the total byte length of the , including its header and ; a 32-bit type field, typically a four-character code such as 'moov' for the movie ; an optional 64-bit large size field used when the initial size value is 1 to accommodate exceeding 4 GiB; and a variable-length that holds the 's content. This design allows to have variable total lengths, enabling flexible encapsulation of and media . Boxes support nesting, where one box can contain other boxes as part of its , creating a hierarchical that organizes complex information such as timing, tracks, and samples. Container boxes primarily hold sub-boxes without additional data, while full boxes include both sub-boxes and specific fields like and flags. This nesting facilitates modular composition, allowing the format to represent timed sequences of data in a scalable manner. of the file begins sequentially from the start, with readers first interpreting the and type fields to determine the box's extent and identity before processing the . The format supports both 32-bit and 64-bit representations for broad , and custom or extended types can employ a 16-byte UUID in place of the four-character code to ensure uniqueness. If the field is 0, the box extends to the end of the file, which is particularly useful for data containers. For robustness, full boxes incorporate an 8-bit version field to indicate the box format version and a 24-bit flags field to control conditional interpretation of , enabling and optional features. Unrecognized box types, versions, or fields are designed to be skipped during , preventing errors from invalidating the entire file and promoting graceful in diverse implementations. At the root level, the file typically comprises top-level boxes such as the 'ftyp' box for declaring the file type and compatibility, the 'moov' box for overall movie , and the 'mdat' box for raw , arranged without deep nesting in this initial layer.

Top-level boxes

The ISO base media file format (ISOBMFF) organizes its content into a sequence of top-level boxes that form the root structure of the file, enabling parsers to identify , , and media data efficiently. These boxes adhere to the general box architecture, where each begins with a size and type indicator, allowing sequential without an index. The primary top-level boxes include the File Type Box, Movie Box, Media Data Box, Free Space Box, and Skip Box, which collectively ensure compatibility, presentation control, and data storage while supporting optional elements for flexibility. The File Type Box ('ftyp') is a mandatory top-level box that appears at the beginning of the file to declare its type and profile. It specifies a major brand identifying the primary specification (e.g., 'isom' for the base ISO format or 'mp41' for MPEG-4 Part 1 compatibility), a minor version for minor revisions within that brand, and a list of compatible brands that indicate supported extensions or variants. This box enables parsers to quickly verify if the file can be processed under a given , preventing errors from incompatible features. For instance, a file branded 'iso2' supports version 2 of the base format, ensuring across tools like media players and encoders. Without 'ftyp', the file is invalid according to the standard. The Movie Box ('moov') serves as the mandatory container for all presentation-related at the file root level and must appear before any data in non-fragmented files to allow immediate access to timing and structure information. It encapsulates essential sub-elements such as the movie header for overall duration and timescale, along with definitions that organize streams, though its internal details are defined elsewhere. Positioned early in the file, 'moov' facilitates efficient seeking and playback initialization, particularly in download scenarios where metadata precedes content. Exactly one 'moov' box is required per file for standard conformance. The Media Data Box ('mdat') is the optional yet essential top-level box for storing the raw, unstructured media payload, such as video frames, audio samples, or other timed content, and it may appear multiple times to group related data. Unlike boxes, 'mdat' contains no headers beyond its size and type; the actual media samples are referenced by offsets and lengths from the 'moov' metadata. This separation allows flexible file layouts, including interleaving media data after metadata or in separate files for advanced uses, while ensuring parsers can extract samples without interpreting the data itself. 'mdat' is absent in metadata-only files but required for any file with media content. For managing unused or reserved space, the Free Space Box ('free') and Skip Box ('skip') are optional top-level boxes that parsers ignore during processing, providing mechanisms for padding, alignment, or future extensions without affecting validity. The 'free' box marks irrelevant data that can be safely discarded or overwritten, often used for unused space in edited files, while 'skip' indicates content to be overlooked entirely, such as obsolete padding. Both can occur zero or more times at the root and contain arbitrary data up to their declared size, but they carry no semantic meaning and are not required for file conformance. These boxes help maintain file integrity during incremental updates or storage optimizations.

Media organization

Tracks and track types

In the ISO base media file format, media content is organized into one or more , where each represents a timed sequence of related samples that form an independent media stream, such as a sequence of video frames, audio samples, or streaming instructions. are contained within the Movie Box ('moov'), enabling the encapsulation of multiple streams in a single file for synchronized presentation. The Track Box, denoted by the four-character code 'trak', is the primary container for a track's and references to its data. It includes sub-boxes for track-specific information, such as the header, media details via the Media Box ('mdia'), and sample tables that describe sample timing, dependencies, and decoding parameters. Within each Track Box, the mandatory Track Header Box ('tkhd') defines core track properties, including a unique track ID (a non-zero unsigned 32-bit integer, managed sequentially via the movie header's next_track_ID field), the track's duration expressed in the movie's timescale, visual dimensions (width and height as 16.16 fixed-point values for applicable tracks), a layer value for front-to-back rendering order among visual tracks, and flags controlling track usage (enabled, included in movie, included in preview). The 'tkhd' box also specifies an alternate group identifier for mutually exclusive tracks (e.g., language variants) and a for spatial adjustments like scaling or . Tracks are specialized by type based on the media they handle, identified via the handler type in the Handler Box ('hdlr') within the Media Box. Audio tracks use the handler type 'soun' and contain encoded sound streams, often parameterized by sample entries for decoders like . Video tracks employ the handler type 'vide' to store visual streams, such as those compressed with H.264/AVC, with sample entries specifying decoder configurations and synchronization points like keyframes. Hint tracks, using the handler type 'hint', hold packetization instructions for streaming protocols (e.g., RTP or HTTP), referencing underlying media tracks without carrying the actual media data. Text and subtitle tracks manage timed textual content, such as captions or , synchronized to the presentation timeline via appropriate sample descriptions. Metadata tracks store descriptive information, either timed (e.g., synchronized annotations) or untimed (e.g., track-level descriptors), often using custom handler types. Synchronization across multiple tracks relies on a shared timescale from the movie header, ensuring samples from different tracks align temporally based on their decoding and composition timestamps. Edit lists in the Edit Box ('edts') allow per-track timeline adjustments, such as offsets or empty segments, to fine-tune alignment without altering sample data. Track references enable dependencies, such as a hint track linking to tracks or audio configured as parallel tracks (e.g., left and right channels sharing an alternate group). The format imposes no explicit limit on the number of tracks, though implementations typically support dozens to accommodate complex presentations with multiple audio languages, , or metadata layers.
Key Fields in Track Header Box ('tkhd')DescriptionData Type
track_IDUnique 32-bit identifier for the trackunsigned int(32)
durationTrack duration in movie timescale unitsunsigned int(32) or (64), version-dependent
layerRendering order (lower values in front)int(16)
alternate_groupGroup ID for exclusive tracks (0 if none)int(16)
width/heightVisual dimensions (for video tracks)unsigned int(32), 16.16 fixed-point
flagsBitmask: enabled (0x1), in movie (0x2), in preview (0x4)unsigned int(24)

Samples and decoding times

In the ISO base media file format, a sample represents the basic unit of media data within a track, such as a single video frame or an audio frame, stored contiguously in the media data box ('mdat') and referenced by in the sample tables. These samples are implicitly numbered sequentially starting from 1 and are associated with unique timestamps, enabling precise synchronization during playback. The Sample Table Box ('stbl'), contained within the media information box ('minf') of a , serves as the central for sample , including one or more of the following sub-boxes: Sample Description Box ('stsd') for and initialization information, Sample Size Box ('stsz') for individual sample sizes, Chunk Offset Box ('stco' or 'co64' for large files) for locating samples in 'mdat', and Time to Sample Box ('stts') for timing details. This structure allows efficient access to samples without parsing the entire media data, supporting variable bit rates and frame sizes. Decoding times for samples are defined in the Decoding Time to Sample Box ('stts'), which maps sample indices to their durations using a table of entries, each specifying a count of consecutive samples and a shared delta value in the track's media timescale. The decoding (DTS) for a sample is computed cumulatively from these deltas, accommodating variable frame rates by grouping samples with identical durations, such as in content with mixed frame rates for smooth motion or efficiency. Presentation timestamps (PTS), which determine display order, are derived from DTS values adjusted by offsets in the optional Composition Time to Sample Box ('ctts'), essential for media like video with B-frames where decoding precedes presentation. Each entry in 'ctts' applies an offset to a group of samples, ensuring unique PTS values across the track and enabling reordering without altering the decoding sequence. Version 0 uses unsigned offsets for non-negative adjustments, while version 1 supports signed offsets for more flexible timing scenarios. For , the Sync Sample Table Box ('stss') lists the indices of sync samples, such as keyframes or intra-coded frames, which can be decoded independently without relying on prior samples. If 'stss' is absent, all samples are treated as sync samples, facilitating seeking and editing by identifying entry points in the track. This mechanism is crucial for applications requiring quick navigation, like streaming, where sync samples mark stream access points.

Metadata and presentation

Movie header and timing

The Movie Header Box, identified by the four-character code 'mvhd', is a mandatory full box contained within the Movie Box ('moov') of the ISO base media file format, providing essential media-independent for the entire presentation, including global timing information and playback parameters. It declares the creation and modification timestamps, the timescale for time measurements, the overall duration, and other settings such as playback rate and audio volume, ensuring a unified temporal framework across all tracks. The 'mvhd' box supports two versions to accommodate varying file durations and timestamp ranges: version 0 uses 32-bit fields for creation time, modification time, and duration, suitable for presentations up to approximately 2^32 timescale units (often until around 2040 depending on the timescale), while employs 64-bit fields for creation time, modification time, and duration to support longer content without overflow. The timescale field, a 32-bit unsigned present in both versions, defines the time unit as ticks per second and serves as the common reference for all tracks in the file, such as 90000 ticks per second for high-frame-rate video to enable precise . Duration is expressed as an multiple of the timescale, representing the total length of the presentation based on the longest track; if undetermined, it is set to the maximum value (all 1s in binary). Creation and modification times are recorded as seconds since midnight on January 1, 1904, in UTC, using 32-bit or 64-bit unsigned s depending on the version. Following the duration, the preferred rate is a 32-bit fixed-point 16.16 (default 1.0, or 0x00010000 in ) indicating the desired playback speed relative to normal, and the preferred volume is a 16-bit fixed-point 8.8 (default 1.0, or 0x0100) setting the initial audio level for the . These are followed by reserved fields: a 16-bit reserved set to 0 and two 32-bit unsigned integers reserved and set to 0. For spatial positioning, the 'mvhd' box includes a , an array of nine 32-bit fixed-point values (16.16 format, except the offset components u, v, w in 2.30 format), structured as a 3x3 {a, b, u; c, d, v; x, y, w} that applies , , and to tracks during presentation, with default values forming an (a=d=0x00010000, others 0 except w=0x40000000). This is followed by six 32-bit pre-defined fields reserved and set to 0 (mapping preview, poster, selection, and current time fields). The next track ID, a 32-bit unsigned that specifies the identifier for the subsequent track to be added, ensuring uniqueness and exceeding any existing track IDs in the file, concludes the box.
FieldSize (Version 0)Size (Version 1)TypeNotes
Version/Flags4 bytes4 bytesFull box headerVersion 0 or 1
Creation Time4 bytes8 bytesunsigned intSeconds since 1904-01-01 UTC
Modification Time4 bytes8 bytesunsigned intSeconds since 1904-01-01 UTC
Timescale4 bytes4 bytesunsigned int(32)Ticks per second
Duration4 bytes8 bytesunsigned intIn timescale units; all 1s if indeterminate
Preferred Rate4 bytes4 bytesfixed32(16.16)Default 1.0
Preferred Volume2 bytes2 bytesfixed16(8.8)Default 1.0
Reserved2 bytes2 bytesbit(16)Set to 0
Reserved8 bytes8 bytesunsigned int(32)Set to 0
Matrix36 bytes36 bytesarray fixed323x3 (16.16 except u,v,w in 2.30); default
Pre-defined24 bytes24 bytesunsigned int(32)Reserved; set to 0 ( legacy fields)
Next Track ID4 bytes4 bytesunsigned int(32)For next track
This table outlines the structure of the 'mvhd' box, with sizes in bytes and types as defined in the standard; reserved fields pad the box to ensure compatibility.

Edit lists and composition

The Edit Box ('edts'), contained within each Track Box ('trak'), serves as an optional container for edit lists that enable flexible temporal mapping between the presentation timeline and the media data without modifying the underlying samples. This box allows for operations, such as inserting gaps or adjusting playback rates, by defining segments of the track's timeline. In its absence, the format assumes a direct one-to-one correspondence between presentation and media times. The Edit List Box ('elst'), mandatory if the Edit Box is present, holds a table of edit entries that specify the duration, starting media time, and playback rate for each segment. The 'elst' box supports versions 0 and 1: version 0 uses 32-bit fields for segment duration and media time, while version 1 uses 64-bit fields for longer content. Each entry includes a segment duration measured in the movie timescale (from the 'mvhd' box), a media time value indicating the starting time in the track's media timescale (from the 'mdhd' box; where negative values, such as -1, denote empty time for silence or blank frames), and a media rate defaulting to 1.0 for normal playback but adjustable for effects like fast-forward (e.g., 2.0) or reverse (negative values). These entries facilitate use cases including gap filling to offset track starts, looping through repeated media segments, and A/B switching between alternate media portions, all while preserving the original sample integrity for shadow synchronization in non-linear workflows. The Track Header Box ('tkhd'), part of the Track Box, incorporates a composition that defines spatial transformations and for tracks, particularly visual ones, during assembly. This 3x3 fixed-point , stored as nine 32-bit integers, supports operations such as (via off-diagonal coefficients), (by modifying diagonal elements), and (through offset terms), with coordinates referenced from the upper-left in units. A separate layer field in the 'tkhd' enables track ordering, where lower values position content closer to the viewer, allowing for composited overlays in multi-track scenarios like picture-in-picture effects.

Extensions and variants

Common branded formats

The ISO base media file format (ISOBMFF) serves as the foundation for several branded variants, each identified by specific major and compatible brands declared in the file type ('ftyp') box. These brands indicate compliance with particular profiles or extensions, enabling interoperability across devices and applications while supporting diverse media types such as video, audio, and images. MP4 (MPEG-4 Part 14) is the most widely adopted format, standardized in ISO/IEC 14496-14, and uses the 'mp41' for version 1 files or 'mp42' for version 2, often combined with the 'isom' for full ISOBMFF . It encapsulates audio and video streams, commonly employing codecs like H.264/AVC or , and has become the for web streaming, mobile devices, and general multimedia distribution due to its broad and . 3GP, developed by the 3rd Generation Partnership Project (), and 3G2, developed by 3GPP2, are mobile-optimized formats. 3GP uses the '3gp4' brand (for Release 4) or later variants like '3gp5' and '3gp6', alongside 'isom' for base compatibility. 3G2 uses brands such as '3g2a' and '3g2b'. These formats support low-bandwidth scenarios with codecs such as video and audio, making them suitable for early cellular networks; 3GP targets GSM-based systems, while 3G2 addresses CDMA. HEIF (High Efficiency Image File Format), defined in ISO/IEC 23008-12, extends ISOBMFF for still images and sequences, primarily using 'heic' for -encoded single images or collections and 'hevc' for sequences, with support for advanced features like layered imaging. It enables compact storage of high-quality photos, often with the .heic extension, and is increasingly used in devices for its superior compression over . Other notable variants include (.mov), which employs the 'qt ' brand (with trailing spaces) for Apple's container, supporting a wide range of codecs and timelines; AVC-HD (.m4v), an iTunes-specific video format using the 'M4V ' brand for protected H.264 content; and audio-only (.m4a) files, branded 'M4A ', focused on AAC audio tracks without video. These extensions leverage the core ISOBMFF structure for specialized use cases like editing or . To ensure cross-playback, files often declare multiple compatible brands in the 'ftyp' box, such as 'isom' alongside 'mp41' or '3gp4', allowing parsers to identify supported features without requiring full specification adherence. This multi-brand approach promotes backward compatibility and ecosystem integration.

Advanced features and amendments

The fragmented MP4 format extends the ISO base media file format (ISOBMFF) to support dynamic streaming and low-latency delivery by dividing media content into self-contained movie fragments, each consisting of a Movie Fragment Box ('moof') containing metadata for that segment and a Media Data Box ('mdat') holding the corresponding sample data. This structure allows for progressive downloading and playback without requiring the entire file to be available upfront, enabling efficient adaptation to varying network conditions in streaming scenarios. The Segment Index Box ('sidx') provides an index of subsegments within the file, facilitating random access and efficient seeking by listing offsets, durations, and sizes for quick navigation. Complementing this, the Track Fragment Base Media Decode Time Box ('tfdt') specifies the decode time origin for samples in a fragment, ensuring accurate timing synchronization across fragmented tracks even when fragments are received out of order. These features, introduced in amendments to ISO/IEC 14496-12, enhance the format's suitability for adaptive bitrate streaming by minimizing buffering delays and supporting seamless concatenation of fragments. The ISO BMFF Byte Stream Format, standardized by the W3C in 2024, defines a byte-stream representation of ISOBMFF segments tailored for integration with the Media Source Extensions (MSE) API in web browsers, allowing JavaScript applications to process and append media segments incrementally. This specification structures segments as an optional Segment Type Box ('styp') followed by a single 'moof' box and its associated 'mdat', enabling the parsing of fragmented ISOBMFF data as a continuous byte stream without necessitating a full file download. By supporting initialization segments for setup and media segments for content delivery, it facilitates low-latency live streaming directly in browsers, where media can be demuxed and decoded on-the-fly, improving compatibility with web-based video players and reducing startup times for real-time applications. Extensions for spatial and immersive media have advanced ISOBMFF to handle 3D and volumetric content, with Apple introducing specialized boxes in 2025 to support stereoscopic and spatial video within the format. These include the Video Extended Usage Box ('vexu'), which signals stereo properties and contains child boxes such as the StereoViewInformationBox ('stri') to indicate the presence of left and right eye views in a single track, and the StereoViewBox ('eyes') to denote stereoscopic configuration. Additional boxes like the HeroStereoEyeDescriptionBox ('hero') designate a primary eye view, while the StereoCameraInformationBox ('cams') and StereoBaselineBox ('blin') describe camera geometry and inter-ocular distance for accurate 3D rendering. For immersive projections, the ProjectionBox ('proj') and HorizontalFieldOfViewBox ('hfov') enable equirectangular or rectilinear mappings with field-of-view metadata, allowing playback systems to render wide-field or 360-degree stereo content seamlessly. These Apple extensions, built on Multiview High Efficiency Video Coding (MV-HEVC), integrate with ISOBMFF tracks to deliver immersive experiences, such as spatial video captured on devices like the Apple Vision Pro, by embedding 3D metadata directly in the file structure. In parallel, MPEG-I standards extend ISOBMFF for and volumetric media, particularly through ISO/IEC 23090-18:2024, which specifies the of geometry-based (G-PCC) data and associated within the format. This includes mapping samples to ISOBMFF tracks, where geometry, attribute, and occupancy data are encapsulated as timed or non-timed items, supporting sparse dynamic s from sources like or 3D mapping. MPEG-I Part 10 further defines carriage for visual volumetric video-based coding (V3C), integrating (PCC) into ISOBMFF for and via protocols like DASH, with boxes for component signaling and extraction of sub-parts during decoding. These features enable efficient delivery of immersive 3D scenes, allowing to subsets for rendering in or applications, while maintaining compatibility with existing ISOBMFF parsers through extensible sample groups. Protection schemes in ISOBMFF provide robust mechanisms for content security, primarily through the Common Encryption (CENC) format defined in ISO/IEC 23001-7, which standardizes encryption parameters for audio and video samples using the (AES-128) in counter mode. The Protection Scheme Information Box ('sinf') encapsulates the overall protection metadata for a track, including the Original Format Box ('frma') to identify the unencrypted codec and the Scheme Type Box ('schm') to specify the protection scheme like 'cenc' or 'cbcs' (constant ). Nested within 'sinf' is the Scheme Information Box ('schi'), a for system-specific data such as key IDs, initialization vectors, and rights management information required by management and protection (IPMP) tools. These boxes enable interoperability across (DRM) systems like or , where encrypted samples are flagged in the Sample Description Box, allowing decoders to apply keys obtained externally without altering the media structure. Widely adopted in protected streaming, this scheme supports partial encryption of keyframes for selective security while preserving format flexibility. Recent ISO amendments post-2015 have enhanced ISOBMFF to accommodate emerging codecs and multi-view capabilities, notably through updates to ISO/IEC 14496-15 for carriage of (NAL) unit structured video. The 2020 edition incorporates support for (VVC, ITU-T H.266 / ISO/IEC 23090-3), defining parameter sets and sample entries for VVC bitstreams in tracks, including extensions for layered coding and scalability. For multi-view video, amendments introduce profiles for multiview HEVC (MV-HEVC) and multiview (MV-VVC), where multiple views are stored in separate or interleaved tracks with dependency signaling via the View Identifier Box ('vwid'), enabling stereoscopic or free-viewpoint rendering. These updates, integrated into ISO/IEC 14496-12:2020, ensure while adding boxes for operational points and layer hierarchies, facilitating efficient storage and extraction of high-efficiency multi-view content for applications like broadcasting.

Applications and usage

Streaming and delivery

The ISO base media file format supports progressive download by recommending placement of the moov box at the beginning of the file, enabling immediate access to for playback without waiting for the entire file to download. This structure allows clients to parse timing and structural information early, facilitating partial file playback as data arrives over HTTP. Additionally, the optional Progressive Download Information box (pdin) provides further guidance on download rates and file portions suitable for progressive rendering. For adaptive streaming, the format integrates with (DASH) through fragmented file structures, where the Segment Index box (sidx) indexes movie fragments to enable efficient HTTP fetches of adaptive bitrate segments aligned with DASH periods. These fragmented files, branded with identifiers like 'msdh' for DASH media segments, allow seamless switching between quality levels based on network conditions without requiring full file re-parsing. Low-latency streaming modes leverage self-initializing fragments, which incorporate Sample to Group (sbgp) and Sample Group Description (sgpd) boxes to embed necessary initialization data within each fragment, reducing dependency on prior segments. This enables live broadcasts with end-to-end latencies under 1 second, as seen in profiles like CMAF low-latency DASH, where fragments can be independently decoded and presented. Hint tracks, identified by the 'hint' track type, facilitate real-time streaming protocols by packaging media samples into protocol-specific payloads, such as RTP packets for RTSP/RTP delivery. These tracks include instructions for servers to reconstruct and transmit streams, supporting or scenarios without altering the underlying data. The hierarchical box structure, with explicit size and offset fields, supports efficient seeking via HTTP byte-range requests, allowing clients to fetch specific portions of a file—such as or individual samples—based on calculated positions for quick navigation in large files.

Broadcasting and storage

The ISO base media file format (ISOBMFF) integrates with broadcast standards like ATSC 3.0 and DVB to enable efficient IP-based delivery of media content. In ATSC 3.0, ISOBMFF forms the basis for media encapsulation in MPEG Media Transport (MMT), where Media Processing Units (MPUs) are wrapped as self-contained files using the 'mpuf' brand, supporting both real-time streaming and synchronization via UTC timestamps across broadcast and broadband channels. The DVB File Format extends ISOBMFF to handle recording and playback of RTP streams and MPEG-2 transport streams in systems such as DVB-H, DVB-T, and DVB-IP, incorporating reception hint tracks for synchronization with RTCP Sender Reports. MMT further leverages ISOBMFF for MPU delivery over UDP/IP in broadcast environments, with signaling via elements like the MMT Package Table to map assets and ensure robust session management. For archival purposes, ISOBMFF is recognized by the as a sustainable format suitable for middle- and final-state of moving images and audio. Its self-contained structure, embedding all necessary technical and descriptive within boxes such as the Movie Header and meta boxes, minimizes risks from external dependencies and supports long-term accessibility without proprietary tools. ISOBMFF accommodates high-bitrate content through 64-bit fields in boxes like the Movie Header (version 1), enabling durations exceeding 2^32 timescale units—critical for extended or 8K video archives that surpass typical 32-bit limits (e.g., about 13 hours and 15 minutes at a 90 kHz timescale). This extensibility, combined with support for large sample sizes and movie fragments, facilitates handling of high-resolution, long-duration files in professional storage scenarios. Error resilience in ISOBMFF for broadcast chains is enhanced by optional features like sample auxiliary information and redundant across movie fragments, allowing from transmission errors without full reconstruction. While the core lacks built-in boxes, extensions and protocol-level (e.g., in ROUTE or MMT) provide checks, with self-describing boxes aiding partial playback even if segments are damaged. In professional workflows, ISOBMFF is adopted in editing software like , which natively imports and exports ISOBMFF-based containers such as MP4 and , enabling MXF-like operations for media exchange and assembly without the structural complexity of MXF. This support streamlines by allowing seamless integration of timed media tracks and , often as a lighter alternative to MXF in automated broadcast pipelines.

References

  1. [1]
    ISO/IEC 14496-12:2020 - Coding of audio-visual objects
    This document specifies the ISO base media file format, which is a general format forming the basis for a number of other more specific file formats.
  2. [2]
    ISO Base Media File Format - The Library of Congress
    Format Description for ISO_BMFF -- Base format for media file formats, designed primarily to contain time-based audio-visual information in a flexible, ...
  3. [3]
    MPEG-4: ISO base Media File Format
    MPEG-4: ISO base Media File Format Standard: MPEG-4 Part: 12 This standard specifies a file format for multimedia data.
  4. [4]
    ISO Base Media File Format - MPEG
    The ISO Base Media File Format is a structural, media-independent definition for timed media, used for capture, exchange, and playback, and is based on box- ...
  5. [5]
  6. [6]
    [PDF] QuickTime and ISO Base Media File Formats and Spatial and ...
    Jun 9, 2025 · Like the QuickTime File Format (QTFF) upon which it is based, the ISOBMFF format is meant to serve as a container of media using tracks and ...
  7. [7]
    [PDF] ISO/IEC 14496-12 - SRS
    Jul 15, 2012 · ISO (the International Organization for Standardization) and IEC (the International Electrotechnical. Commission) form the specialized ...
  8. [8]
    QuickTime File Format - The Library of Congress
    Aug 28, 2025 · History, Introduced in 1991; structured for use in Windows, 1994; in the mid-to-late 1990s, the format influenced the shape of MPEG-4. See ...
  9. [9]
    [PDF] QuickTime File Format - Apple Developer
    Mar 1, 2001 · Target Atoms for Embedded Movies. QuickTime 4.1 introduced target atoms to accommodate the addition of embedded movies. These target atoms ...
  10. [10]
    File format summary - PRONOM - The National Archives
    Jan 20, 2020 · The basic data unit in a QuickTime file is the atom. Each atom contains size and type information along with its data.<|control11|><|separator|>
  11. [11]
    ISO to adopt Apple format - CNET
    Feb 11, 1998 · The international standards body chooses Apple's QuickTime for developing a new means of transmitting digital, audio, and video signals.
  12. [12]
    QuickTime File Format | Apple Developer Documentation
    The QuickTime File Format is the basis of the MPEG-4 standard and the JPEG-2000 standard, developed by the International Organization for Standardization (ISO).
  13. [13]
    ISO/IEC 14496-12:2004 - Coding of audio-visual objects
    Publication date. : 2004-02 ; Stage. : Withdrawal of International Standard [95.99] ; Edition. : 1 ; Number of pages. : 55 ; Technical Committee : ISO/IEC JTC 1/SC ...Missing: first | Show results with:first
  14. [14]
    [PDF] ETSI TS 126 234 V5.6.0 (2003-09)
    [50]. ISO/IEC 14496-12:2003 | 15444-12:2003: "Information technology – Coding of audio-visual objects – Part 12: ISO base media file format" | "Information ...
  15. [15]
    ISO/IEC 14496-12:2008 - Coding of audio-visual objects
    ISO/IEC 14496-12:2008 specifies the ISO base media file format, which is a general format forming the basis for a number of other more specific file formats.Missing: 2015 | Show results with:2015
  16. [16]
    ISO/IEC 14496-12:2012 - Coding of audio-visual objects
    ISO/IEC 14496-12:2012 specifies the structure and uses of the ISO base media file format. The identical text is published as ISO/IEC 15444-12:2012.Missing: 2005 | Show results with:2005
  17. [17]
    ISO/IEC 14496-12:2015 - Coding of audio-visual objects
    ISO/IEC 14496-12:2015 specifies the ISO base media file format, containing timing, structure, and media information for timed sequences of media data.Missing: definition | Show results with:definition
  18. [18]
    ISO/IEC 14496-12:2022 - Coding of audio-visual objects
    2–5 day deliveryStatus. : Published ; Publication date. : 2022-01 ; Stage. : International Standard to be revised [90.92] ; Edition. : 7 ; Number of pages. : 250.Missing: history | Show results with:history
  19. [19]
    Boxes (Atoms) - MP4RA
    Boxes (Atoms). This sub-section documents the structural atom or box types for the file formats. All files in this family are structured as a series of ...
  20. [20]
    Brands - MP4RA
    3GPP Media Segment conforming to the Media Segment Format for 3GP DASH. 3GPP ... UltraViolet file brand – conforming to the DECE Common File Format spec, Annex E.
  21. [21]
  22. [22]
    [PDF] ETSI TS 126 244 V7.3.0 (2008-01)
    All 3GP files of Release 5 or later shall contain the compatible brand "isom" indicating that they conform to the ISO base media file format, unless the reader ...
  23. [23]
    Image File Format - ISO/IEC 23008-12:2022
    This document also specifies brands for the storage of images and image sequences conforming to High Efficiency Video Coding (HEVC), Advanced Video Coding ...
  24. [24]
    Major brand | Apple Developer Documentation
    A 32-bit unsigned integer that should be set to 'qt ' (note the two trailing ASCII space characters) for QuickTime movie files.
  25. [25]
    ISO BMFF Byte Stream Format - W3C
    Jul 23, 2024 · An ISO BMFF initialization segment is defined in this specification as a single File Type Box (ftyp) followed by a single Movie Box (moov). The ...
  26. [26]
    [PDF] ISO Base Media File Format and Apple HEVC Stereo Video
    Jun 9, 2025 · video in the ISO base media file format”. [HEVC] ISO/IEC 23008-2:2020 ... video is separated into two video tracks, with each track carrying only ...
  27. [27]
    ISO/IEC 23090-18:2024
    ### Summary of ISO/IEC 23090-18:2024
  28. [28]
    MPEG-I: Carriage of Visual Volumetric Video-based Coding Data
    MPEG-I standard defines the carriage of point cloud data, specifying storage in ISOBMFF and transport with DASH, MMT, etc. for new media like PCC.
  29. [29]
    ISO/IEC 23001-7:2012
    ### Summary of Protection Schemes in ISO/IEC 23001-7
  30. [30]
    [PDF] A/331, "Signaling, Delivery, Synchronization and Error Protection"
    Jan 13, 2025 · ... ISO Base Media File Format (ISO BMFF). [35] are used as the delivery, media encapsulation and synchronization format for both broadcast and ...
  31. [31]
    [PDF] Guidelines for the Use of the DVB File Format Specification for the ...
    Brands in file formats derived from the ISO base media file format [i.3] are used to label a file and thus indicate the conformance of a file to a specification ...
  32. [32]
    ISO Base Media File Format / benjamintoofer - Observable
    May 21, 2021 · ISO Base Media File Format. Fragmented MP4 Structure. NOTE: This is ... It represents a single track of a presentation (Video, Audio, Text).
  33. [33]
  34. [34]
    Video: An Overview of the ISO Base Media File Format
    May 24, 2019 · ISO BMFF a standardised MPEG media container developed from Apple's Quicktime and is the basis for cutting edge low-latency streaming as much as it is for ...