WebVTT
WebVTT (Web Video Text Tracks) is a text-based file format initially developed by the Web Hypertext Application Technology Working Group (WHATWG) and subsequently standardized by the World Wide Web Consortium (W3C) for encoding timed text tracks that can be synchronized with HTML5 <video> and <audio> elements via the <track> element.[1] Primarily designed for providing captions and subtitles to enhance video accessibility, it also supports additional functionalities such as chapter markers for navigation, audio descriptions for visual impairments, and time-aligned metadata for applications like interactive content or analytics.[1]
The format uses a simple, human-readable structure consisting of cues—each defined by start and end timestamps, followed by the text or data to display—allowing for precise synchronization with media playback.[1] WebVTT files typically have a .vtt extension and adhere to strict parsing rules, including support for UTF-8 encoded Unicode characters, line breaks, and optional headers for metadata like language or styling information.[1] Styling and positioning of cues can be customized using CSS, with features like regions for grouping related text (e.g., roll-up captions) and voice annotations for multilingual support.[1]
Published as a W3C Candidate Recommendation on 4 April 2019, the specification remains stable for implementation, though it has not advanced to full Recommendation status as of 2025.[1] WebVTT enjoys broad browser compatibility, with full support in major engines including Chrome since version 23, Firefox since 31, Safari since 6.1, and Edge since 12, enabling seamless integration in web applications without plugins.[2] Its adoption has been driven by the need for accessible web media, aligning with standards like the Web Content Accessibility Guidelines (WCAG).[1]
Introduction
Definition and Purpose
WebVTT, or Web Video Text Tracks, is a W3C standard format for marking up external text track resources that synchronize timed text with HTML5 <video> and <audio> elements via the <track> element.[1] It enables the display of time-aligned text overlays, such as subtitles and captions, to enhance media accessibility and user experience on the web.[1]
The primary purposes of WebVTT include providing subtitles and captions for dialogue and non-speech audio, chapter titles for navigation within media, and metadata displays for additional context like video descriptions.[1] These features support diverse applications, from educational videos requiring precise timing to entertainment content needing multilingual support.[3]
Key benefits of WebVTT stem from its plain-text format, which is UTF-8 encoded for broad compatibility and ease of editing.[1] It facilitates internationalization through support for BCP 47 language tags and vertical text layouts, such as those used in Japanese.[1] Additionally, WebVTT integrates with web accessibility standards like WCAG by enabling synchronized captions that meet success criteria for prerecorded media, benefiting users who are deaf or hard-of-hearing.[4] The format uses the MIME type text/vtt and the file extension .vtt.[1]
WebVTT evolved from earlier subtitle formats like SubRip (SRT) to address web-specific needs for timed text integration.[3]
History and Development
WebVTT originated from efforts by the Web Hypertext Application Technology Working Group (WHATWG) in 2010, initially developed under the name WebSRT to provide a simple, web-compatible format for timed text tracks in video content.[3] This work was driven by the need for a native subtitle and captioning solution integrated with emerging HTML5 video capabilities, addressing limitations in existing formats like SubRip (SRT) for browser-based playback.[5] The format was soon renamed WebVTT (Web Video Text Tracks) to better reflect its broader scope beyond SRT compatibility.[5]
On January 13, 2011, WebVTT was formally introduced in the World Wide Web Consortium (W3C) HTML5 Working Draft, alongside the <track> element, marking its integration into the evolving HTML5 specification as a standard for external text track resources.[6] Key early contributions came from Ian Hickson, the primary editor of the WHATWG's HTML Living Standard, who spearheaded the initial design and specification within the WHATWG's iterative process. The W3C became more actively involved starting in 2012, with the Web Media Text Tracks Community Group publishing the first draft of a dedicated WebVTT specification on August 10, 2012, to align it with W3C's standardization goals.[7]
Silvia Pfeiffer played a pivotal role in the W3C's efforts, serving as editor of the WebVTT specification from 2013 onward and ensuring compatibility with web accessibility requirements.[1] The specification advanced to Candidate Recommendation status on April 4, 2019, reflecting broad implementation stability across browsers without significant changes thereafter as of 2025. As of 2025, WebVTT continues to be maintained, with efforts including its selection as a focus for the Interop 2025 initiative to enhance cross-browser compatibility.[8] Major milestones include its seamless incorporation into the WHATWG's HTML Living Standard, which serves as the normative reference for ongoing evolution, and continued maintenance by both organizations to support web-native timed text without progressing to full W3C Recommendation due to the format's maturity and the preference for living standards.[1]
File Structure
A WebVTT file begins with a mandatory signature consisting of the string "WEBVTT" on the first line, optionally followed by zero or more space or tab characters and then one or more line terminators, such as a carriage return (U+000D) followed by a line feed (U+000A), or just a line feed (U+000A).[9] This signature identifies the file format and must be present for the file to be recognized as valid WebVTT.[9] An optional byte order mark (BOM) for UTF-8 encoding may precede the signature to indicate the character encoding.[10]
The overall structure of a WebVTT file comprises this signature header, followed by an optional sequence of comment blocks and WebVTT cue blocks.[9] Comment blocks, which start with the word "NOTE" and can span multiple lines, may appear anywhere after the signature except within cue blocks.[9] Cue blocks, which contain the timed text data, form the core content and are separated from other blocks by empty lines.[9] The file must not contain binary data and is encoded exclusively in UTF-8.[9]
Parsing of WebVTT files is line-based, with lines delimited by line feed (U+000A), carriage return (U+000D), or the sequence carriage return followed by line feed (CRLF).[10] Empty lines, consisting solely of line terminators or whitespace, serve to separate blocks and are otherwise ignored during processing.[10] For readability, it is recommended to keep individual lines under 80 characters, though no strict maximum length is enforced by the specification.[9]
If a file fails to match the required structure—such as lacking the proper signature or containing invalid syntax—it is treated as an empty text track with no cues.[10] This error handling ensures robust processing in user agents, preventing crashes from malformed inputs.[10]
Cue Blocks and Timing
A WebVTT cue block forms the fundamental unit for delivering timed text, such as captions or subtitles, and consists of an optional cue identifier on its own line, followed by a timings line containing start and end timestamps separated by "-->", an optional settings line with cue-specific configurations, a blank line, and then the cue text spanning one or more lines until the next blank line or file end.[11] The cue identifier, if present, is a unique string that must not contain the string "-->" or any line breaks, allowing references to specific cues within the file.[12] The cue text itself represents the visible payload, such as dialogue or descriptions, and can include line breaks that are preserved during rendering.[13]
Timestamps in WebVTT follow a precise format of optionally two digits for hours (00-99) followed by a colon, two digits for minutes (00-59) and seconds (00-59) each separated by colons, a decimal point, and exactly three digits for milliseconds (000-999), resulting in patterns like "MM:SS.mmm" or "HH:MM:SS.mmm".[14] This format provides a default time step value of 0.001 seconds, enabling sub-second precision for synchronization with media playback.[14]
For timing validation, the start timestamp of a cue must be greater than or equal to the start timestamps of all preceding cues in the file, ensuring cues are ordered non-decreasingly by start time, while the end timestamp must be strictly greater than the cue's start timestamp.[15] Overlaps between cues are permitted, allowing multiple cues to be active simultaneously if their time intervals intersect.[15]
Cue text supports automatic line wrapping by user agents to fit within the rendering area, with explicit line breaks in the text treated as soft wraps unless at the cue's edge.[16] It also accommodates voice spans via the <v> tag (e.g., <v Alice>Hello</v>), which identifies speakers and may omit the closing tag if spanning the entire cue, as well as language spans using <lang> with a BCP 47 language tag (e.g., <lang en>English text</lang>).[17][18] Basic inline formatting is enabled through tags like <i> for italics, <b> for bold, and <u> for underline, along with support for <c> spans for styling classes and embedded timestamps for internal cue timing.[13]
In the processing model, WebVTT cues within a file are sorted by their start time to establish the text track cue order, which governs how they are parsed and applied.[15] A cue becomes active when the media's current playback time is greater than or equal to its start time and less than its end time, at which point it is rendered by converting the cue text into CSS boxes and displaying them overlaid on the video viewport during that interval.[19] Active cues are updated dynamically as playback progresses, with rendering ceasing once the end time is reached.[19]
For illustration, a sample cue block might appear as:
1
00:00:01.000 --> 00:00:04.000
<v Sam>Hello, world!</v>
1
00:00:01.000 --> 00:00:04.000
<v Sam>Hello, world!</v>
This example demonstrates an identifier "1", timings from 1 second to 4 seconds, and cue text with a voice span.[11]
Header and Settings
The WebVTT file header block follows the mandatory "WEBVTT" file signature and consists of optional comment blocks and region definition blocks before the first cue or the end of the file.[20] Comment blocks begin with the keyword "NOTE" followed by zero or more lines of text, serving as ignored annotations for authors or tools; they must end with a blank line or the start of a cue or region block.[21] Region definition blocks start with "REGION" followed by a space-separated list of settings that configure reusable rendering areas for grouping multiple cues, such as for roll-up captions in live content; these include parameters like id (a unique identifier string), width (default 100, as a percentage of viewport width), lines (default 3, for height in lines), regionanchor (default 0%,100%, the anchor point within the region), viewportanchor (default 0%,100%, the position in the viewport), and scroll (up or none, default none).[22]
Global configuration options in WebVTT files are minimal and not formally parsed by the core specification, but common practices include metadata lines immediately after the signature for tool interoperability. For instance, lines like "Kind: captions" indicate the track type (e.g., subtitles, captions, chapters, or metadata), though the actual kind is determined by the HTML <track> element's kind attribute rather than the file itself.[23] Similarly, "Language: en" declares a default BCP 47 language tag (e.g., en for English), but the specification does not parse this globally; language is instead applied per cue via <lang> tags within cue text or externally via the <track> element's srclang attribute.[24] Alignment options, such as global text direction, are not defined in the header but can be influenced by CSS or per-cue settings.
Per-cue settings appear on the line immediately following a cue's timing information and consist of space- or tab-separated key-value pairs that control the rendering position, size, and alignment of individual cues, overriding any default or global behaviors.[25] Valid settings include:
vertical:rl or vertical:lr, which sets the writing direction to vertical growing right-to-left (rl, default for some languages like Japanese) or left-to-right (lr); if unspecified, horizontal layout is used.[26]
line:N (where N is an integer or percentage, e.g., line:80%), positioning the cue box vertically from the top (positive values) or bottom (negative); an optional alignment like :start or :end specifies snap-to-line behavior.[27]
position:N% (N between 0 and 100), placing the cue box horizontally; optional alignments :line-left, :center, or :line-right adjust for vertical mode or text alignment.[28]
size:N% (N between 0 and 100, default 100), defining the width of the cue box as a percentage of the viewport.[29]
align:start, align:center, align:end, align:left, or align:right, controlling text alignment within the cue box (default center).[30]
These settings are parsed by splitting the line on whitespace, then processing each name:value pair (case-insensitive names, values trimmed); unrecognized or invalid settings are ignored without error, ensuring robust parsing.[31] Additionally, cues can reference header-defined regions via region:id in the settings line, allowing inline assignment without separate declarations, though explicit header regions provide reusable configurations.[32]
Advanced Features
Regions and Positioning
Regions in WebVTT allow authors to define named subareas within the video viewport for rendering groups of cues, enabling more precise spatial control over subtitle and caption placement.[33] A region is specified in the file header using the REGION keyword followed by settings such as id, width, lines, regionanchor, viewportanchor, and scroll. For example, a region might be defined as REGION [id](/page/.id):sidebar lines:3 regionanchor:0%,100% [viewportanchor](/page/Viewport):10%,90% [scroll](/page/Scroll):up, which creates a three-line high area anchored at the bottom-left of itself and positioned 10% from the left and 90% from the top of the viewport, with new cues scrolling upward to make room for incoming ones.[34] These settings configure the region's dimensions and anchoring: width as a percentage of the viewport (default 100%), lines as the height in text lines (default 3), regionanchor as the region's internal attachment point (default 0%,100%), viewportanchor as the viewport attachment point (default 0%,100%), and scroll as the behavior for cue overflow (none or up).[35] Cues can then reference a region by its id in their settings list, such as region:sidebar, directing them to render within that bounded area rather than the full viewport.[36]
Positioning controls for individual cues provide fine-grained layout options independent of or in conjunction with regions, focusing on horizontal and vertical placement.[37] Key settings include line, which specifies the cue box's vertical position as an integer line number (positive from top, negative from bottom) or a percentage (default auto, placing at the bottom); position, which sets the horizontal (or vertical in right-to-left modes) offset as a percentage or integer (default auto, centered); size, defining the cue box width as a percentage (default 100%); and align, controlling text alignment within the box (start, center, end, or left/right).[38] Additionally, vertical can switch the writing direction to right-to-left (rl) or left-to-right (lr) for vertical cue layouts, though this interacts with region usage.[39] For instance, a cue setting like line:80% position:10% size:50% align:start would position a 50% wide box starting 10% from the edge on the 80th percentile line from the top.[40]
The rendering model for regions and positioning snaps cue boxes to discrete lines or exact positions based on the video's font metrics, ensuring consistent placement across devices.[41] In regions, cues are stacked vertically from the region's bottom, with overflow handled by the scroll setting—upward scrolling shifts existing cues to reveal new ones, mimicking live caption roll-up, while none causes later cues to overwrite earlier ones.[42] Outside regions, cues use the viewport's line grid, where line values map to row offsets (e.g., line n aligns to the nth line from the top or bottom), and position percentages compute absolute offsets relative to the writing direction.[43] If a cue specifies line, position, or size, it cannot be assigned to a region, falling back to viewport rendering to avoid conflicts.[44]
Common use cases for regions and positioning include creating multi-line subtitles that occupy a dedicated bottom area without overlapping video content, or side-by-side translations where one region holds English cues on the left and another holds dubbed text on the right.[45] Roll-up captions for live streams benefit from scrolling regions, allowing a fixed number of recent lines to display while older ones move up and out.[46] Positioning settings enable alignment of cues to specific on-screen elements, such as bottom-center for standard subtitles or edge-aligned for speaker identification.[47]
Despite these capabilities, WebVTT regions have limited browser support, with full implementation only in Firefox (version 59+) and Safari (version 14.1+), but not in Chrome or Edge, leading to fallback rendering in the default viewport position.[48] Additionally, regions support only horizontal cues, and vertical writing modes or explicit positioning settings disable region assignment, restricting their flexibility in diverse layouts.[49]
Styling and Text Formatting
WebVTT supports inline markup tags within cue text to apply basic text formatting, allowing authors to emphasize or annotate content directly in the subtitle or caption payload. The supported tags include <b> for bold text, <i> for italic text, <u> for underline text, <ruby> for ruby annotations (which typically encloses base text and uses an inner <rt> tag for the ruby text overlay), and <v> for voice spans that label speaker identities (e.g., <v Alice>Hello</v>). These tags are parsed as a tree structure during rendering, enabling precise control over visual presentation without altering the underlying timing or positioning.[50]
Styling in WebVTT extends beyond inline tags through integration with CSS, where the ::cue pseudo-element targets all cue content for global modifications such as color: red; or font-family: [Arial](/page/Arial);. More granular control is possible by selecting specific tags, like ::cue(b) { font-weight: bold; } to customize bold elements, or using descendant selectors such as ::cue(i u) { text-decoration: underline wavy; } for combined italic and underline effects. Additionally, the ::cue-region pseudo-element allows styling of entire regions if defined, though it applies uniformly to cues within that area. This CSS approach ensures consistent theming across cues while respecting the media element's stylesheet cascade.[51]
External and embedded styling options further enhance flexibility; authors can include a <STYLE> block immediately after the WebVTT header (before the first cue) to embed CSS rules directly in the file, such as STYLE ::cue { background: rgba(0,0,0,0.8); color: white; }, which is parsed as a standard CSS stylesheet. Alternatively, external CSS files linked via the HTML document's <style> or <link> elements can target the video element, applying rules that propagate to cues (e.g., video::cue { [visibility](/page/Visibility): visible; }). This separation promotes reusability, as styles can be updated independently of the caption track.[52]
Formatting rules in WebVTT enforce robustness during parsing: tags may nest arbitrarily to create complex hierarchies (e.g., <b><i>emphasized bold</i></b>), but malformed or unrecognized tags are simply stripped, preserving the surrounding text without error. Comments within tags are forbidden (no --> sequences), and bidirectional text is handled natively through Unicode bidirectional algorithm controls, such as left-to-right marks (U+200E) to ensure proper script direction in mixed-language cues. These mechanisms maintain readability across diverse linguistic contexts.[53]
Accessibility remains a core consideration in WebVTT styling, where inline tags like <v> facilitate audio descriptions by clearly attributing dialogue to speakers, aiding screen reader navigation and user comprehension. Styles applied via CSS must prioritize contrast and legibility—default classes such as .past-nodes for elapsed text or .white for bright foregrounds align with WCAG guidelines to avoid obscuring essential content, ensuring cues remain perceivable for users with visual or cognitive impairments. Overly decorative effects are discouraged if they compromise clarity.[54]
WebVTT supports non-display functionalities through specific cue kinds and structural elements, enabling the embedding of timed data that enhances media interactivity without visual rendering. When a WebVTT track has its kind set to "metadata" via the HTML <track> element, its cues are interpreted as time-aligned metadata rather than displayed content.[1] These cues are hidden from the user interface and are primarily processed by scripting applications or media controllers to deliver structured information synchronized with the media timeline, such as timed annotations or triggers for dynamic events.[1]
For video navigation, when a WebVTT track has its kind set to "chapters" via the <track> element, its cues provide chapter titles associated with their timestamp ranges.[1] These cues must be non-overlapping and nested within the media duration to function as navigation targets, allowing media players to generate chapter menus or seek points based on the timestamps and titles.[1] The chapter title is derived by concatenating the text content of the cue in a specific traversal order, excluding any non-text elements like ruby annotations.[1]
Comment blocks in WebVTT files begin with the keyword "NOTE" and are entirely ignored during parsing and rendering, serving solely as annotations for file authors or developers.[1] These blocks can include multiple lines of explanatory text, such as timing notes or revision history, and are placed between other elements without affecting the file's functionality.[1]
The cue text in metadata or chapter cues can embed structured data formats, such as JSON objects, to facilitate advanced applications like timed API interactions or data serialization.[1] For instance, a metadata cue might contain {"event": "slide_change", "id": 5} to signal a scripting trigger at a precise moment.[1] This flexibility allows for custom name-value pairs or serialized content that applications can parse without visual dependencies.
Common use cases for these non-display features include creating interactive transcripts by linking metadata cues to searchable text segments, enabling search indexing of media content through timestamped keywords, and integrating with media players for automated navigation or event handling.[1] In interactive scenarios, metadata cues can drive dynamic overlays or analytics, while chapters support user-friendly seeking in long-form videos.[1]
webvtt
WEBVTT
NOTE
This file contains [metadata](/page/Metadata) for annotations and chapters.
(Use with <track kind="metadata"> or <track kind="chapters"> in [HTML](/page/HTML).)
00:00:00.000 --> 00:00:10.000
{"type": "[annotation](/page/Annotation)", "label": "[Introduction](/page/Introduction)"}
00:00:10.000 --> 00:01:00.000
Chapter 1: [Overview](/page/Peugeot_1007)
NOTE End of first section; next chapter at 01:00.
WEBVTT
NOTE
This file contains [metadata](/page/Metadata) for annotations and chapters.
(Use with <track kind="metadata"> or <track kind="chapters"> in [HTML](/page/HTML).)
00:00:00.000 --> 00:00:10.000
{"type": "[annotation](/page/Annotation)", "label": "[Introduction](/page/Introduction)"}
00:00:10.000 --> 00:01:00.000
Chapter 1: [Overview](/page/Peugeot_1007)
NOTE End of first section; next chapter at 01:00.
Comparisons
Differences from SubRip
WebVTT introduces several enhancements over the SubRip (.srt) format, primarily to support web-based video integration and advanced rendering capabilities. While both formats are text-based and designed for timed subtitles, WebVTT's structure allows for greater flexibility in positioning, styling, and metadata integration, making it more suitable for HTML5 environments.[1]
One key difference lies in timestamp formatting and precision. WebVTT employs a format of HH:MM:SS.mmm, using a decimal point to separate seconds from milliseconds (e.g., 00:12:34.567), which supports three-digit millisecond precision and is aligned with international time notation standards. In contrast, SubRip uses HH:MM:SS,mmm with a comma separator for milliseconds (e.g., 00:12:34,567), also providing millisecond-level accuracy but following a locale-specific convention that can lead to parsing inconsistencies in global applications.[1][55]
Cue identifiers represent another advancement in WebVTT. Each cue can optionally include a unique alphanumeric identifier immediately before the timestamps (e.g., "cue1\n00:00:01.000 --> 00:00:04.000"), facilitating scripting, CSS targeting, and reference in external documents, which is particularly useful for interactive web media. SubRip, however, relies solely on sequential numeric counters (e.g., "1") for ordering cues, without support for unique identifiers that enable such programmatic access.[1][56]
WebVTT significantly expands on settings and positioning options per cue. It includes configurable attributes in the cue header, such as align:start|center|end for text alignment, position:n% for horizontal placement, line:n% for vertical offset, size:n% for text width, and region:id for predefined areas on the video viewport, allowing subtitles to appear anywhere on screen rather than fixed at the bottom. SubRip lacks these granular controls, offering only basic override tags for left/center/right alignment within the cue text itself, with no native support for vertical positioning or regions, leaving rendering largely to the media player.[1][55]
In terms of styling and text formatting, WebVTT provides robust options beyond plain text. It supports inline HTML-like tags (e.g., <b> for bold, <i> for italics, <u> for underline, <ruby> for ruby text) within cue payloads and allows external CSS styling via embedded STYLE blocks or linked stylesheets, enabling font customization, colors, and animations tailored to web design. SubRip supports limited inline formatting through HTML-derived tags (e.g., <i> or <b>), but these are rudimentary, player-dependent, and do not integrate with CSS or advanced web styling, restricting it to basic emphasis without broader visual enhancements.[1][56]
The file header structure also diverges notably. WebVTT mandates a signature line reading "WEBVTT" at the file's start, followed by optional metadata lines (e.g., Language: en), which declares the format and provides context for parsers. SubRip files have no required header and begin directly with the first cue's numeric identifier, simplifying creation but reducing interoperability cues for automated processing.[1][55]
Finally, WebVTT's extensibility surpasses SubRip's subtitle-centric design. It accommodates metadata tracks through JSON-like cue payloads (e.g., for descriptions or annotations) and chapter markers via specially formatted cues that integrate with HTML5 navigation, broadening its use to non-subtitle purposes like timed web documents. SubRip remains confined to subtitle delivery, with no standardized mechanisms for metadata or chapters, limiting its scope to basic captioning without additional file types or extensions.[1][56]
Relation to Other Standards
WebVTT, as a text-based format, contrasts with the XML-based Timed Text Markup Language (TTML), which is designed for more complex applications in broadcast and digital television environments.[57] TTML supports hierarchical structures and advanced features like animations, while WebVTT emphasizes simplicity with flat cue blocks suitable for web delivery.[1] A subset of TTML, known as SMPTE-TT (based on SMPTE ST 2052-1), is commonly used in professional media workflows, whereas WebVTT prioritizes lightweight parsing for online video.[57]
In terms of integration with HTML5, WebVTT serves as the native format for the <track> element within <video> and <audio> tags, enabling seamless synchronization of captions, subtitles, and chapters without additional processing.[1] TTML, by contrast, lacks this direct support and typically requires conversion to WebVTT or other web-compatible formats for browser rendering.[1]
WebVTT is widely incorporated into adaptive streaming protocols such as HTTP Live Streaming (HLS) and MPEG-DASH for delivering subtitles in manifests. In HLS, it employs segmented WebVTT files aligned with media segments (often 6-30 seconds) to ensure low-latency caption delivery. For MPEG-DASH, WebVTT segments are embedded in Media Presentation Description (MPD) files, allowing out-of-band subtitle tracks that conform to DASH Industry Forum guidelines.[58]
Regarding accessibility, WebVTT aligns with Web Content Accessibility Guidelines (WCAG) 2.1 Success Criterion 1.2.2 by providing synchronized text equivalents for prerecorded audio in video content, using the <track kind="captions"> attribute to meet Level A requirements.[59] It complements WAI-ARIA practices by handling timed media tracks, while ARIA attributes enhance dynamic, non-media elements like live regions or custom controls in video players.[60]
Looking toward future developments, efforts within the W3C Timed Text Working Group explore interoperability between WebVTT and TTML-based profiles like IMSC (TTML Profiles for Internet Media Subtitles and Captions), potentially enabling hybrid formats that bridge web and broadcast ecosystems through mappings and shared features for global subtitle delivery.[61]
Implementation
Browser Compatibility
WebVTT has achieved broad core support for parsing and displaying basic cues across major browsers since the mid-2010s. All contemporary versions of Chrome (23+), Firefox (31+), Safari (6+), and Edge (12+) handle fundamental WebVTT functionality, such as timed text tracks for subtitles and captions integrated with HTML5
| Browser | Basic Support Version | Initial Release Year |
|---|
| Chrome | 23+ | 2013 |
| Firefox | 31+ | 2014 |
| Safari | 6+ | 2012 |
| Edge | 12+ | 2015 |
| Internet Explorer | 10+ (partial) | 2012 |
Support for advanced features varies, with CSS styling via the ::cue pseudo-element fully implemented in Chrome and Edge from their initial WebVTT versions, allowing custom fonts, colors, and layouts for cues.[62] In Firefox, ::cue support arrived in version 55 (2017), but more granular selectors like ::cue() for specific cue elements were not fully available until version 78 (2020), and ::cue-region remains unsupported across all browsers. Regions, which define spatial areas for cue placement, are supported in Firefox 59+ and Safari 14.1+, but lack implementation in Chrome and Edge as of 2025.[48]
Mobile browsers mirror desktop support closely, with full core and styling capabilities in Chrome for Android (25+) and Safari on iOS (7+). Older versions of Internet Explorer (10+) provide partial support for basic cues but omit advanced styling and positioning features.[62][64][63]
Notable gaps include inconsistent rendering of voice tags () in Firefox, where speaker identification may not display distinctly without custom CSS, and limited API exposure for metadata cues in certain TextTrack implementations, preventing JavaScript access to non-visual data in some browsers.[65] Developers can test compatibility using the VTTCue interface, which is available in supported browsers for creating and manipulating cues programmatically, and employ polyfills like those from MediaElement.js for legacy environments lacking native support.[66]
Usage in HTML5
WebVTT files are integrated into HTML5 web pages using the <track> element nested within a <video> or <audio> element to provide synchronized text tracks such as subtitles, captions, chapters, or metadata. The <track> element requires attributes like kind (e.g., "subtitles" or "captions"), src for the WebVTT file URL, srclang for the language code, and optionally label for user-facing descriptions and default to auto-select the track. This enables browsers to fetch, parse, and render the cues automatically during media playback. For instance:
html
<video controls width="640" height="480">
<source src="example.mp4" type="video/mp4">
<track kind="subtitles" src="subtitles.vtt" srclang="en" label="English" default>
</video>
<video controls width="640" height="480">
<source src="example.mp4" type="video/mp4">
<track kind="subtitles" src="subtitles.vtt" srclang="en" label="English" default>
</video>
In this setup, the browser loads the WebVTT file asynchronously and displays cues based on the current playback time.[1]
The JavaScript API for WebVTT centers on the TextTrack interface, which manages individual text tracks, and the VTTCue constructor for defining dynamic cues. A TextTrack object is obtained from a media element's textTracks collection or created via addTextTrack(), exposing properties like kind, label, language, and mode (which can be set to 'showing' for visible rendering, 'hidden' for internal processing without display, or 'disabled' to ignore the track). Cues are instantiated with new VTTCue(startTime, endTime, text)—where times are in seconds—and added to a track using addCue(cue), allowing runtime modifications to subtitle content. An example of dynamic cue creation is:
javascript
const video = document.querySelector('video');
const track = video.addTextTrack('subtitles', 'Dynamic Subtitles', 'en');
track.mode = 'showing';
const cue = new VTTCue(1.0, 4.0, 'This is a dynamic subtitle.');
track.addCue(cue);
const video = document.querySelector('video');
const track = video.addTextTrack('subtitles', 'Dynamic Subtitles', 'en');
track.mode = 'showing';
const cue = new VTTCue(1.0, 4.0, 'This is a dynamic subtitle.');
track.addCue(cue);
This API facilitates interactive applications, such as generating captions from live transcripts.[67]
Event handling in WebVTT usage revolves around the 'cuechange' event on TextTrack objects, which fires whenever the set of active cues changes—typically as playback advances through timestamps—enabling developers to update UI elements or log active text. For example:
javascript
track.addEventListener('cuechange', () => {
console.log('Active cues:', track.activeCues);
});
track.addEventListener('cuechange', () => {
console.log('Active cues:', track.activeCues);
});
Track modes can also be adjusted dynamically in response to user interactions, such as toggling visibility via buttons. Additionally, media elements propagate loading errors that may affect tracks, allowing fallback logic through the 'error' event.[63]
Dynamic loading of WebVTT tracks occurs through the media element's addTextTrack(kind, label, language) method, which returns a new TextTrack for immediate cue population without relying on static <track> elements. WebVTT files linked via <track> are parsed incrementally as they download, using the browser's internal WebVTT parser to construct cues on-the-fly. For custom parsing scenarios, the API supports accessing raw cue data via getCueAsHTML() on VTTCue instances, though full parser control is limited to the specification's algorithmic steps rather than a direct JS-exposed parser. This approach is ideal for applications needing real-time caption generation, such as streaming services.[1]
Best practices for WebVTT in HTML5 include providing multiple <track> elements for language fallbacks, with default on the primary one to ensure accessibility across user preferences and ensure srclang matches BCP 47 language tags for proper selection. Serve files with the text/vtt MIME type to prevent parsing failures, and handle potential errors by monitoring the media element's 'error' event, which can indicate track loading issues due to network or format problems—implementing graceful degradation like disabling the track or alerting users. Validate cues for unique identifiers and valid timestamps to avoid overlaps, and test rendering across devices to confirm synchronization.
Examples
Basic Cue Example
A basic WebVTT file for standard subtitles consists of the "WEBVTT" signature followed by cue blocks, each defining a timing range and corresponding text to display.[1]
The following is a complete, simple example:
WEBVTT
00:00:01.000 --> 00:00:04.000
Hello, world!
00:00:05.000 --> 00:00:10.000
This is a subtitle.
WEBVTT
00:00:01.000 --> 00:00:04.000
Hello, world!
00:00:05.000 --> 00:00:10.000
This is a subtitle.
This file starts with the required "WEBVTT" signature on the first line, followed by a blank line, and then includes two cue blocks. Each cue specifies start and end timestamps in the format HH:MM:SS.mmm --> HH:MM:SS.mmm, separated by an arrow, with plain text on subsequent lines until the next blank line or end of file. The cues display sequentially as the video reaches their respective time intervals.[68]
To implement this, save the content as a file with a .vtt extension and associate it with an HTML5 <video> element using the <track> tag, such as <track kind="subtitles" src="example.vtt" srclang="en">; this assumes default positioning and rendering by the user agent.
This example fully conforms to the WebVTT specification, using only core elements without advanced settings like regions or styling.[69]
Advanced Features Example
The following example demonstrates an advanced WebVTT file that incorporates regions for positioning cues, inline tags for text formatting and voice identification, and embedded CSS styling for visual enhancement. This sample is compliant with the WebVTT specification and suitable for captions in HTML5 video elements.[1]
WEBVTT
NOTE
Sample demonstrating regions, inline tags, and styling.
REGION
id:bottom
width:100%
lines:3
regionanchor:0%,100%
viewportanchor:0%,90%
scroll:up
STYLE
::cue {
color: yellow;
background: rgba(0, 0, 0, 0.5);
}
::cue(b) {
font-weight: bold;
}
::cue(v) {
color: cyan;
}
00:00:01.000 --> 00:00:04.000 region:bottom align:center position:50%
<b>Hello</b> world! This is spoken <v Bob>in Bob's voice</v>.
WEBVTT
NOTE
Sample demonstrating regions, inline tags, and styling.
REGION
id:bottom
width:100%
lines:3
regionanchor:0%,100%
viewportanchor:0%,90%
scroll:up
STYLE
::cue {
color: yellow;
background: rgba(0, 0, 0, 0.5);
}
::cue(b) {
font-weight: bold;
}
::cue(v) {
color: cyan;
}
00:00:01.000 --> 00:00:04.000 region:bottom align:center position:50%
<b>Hello</b> world! This is spoken <v Bob>in Bob's voice</v>.
In this file, the header begins with "WEBVTT" followed by a "NOTE" block that provides a non-displayed comment for file documentation.[68][70]
The "REGION" block defines a customizable display area named "bottom," spanning the full video width (100%) across 3 lines, anchored at the bottom of the viewport (90% from the top) with upward scrolling for multiple cues. This allows cues to appear in a dedicated subtitle zone rather than the default position.[71]
The cue timing "00:00:01.000 --> 00:00:04.000" marks the display window from 1 second to 4 seconds, with settings "region:bottom align:center position:50%" directing the cue to the defined region, centered horizontally at 50% of the viewport width. The cue text employs inline tags: "Hello" renders "Hello" in bold for emphasis, while "in Bob's voice" identifies the speaker as Bob, potentially cueing audio synchronization or highlighting in supported renderers.[72][73]
The "STYLE" block applies CSS rules using the "::cue" pseudo-element to style all cue content with yellow text on a semi-transparent black background for readability; additionally, "::cue(b)" reinforces bold formatting, and "::cue(v)" colors voice tags cyan to distinguish speakers visually. This embedded styling overrides defaults without external files.[74]
When rendered in a browser, this cue appears in the bottom region with yellow bold text for "Hello," cyan for the voice attribution, and centered positioning, enhancing accessibility and engagement for captioned video. Features like regions and voice tags require support in modern browsers such as Chrome, Firefox, and Safari; unsupported elements fallback to plain text rendering without errors.[75][76]
This structure validates as spec-compliant via the W3C WebVTT test suite, confirming interoperability for extensions like custom styling while adhering to core requirements.[77][1]