Byte order mark
The byte order mark (BOM) is a Unicode character, U+FEFF ZERO WIDTH NO-BREAK SPACE, placed at the start of a text file or data stream to signal the byte order (endianness) for multi-byte encodings such as UTF-16 and UTF-32, and optionally as an encoding signature for UTF-8.[1][2] This usage leverages the character's code point to distinguish between big-endian and little-endian representations, where the byte sequence FE FF indicates big-endian and FF FE indicates little-endian in UTF-16, while similar patterns apply to UTF-32 (00 00 FE FF for big-endian and FF FE 00 00 for little-endian).[3][4] In UTF-8, the BOM appears as the byte sequence EF BB BF and serves solely as a signature to identify the encoding, though it is neither required nor recommended due to potential compatibility issues like interference with ASCII processing or file concatenation.[5][6] Historically, U+FEFF was originally defined as a zero-width no-break space for formatting purposes, but its adoption as a BOM stems from the need to resolve byte order ambiguities in early Unicode implementations, with the reversed sequence U+FFFE designated as a noncharacter to detect incorrect interpretations.[2] The BOM is essential for UTF-16 and UTF-32 streams without explicit byte order declarations, enabling processors to correctly interpret the data without prior knowledge of the system's endianness.[3] However, its presence can complicate software handling, as some parsers may treat it as content rather than a signature, leading to invisible characters or parsing errors in protocols assuming plain text.[5] Recommendations from the Unicode Consortium advise against using BOM in new UTF-8 protocols, favoring explicit encoding labels instead, while encouraging software to detect and strip it when present for robustness.[5] In web contexts like HTML and XML, the BOM is permitted but should be ignored by parsers to avoid affecting document structure.[6]Introduction
Definition
The byte order mark (BOM) is the Unicode character U+FEFF ZERO WIDTH NO-BREAK SPACE (previously known as BYTE ORDER MARK), which functions as a signature to indicate the byte order of a text stream.[1] When used in this capacity, U+FEFF is interpreted not as a visible character but as metatextual metadata for parsers to determine the endianness of the encoded data.[2] Unlike typical Unicode characters that contribute to content rendering, the BOM serves a purely structural role, preventing misinterpretation of byte sequences in multi-byte encodings.[7] The primary purpose of the BOM is to signal whether a text file employs big-endian or little-endian byte ordering, especially in encodings like UTF-16 and UTF-32 where characters span multiple bytes.[8] Endianness describes the sequential arrangement of bytes within a multi-byte value: big-endian places the most significant byte first, while little-endian places the least significant byte first.[8] For instance, the BOM character U+FEFF manifests as the byte sequence FE FF in big-endian UTF-16, but as FF FE in little-endian UTF-16, allowing processors to detect and adjust for the correct interpretation.[8] Although primarily associated with byte-order-sensitive encodings, the BOM may also appear in UTF-8 files as an optional indicator of the encoding scheme, without conveying endianness information since UTF-8 is inherently byte-order agnostic.[7]History
The character U+FEFF was introduced in the initial release of the Unicode Standard (version 1.0) in October 1991, serving primarily as a byte order mark to indicate the endianness of Unicode text streams in fixed-width encodings like UCS-2.[9] In the subsequent amendment (Unicode 1.0.1) published in June 1993, it was additionally designated as a zero-width no-break space, allowing it to function as an invisible formatting control for preventing line breaks without adding width.[10] This dual role reflected early efforts to support both structural text processing and encoding identification in emerging international standards. The byte order mark gained further prominence with the adoption of Unicode in international standards, notably its inclusion in the first edition of ISO/IEC 10646 (1993), which defined the Universal Character Set (UCS) and incorporated U+FEFF for endianness signaling in multi-byte encodings. In parallel, practical implementation accelerated in the 1990s; for instance, Microsoft updated Windows Notepad around 1993 to internally use UTF-16 little-endian with a leading BOM for saving Unicode text files, influencing widespread adoption in Windows-based text editing and localization workflows.[11] Unicode version 2.0 (1996) expanded the character repertoire and formalized UTF-16 as the primary encoding, explicitly repurposing U+FEFF to detect byte order in UTF-16 streams while retaining its no-break space utility.[12] By the early 2000s, the BOM's application extended to UTF-8, sparking debates within the Unicode Consortium and IETF about its necessity, as UTF-8 lacks endianness concerns; the IETF's RFC 2781 (2000) solidified BOM usage for UTF-16 serialization over networks, but discussions highlighted potential issues like misinterpretation in ASCII-compatible contexts.[13] The Unicode Consortium has long emphasized the BOM's optional nature across encodings, clarifying its role as a signature rather than a required element and discouraging routine inclusion to avoid compatibility problems. Post-2020 clarifications further refined guidance, particularly in Unicode 14.0 (2021) and 15.0 (2022), where the Consortium explicitly discouraged UTF-8 BOMs except in niche scenarios like stream identification in HTML or when distinguishing from legacy encodings, prioritizing interoperability in modern protocols and filesystems.[5] This guidance was reaffirmed in subsequent versions, including Unicode 16.0 (2024) and 17.0 (2025), with no changes to BOM recommendations.[14] These updates built on influences from early text editors and web standards, ensuring the BOM remains a flexible but non-essential tool for Unicode processing.Technical Details
Byte Sequences
The byte order mark (BOM) consists of byte sequences that encode the Unicode character U+FEFF (ZERO WIDTH NO-BREAK SPACE) at the start of a text stream or file, serving as a signature for certain Unicode encodings. These sequences vary by the encoding form and, for multi-byte forms, by the byte serialization order (endianness). They are derived directly from the UTF-8, UTF-16, or UTF-32 transformation rules applied to the code point U+FEFF, which has the hexadecimal value FEFF and binary representation 1111111011111111. In UTF-8, a variable-length encoding, U+FEFF is represented as a three-byte sequence: EF BB BF. This results from the UTF-8 algorithm for code points in the range U+0800 to U+FFFF, which uses the form 1110xxxx 10yyyyyy 10zzzzzz, where the bits of FEFF (1111 1110 1111 1111) fill the x, y, and z positions after the leading bits, yielding the binary 11101111 10111011 10111111 in hexadecimal EF BB BF.[15] For UTF-16, a fixed 16-bit encoding, the BOM is a two-byte sequence matching the code unit for U+FEFF. In big-endian order (most significant byte first), it is FE FF; in little-endian order (least significant byte first), it is FF FE. These sequences simply serialize the 16-bit value FEFF according to the byte order, with no additional surrogates since U+FEFF is in the Basic Multilingual Plane. In UTF-32, a fixed 32-bit encoding, the BOM is a four-byte sequence with the 16-bit value FEFF zero-extended to 32 bits (00 00 FE FF in big-endian, FF FE 00 00 in little-endian). The leading zeros pad the higher 16 bits of the code point, serialized per the endianness.[15] The following table summarizes the BOM byte sequences:| Encoding | Endianness | Byte Sequence (hex) | Length |
|---|---|---|---|
| UTF-8 | N/A | EF BB BF | 3 bytes |
| UTF-16 | Big-endian | FE FF | 2 bytes |
| UTF-16 | Little-endian | FF FE | 2 bytes |
| UTF-32 | Big-endian | 00 00 FE FF | 4 bytes |
| UTF-32 | Little-endian | FF FE 00 00 | 4 bytes |
Endianness Detection
The Byte Order Mark (BOM) enables endianness detection by serving as an encoded indicator of the byte serialization order at the start of a text stream in multi-byte Unicode encodings like UTF-16 and UTF-32. When a parser encounters a BOM, it examines the initial bytes to identify whether the encoding uses big-endian (where the most significant byte comes first) or little-endian (where the least significant byte comes first) order. For UTF-16, this involves checking the first two bytes against the known encoding of the Unicode character U+FEFF ZERO WIDTH NO-BREAK SPACE in each endianness; a match with the big-endian form signals UTF-16BE, while the little-endian form indicates UTF-16LE. This process ensures that subsequent code units are decoded in the correct byte order, preventing misinterpretation of character data across different hardware architectures. In UTF-32, endianness detection follows a parallel mechanism but requires reading the first four bytes, as each code unit spans four bytes. The parser compares these bytes to the big-endian or little-endian representations of U+FEFF, confirming UTF-32BE or UTF-32LE accordingly, and verifies that the remaining data aligns on four-byte boundaries to maintain encoding integrity. This four-byte check accounts for the fixed-width nature of UTF-32, allowing reliable order determination even on systems with varying native endianness. The Unicode Standard specifies that the BOM must appear at byte position zero for proper detection. If no BOM is present, parsers cannot rely on an explicit order indicator, leading to potential ambiguity. The Unicode Standard recommends defaulting to big-endian order for interchange and protocol use to promote interoperability, though many implementations fall back to the system's native endianness for local files to optimize performance. This fallback strategy balances standardization with practical efficiency but underscores the importance of including a BOM in cross-platform data exchange. A typical step-by-step algorithm for BOM-based endianness detection in parsers is as follows:- Read the initial two bytes for presumed UTF-16 or four bytes for presumed UTF-32.
- Compare the read bytes to the big-endian and little-endian encodings of U+FEFF.
- If the bytes match the big-endian form, set the decoding mode to big-endian and consume the BOM.
- If they match the little-endian form, set the decoding mode to little-endian and consume the BOM.
- If no match occurs, apply the default endianness (big-endian per Unicode recommendation for interchange) and proceed without consuming a BOM.