Fact-checked by Grok 2 weeks ago

Byte order mark

The byte order mark (BOM) is a character, U+FEFF ZERO WIDTH NO-BREAK SPACE, placed at the start of a or to signal the byte order () for multi-byte encodings such as UTF-16 and UTF-32, and optionally as an encoding signature for . This usage leverages the character's to distinguish between big-endian and little-endian representations, where the byte sequence FE FF indicates big-endian and FF FE indicates little-endian in UTF-16, while similar patterns apply to UTF-32 (00 00 FE FF for big-endian and FF FE 00 00 for little-endian). In UTF-8, the BOM appears as the byte sequence EF BB BF and serves solely as a signature to identify the encoding, though it is neither required nor recommended due to potential compatibility issues like interference with ASCII processing or file concatenation. Historically, U+FEFF was originally defined as a zero-width no-break space for formatting purposes, but its adoption as a BOM stems from the need to resolve byte order ambiguities in early implementations, with the reversed sequence U+FFFE designated as a noncharacter to detect incorrect interpretations. The BOM is essential for and streams without explicit byte order declarations, enabling processors to correctly interpret the data without prior knowledge of the system's . However, its presence can complicate software handling, as some parsers may treat it as content rather than a , leading to invisible characters or errors in protocols assuming . Recommendations from the advise against using BOM in new protocols, favoring explicit encoding labels instead, while encouraging software to detect and strip it when present for robustness. In web contexts like and XML, the BOM is permitted but should be ignored by parsers to avoid affecting document structure.

Introduction

Definition

The byte order mark (BOM) is the Unicode character U+FEFF ZERO WIDTH NO-BREAK SPACE (previously known as BYTE ORDER MARK), which functions as a to indicate the byte order of a text stream. When used in this capacity, U+FEFF is interpreted not as a visible character but as metatextual for parsers to determine the of the encoded data. Unlike typical characters that contribute to content rendering, the BOM serves a purely structural role, preventing misinterpretation of byte sequences in multi-byte encodings. The primary purpose of the BOM is to signal whether a employs or little-endian byte ordering, especially in encodings like UTF-16 and UTF-32 where characters span multiple bytes. describes the sequential arrangement of bytes within a multi-byte value: places the most significant byte first, while little-endian places the least significant byte first. For instance, the BOM character U+FEFF manifests as the byte sequence FE FF in UTF-16, but as FF FE in little-endian UTF-16, allowing processors to detect and adjust for the correct interpretation. Although primarily associated with byte-order-sensitive encodings, the BOM may also appear in UTF-8 files as an optional indicator of the encoding scheme, without conveying endianness information since UTF-8 is inherently byte-order agnostic.

History

The character U+FEFF was introduced in the initial release of the Standard (version 1.0) in October 1991, serving primarily as a byte order mark to indicate the of Unicode text streams in fixed-width encodings like UCS-2. In the subsequent amendment ( 1.0.1) published in June 1993, it was additionally designated as a zero-width no-break space, allowing it to function as an invisible formatting control for preventing line breaks without adding width. This dual role reflected early efforts to support both structural text processing and encoding identification in emerging international standards. The byte order mark gained further prominence with the adoption of in international standards, notably its inclusion in the first edition of ISO/IEC 10646 (1993), which defined the Universal Character Set (UCS) and incorporated U+FEFF for endianness signaling in multi-byte encodings. In parallel, practical implementation accelerated in the 1990s; for instance, updated around 1993 to internally use UTF-16 little-endian with a leading BOM for saving Unicode text files, influencing widespread adoption in Windows-based text editing and localization workflows. version 2.0 (1996) expanded the character repertoire and formalized UTF-16 as the primary encoding, explicitly repurposing U+FEFF to detect byte order in UTF-16 streams while retaining its no-break space utility. By the early 2000s, the BOM's application extended to , sparking debates within the and IETF about its necessity, as lacks endianness concerns; the IETF's 2781 (2000) solidified BOM usage for serialization over networks, but discussions highlighted potential issues like misinterpretation in ASCII-compatible contexts. The has long emphasized the BOM's optional nature across encodings, clarifying its role as a rather than a required element and discouraging routine inclusion to avoid compatibility problems. Post-2020 clarifications further refined guidance, particularly in 14.0 (2021) and 15.0 (2022), where the Consortium explicitly discouraged BOMs except in niche scenarios like stream identification in or when distinguishing from legacy encodings, prioritizing interoperability in modern protocols and filesystems. This guidance was reaffirmed in subsequent versions, including 16.0 (2024) and 17.0 (2025), with no changes to BOM recommendations. These updates built on influences from early text editors and web standards, ensuring the BOM remains a flexible but non-essential tool for processing.

Technical Details

Byte Sequences

The byte order mark (BOM) consists of byte sequences that encode the Unicode character U+FEFF (ZERO WIDTH NO-BREAK SPACE) at the start of a text stream or file, serving as a signature for certain encodings. These sequences vary by the encoding form and, for multi-byte forms, by the byte order (). They are derived directly from the UTF-8, UTF-16, or UTF-32 transformation rules applied to the U+FEFF, which has the value FEFF and representation 1111111011111111. In , a variable-length encoding, U+FEFF is represented as a three-byte sequence: EF BB BF. This results from the UTF-8 algorithm for code points in the range U+0800 to U+FFFF, which uses the form 1110xxxx 10yyyyyy 10zzzzzz, where the bits of FEFF (1111 1110 1111 1111) fill the x, y, and z positions after the leading bits, yielding the binary 11101111 10111011 10111111 in hexadecimal EF BB BF. For UTF-16, a fixed 16-bit encoding, the BOM is a two-byte sequence matching the code unit for U+FEFF. In big-endian order (most significant byte first), it is FE FF; in little-endian order (least significant byte first), it is FF FE. These sequences simply serialize the 16-bit value FEFF according to the byte order, with no additional since U+FEFF is in the Basic Multilingual Plane. In UTF-32, a fixed 32-bit encoding, the BOM is a four-byte sequence with the 16-bit value FEFF zero-extended to 32 bits (00 00 FE FF in big-endian, FF FE 00 00 in little-endian). The leading zeros pad the higher 16 bits of the , serialized per the . The following table summarizes the BOM byte sequences:
EncodingEndiannessByte Sequence (hex)Length
N/AEF BB BF3 bytes
UTF-16Big-endianFE FF2 bytes
UTF-16Little-endianFF FE2 bytes
UTF-32Big-endian00 00 FE FF4 bytes
UTF-32Little-endianFF FE 00 004 bytes
These sequences appear as the initial bytes in the data stream.

Endianness Detection

The Byte Order Mark (BOM) enables endianness detection by serving as an encoded indicator of the byte serialization order at the start of a text stream in multi-byte Unicode encodings like UTF-16 and UTF-32. When a parser encounters a BOM, it examines the initial bytes to identify whether the encoding uses big-endian (where the most significant byte comes first) or little-endian (where the least significant byte comes first) order. For UTF-16, this involves checking the first two bytes against the known encoding of the Unicode character U+FEFF ZERO WIDTH NO-BREAK SPACE in each endianness; a match with the big-endian form signals UTF-16BE, while the little-endian form indicates UTF-16LE. This process ensures that subsequent code units are decoded in the correct byte order, preventing misinterpretation of character data across different hardware architectures. In UTF-32, endianness detection follows a parallel mechanism but requires reading the first four bytes, as each code unit spans four bytes. The parser compares these bytes to the big-endian or little-endian representations of , confirming UTF-32BE or UTF-32LE accordingly, and verifies that the remaining data aligns on four-byte boundaries to maintain encoding integrity. This four-byte check accounts for the fixed-width nature of UTF-32, allowing reliable order determination even on systems with varying native . The Unicode Standard specifies that the BOM must appear at byte position zero for proper detection. If no BOM is present, parsers cannot rely on an explicit order indicator, leading to potential ambiguity. The Unicode Standard recommends defaulting to big-endian order for interchange and use to promote , though many implementations fall back to the system's native for local files to optimize performance. This fallback strategy balances standardization with practical efficiency but underscores the importance of including a BOM in cross-platform data exchange. A typical step-by-step algorithm for BOM-based endianness detection in parsers is as follows:
  • Read the initial two bytes for presumed UTF-16 or four bytes for presumed UTF-32.
  • Compare the read bytes to the big-endian and little-endian encodings of U+FEFF.
  • If the bytes match the big-endian form, set the decoding mode to big-endian and consume the BOM.
  • If they match the little-endian form, set the decoding mode to little-endian and consume the BOM.
  • If no match occurs, apply the default endianness (big-endian per Unicode recommendation for interchange) and proceed without consuming a BOM.
This algorithm prioritizes explicit BOM signals while providing a robust fallback. Unlike broader encoding signatures that primarily identify the format (e.g., distinguishing UTF-8 from other schemes), the BOM's role in endianness detection is narrowly focused on resolving byte order ambiguity within fixed-width multi-byte encodings, ensuring accurate reconstruction of Unicode code points.

Usage in Encodings

UTF-8

UTF-8 is a byte-oriented encoding scheme that operates independently of the system's endianness, eliminating the need for a byte order mark to resolve byte ordering issues. Instead, the BOM in UTF-8 serves primarily as an optional signature to signal that the subsequent data is encoded in UTF-8, facilitating format detection in environments where encoding ambiguity exists. The specific byte sequence for the UTF-8 BOM is EF BB BF, corresponding to the U+FEFF. This marker is explicitly optional under the Unicode Standard and is not mandatory for valid text. The standard, as outlined in RFC 3629, acknowledges the BOM's utility in distinguishing from legacy single-byte encodings like ISO-8859-1, particularly in files without . Since 5.0 (released in ), the specification has permitted its use while initially discouraging it due to potential interpretation as a visible in some systems; more recent versions, such as Unicode 16.0, adopt a neutral stance without recommending for or against it. Adoption of the BOM remains prevalent in certain ecosystems, notably Windows applications. For instance, Notepad included the BOM by default when saving files in UTF-8 encoding in versions prior to build 1809 (October 2018 Update), after which the application shifted to saving new UTF-8 files without the BOM as the default while retaining the option for inclusion. In contexts, the BOM aids HTTP clients in charset detection for or other text resources lacking an explicit Content-Type header with charset parameter, enhancing reliability in mixed-encoding environments. This usage underscores the BOM's role in promoting , though its optional nature allows flexibility in .

UTF-16

UTF-16 is a 16-bit fixed-width encoding scheme for characters, where each code unit consists of two bytes, and characters outside the Basic Multilingual Plane are represented using surrogate pairs. The Byte Order Mark (BOM), encoded as U+FEFF, plays a critical role in UTF-16 by indicating the byte serialization order—big-endian (UTF-16BE) or little-endian (UTF-16LE)—to ensure correct interpretation across systems with differing native . In big-endian order, the BOM appears as the byte sequence FE FF, while in little-endian, it is FF FE. RFC 2781 defines "UTF-16" as an encoding that uses either big-endian or little-endian byte order, as specified by the initial BOM, thereby mandating its presence for unambiguous deserialization when the endianness is not externally labeled. Without a BOM, the standard recommends assuming big-endian order, though in practice, many implementations default to little-endian due to platform conventions. The Unicode Standard endorses the use of the BOM in UTF-16 streams and files to promote interoperability, particularly in scenarios where byte order cannot be inferred from context or metadata. For the fixed-endian variants UTF-16BE and UTF-16LE, the BOM is optional and ignored during deserialization if present, as the order is predefined; however, including it enables automatic detection of the encoding form. UTF-16 with a BOM is the prevalent format in Windows applications and system text files, where little-endian is native, including for source code saved in environments like . In HTML documents, the BOM facilitates detection when the encoding is declared as UTF-16, ensuring proper rendering across browsers. The necessity of the BOM is evident in handling surrogate pairs; for instance, an like U+1F600 (GRINNING FACE), which spans a high (D83D) and low (DE00) in UTF-16, requires correct byte order to pair the 16-bit units accurately, preventing garbled output such as reversed surrogates in little-endian without detection.

UTF-32

UTF-32 is a fixed-width encoding form that represents each using exactly 32 bits, making it particularly reliant on the byte order mark (BOM) to indicate the of the . The BOM in UTF-32 consists of the four-byte sequence 00 00 FE FF for big-endian (UTF-32BE) or FF FE 00 00 for little-endian (UTF-32LE), allowing systems to correctly interpret the byte order without prior knowledge of the platform's architecture. Despite its straightforward structure, UTF-32 is less commonly used than UTF-16 primarily due to its larger storage requirements, as it allocates four bytes per regardless of the 's value. However, it finds application in specific contexts such as certain XML processing pipelines where consistent access is needed, and in internal program representations that prioritize simplicity over efficiency. The use of the BOM in UTF-32 aligns with guidelines in the Unicode Standard and ISO/IEC 10646, which harmonize to promote by ensuring that the encoding's byte order is explicitly signaled for portability across diverse systems. This mirrors the approach for UTF-16 but leverages UTF-32's uniform four-byte units to eliminate the need for pairs required in the 16-bit format. One key advantage of incorporating the BOM in UTF-32 is its facilitation of simpler parsing, as the fixed-width nature allows direct indexing to any without variable-length calculations, while the BOM safeguards against misinterpretation when data is exchanged in heterogeneous environments. For instance, in data streams emphasizing memory alignment, such as exports from certain databases, the BOM ensures reliable detection to maintain .

Processing and Issues

Detection Methods

Detection of the byte order mark (BOM) typically involves scanning the initial bytes of a text or file to match against predefined BOM patterns for various encodings. The process begins by reading the first 1 to 4 bytes, depending on the potential encoding: for , check for the sequence EF BB BF; for UTF-16 little-endian, FF FE; for UTF-16 big-endian, FE FF; and for UTF-32, the corresponding 4-byte sequences like 00 00 FE FF or FF FE 00 00. If a match is found, the BOM is identified and usually stripped from the to prevent it from being interpreted as content, ensuring the remaining text is processed correctly without leading invisible characters. This byte-matching approach is efficient and reliable for unmarked files, as the BOM signatures are unique and do not overlap with common content bytes. Programming languages and libraries provide built-in or extensible mechanisms for BOM detection and handling. In , the 'utf-8-sig' automatically detects and skips the BOM (EF BB BF) during decoding if present at the start, while prepending it during encoding for output files. For , the standard InputStreamReader does not automatically strip BOMs, requiring developers to implement custom wrappers like UnicodeBOMInputStream, which reads the initial bytes, identifies the BOM type, and skips it before passing the stream to a reader. In .NET, the Encoding.UTF8 class includes BOM handling by default, prepending the UTF-8 BOM to encoded output and recognizing it during decoding, though constructors allow disabling this via the emitIdentifier parameter set to false for BOM-free output. In files with mixed content or potential embedded U+FEFF characters, heuristics rely on positional context to distinguish a true BOM from a zero-width non-breaking space (ZWNBSP). If the U+FEFF appears at the very beginning of the , it is treated as a BOM for encoding and signaling; otherwise, occurrences later in the file are interpreted as ZWNBSP, a formatting character, to avoid misidentification. This position-based rule aligns with guidelines, preventing false positives in content where ZWNBSP might legitimately appear for typographic purposes. Best practices for text parsers emphasize robustness by always inspecting for a BOM at the stream's start, irrespective of any declared encoding, to handle unmarked or ambiguously labeled files. Parsers should support processing both BOM-prefixed and BOM-absent streams seamlessly, stripping the BOM when detected to normalize input while preserving compatibility with protocols that expect it, such as certain text formats. Post-2020 developments in libraries have enhanced BOM auto-detection, particularly for web applications. In versions 18 and later, the TextDecoder API (stable since v18) improves handling of streams by default stripping the BOM during decoding when the ignoreBOM option is true (the default); libraries such as iconv-lite automatically strip the BOM during decoding for , aiding in cross-platform file processing for modern web apps.

Common Problems and Solutions

The presence of a byte order mark (BOM) in UTF-8 files can lead to unexpected "invisible" characters or when processed by Unix tools such as and , which typically expect files without a leading BOM and may interpret the EF BB BF bytes as literal content, resulting in garbled output or failed matches. This issue often arises from double-encoding errors, where the BOM is misinterpreted as part of the text and re-encoded, producing sequences like  that disrupt further processing. Cross-platform compatibility challenges frequently occur because some Windows-based editors and applications may add a BOM when saving files, though has defaulted to UTF-8 without BOM since (2019), while and macOS environments generally avoid it, leading to display glitches in editors like Vim where the BOM may appear as a ^@ symbol or cause misalignment if not explicitly handled. For instance, scripts with shebangs (e.g., #!/bin/) saved with a BOM may fail to execute on , as the interpreter treats the BOM as part of the line, resulting in "executable not found" errors. In web and contexts, a BOM at the start of files can break CSS rules by causing the initial declarations to be ignored in some user agents, or it may be rendered as visible content, such as extra blank lines or characters at the top of pages, particularly in PHP-generated where the BOM precedes output. Mitigation strategies include server-side stripping, such as configuring with mod_filter to remove the BOM from responses or using functions like preg_replace('/^\xEF\xBB\xBF/', '', $content) before echoing content. The advises against using a BOM with except in systems or for explicit purposes, recommending files without BOM for maximum portability across tools and platforms; validation can be performed using libraries like chardet, which detects the BOM and identifies the encoding as while allowing stripping during decoding. In emerging AI and text processing workflows post-2020, a BOM can interfere with tokenization in pipelines by introducing extraneous bytes that fragment tokens or alter subword merges in algorithms like byte-pair encoding, potentially degrading model performance on datasets; solutions involve preprocessing to strip the BOM before tokenization.

References

  1. [1]
    Glossary of Unicode Terms
    Byte Order Mark. The Unicode character U+FEFF when used to indicate the byte order of a text. (See Section 2.13, Special Characters and Noncharacters, and ...
  2. [2]
    Special Areas and Format Characters - Unicode
    1 Byte Order Mark (BOM): U+FEFF. For historical reasons, the character U+FEFF used for the byte order mark is named ZERO WIDTH NO-BREAK SPACE ...
  3. [3]
    The Unicode standard - Globalization | Microsoft Learn
    Feb 2, 2024 · UTF-16 and UTF-32 require a byte order mark (BOM, U+FEFF) at the beginning of the stream or file to indicate the byte order. The byte orders ...
  4. [4]
  5. [5]
    [PDF] Clarify guidance for use of a BOM as a UTF-8 encoding signature
    Jan 2, 2021 · See the “Byte Order Mark” subsection in. Section 23.8, Specials, for more information. This statement is sometimes interpreted as a ...
  6. [6]
    The byte-order mark (BOM) in HTML - W3C
    Jan 31, 2013 · A Byte Order Mark , sometimes abbreviated "BOM", is a special Unicode character intended to appear at the very beginning of a text file. Its ...
  7. [7]
    UTR#17: Unicode Character Encoding Model
    In Unicode, the character at code point U+FEFF is defined as the byte order mark, while its byte-reversed counterpart, U+FFFE is a noncharacter (U+FFFE) in UTF ...
  8. [8]
    Chapter 3 – Unicode 16.0.0
    When represented in UTF-8, the byte order mark turns into the byte sequence <EF BB BF>. Its usage at the beginning of a UTF-8 data stream is not required by the ...<|control11|><|separator|>
  9. [9]
    Unicode 1.0
    ### Summary of U+FEFF in Unicode 1.0
  10. [10]
    [PDF] Unicode 1.0.1
    value U+FEFF may now also be used as ZERO WIDTH NO-BREAK SPACE (ZWNBSP). For convenience in dis- cussion, it can also be referred to by this name (which is ...
  11. [11]
    U+feff (alternate title: UTF-8 is the BOM, dude!) - Miloush.net
    Jan 20, 2005 · History: NT 3.1 shipped with an ASCII only Notepad. In the fall of 1993, several applications were converted to use Unicode. At this time, ...
  12. [12]
    None
    Nothing is retrieved...<|separator|>
  13. [13]
    Unicode Mail List Archive: RE: UTF-8 BOM Nonsense
    RE: UTF-8 BOM Nonsense. From: Karlsson Kent - keka (keka@im.se) Date: Thu Jun 22 2000 - 13:50:10 EDT. Next message: John Cowan: "Re: UTF-8N?
  14. [14]
    Chapter 2 – Unicode 16.0.0
    The Unicode Standard provides three distinct encoding forms for Unicode characters, using 8-bit, 16-bit, and 32-bit units. These are named UTF-8, UTF-16, and ...
  15. [15]
    RFC 3629 - UTF-8, a transformation format of ISO 10646
    UTF-8, the object of this memo, has a one-octet encoding unit. It uses all bits of an octet, but has the quality of preserving the full US-ASCII [US-ASCII] ...
  16. [16]
    Windows 10 Notepad is Getting Better UTF-8 Encoding Support
    Dec 12, 2018 · In this build, Microsoft added the ability to save files as UTF-8 without a BOM (Byte Order Mark), which is labeled as the "UTF-8" option when saving a file.
  17. [17]
    RFC 2781 UTF-16, an encoding of ISO 10646 - IETF
    This memo provides information for the Internet community. It does not specify an Internet standard of any kind. Distribution of this memo is unlimited.
  18. [18]
    Using Byte Order Marks - Win32 apps | Microsoft Learn
    Sep 26, 2024 · Therefore, Unicode has defined a character (U+FEFF) and a noncharacter (U+FFFE) as byte order marks. They are mirror byte images of each other.
  19. [19]
    Unicode Basics | ICU Documentation
    Only the “UTF-16” and “UTF-32” names include recognition of the byte order marks that are specific to them (and the ICU converters for these names do this ...Overview Of Unicode · Character Encoding Forms And... · Serialized Formats
  20. [20]
    Internally encoded XML data - IBM
    If the data contains a Unicode BOM, the BOM determines the encoding. The ... UTF-32 Big Endian, X'0000FEFF', UTF-32. UTF-32 Little Endian, X'FFFE0000', UTF ...
  21. [21]
    ISO/IEC 10646:2017 - Universal Coded Character Set (UCS)
    - specifies the coded representations for control characters and private use characters,. - specifies three encoding forms of the UCS: UTF-8, UTF-16, and UTF-32 ...Missing: BOM | Show results with:BOM
  22. [22]
    UTR#17: Unicode Character Encoding Model
    ### Summary of BOM Details from UTR#17
  23. [23]
    codecs — Codec registry and base classes — Python 3.14.0 ...
    On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should ...Codecs -- Codec Registry And... · Codec Base Classes · Standard Encodings
  24. [24]
    gpakosz/UnicodeBOMInputStream: Doing things right, in ... - GitHub
    Wrap any InputStream with UnicodeBOMInputStream and use the getBOM() and/or skipBOM() methods. See UnicodeBOMInputStreamUsage.java . If you find this library ...
  25. [25]
  26. [26]
    Add 'utf8-sig' encoding option. #4039 - GitHub
    Sep 20, 2012 · @tracker1 The purpose of a BOM is to indicate byte order for UTF-16 and UTF-32 encodings. The whole using it in UTF-8 is a Windows Notepad hack.
  27. [27]
    Display problems caused by the UTF-8 BOM - W3C
    Jul 17, 2007 · A UTF-8 signature at the beginning of a CSS file can sometimes cause the initial rules in the file to fail on certain user agents. In some ...What is a UTF-8 signature... · Detecting the BOM · Removing the BOM
  28. [28]
    UTF-8 and BOM - Manski's Dev Log
    Mar 22, 2020 · By default, files use UTF-8 without BOM. BOM is used for specific file types, especially those with end-user visible text, to avoid encoding ...
  29. [29]
    Shebang executable not found because of UTF-8 BOM (Byte Order ...
    Sep 28, 2013 · The cause of the problem is that my file was encoded using UTF8 with BOM (Byte Order Mark). Removing the BOM, ie encoding the file using UTF8 without BOM ...How to fix "Byte-Order Mark found in UTF-8 File" validation warningWhat encoding to use for cross-platform development (PCs, Macs ...More results from stackoverflow.com
  30. [30]
    How it works — chardet 5.0.0 documentation - Read the Docs
    If the text starts with a BOM, we can reasonably assume that the text is encoded in UTF-8 , UTF-16 , or UTF-32 . (The BOM will tell us exactly which one; that's ...