Fact-checked by Grok 2 weeks ago

UTF-16

UTF-16 is a variable-width character encoding form for the Unicode standard that represents each Unicode code point using one or two 16-bit code units, allowing it to encode the entire repertoire of Unicode characters, including those beyond the Basic Multilingual Plane.^[1] Developed as an extension of the earlier UCS-2 encoding, UTF-16 incorporates surrogate pairs—specific 16-bit sequences in the ranges U+D800–U+DBFF (high surrogates) and U+DC00–U+DFFF (low surrogates)—to represent the 1,048,576 code points from U+10000 to U+10FFFF that cannot fit into a single 16-bit unit.^[2] This encoding scheme was formally specified in the Unicode Standard version 2.0 and is detailed in RFC 2781, which also addresses serialization of UTF-16 into byte streams using big-endian (UTF-16BE) or little-endian (UTF-16LE) byte orders, often preceded by an optional byte order mark (BOM) at U+FEFF to indicate the endianness. UTF-16 is one of three primary encoding forms defined by the Unicode Standard alongside UTF-8 and UTF-32, sharing the same repertoire of 159,801 characters and symbols as of Unicode 17.0.^[3] Unlike fixed-width encodings like UTF-32, UTF-16's variable-length nature optimizes storage and processing for the majority of characters (those in the 0–FFFF range, which comprise about 95% of common text), using just two bytes per character, while doubling to four bytes only for rarer supplementary characters.^[1] It supports interoperability with ISO/IEC 10646, the international standard for Universal Character Set (UCS), ensuring consistent representation across systems. Historically, UTF-16 evolved from UCS-2, the initial 16-bit fixed-width encoding in early Unicode versions, which was limited to 65,536 code points; the introduction of surrogates in UTF-16 addressed the need for expansion without breaking legacy UCS-2 implementations that ignored surrogates.^[2] Today, UTF-16 remains widely used for internal text processing in major platforms, including Microsoft Windows (where it serves as the native string format since Windows 2000) and Java (since J2SE 5.0), due to its balance of efficiency in memory usage and random access capabilities compared to UTF-8.^[4] However, for file storage and network transmission, UTF-8 is often preferred for its ASCII compatibility and variable-byte efficiency with Latin scripts.^[1]

History

Origins in UCS-2

UCS-2, or Universal Character Set coded in 2 octets, emerged as a fixed-width 16-bit encoding scheme specifically for the Basic Multilingual Plane (BMP), which encompasses code points from U+0000 to U+FFFF, allowing representation of 65,536 characters. This encoding was formally defined in the initial edition of ISO/IEC 10646-1:1993, the international standard for the Universal Multiple-Octet Coded Character Set (UCS), where it served as the two-octet form limited to the BMP.^[5] At the time, UCS-2 was seen as sufficient for encoding the major scripts and symbols in widespread use, aligning closely with the early Unicode standard's scope.^[6] However, as the Unicode Consortium's ambitions grew to encompass a broader repertoire of the world's writing systems, the limitations of UCS-2 became evident: it could not natively handle code points beyond U+FFFF, capping the total addressable characters at 65,536 despite the need to support up to 1,114,112 code points across 17 planes in the full Unicode space. This restriction threatened the standard's goal of universal coverage, prompting calls for a variable-width extension that maintained backward compatibility with existing UCS-2 implementations while enabling encoding of supplementary characters.^[7] In the early 1990s, following the Consortium's founding in 1991, initial proposals focused on expanding beyond the 16-bit boundary without disrupting the BMP, laying the groundwork for mechanisms to access the full code space.^[6] By 1996, these efforts culminated in discussions within the Unicode Technical Committee (UTC), which approved the introduction of surrogate pairs in Unicode 2.0, released in July of that year, effectively transforming UCS-2 into the more capable UTF-16 encoding.^[6] This evolution addressed UCS-2's shortcomings by reserving specific 16-bit ranges for surrogates, allowing pairs of code units to represent higher code points while ensuring single-unit encoding for BMP characters. The UTC's deliberations emphasized compatibility and efficiency, marking a pivotal shift toward a scalable Unicode architecture.

Standardization and Evolution

UTF-16 was officially introduced as part of the Unicode Standard in version 2.0, released in July 1996, to address the limitations of the earlier fixed-width UCS-2 encoding by incorporating a variable-length mechanism using surrogate pairs for characters beyond the Basic Multilingual Plane.^[8] This update expanded Unicode's capacity to over a million code points while maintaining backward compatibility with existing 16-bit implementations. Concurrently, UTF-16 was incorporated into the International Standard ISO/IEC 10646 through Amendment 1 to the 1993 edition, published in October 1996, which defined it as a transformation format for the Universal Character Set.^[9] The evolution of UTF-16 continued with further clarifications in subsequent Unicode versions. In Unicode 3.1, released in March 2001, the standard provided explicit guidance on surrogate pair handling and formally deprecated UCS-2 as an encoding form, emphasizing UTF-16 as the preferred 16-bit encoding to ensure full support for the growing repertoire of characters. Additionally, in February 2000, RFC 2781 from the Internet Engineering Task Force standardized the serialization of UTF-16 as an octet stream, formally designating variants such as UTF-16BE (big-endian) and UTF-16LE (little-endian) for network transmission and interchange.^[10] As of Unicode 17.0, released in September 2025, UTF-16 remains a stable and core encoding form without major structural changes, continuing to support the full range of 159,801 assigned characters while aligning with updates to ISO/IEC 10646.^[3] This stability reflects UTF-16's entrenched role in systems like Windows and Java, where it serves as the primary internal representation for Unicode text.

Technical Overview

Basic Encoding Principles

UTF-16 is one of the standard Unicode Transformation Formats (UTFs), designed to encode the full repertoire of Unicode code points, which range from U+0000 to U+10FFFF, using sequences of 16-bit code units.^[11] Unlike fixed-width encodings such as UTF-32, where every code point is represented by a single 32-bit unit, UTF-16 employs a variable-width scheme: the majority of code points are encoded with a single 16-bit unit, while a subset requires two units to accommodate the extended range.^[12] This approach balances efficiency for common text with support for the complete Unicode space, which encompasses over 1.1 million possible code points.^[11] In UTF-16, a code unit is defined as an unsigned 16-bit integer with values from 0x0000 to 0xFFFF, serving as the basic building block for representing characters in memory or transmission.^[12] These code units are typically stored as two consecutive bytes, but their interpretation depends on the system's endianness, which is addressed through byte order marks or explicit specification.^[12] The encoding ensures that valid Unicode text can be processed without ambiguity, provided the rules for unit sequences are followed. Unicode organizes its code points into 17 planes, each containing 65,536 positions, for a total capacity of 1,114,112 code points (though not all are assigned).^[13] The Basic Multilingual Plane (BMP), or Plane 0 (U+0000 to U+FFFF), includes most frequently used scripts and symbols, allowing direct single-unit encoding in UTF-16.^[11] Supplementary planes (Planes 1 through 16, U+10000 to U+10FFFF) cover additional characters such as historic scripts and emojis, necessitating multi-unit representations like surrogate pairs in UTF-16.^[13]

Surrogate Pair Mechanism

In UTF-16, the surrogate pair mechanism enables the encoding of the 1,048,576 code points in the supplementary planes (U+10000 to U+10FFFF) using two consecutive 16-bit code units, extending the encoding beyond the 65,536 code points of the Basic Multilingual Plane (BMP), which are encoded with a single code unit.^[14] This variable-width approach maintains backward compatibility with UCS-2 while supporting the full Unicode repertoire.^[10] A surrogate pair comprises a high surrogate code unit in the range U+D800 to U+DBFF (decimal 55,296 to 56,319), followed immediately by a low surrogate code unit in the range U+DC00 to U+DFFF (decimal 56,320 to 57,343).^[14] These ranges together provide 1,024 possible values for high surrogates and 1,024 for low surrogates, yielding 1,048,576 unique combinations sufficient for all supplementary code points. To encode a supplementary code point U (where U+10000 \leq U \leq U+10FFFF), subtract $0 \times 10000 from U to obtain S; the high surrogate H is then $0 \times D800 + \lfloor S / 0 \times 400 \rfloor, and the low surrogate L is $0 \times DC00 + (S \mod 0 \times 400).^[10] Conversely, to decode a valid surrogate pair (H, L), compute U = 0 \times 10000 + ((H - 0 \times D800) \times 0 \times 400) + (L - 0 \times DC00).^[10] Valid UTF-16 sequences require surrogates to always form complete, properly ordered pairs; an unpaired high surrogate, unpaired low surrogate, or a pair where the low surrogate precedes the high one is ill-formed and must be rejected by conforming processors.^[14] This strict pairing ensures unambiguous decoding and prevents interpretation errors. Historically, the code points in the surrogate ranges were explicitly reserved and unassigned in UCS-2—the original 16-bit fixed-width encoding of Unicode—to facilitate this future-proof extension to UTF-16 without invalidating existing UCS-2 implementations.^[15]

Code Unit Details

BMP Code Points

In UTF-16, code points within the Basic Multilingual Plane (BMP), which spans U+0000 to U+FFFF, are encoded using a single 16-bit code unit where the value of the code unit directly matches the Unicode code point value.^[13]^[16] This direct mapping applies specifically to non-surrogate code points in the ranges U+0000 to U+D7FF and U+E000 to U+FFFF, ensuring efficient representation without additional processing for the majority of commonly used characters.^[13]^[16] The BMP provides capacity for 65,536 code points, of which 63,488 (excluding the surrogate range) are available for assignment to characters, encompassing most modern scripts such as Latin, Greek, Cyrillic, Arabic, Devanagari, and Chinese ideographs, among others.^[13] This allocation prioritizes the characters needed for global text processing, with the surrogate range U+D800 to U+DFFF explicitly excluded from direct assignment to prevent conflicts in the encoding scheme.^[13]^[16] Representative examples of BMP coverage include the ASCII subset from U+0000 to U+007F, which encodes basic Latin characters and control codes in a single code unit identical to their legacy 7-bit values, and the CJK Unified Ideographs range from U+4E00 to U+9FFF, supporting thousands of East Asian characters essential for languages like Mandarin and Japanese.^[13]

Supplementary Code Points

In UTF-16, supplementary code points range from U+010000 to U+10FFFF and are located in Planes 1 through 16 of the Unicode code space, encompassing a total of 1,048,576 code points. These code points, which extend beyond the Basic Multilingual Plane (BMP) covered by single 16-bit code units, require a pair of 16-bit code units known as surrogates for representation in UTF-16. A surrogate pair consists of a high surrogate code unit (in the range 0xD800 to 0xDBFF) followed immediately by a low surrogate code unit (in the range 0xDC00 to 0xDFFF). To decode a surrogate pair into the corresponding supplementary code point, the following formula is applied:

\text{code point} = (\text{high surrogate} - 0xD800) \times 0x400 + (\text{low surrogate} - 0xDC00) + 0x10000

This calculation maps the 20 bits of information from the two 10-bit surrogate values (after offset adjustment) to the 21-bit supplementary code point space, starting at U+10000. Examples of supplementary code points include emojis such as U+1F600 (GRINNING FACE, 😀) in Plane 1 and characters from ancient scripts like U+10400 (OSMANYA LETTER ALEF) also in Plane 1, which encode diverse modern and historical writing systems. For UTF-16 text to be well-formed, surrogate pairs must be valid: a high surrogate must always be paired with a following low surrogate in the specified ranges, and unpaired surrogates are not permitted, as they represent ill-formed sequences that implementations must reject or replace during processing.

Surrogate Code Units

In UTF-16, the surrogate code units occupy the reserved range from U+D800 to U+DFFF, encompassing 2048 consecutive code points that are explicitly unassigned to any characters in the Unicode standard.^[14] This range is evenly divided into two halves: the high surrogates from U+D800 to U+DBFF (1024 values) and the low surrogates from U+DC00 to U+DFFF (another 1024 values).^[14] These code units serve solely as components in surrogate pairs to encode supplementary characters beyond the Basic Multilingual Plane (BMP), enabling UTF-16 to extend Unicode support without disrupting legacy UCS-2 implementations that only process BMP code points up to U+FFFF.^[14] The design of surrogates ensures backward compatibility by placing them outside the BMP, so UCS-2 decoders—which assume fixed 16-bit code units for all characters—encounter these values as invalid or uninterpretable without altering BMP handling. However, surrogates must never be treated as standalone characters in valid UTF-16 sequences; interpreting an isolated surrogate as a single code point renders the text ill-formed and can lead to data corruption or security vulnerabilities, such as injection attacks in converters that fail to reject lone surrogates.^[17] For instance, mishandling isolated surrogates in cross-encoding conversions may allow malicious sequences to bypass filters, similar to overlong UTF-8 exploits.^[17] To mitigate such risks and promote full Unicode compliance, modern standards and implementations deprecate UCS-2 in favor of proper UTF-16 processing that recognizes and validates surrogate pairs. The Unicode Standard explicitly advises against UCS-2 usage, emphasizing surrogate-aware APIs to enforce correct handling of extended code points.^[14]

Examples and Illustrations

Simple Character Encoding

In UTF-16, characters from the Basic Multilingual Plane (BMP), which encompasses code points U+0000 through U+FFFF, are encoded using a single 16-bit code unit directly corresponding to the character's scalar value.^[18] This straightforward mapping ensures that the vast majority of commonly used characters, including those in Latin scripts and many ideographic systems, require only two bytes of storage per character. For instance, the Latin capital letter 'A' (U+0041) is represented as the single code unit 0x0041.^[18] Similarly, the CJK Unified Ideograph '中' (U+4E2D), meaning "middle" or "China," is encoded as the code unit 0x4E2D. These encodings preserve the original 16-bit value without modification, facilitating efficient processing in systems optimized for 16-bit operations. A practical example is the English string "Hello," which consists entirely of BMP characters and thus requires five 16-bit code units: [0x0048, 0x0065, 0x006C, 0x006C, 0x006F].^[18] To illustrate the bit-level structure, consider the code unit for 'A' (0x0041). In 16-bit binary, it is represented as 0000 0000 0100 0001, where the high byte is 0x00 and the low byte is 0x41. This binary form highlights UTF-16's alignment with 16-bit architectures, though byte order considerations apply during serialization.

Surrogate Pair Encoding

Surrogate pairs in UTF-16 enable the representation of the 1,048,576 supplementary characters in the Unicode planes beyond the Basic Multilingual Plane (BMP), which spans code points from U+10000 to U+10FFFF.^[14] Each such character is encoded using two consecutive 16-bit code units: a high surrogate from the range U+D800 to U+DBFF and a low surrogate from U+DC00 to U+DFFF. The encoding process begins by subtracting 0x10000 from the supplementary code point to obtain a 20-bit offset, which is then split into the upper 10 bits for the high surrogate and the lower 10 bits for the low surrogate.^[14] Consider the example of the grinning face emoji at U+1F600. First, compute the offset: 0x1F600 - 0x10000 = 0xF600. The high surrogate is derived by right-shifting this offset by 10 bits (0xF600 >> 10 = 0x3D) and adding 0xD800, yielding 0xD83D. The low surrogate is obtained by masking the lower 10 bits (0xF600 & 0x3FF = 0x200) and adding 0xDC00, resulting in 0xDE00. Thus, U+1F600 is encoded as the surrogate pair [0xD83D, 0xDE00].^[14] In a mixed string combining BMP and supplementary characters, such as "A😀" (where "A" is U+0041), the UTF-16 sequence is [0x0041, 0xD83D, 0xDE00], with the BMP character using a single code unit followed by the two-unit surrogate pair.^[14] Decoding a surrogate pair involves verification and reconstruction. Upon encountering a code unit in the high surrogate range (0xD800–0xDBFF), the next code unit must be in the low surrogate range (0xDC00–0xDFFF) to form a valid pair; otherwise, it is an error. The original code point is then reconstructed using the formula:

\text{U} = 0x10000 + ((\text{H} - 0xD800) \times 0x400) + (\text{L} - 0xDC00)

where H is the high surrogate and L is the low surrogate. For the pair [0xD83D, 0xDE00], this yields 0x10000 + ((0xD83D - 0xD800) × 0x400) + (0xDE00 - 0xDC00) = 0x10000 + (0x3D × 0x400) + 0x200 = 0x1F600.^[14] The structure of a surrogate pair can be visualized textually as follows, showing how the 20-bit supplementary value is divided:

Supplementary Code Point (U+10000 to U+10FFFF): 20 bits
Offset (U - 0x10000):                    10 bits (high) | 10 bits (low)
High Surrogate (U+D800–U+DBFF):         0xD800 + high bits
Low Surrogate (U+DC00–U+DFFF):          0xDC00 + low bits
Supplementary Code Point (U+10000 to U+10FFFF): 20 bits
Offset (U - 0x10000):                    10 bits (high) | 10 bits (low)
High Surrogate (U+D800–U+DBFF):         0xD800 + high bits
Low Surrogate (U+DC00–U+DFFF):          0xDC00 + low bits

This pairing ensures unambiguous encoding within the 16-bit framework.^[14]

Byte Order and Serialization

Endianness Variants

UTF-16 defines two primary variants for serializing its 16-bit code units into byte sequences based on the system's or protocol's endianness: UTF-16BE (big-endian) and UTF-16LE (little-endian). These variants ensure consistent representation when transmitting or storing UTF-16 data across different architectures. The specification in RFC 2781 outlines these encodings to address octet stream serialization for interoperability.^[16] In UTF-16BE, bytes are ordered with the most significant byte (high byte) first, aligning with traditional network byte order conventions, which favor big-endian for multi-byte values. For instance, the code point U+0041 (Latin capital letter A) is represented as the 16-bit value 0x0041 and serialized as the byte sequence 0x00 followed by 0x41.

0x00 0x41
0x00 0x41

This order facilitates direct compatibility with big-endian systems and protocols like TCP/IP. Conversely, UTF-16LE serializes bytes with the least significant byte (low byte) first, common on little-endian architectures such as x86 processors. The same code point U+0041 becomes the byte sequence 0x41 followed by 0x00.

0x41 0x00
0x41 0x00

This variant is prevalent in Windows environments for native processing efficiency.^[19] Network protocols and file formats often specify one variant to avoid ambiguity, with RFC 2781 recommending big-endian as the default in the absence of indicators. Assuming the wrong endianness when decoding UTF-16 data can result in swapped bytes, leading to mojibake—garbled or nonsensical text where characters are misinterpreted. The byte order mark (BOM) provides a mechanism for automatic detection of the correct variant.^[16]

Byte Order Mark Usage

The Byte Order Mark (BOM) in UTF-16 is the Unicode character U+FEFF placed at the start of a text stream or file to indicate the byte serialization order.^[16] When encoded in big-endian format, the BOM consists of the byte sequence 0xFE 0xFF; in little-endian format, it is 0xFF 0xFE.^[13] Detection of endianness relies on examining these initial bytes: a sequence of FE FF signals big-endian UTF-16 (UTF-16BE), while FF FE indicates little-endian UTF-16 (UTF-16LE).^[16] This mechanism allows parsers to unambiguously interpret the subsequent code units without prior knowledge of the system's endianness.^[13] The Unicode Standard describes the BOM as optional for UTF-16 but recommends its inclusion in files and streams to facilitate reliable parsing across diverse platforms.^[13] Without a BOM, applications must rely on external conventions or defaults, which can lead to errors in multi-endian environments.^[16] A key issue arises if the BOM is not properly detected: it may be treated as the zero-width non-breaking space (ZWNBSP) character, resulting in unintended spacing or formatting artifacts in the output.^[16] This problem is especially common in legacy systems or protocols that do not explicitly handle BOMs. The BOM is widely used in Windows text files, where UTF-16LE with a leading FF FE is the default for many applications, enhancing compatibility within the ecosystem.^[20]

Properties and Efficiency

Storage Requirements

UTF-16 encodes each character from the Basic Multilingual Plane (BMP), which covers the most commonly used code points including Latin, Cyrillic, and many Asian scripts, using a single 16-bit code unit equivalent to 2 bytes.^[1] For characters outside the BMP in the supplementary planes, such as certain rare symbols or historic scripts, UTF-16 employs surrogate pairs consisting of two 16-bit code units, requiring 4 bytes total.^[21] This variable-length approach ensures full coverage of the Unicode repertoire while optimizing for frequent characters. In typical text corpora, where the vast majority of characters reside in the BMP, UTF-16 achieves an average storage efficiency of approximately 2 bytes per character, as supplementary characters represent a small fraction of usage in most languages and documents. Compared to UTF-8, UTF-16 requires more space for ASCII-range text (2 bytes per character versus 1 byte), potentially increasing file sizes for English-heavy content, though its consistent 2-byte units for BMP characters provide aligned memory access advantages in certain implementations. When serialized to files or streams, UTF-16 often includes a 2-byte Byte Order Mark (BOM) at the start to specify the endianness (big-endian or little-endian), introducing a fixed overhead that does not encode any textual content.^[20] This BOM, typically U+FEFF, helps decoders interpret the byte order correctly but adds to the overall storage footprint for short texts.^[16]

Performance Characteristics

UTF-16's use of fixed 16-bit code units provides a significant advantage in runtime efficiency for operations requiring random access to string elements, such as indexing into a string to retrieve a specific character. In environments like Java, where strings are internally represented as arrays of 16-bit characters in UTF-16 encoding, methods like charAt(index) achieve constant-time O(1) performance by directly accessing the corresponding code unit without needing to parse variable-length sequences. This alignment with 16-bit memory boundaries also facilitates faster loading and manipulation in legacy systems optimized for UCS-2, the predecessor to UTF-16, enhancing overall processing speed for applications handling primarily Basic Multilingual Plane (BMP) characters.^[22] However, the variable-width nature of UTF-16 introduces drawbacks during sequential iteration or full string traversal, as supplementary characters beyond the BMP must be represented by surrogate pairs consisting of two 16-bit code units. Processing these requires explicit checks to detect high surrogates (in the range D800–DBFF) and pair them with subsequent low surrogates (DC00–DFFF), adding conditional logic that can degrade performance when supplementary characters are encountered. Supplementary characters occur infrequently in most real-world text and remain extremely rare as a percentage of overall text worldwide, so this overhead is minimal for BMP-heavy content but becomes noticeable in emoji-rich modern text or scripts using supplementary planes.^[22] Benchmarks comparing UTF-16 to other encodings indicate it outperforms UTF-8 in random access and simple BMP processing tasks within 16-bit-oriented legacy environments due to reduced decoding steps for European-language texts. In contrast, UTF-8 is favored in contemporary variable-length processing pipelines for its compact representation of ASCII subsets and lower bandwidth needs, showing better throughput in I/O-bound modern applications despite more complex byte-level parsing. These results stem from tests on basic operations like length calculation and substring extraction across diverse corpora.^[23] The handling of surrogates also influences caching strategies and normalization processes, where improper decoding can lead to fragmented caches or erroneous canonical forms. For instance, normalization algorithms must first resolve surrogate pairs to code points before applying decomposition or composition rules, potentially doubling the effective length for affected segments and increasing memory usage in cached normalized strings; optimized implementations mitigate this by pre-decoding surrogates during ingestion.

Applications and Usage

In Programming Environments

In programming languages and environments, UTF-16 is natively adopted for string representations to efficiently handle Unicode characters, particularly those in the Basic Multilingual Plane (BMP). In Java, strings are internally stored as sequences of 16-bit UTF-16 code units, where supplementary characters beyond the BMP are represented using surrogate pairs.^[24] This design allows direct access to characters via the char type, which corresponds to a single UTF-16 code unit, while methods like length() return the number of code units rather than scalar values.^[25] JavaScript similarly treats strings as sequences of UTF-16 code units, as specified in the ECMAScript standard, enabling straightforward indexing and manipulation but requiring careful handling of surrogates for characters outside the BMP, such as many emojis.^[26] For instance, the String.prototype.length property counts UTF-16 code units, so a single supplementary character like the emoji "👍" (U+1F44D) occupies two units, leading to a length of 2. In .NET and C#, strings are implemented as immutable arrays of char values, each representing a UTF-16 code unit, with surrogate pairs used to encode supplementary characters.^[27] The framework provides APIs like String.Normalize() to handle Unicode normalization forms, but developers must account for surrogates when iterating or slicing strings to avoid splitting pairs.^[28] Cross-platform support for UTF-16 is facilitated by libraries such as the International Components for Unicode (ICU), which uses UTF-16 as its primary internal format for strings via the UnicodeString class.^[29] ICU offers robust handling of Unicode normalization, including conversion to Normalization Form C (NFC, canonical composition) and Normalization Form D (NFD, canonical decomposition), ensuring consistent text processing across environments.^[30] A common pitfall in UTF-16-based environments is the distinction between code units and code points when measuring string length or accessing characters, particularly with emojis that often require surrogate pairs. For example, in JavaScript, attempting to count "characters" via length or charAt() can miscount emojis like "👨‍👩‍👧‍👦" (a family emoji using multiple code points), resulting in inflated lengths and incorrect slicing.^[26] Similarly, in Java, String.length() reports code units, potentially leading to errors in user interface rendering or data validation unless surrogate-aware methods like codePointCount() are used.^[25] These issues underscore the need for grapheme cluster-aware libraries to accurately process modern text.

In Protocols and File Formats

In Microsoft Windows, UTF-16 little-endian (UTF-16LE) serves as the default encoding for plain text files saved via Notepad when selecting the "Unicode" option, which includes a byte order mark (BOM) for endianness detection. This encoding ensures compatibility with Unicode characters in .txt files, though legacy ANSI remains an alternative for basic ASCII content.^[1] For XML documents on Windows, UTF-16LE with a BOM is commonly used to signal the encoding automatically, as the XML 1.0 specification requires processors to support UTF-16 and detect it via the BOM when no explicit declaration is present. Without the BOM, an explicit encoding='utf-16' attribute in the XML declaration is necessary, but the BOM simplifies parsing in Windows environments. In the NTFS file system, filenames are stored using UTF-16 encoding to support international characters, with a maximum length of 255 UTF-16 code units per name. In network protocols, HTTP allows UTF-16 as a charset in Content-Type and Accept-Charset headers, enabling servers to transmit responses encoded in UTF-16, though UTF-8 is more prevalent for efficiency. For JSON in JavaScript environments, strings are internally represented as UTF-16, and while the JSON specification recommends UTF-8 for interchange, JavaScript's JSON.stringify and JSON.parse operations handle UTF-16 natively during processing. HTML5 documents support UTF-16 encoding, detected via the BOM or HTTP headers, but the <meta charset> element cannot declare it directly due to ASCII-compatibility requirements; instead, reliance on external signaling ensures proper rendering. In PDF files, UTF-16 big-endian (UTF-16BE) is optionally used for Unicode text strings and font embedding since PDF 1.2, facilitating international content without altering the core binary structure. Microsoft Office applications, such as Word and Excel, optionally employ UTF-16 for document content and CSV exports, particularly when handling Unicode data to maintain compatibility with Windows internals. Legacy issues arise from transitions in XML 1.0, where UTF-16 without a BOM can lead to misdetection if not explicitly declared, as processors must support it but may default to UTF-8 otherwise. Migration from UCS-2 to UTF-16 poses challenges in older systems, as UCS-2 treats surrogate pairs as separate invalid characters, potentially corrupting emojis or supplementary planes during conversion without proper surrogate handling. This has lingering effects in Windows file paths, where unpaired surrogates from UCS-2 legacy are permitted but invalid in strict UTF-16.

References

[1]
The Unicode standard - Globalization | Microsoft Learn
Feb 2, 2024 · UTF-16 uses either one or two 16-bit code units (2 bytes or 4 bytes) to represent the Unicode code points. A single code unit can represent code ...
[2]
6 Supporting Multilingual Databases with Unicode
UTF-16 encoding is the 16-bit encoding of Unicode. UTF-16 is an extension of UCS-2 because it supports the supplementary characters that are defined in Unicode ...
[3]
https://www.unicode.org/versions/Unicode17.0.0/
[4]
Supporting Multilingual Databases with Unicode - Oracle Help Center
UTF-16 is the main Unicode encoding used for internal processing by Java since version J2SE 5.0 and by Microsoft Windows since version 2000. The benefits of UTF ...
[5]
Relationship to ISO/IEC 10646 - Unicode
UCS-2 stands for “Universal Character Set coded in 2 octets” and is also known as “the two-octet BMP form.” It was documented in earlier editions of 10646 as ...
[6]
History of Unicode
### Summary of Unicode Versions Timeline (1990s): UCS-2 and UTF-16 Development
[7]
UTR#17: Unicode Character Encoding Model
Sep 20, 2022 · The UCS-2 encoding form, which is associated with ISO/IEC 10646 and can only express the subset of characters in the BMP, is a fixed-width ...
[8]
https://www.unicode.org/reports/tr17/
[9]
ISO/IEC 10646-1:1993/Amd 1:1996
Status. : Withdrawn ; Publication date. : 1996-10 ; Stage. : Withdrawal of International Standard [95.99] ; Edition. : 1 ; Number of pages. : 6.
[10]
[PDF] The Unicode Standard, Version 16.0 – Core Specification
Sep 10, 2024 · Starting with this version, the Unicode Consortium has changed the way the. Unicode Standard is produced. The interactive HTML version is ...
[11]
RFC 2781: UTF-16, an encoding of ISO 10646
This document describes the UTF-16 encoding of Unicode/ISO-10646, addresses the issues of serializing UTF-16 as an octet stream for transmission over the ...
[12]
Chapter 2 – Unicode 16.0.0
This chapter describes the fundamental principles governing the design of the Unicode Standard and presents an informal overview of its main features.
[13]
Chapter 3 – Unicode 16.0.0
This chapter defines conformance to the Unicode Standard in terms of the principles and encoding architecture it embodies. The first section defines the format ...
[14]
https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/
[15]
Glossary of Unicode Terms
High-Surrogate Code Unit. A 16-bit code unit in the range D80016 to DBFF16, used in UTF-16 as the leading code unit of a surrogate pair. Also known as a ...
[16]
RFC 2781 - UTF-16, an encoding of ISO 10646 - IETF Datatracker
This document describes the UTF-16 encoding of Unicode/ISO-10646, addresses the issues of serializing UTF-16 as an octet stream for transmission over the ...
[17]
https://www.unicode.org/reports/tr36/tr36-8.html
[18]
RFC 2781 UTF-16, an encoding of ISO 10646 - IETF
This document describes the UTF-16 encoding of Unicode/ISO-10646, addresses the issues of serializing UTF-16 as an octet stream for transmission over the ...
[19]
https://learn.microsoft.com/en-us/dotnet/api/system.text.encoding.unicode
[20]
Text Encoding Converter - Localizely
UTF-16 (Big Endian). utf16le. UTF-16 (Little Endian) ... Each encoding scheme maps byte values to specific characters, so a mismatch leads to garbled text.<|control11|><|separator|>
[21]
Using Byte Order Marks - Win32 apps | Microsoft Learn
Sep 26, 2024 · For UTF-8, the byte order mark is optional, since the bytes may only be in one order. For UTF-16 and UTF-32, the byte order mark is required ...
[22]
Character encodings: Essential concepts
The first 65,536 code point positions in the Unicode character set are said to constitute the Basic Multilingual Plane (BMP) . The BMP includes most of the more ...Character Sets, Coded... · Characters & Clusters · Characters & Glyphs
[23]
UTN #12: UTF-16 for Processing - Unicode
Jan 13, 2004 · Summary. This document attempts to make the case that it is advantageous to use UTF-16 (or 16-bit Unicode strings) for text processing.
[24]
(PDF) Analysis of text encodings in computer systems - ResearchGate
Mar 8, 2017 · ... (UTF-8, UTF-16, and UTF-32). It presents the short performance ... Analysis of text encodings in computer systems. March 2016. Conference ...
[25]
Chapter 3. Lexical Structure - Oracle Help Center
The Java programming language represents text in sequences of 16-bit code units, using the UTF-16 encoding. Some APIs of the Java SE Platform, primarily in the ...
[26]
String (Java Platform SE 8 ) - Oracle Help Center
A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character ...Frames · Uses of Class java.lang.String · CharSequence · PatternMissing: internal | Show results with:internal
[27]
ECMAScript® 2026 Language Specification - TC39
The String type is generally used to represent textual data in a running ECMAScript program, in which case each element in the String is treated as a UTF-16 ...
[28]
System.String class - .NET | Microsoft Learn
Jan 8, 2024 · The Unicode standard Characters in a string are represented by UTF-16 encoded code units, which correspond to Char values. Each character in a ...
[29]
Introduction to character encoding in .NET - Microsoft Learn
Oct 22, 2024 · 16-bit Unicode Transformation Format (UTF-16) is a character encoding system that uses 16-bit code units to represent Unicode code points. .Missing: 16.0 chapter
[30]
Chars and Strings | ICU Documentation
16-bit Unicode strings in internal processing contain sequences of 16-bit code units that may not always be well-formed UTF-16. ICU treats single, unpaired ...
[31]
Normalization | ICU Documentation
Normalization is used to convert text to a unique, equivalent form. Software can normalize equivalent strings to one particular sequence.Overview · New API · Data File Syntax