Fact-checked by Grok 2 weeks ago

UTF-32

UTF-32 is a fixed-width Unicode encoding form that represents each Unicode code point as a single 32-bit code unit, allowing direct mapping from code points (ranging from U+0000 to U+10FFFF) to their numeric values without the use of surrogates or variable-length sequences.^[1] Introduced as part of the Unicode Standard in version 3.1, UTF-32 provides a simple and straightforward alternative to variable-width encodings like UTF-8 and UTF-16, serializing text as a sequence of four bytes per character in either big-endian or little-endian byte order.^[2]^[3] To resolve byte order ambiguities, UTF-32 streams often begin with a byte order mark (BOM) using the Unicode character U+FEFF.^[1] This encoding's fixed width facilitates efficient random access and indexing in memory, making it ideal for internal processing in some programming languages, though it consumes significantly more storage space—four bytes per character—compared to UTF-8's average of about one byte for Latin scripts.^[1]^[4] Despite its inefficiency for storage and transmission, UTF-32 remains valuable in scenarios prioritizing simplicity over compactness, such as certain file formats or environments requiring uniform character sizing, and it fully supports the Unicode codespace encompassing over 1.1 million code points.^[1]

Introduction

Definition and Basics

UTF-32 is a fixed-width Unicode encoding form that represents each Unicode code point using exactly 32 bits, or four bytes, where the code point's numerical value is directly encoded as a single 32-bit integer.^[1] This direct mapping ensures that every character occupies a uniform amount of space, distinguishing UTF-32 from variable-width encodings such as UTF-8 and UTF-16, which use differing numbers of bytes per code point and require parsing to determine boundaries.^[5] The fixed size facilitates straightforward random access and indexing into strings without the need for sequential decoding.^[6] The encoding supports the full range of Unicode code points from U+0000 to U+10FFFF, encompassing the 1,114,112 valid scalar values defined by the Unicode Standard.^[1] To align with Unicode's 21-bit code space, the 11 most significant bits of the 32-bit value are always set to zero for valid code points, preventing representation of the reserved or invalid values beyond U+10FFFF.^[4] For example, the ASCII character 'A' (U+0041) is encoded as the 32-bit value 0x00000041, while the non-ASCII Euro sign '€' (U+20AC) becomes 0x00000020AC.^[1] In practice, UTF-32 is identical to the UCS-4 encoding form specified in ISO/IEC 10646 for the Unicode repertoire, as both use the same 32-bit direct mapping within the 21-bit limit.^[7]

Comparison to Other Unicode Encodings

UTF-32 is a fixed-width encoding that uses exactly 32 bits (4 bytes) to represent each Unicode code point, in contrast to UTF-8, which employs a variable-width scheme ranging from 1 to 4 bytes per code point.^[1] This fixed structure in UTF-32 simplifies operations like random access and indexing, as each character occupies a constant amount of space, whereas UTF-8 requires parsing variable-length sequences to locate boundaries, making it more complex for such tasks.^[8] However, UTF-8 offers superior space efficiency for Latin-based scripts, such as ASCII, where it uses only 1 byte per character compared to UTF-32's 4 bytes, resulting in up to four times the storage overhead for UTF-32 in those cases.^[1] Compared to UTF-16, which uses 16-bit code units and can span 2 to 4 bytes per character, UTF-32 directly encodes the full 21-bit Unicode code point range within its 32 bits without the need for surrogate pairs.^[1] UTF-16 relies on surrogate pairs—two consecutive 16-bit units—to represent code points beyond the Basic Multilingual Plane (BMP, U+0000 to U+FFFF), which adds complexity to processing and can disrupt binary ordering.^[9] In UTF-32, this issue is eliminated, as every code point is self-contained in a single unit, providing a more straightforward mapping to the Unicode scalar values.^[8] Regarding space efficiency, UTF-32 consumes 4 bytes uniformly for all characters, making it equivalent to UTF-16 for BMP characters (where UTF-16 uses 2 bytes) but twice as large overall for those, and identical to UTF-16's 4 bytes for supplementary characters beyond the BMP.^[1] For example, the ASCII character "A" (U+0041) requires 1 byte in UTF-8, 2 bytes in UTF-16, and 4 bytes in UTF-32, while a supplementary character like 𐀀 (U+10000) uses 4 bytes across all three encodings.^[1] UTF-8 generally provides the best average compression for European languages but expands for East Asian scripts, whereas UTF-32's fixed size leads to consistent but higher memory usage across the board.^[9] In terms of performance trade-offs, UTF-32 supports constant-time (O(1)) indexing and random access by allowing direct offset calculations, avoiding the variable-time parsing required in UTF-8 (which may need up to three bytes of lookahead) and UTF-16 (which involves surrogate detection).^[8] This makes UTF-32 particularly advantageous in internal processing scenarios where speed trumps storage, though its larger footprint can degrade overall system performance in memory-constrained environments.^[8]

Technical Specifications

Encoding Mechanism

UTF-32 encodes each Unicode code point as a single 32-bit unsigned integer, providing a direct one-to-one mapping between the code point value and the code unit.^[10] This fixed-width encoding form uses the full 32 bits to represent code points in the range from U+0000 to U+10FFFF, ensuring simplicity in processing and random access.^[3] At the bit level, the code point value occupies the lowest 21 bits of the 32-bit word (bits 0 through 20), while the higher bits (31 through 21) are always set to zero, as the maximum code point U+10FFFF requires only 21 bits (1,114,111 in decimal).^[3] Mathematically, for any valid code point c where $0 \leq c \leq 0x10FFFF, the UTF-32 code unit u is simply u = c, padded with leading zeros in the upper bits to form a 32-bit value.^[10] Unlike variable-width encodings such as UTF-16, UTF-32 employs no surrogates or multi-unit representations; every code point is encoded in exactly one 32-bit unit, facilitating straightforward indexing and substring operations.^[3] Code units representing values outside the valid Unicode codespace—specifically, those greater than U+10FFFF or in the surrogate range D800–DFFF—are considered ill-formed and invalid.^[10] Implementations must treat such sequences as errors, typically replacing them with a Unicode replacement character (U+FFFD) or signaling a decoding failure, depending on the context.^[3] For example, the supplementary character 😀 (U+1F600, hexadecimal 0x0001F600) is encoded in UTF-32 as the 32-bit value 0x0001F600. In binary, this appears as:

00000000 00000001 11110110 00000000
00000000 00000001 11110110 00000000

Here, the code point bits are placed in the low 21 positions, with the upper 11 bits zeroed. When serialized for transmission, the byte order determines the sequence of bytes (e.g., big-endian: 00 01 F6 00), as detailed in the Byte Order and Endianness section.^[3]

Byte Order and Endianness

UTF-32 encodes each Unicode code point as a single 32-bit code unit, but when serializing these units into a byte stream for storage or transmission, the order of the four bytes must be specified to ensure correct interpretation across different systems. This byte order is determined by the endianness convention: big-endian or little-endian. In big-endian order (UTF-32BE), the most significant byte is stored first, which aligns with the network byte order standard used in protocols like TCP/IP. Conversely, little-endian order (UTF-32LE) stores the least significant byte first, which is the native convention on many modern processors, such as those in the x86 architecture family.^[11]^[12] To resolve ambiguity in untagged data streams, UTF-32 supports an optional Byte Order Mark (BOM), represented by the Unicode character U+FEFF (zero-width no-break space). When used as a BOM at the beginning of a UTF-32 stream, it is encoded according to the endianness: in UTF-32BE, it appears as the byte sequence 00 00 FE FF; in UTF-32LE, as FF FE 00 00. Detection rules require examining the first four bytes of the stream: if they match 00 00 FE FF, the stream is interpreted as big-endian; if FF FE 00 00, as little-endian. If the first four bytes do not match either BOM sequence, no BOM is present, and the endianness must be determined externally, often defaulting to big-endian in standards-compliant contexts. If the decoded first character is U+FFFE, this strongly indicates a byte order mismatch, and the stream should be byte-swapped before interpretation. Such sequences are not inherently ill-formed unless they contain invalid code units (e.g., values greater than U+10FFFF). The BOM itself is not considered part of the text content and should be ignored after detection.^[13]^[11]^[12] A mismatch in endianness interpretation can result in garbled text, as the bytes of each code unit are reordered incorrectly. For example, the code point U+0041 (LATIN CAPITAL LETTER A, decimal 65) is represented as the 32-bit value 0x00000041. In UTF-32BE, this serializes to the byte sequence 00 00 00 41; in UTF-32LE, to 41 00 00 00. If a UTF-32LE stream is mistakenly read as UTF-32BE, U+0041 would be interpreted as 0x41000000, an invalid code point outside the Unicode range, potentially leading to replacement characters or processing errors. This underscores the importance of correct endianness handling for interoperability in cross-platform environments.^[11]^[3]

History and Development

Origins in UCS-4

The Universal Character Set (UCS-4) served as the foundational 32-bit fixed-width encoding for the Universal Multiple-Octet Coded Character Set defined in the inaugural edition of ISO/IEC 10646, published in 1993. This encoding scheme utilized four octets to represent each character, enabling a repertoire of up to $2^{31} code points, spanning from U+00000000 to U+7FFFFFFF, with the 32nd bit (most significant) constrained to zero to maintain compatibility across diverse systems and avoid sign-extension issues in signed integer representations.^[14]^[15] In its early form, UCS-4 offered a significantly broader addressable range than the contemporaneous Unicode standard, which from its 1.0 release in 1991 was designed around a 16-bit fixed-width encoding (UCS-2) limited to 65,536 code points (U+0000 to U+FFFF), highlighting an initial misalignment between the ISO's expansive vision for global character representation and Unicode's more constrained initial implementation. This gap prompted evolutionary steps, notably with Unicode 2.0 in 1996, which solidified the adoption of UCS-2 as the baseline encoding while introducing surrogate pairs to access supplementary planes, laying groundwork for fixed-width encodings that could leverage the full UCS-4 range in subsequent developments.^[16] UCS-4 found early practical application in the late 1990s amid growing internationalization needs, such as in the World Wide Web Consortium's XML 1.0 specification of 1998, which designated UCS-4 as a supported encoding for document character data alongside UTF-8 and UTF-16, enabling robust handling of multilingual content in structured data exchange before 2000.^[17]

Standardization and Evolution

UTF-32 was formally integrated into the Unicode Standard as one of its three primary encoding forms—alongside UTF-8 and UTF-16—starting with version 3.1, released in March 2001.^[2] This inclusion provided a fixed-width alternative for representing Unicode code points directly as 32-bit values, building on earlier proposals from 1999.^[3] No substantive modifications to UTF-32's core specification occurred after version 3.1, reflecting the encoding's design stability within the evolving Unicode repertoire.^[2] In 2003, the revision of ISO/IEC 10646 limited the Universal Character Set repertoire to code points U+0000 through U+10FFFF, aligning it with Unicode's 21-bit constraint and excluding surrogates as encoding targets. Concurrently, RFC 3629 updated the specification for UTF-8 to match this range.^[18]^[19] This restriction effectively rendered UTF-32 indistinguishable from a constrained form of UCS-4, the original 32-bit encoding defined in ISO/IEC 10646.^[18] The update ensured interoperability across protocols while prohibiting encodings for undefined or reserved code points above U+10FFFF.^[19] UTF-32, along with its big-endian (UTF-32BE) and little-endian (UTF-32LE) variants, received official registration as MIME charset names by the Internet Assigned Numbers Authority (IANA) in 2002, following submissions in 2001.^[20]^[21]^[22] These registrations specified limited use for text/MIME contexts, with byte-order defaults and BOM handling outlined in Unicode Technical Report #19. Since the 2003 alignment, UTF-32 has seen no major evolutionary changes due to the Unicode Consortium's stability policies, which prioritize backward compatibility for encoding forms.^[23] Minor clarifications appeared in Unicode 15.0 (2022), particularly in Section D90 of the core specification, refining code unit sequence rules and error detection for malformed inputs across encoding forms like UTF-32. Unicode 16.0 (2024) introduced no further changes to UTF-32's specification. These updates addressed conformance without altering the fundamental mechanics, underscoring UTF-32's role as a mature, unchanging component of the standard.^[24]^[25]

Advantages and Limitations

Benefits of Fixed-Width Encoding

One key benefit of UTF-32's fixed-width encoding is the ability to perform random access to individual code points efficiently. Each Unicode code point is represented by exactly one 32-bit code unit, allowing direct indexing to the nth code point without the need for parsing variable-length sequences or length prefixes, achieving O(1) time complexity for operations like string.^[1] This contrasts with variable-width encodings such as UTF-8 and UTF-16, where locating a specific code point requires sequential scanning.^[1] The fixed size also simplifies various string processing tasks. Substring extraction can be performed by simple byte offsets, as each code point occupies a predictable 4 bytes, and the number of code points in a string is directly calculated as the total byte length divided by 4.^[1] Additionally, UTF-32 avoids issues like overlong encodings inherent in some variable-width forms, ensuring a one-to-one mapping between code points and code units that streamlines validation and manipulation.^[1] These properties make UTF-32 particularly suitable for internal representations in text processing engines, such as those handling glyph mapping—where isolating individual code points for font rendering is essential—or collation algorithms that compare code points directly for sorting.^[1] For instance, in Unicode normalization algorithms, the fixed width facilitates straightforward iteration and decomposition of code points into canonical forms, reducing algorithmic complexity compared to variable encodings that require additional decoding steps.^[26]

Drawbacks and Inefficiencies

One of the primary drawbacks of UTF-32 is its storage inefficiency, as it requires a fixed 4 bytes (32 bits) to encode every Unicode code point, regardless of the character's complexity.^[1] In contrast, UTF-8 employs a variable-length encoding that uses only 1 byte for the 128 ASCII characters, resulting in UTF-32 files being up to 4 times larger for English or ASCII-dominant text.^[1] For typical multilingual documents with a mix of scripts, this can lead to 2-4 times the storage overhead compared to UTF-8, making UTF-32 impractical for large-scale data storage where space is a concern.^[5] This inefficiency extends to bandwidth usage during transmission, where the fixed-width nature of UTF-32 increases data transfer costs significantly over networks.^[1] For instance, transmitting English text in UTF-32 consumes four times the bandwidth of the same content in UTF-8, exacerbating issues in bandwidth-limited environments like the web or mobile applications.^[5] Without additional compression layers, this waste is unavoidable, as UTF-32 offers no inherent compactness for common character ranges. Compatibility challenges further limit UTF-32's practicality, particularly in web contexts where the HTML standard explicitly prohibits its use to ensure consistent parsing and avoid ambiguity in encoding detection.^[27] User agents must not support UTF-32, as algorithms for sniffing encodings intentionally fail to distinguish it from UTF-16, potentially leading to misinterpretation of content.^[28] Additionally, without a byte order mark (BOM), endianness mismatches between big-endian and little-endian systems can cause severe errors, such as garbled text or crashes, during interchange.^[29] UTF-32 also lacks the ASCII transparency provided by UTF-8, where ASCII bytes remain unchanged and compatible with legacy systems, offering no such compression or backward-compatibility benefits for plain text.^[1] Moreover, it permits encoding of invalid 32-bit values, such as surrogate code points (U+D800–U+DFFF) or those exceeding U+10FFFF, which violate Unicode conformance and can introduce security risks or processing failures if not validated.

Usage and Adoption

In Programming Languages

In programming languages, UTF-32 is supported through native types and APIs that facilitate direct handling of Unicode code points as 32-bit integers, enabling straightforward indexing and manipulation without variable-length decoding overhead during processing. This fixed-width approach is particularly useful for algorithms requiring random access to characters, such as string searching or normalization, though most languages convert to more compact encodings like UTF-8 for storage and I/O to optimize memory and performance.^[30] Python provides native support for UTF-32 through its str type, which internally represents Unicode strings using a variable-width scheme that can include UCS-4 (equivalent to UTF-32) for strings containing characters beyond the Basic Multilingual Plane, especially in builds configured with --enable-unicode=ucs4 prior to version 3.3, after which compact representations like Latin-1 or UCS-2 are used for efficiency.^[30]^[31] Since Python 3.3, the internal representation optimizes memory by selecting the narrowest suitable encoding per string, falling back to full 32-bit UCS-4 when necessary for supplementary characters.^[30] Developers can explicitly encode and decode strings to UTF-32 using the encode('utf-32') and decode('utf-32') methods in the built-in codecs module, which handles byte-order marks and endianness automatically.^[32] Best practices include using UTF-32 internally for code-point iteration via ord() and chr(), then converting to UTF-8 for file I/O.^[32] C++ introduced explicit UTF-32 support in C++11 with the char32_t type, a fixed 32-bit integer for Unicode code points, and std::u32string in the <string> header for sequences of such characters. For conversions between UTF-32 and other encodings like UTF-8, the std::codecvt<char32_t, char, std::mbstate_t> facet in <codecvt> (deprecated in C++17 and removed in C++20, though some implementations may still provide it for compatibility) enables facet-based transformations, often used with std::wstring_convert for stream integration. Modern best practices recommend libraries like ICU or Boost.Locale for robust UTF-32 handling, avoiding direct byte manipulation to ensure endianness portability, and employing u32string for internal processing before outputting as UTF-8 strings. Java internally uses UTF-16 for String objects via 16-bit char arrays but provides comprehensive UTF-32 access through the java.nio.charset package, where CharsetEncoder and CharsetDecoder support explicit UTF-32 encoding and decoding via Charset.forName("UTF-32").^[33] For direct code-point manipulation, methods like String.codePointAt(int index) return a 32-bit int representing the Unicode scalar value at the specified UTF-16 index, handling surrogate pairs transparently to abstract away the internal encoding. This API is ideal for algorithms needing fixed-width access, such as collation or normalization, with best practices involving ByteBuffer for bulk conversions and avoiding raw char assumptions for full Unicode coverage.^[33] In Rust, the char type is a 32-bit Unicode scalar value, effectively providing UTF-32 representation for individual code points, with operations like char::from_u32(u32) for safe construction from raw values.^[34] Strings remain UTF-8 encoded as String or &str, but UTF-32 decoding is supported via the encoding_rs crate or similar for converting byte slices, though standard library lacks a direct from_utf32 method on String; instead, iterate over code points with .chars() which yields chars.^[35] Best practices emphasize using char for processing loops and converting to UTF-8 for I/O, leveraging Rust's zero-cost abstractions for efficient internal UTF-32-like handling without explicit encoding switches. Go treats runes as 32-bit int32 aliases for Unicode code points, enabling UTF-32 semantics when iterating strings via range over rune slices, where each rune directly maps to a code point. The golang.org/x/text/encoding/unicode/utf32 package provides explicit encoders and decoders for UTF-32 byte streams, supporting big- and little-endian variants with BOM detection.^[36] Common patterns include decoding input to []rune for manipulation—effectively an in-memory UTF-32 array—before re-encoding to UTF-8 for output, as Go's strings are immutable UTF-8 bytes optimized for this workflow. Across these languages, a prevalent best practice is employing UTF-32 internally for algorithmic processing where fixed-width access simplifies implementation, such as in regular expression engines or text analyzers, while converting to UTF-8 or UTF-16 for external interfaces to align with system-level conventions in operating systems and applications.^[32]

In Operating Systems and Applications

In Microsoft Windows, UTF-32 is rarely employed for general string handling, where UTF-16 remains the dominant internal encoding, and it lacks native support in file system operations or standard APIs. However, in specialized components such as DirectWrite, the API facilitates glyph processing through 32-bit Unicode code point representations during low-level text layout and rendering tasks.^[37] On Unix-like systems including Linux, the International Components for Unicode (ICU) library uses 32-bit UChar32 types for code points in certain APIs, particularly for operations requiring direct code point access, such as collation and normalization, to ensure consistent handling of Unicode scalars across 32-bit data types.^[38] In applications, text editors like Vim support UTF-32 through internal conversions enabled by setting the 'encoding' option to "ucs-4", allowing seamless handling of fixed-width Unicode data during editing and display operations.^[39] Font rendering engines, such as HarfBuzz, rely on 32-bit Unicode code point access for shaping and glyph selection, integrating UTF-32 inputs via functions like hb_buffer_add_codepoints() to map characters directly to font tables without surrogate pair complications.^[40] Regarding standards, UTF-32 is explicitly forbidden as a document character encoding in HTML5, as detection algorithms do not distinguish it from other encodings and it complicates web compatibility.^[41] In contrast, XML 1.0 permits UTF-32 encoding provided a byte order mark (BOM) is present to indicate endianness, ensuring unambiguous parsing of the fixed-width format.^[42]

Variants and Extensions

Endianness-Specific Variants

UTF-32BE is the big-endian variant of UTF-32, in which the most significant byte of each 32-bit code unit appears first in the byte sequence.^[43] It is registered with the Internet Assigned Numbers Authority (IANA) and has the MIME type "utf-32BE".^[44] This form serves as the default for network protocols, where big-endian byte order is conventionally assumed in the absence of a byte order mark (BOM).^[43] UTF-32LE is the little-endian counterpart, with the least significant byte first, and carries the IANA MIME type "utf-32LE".^[44] It is prevalent on little-endian processor architectures, such as those based on Intel x86.^[45] These endianness-specific variants ensure consistent interpretation of UTF-32 data across systems with differing native byte orders, tying into broader byte order conventions.^[43] To enable unambiguous detection of the encoding form and byte order in untagged files, a BOM (U+FEFF) is mandatory at the beginning of UTF-32 data streams.^[13] The BOM for UTF-32BE consists of the byte sequence 00 00 FE FF, while for UTF-32LE it is FF FE 00 00.^[13] Detection algorithms examine the initial bytes: for instance, a first 32-bit word of 0xFFFE0000 signals an invalid code point (as it exceeds the maximum Unicode code point of U+10FFFF), confirming the presence of a reversed or erroneous BOM and distinguishing the endianness.^[13] However, when the encoding is explicitly declared (e.g., via MIME type), the BOM is prohibited to avoid redundancy.^[13] For interoperability between UTF-32BE and UTF-32LE, conversion utilities such as GNU libiconv facilitate byte order swapping while preserving code points.^[46] In web contexts, HTTP response headers can specify these variants, such as "Content-Type: text/plain; charset=utf-32BE", ensuring correct handling by clients.^[47]

Non-Standard and Deprecated Uses

One non-standard practice in UTF-32 involves treating UTF-16 surrogate code points (U+D800 to U+DFFF) as individual characters rather than components of surrogate pairs for encoding supplementary characters. These code points are reserved exclusively for UTF-16 pairing to represent code points beyond the Basic Multilingual Plane and are not part of the set of Unicode scalar values; using them as standalone characters in UTF-32 results in ill-formed sequences. This usage was explicitly discouraged starting with Unicode 3.1 in 2001, as it violates conformance requirements and can lead to interoperability issues across encoding forms.^[48] In error recovery scenarios, some legacy systems or libraries have non-standardly mapped malformed UTF-8 byte sequences directly to 32-bit values in UTF-32 by interpreting invalid bytes as code points without replacement or rejection, potentially preserving data loss or corruption from the original input. This approach, while avoiding total failure in conversion pipelines, deviates from recommended Unicode best practices, which mandate using the replacement character U+FFFD for ill-formed sequences to maintain data integrity. Such handling is rare in modern implementations but persists in certain older database or file processing tools for backward compatibility. The original UCS-4 encoding form, which supported the full 31-bit range (up to 2^31 - 1), was commonly used in pre-2003 systems aligned with early ISO/IEC 10646 standards before Unicode's alignment and restrictions. This allowed encoding beyond the current Unicode limit of U+10FFFF, but with the adoption of UTF-32 in Unicode 3.1 (2001), UCS-4 became obsolete for new Unicode-compliant applications, as UTF-32 enforces the 21-bit Unicode scalar value range and incorporates specific Unicode semantics like surrogate ill-formedness detection. Legacy systems retaining full 31-bit UCS-4 may encounter compatibility issues when interfacing with standard UTF-32.^[3]^[12] Rare experimental extensions to UTF-32, such as hypothetical 64-bit variants for packing multiple code points or handling ultra-large hypothetical character sets, have appeared in academic prototypes but lack standardization and see minimal adoption post-2020 due to Unicode's stability policy limiting the repertoire to 1,114,112 code points. In legacy Asian standards, some wide character sets (e.g., extensions in early CJK processing environments) employed 32-bit or wider encodings akin to UCS-4 for handling extensive ideograph ranges, but these have largely been supplanted by official UTF-32 without further evolution.

References

[1]
Chapter 2 – Unicode 16.0.0
UTF-32 is the simplest Unicode encoding form. Each Unicode code point is represented directly by a single 32-bit code unit. Because of this, UTF-32 has a one-to ...
[2]
UAX #27: Unicode 3.1
With the addition of UTF-32, the Unicode Standard now has three sanctioned encoding forms: UTF-8, UTF-16, and UTF-32. These are the 8-bit, 16-bit, and 32-bit ...
[3]
UTF-32 - Unicode
May 31, 1999 · UTF-32 is the Unicode Transformation Format that serializes a Unicode scalar value as a sequence of four bytes, in either Big Endian or Little ...
[4]
Glossary of Unicode Terms
The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 ...
[5]
https://www.unicode.org/standard/principles.html
[6]
https://www.unicode.org/reports/tr27/tr27-4.html
[7]
[PDF] ISO/IEC International Standard ISO/IEC 10646 - Unicode
9.3 UTF-32 (UCS-4). UTF-32 (or UCS-4) is the UCS encoding form that assigns each UCS scalar value to a single unsigned. 32-bit code unit. The terms UTF-32 and ...<|control11|><|separator|>
[8]
https://www.unicode.org/notes/tn12/
[9]
Considering the three Unicode encodings, UTF-8, UTF-16 and UTF ...
Considering the three Unicode encodings, UTF-8, UTF-16 and UTF-32, each UTF-n variation is merely a mathematical transformation of Unicode. When not considering ...
[10]
Chapter 3 – Unicode 17.0.0
This chapter defines conformance to the Unicode Standard in terms of the principles and encoding architecture it embodies.
[11]
[PDF] The Unicode Standard, Version 16.0 – Core Specification
Sep 10, 2024 · The Unicode Standard, Version 16.0 – Core Specification. Page 1. The Unicode® Standard. Version 16.0 – Core Specification. The Unicode ...
[12]
https://www.unicode.org/reports/tr17/
[13]
https://www.unicode.org/faq/utf_bom.html
[14]
ISO/IEC 10646-1:1993 - Information technology
Universal Multiple-Octet Coded Character Set (UCS)Part 1: Architecture and Basic Multilingual Plane.Missing: document | Show results with:document
[15]
Annex B The Universal Character Set (UCS) - Open Standards
The full UCS allows for 31-bit coding (there is a 32nd bit that is constrained to be zero) and so provides for over two thousand million characters. It should ...
[16]
About Versions
### Summary of Unicode 1.0, 2.0, UCS-2, and UCS-4 with Early History and 1996 UCS-2 Adoption
[17]
Extensible Markup Language (XML) 1.0
Summary of each segment:
[18]
RFC 3629 - UTF-8, a transformation format of ISO 10646
UTF-8 is a transformation format of ISO 10646, preserving the US-ASCII range with a one-octet encoding unit, and is compatible with US-ASCII.Missing: fixed- | Show results with:fixed-
[19]
UTF-32
Charset Registration: UTF-32 (last updated 2002-04-29) Charset aliases: NONE Suitability for use in MIME text: NO Published specification(s): http://www ...Missing: types 32BE 32LE date
[20]
None
- **Registration Date**: Last updated 2002-04-29
[21]
UTF-32LE
Charset Registration: UTF-32LE (last updated 2002-04-29) Charset aliases: NONE Suitability for use in MIME text: NO Published specification(s): http://www ...Missing: types date
[22]
Unicode® Character Encoding Stability Policies
Jan 9, 2024 · This page lists the policies of the Unicode Consortium regarding character encoding stability. These policies are intended to ensure that text encoded in one ...
[23]
https://www.unicode.org/policies/stability_policy.html
[24]
https://www.unicode.org/versions/Unicode15.0.0/
[25]
https://www.unicode.org/versions/Unicode16.0.0/
[26]
https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-2/#G287424
[27]
https://html.spec.whatwg.org/multipage/parsing.html#prohibited-content
[28]
Unicode Objects and Codecs — Python 3.14.0 documentation
Unicode objects internally use a variety of representations, in order to allow handling the complete range of Unicode characters while staying memory efficient.Unicode Objects · Unicode Type · Built-In Codecs
[29]
Python behind the scenes #9: how Python strings work
Feb 21, 2021 · The ISO 10646 standard also defines the UCS-4 encoding form, which is effectively the same thing as UTF-32. UTF-32 and UTF-16 are widely used ...
[30]
codecs — Codec registry and base classes — Python 3.14.0 ...
This module defines base classes for standard Python codecs (encoders and decoders) and provides access to the internal Python codec registry.Codecs -- Codec Registry And... · Codec Base Classes · Standard Encodings
[31]
Supported Encodings
String classes, and classes in the java.nio.charset package can convert between Unicode and a number of other character encodings. The supported encodings vary ...
[32]
char - Rust Documentation
The char type represents a single character. More specifically, since 'character' isn't a well-defined concept in Unicode, char is a 'Unicode scalar value'.
[33]
String in std::string - Rust
Nov 3, 2020 · from_utf8() checks to ensure that the bytes are valid UTF-8, and then does the conversion. If you are sure that the byte slice is valid UTF-8, ...Alloc/ string.rs · Chars · convert::From · Pattern
[34]
utf32 package - golang.org/x/text/encoding/unicode/utf32
Package utf32 provides the UTF-32 Unicode encoding. Please note that support for UTF-32 is discouraged as it is a rare and inefficient encoding.
[35]
DirectWrite (DWrite) - Win32 apps - Microsoft Learn
Oct 4, 2021 · DirectWrite also provides a low-level glyph rendering API for developers who want to perform their own layout and Unicode-to-glyph processing.
[36]
Unicode Basics | ICU Documentation
UTF-32 is the simplest, but most memory-intensive encoding form: It uses one 32-bit integer per Unicode character. SCSU is an encoding scheme that provides a ...
[37]
UTF-8, UTF-16, and UTF-32 - unicode - Stack Overflow
Jan 30, 2009 · Unicode is a standard and UTF-16 is a possible representation of Unicode. In the Microsoft ecosystem, they are very close, but not the same ...6 Comments · 3 Comments · 10 Commentsc# - What version of Unicode is supported by which .NET platform ...Can I encode any unicode symbol using UTF-8/16/32?More results from stackoverflow.com
[38]
mbyte.txt - Vim help
Jan 1, 2001 · The result is that all the text that is used inside Vim will be in this encoding. ... UTF-32 little endian UTF-8 is the recommended encoding.
[39]
Adding text to the buffer: HarfBuzz Manual
You can also add Unicode code points directly with hb_buffer_add_codepoints() . The arguments to this function are the same as those for the UTF encodings. But ...Missing: rendering | Show results with:rendering
[40]
Declaring character encodings in HTML - W3C
Aug 19, 2012 · Support for UTF-32 is not recommended. This encoding is rarely used, and frequently implemented incorrectly. This specification does not ...Missing: prohibited | Show results with:prohibited
[41]
XML FAQ: Encoding - opentag.com
It is used to indicate whether the file uses the big-endian or little-endian byte order. The BOM is mandatory for UTF-16 and UTF-32, but it is optional for UTF ...Missing: temporary | Show results with:temporary
[42]
UTR#17: Unicode Character Encoding Model
### Summary of UTF-32, UTF-32BE, UTF-32LE from UTR#17
[43]
Character Sets - Internet Assigned Numbers Authority
Jun 6, 2024 · Character Sets ; UTF-32, 1017 ; UTF-32BE, 1018 ; UTF-32LE, 1019 ; BOCU-1, 1020 ...
[44]
The Unicode standard - Globalization | Microsoft Learn
Feb 2, 2024 · As UTF-32 requires four bytes for every Unicode code point, it would seem that UTF-32 would always lead to larger file sizes than UTF-16 and UTF ...Character And Script Support · Unicode Encoding And Planes · Byte Order And Byte Order...Missing: integration | Show results with:integration
[45]
libiconv - GNU Project - Free Software Foundation (FSF)
This library provides an iconv() implementation, for use on systems which don't have one, or whose implementation cannot convert from/to Unicode.<|control11|><|separator|>
[46]
The byte-order mark (BOM) in HTML - W3C
Jan 31, 2013 · Its original purpose was to indicate the endianness of text that used the UTF-16 or UTF-32 character encodings of Unicode. The Byte Order Mark ...Missing: usage algorithm
[47]
PDUTR #27: Unicode 3.1
With the addition of UTF-32, the Unicode Standard now has three sanctioned encoding forms: UTF-8, UTF-16, and UTF-32. These are the 8-bit, 16-bit, and 32-bit ...