Fact-checked by Grok 2 weeks ago

Code point

A code point is any of the 1,114,112 numerical values in the codespace, ranging from 0 to 10FFFF in notation (U+0000 to U+10FFFF), each potentially assigned to represent an abstract character in the Unicode Standard. These values form the foundation of Unicode's model, distinguishing between assigned code points—such as those for graphic characters, format characters, control characters, and private-use areas—and unassigned or reserved ones like (U+D800–U+DFFF) and noncharacters. In practice, code points are denoted using the "U+" prefix followed by four to six hexadecimal digits, such as U+0041 for the Latin capital letter A, emphasizing their role in abstracting characters from specific glyphs or visual forms to enable universal text representation across scripts and languages. This notation facilitates precise referencing in standards, software, and documentation, where 159,801 characters have been assigned as of 17.0, supporting 172 scripts. The encoding model transforms code points into sequences of code units via defined encoding forms: (variable 1–4 bytes, backward-compatible with ASCII), UTF-16 (1 or 2 16-bit units, using surrogate pairs for code points beyond U+FFFF), and UTF-32 (fixed 32-bit units for direct mapping). This structure allows efficient storage and processing of text, with code points organized into 17 planes—the (BMP, U+0000 to U+FFFF) containing most commonly used characters, and supplementary planes for rarer scripts and symbols. Code points may combine in sequences to form grapheme clusters or normalized forms, ensuring compatibility in rendering and collation, while reserved areas prevent conflicts in future expansions of the standard.

Fundamentals

Definition

A code point is a numerical index or value, typically an integer, used in a character encoding scheme to uniquely identify an abstract character from a predefined repertoire. Code points serve as mappings between human-readable characters and machine-readable , enabling the systematic representation and processing of text in systems. For example, in the ASCII encoding, the code point 65 represents the uppercase letter 'A'. In , the same character is identified by the code point U+0041. A code point designates an abstract character, which is a semantic unit independent of its specific encoded form or visual appearance.

Relation to Characters and Glyphs

An abstract character represents a unit of information with semantic value, serving as the smallest component of independent of any particular encoding scheme or visual rendering method. For example, the abstract character for the letter '' embodies the concept of an 'e' with an , regardless of how it is stored digitally or displayed. In the Standard, each such abstract character is uniquely identified by a code point, which is a non-negative assigned within the Unicode codespace. In distinction to abstract characters, a is the particular visual image or shape used to depict a during rendering or . Glyphs are defined by font technologies and can differ widely; for instance, the abstract character 'A' might be rendered as a glyph in one or a variant in another. The Standard specifies that glyphs are not part of the encoding model itself but result from the interpretation of code points by rendering engines. A code point generally maps to a single abstract character, but the transition from abstract character to introduces variability based on context and presentation rules. One abstract character may correspond to multiple glyphs, such as in the case of positional forms in cursive scripts like , where the shape adapts to , medial, or final positions. Conversely, ligatures can combine multiple abstract characters—each with its own code point—into a single composite , as seen with '' forming a unified shape in many fonts to improve readability. Unicode normalization addresses scenarios where distinct code point sequences encode the same abstract character, enabling consistent text processing across systems. For example, the precomposed 'é' (U+00E9) is canonically equivalent to the sequence 'e' (U+0065) followed by a combining acute accent (U+0301), allowing normalization forms like NFC (which favors precomposed characters) or NFD (which decomposes them) to standardize representations without altering semantic meaning. This equivalence ensures that applications can interchange text reliably while preserving the underlying abstract character.

Representation in Encodings

Code Units vs. Code Points

In character encodings, a code unit represents the smallest fixed-size unit of storage or transmission for text data, typically defined by the bit width of the encoding form. For instance, employs 8-bit code units (bytes), while uses 16-bit code units. These code units serve as the basic building blocks for representing sequences of text, allowing computers to process and interchange data efficiently across different systems. The primary distinction between code points and code units lies in their roles and granularity: a code point is a single numerical value (from 0 to 10FFFF in ) assigned to an abstract character in the standard, whereas code units are the encoded bits that collectively form one or more code points. In fixed-width encodings like UTF-32, each code point corresponds directly to one code unit (a 32-bit value), simplifying access. However, in variable-width encodings such as and UTF-16, a single code point often requires multiple code units to encode, particularly for characters beyond the Basic Multilingual Plane. This multi-unit representation enables compact storage but introduces complexity in parsing text streams. A concrete example illustrates this difference: the Unicode code point U+1F600, which maps to the grinning face emoji (😀), is encoded as four 8-bit code units in UTF-8 (hexadecimal F0 9F 98 80, or bytes 240, 159, 152, 128) and as two 16-bit code units in UTF-16 (hexadecimal D83D DE00, forming a surrogate pair). In UTF-16, the first unit (D83D) is a high surrogate and the second (DE00) a low surrogate, together representing the full code point; treating them separately would yield invalid or unintended characters. When processing text, algorithms must correctly decode sequences of code units into complete code points to ensure accurate interpretation of abstract . Failure to handle multi-unit code points properly—such as by assuming each code unit is an independent —can result in errors like , where encoded text is misinterpreted and rendered as garbled symbols during decoding with an incompatible scheme. This underscores the need for encoding-aware software to normalize and validate input, preventing in applications ranging from browsers to file systems.

Fixed-Width Encodings

Fixed-width encodings are schemes in which each code point is represented using a consistent number of code units, such as bits or bytes, resulting in sequences of uniform length for all s. This direct mapping between code points and fixed-size code units simplifies the representation process, as no variable-length sequences are required to encode different characters. The primary advantages of fixed-width encodings include ease of implementation and processing, since there is no need for complex decoding algorithms to determine character boundaries. They also enable efficient to individual characters within a text stream, allowing the position of the nth character to be computed in constant time by simple arithmetic on the code unit offsets. These properties make fixed-width encodings particularly suitable for applications with small character repertoires, where simplicity outweighs storage efficiency. However, fixed-width encodings have significant limitations due to their uniform sizing, which caps the total number of representable code points at the power of two corresponding to the width (e.g., 128 for 7 bits or 256 for 8 bits). This restricts their ability to accommodate large or diverse sets, such as those required for multilingual text, often necessitating multiple incompatible for different languages. For the full range, UTF-32 is a fixed-width encoding using 32-bit code units, providing direct mapping for all 1,114,112 possible code points without surrogates or variable lengths, though it uses more storage for ASCII-range text compared to variable-width forms. Prominent examples include ASCII, a 7-bit encoding supporting 128 code points from 0x00 to 0x7F, primarily for English-language text and control characters. ISO/IEC 8859-1, an 8-bit extension of ASCII, provides 256 code points for Western European languages, with the first 128 matching ASCII. , another 8-bit scheme developed by , uses a different bit assignment for characters and remains in use on mainframe systems. , a variant of ISO/IEC 8859-1, also employs 8 bits but includes additional printable characters in the upper range for enhanced Western European support.

Variable-Width Encodings

Variable-width encodings represent Unicode code points using a varying number of code units, allowing for more efficient storage of text with predominantly low-range characters while supporting the full range up to U+10FFFF. This approach contrasts with fixed-width encodings like UTF-32, which allocate uniform space regardless of the code point value. By adjusting the number of code units based on the code point's magnitude, these encodings optimize space for common scripts such as Latin and Cyrillic, which fit into fewer units, while extending to rarer or higher-range characters with additional units. UTF-8, a widely used variable-width encoding, employs 8-bit code units and determines the sequence length from the leading bits of the first byte. Code points in the range U+0000 to U+007F (basic Latin and ASCII) are encoded in a single byte, ensuring compatibility with legacy ASCII systems. For U+0080 to U+07FF (e.g., extended Latin, Greek, Cyrillic, Arabic), two bytes are used; U+0800 to U+FFFF (including most of the Basic Multilingual Plane, or BMP) require three bytes; and U+10000 to U+10FFFF (supplementary planes) use four bytes. The encoding algorithm ensures that continuation bytes (always 10xxxxxx in binary) follow the lead byte, which specifies the total length, enabling a self-synchronizing property where parsers can detect sequence boundaries efficiently, even after data corruption, by examining at most four bytes backward.
Code Point RangeBytes in UTF-8Example Characters
U+0000–U+007F1Basic Latin (A–Z)
U+0080–U+07FF2Extended Latin, Greek, Cyrillic, Arabic
U+0800–U+FFFF3Devanagari, Thai, BMP Han
U+10000–U+10FFFF4Emoji, supplementary CJK
UTF-16 uses 16-bit code units and encodes most code points in the BMP (U+0000 to U+FFFF) with a single unit, making it compact for scripts like European languages and many Asian ideographs. For code points beyond U+FFFF, it employs surrogate pairs: a high surrogate (U+D800 to U+DBFF) followed by a low surrogate (U+DC00 to U+DFFF), forming a two-unit (four-byte) sequence that represents one supplementary code point. This mechanism reserves 2,048 code points in the BMP for surrogates, ensuring reversible mapping without overlap. These encodings balance space efficiency and implementation complexity for large character sets. excels in storage for predominantly ASCII or European text, where over 90% of bytes may be single-unit in English documents, but expands significantly for East Asian scripts, potentially using three or four bytes per . offers better performance for BMP-heavy content in processing environments like or Windows APIs, as most operations avoid surrogate handling, though its variable length introduces parsing overhead compared to fixed-width alternatives. Both require careful byte-order handling (e.g., via BOM) to avoid misinterpretation across .

Code Points in Unicode

Unicode Range and Allocation

The Unicode Standard defines a total of 1,114,112 code points, ranging from U+0000 to U+10FFFF, organized into 17 planes each containing 65,536 code points. This expansive codespace accommodates the encoding of characters from all known writing systems while leaving substantial room for expansion. As of Unicode 17.0, released in September 2025, approximately 159,801 code points are assigned to characters, with the remainder categorized as reserved, unallocated, or designated as noncharacters. Reserved code points include areas for (2,048 code points in the range U+D800–U+DFFF), private use (137,468 code points in the private use areas across the Multilingual Plane and Planes 15 and 16), ensuring specific functions without assignment to abstract characters. Unallocated code points remain available for future assignments, while noncharacters—totaling 66—such as U+FDD0–U+FDEF (32 points) and the pairs U+FFFE and U+FFFF in each plane (34 points)—are explicitly not used for and serve purposes like end-of-text markers or byte-order detection. These categories maintain a clear distinction between usable character space and specialized reservations. The allocation principles of Unicode emphasize stability, universality, and future-proofing. Stability is enforced through policies that prohibit the reallocation, removal, or modification of assigned code points, ensuring that once encoded, a character's semantics and properties remain unchanged across versions. Universality aims to encompass characters from every worldwide, harmonized with ISO/IEC 10646 to support global without bias toward any or . Future-proofing is achieved by allocating only a fraction of the codespace initially, preserving over 95% as unallocated to accommodate unforeseen needs, such as emerging scripts or extensions, without disrupting existing implementations. Recent updates reflect these principles without altering the overall range. 17.0 added 4,803 new code points, including characters for four new scripts (such as Beria Erfe and Sidetic) and eight new , bringing the total assigned characters to 159,801 while adhering to stability rules and reserving space for future growth. No major changes to the codespace boundaries or allocation categories have occurred by late 2025.

Planes, Blocks, and Surrogates

Unicode code points are organized into 17 planes, each comprising 65,536 consecutive code points, to facilitate the encoding of a vast repertoire of characters while maintaining compatibility with earlier standards. Plane 0, known as the Basic Multilingual Plane (BMP), spans U+0000 to U+FFFF and contains the most commonly used characters from virtually all modern writing systems, including Latin, Greek, Cyrillic, and many others, ensuring that legacy systems can handle a significant portion of Unicode without modification. Planes 1 through 16 extend the codespace for less common or specialized scripts; for instance, Plane 2, the Supplementary Ideographic Plane (SIP), accommodates additional CJK Unified Ideographs from U+20000 to U+2FFFF, supporting expanded needs for East Asian typography. Within these planes, the Unicode codespace is further subdivided into blocks, which group code points thematically by script, symbol category, or functional purpose, aiding in the organization and lookup of characters. Each block typically ranges from 16 to 256 code points, though sizes vary, and unassigned areas exist for future allocations. For example, the Basic Latin block from U+0000 to U+007F includes the 128 characters of the ASCII standard, such as letters, digits, and , forming the foundation for English and many Western languages. To represent code points beyond the in UTF-16 encoding, surrogate code points in Plane 0 are reserved for pairing to access the supplementary planes (1–16). These consist of 2,048 high surrogates from U+D800 to U+DBFF and 2,048 low surrogates from U+DC00 to U+DFFF, forming 1,048,576 possible pairs that map to code points U+10000 through U+10FFFF. In UTF-16, a supplementary code point is thus encoded as a 32-bit sequence: a 16-bit high surrogate followed by a 16-bit low surrogate, calculated such that the high surrogate's offset (U+D800 subtracted) combined with the low surrogate's (U+DC00 subtracted) yields the supplementary value. Software implementations must properly decode these surrogate pairs to retrieve the full code point; unpaired surrogates or mismatched pairs are considered invalid and typically trigger errors or replacement with substitution characters to maintain .

Historical Development

Pre-Unicode Encodings

The concept of code points originated in early mechanical and electrical systems for representing discrete symbols through numerical assignments. In the late , punched cards emerged as a medium for encoding data, with Herman Hollerith's system for the 1890 U.S. Census using patterns of holes to denote numerical values from 0 to 9, effectively assigning code points to digits for tabulation purposes. This approach laid foundational principles for mapping symbols to fixed positions in a code set, primarily for numeric data but extending to alphabetic characters for comprehensive census tabulation. Telegraph systems in the same era advanced further. The , patented by in 1874 and based on his 1870 invention, employed a 5-bit scheme to define 32 distinct code points, primarily for uppercase letters, numbers, and control signals in asynchronous transmission over teleprinters. This 5-bit limitation reflected hardware constraints of the time, such as mechanical keyboards and early electrical relays, but it enabled efficient of multiple channels on telegraph lines. By the 1960s, computing demanded broader standardization for text interchange. The American Standard Code for Information Interchange (ASCII), developed through collaboration among telecommunications and computer manufacturers, was published as ASA X3.4-1963 by the (predecessor to ANSI), defining a 7-bit code with 128 code points to encompass English uppercase and lowercase letters, digits, , and control characters. This fixed-width encoding prioritized compatibility with existing teletype equipment while allocating the upper bit for parity or future extensions, marking a shift toward universal adoption in U.S.-centric systems. The 1970s and 1980s saw proliferation of 8-bit extensions to address non-English scripts, introducing the notion of code pages as variant mappings within the expanded 256 code points. The ISO/IEC 8859 series, first published in 1987 by the , comprised multiple parts tailored to regional needs, such as ISO/IEC 8859-1 (Latin-1) for Western European languages including accented characters beyond ASCII. These standards extended the 7-bit ASCII subset into the upper 128 positions for language-specific glyphs, facilitating adoption in personal computers and international . Despite these advances, pre-Unicode encodings suffered from profound incompatibilities, as each system optimized for local languages without global coordination. For instance, , devised in 1984 by Taiwan's , used variable-width bytes to encode over 13,000 but overlapped ambiguously with ASCII ranges, rendering it incompatible with Japanese encodings like . , developed in 1982 by and for Japanese , , and hiragana, similarly prioritized single- and double-byte sequences tailored to East Asian scripts, leading to data corruption—known as —when texts were exchanged across platforms. Such fragmentation, driven by national standards and vendor-specific implementations, resulted in hundreds of rival code sets and hindered cross-border digital communication.

Unicode Evolution

The development of the Unicode standard originated in late 1987, when engineers Joe Becker from , Lee Collins from Apple, and Mark Davis from Taligent initiated discussions to create a universal system capable of supporting multiple writing systems beyond the limitations of ASCII. This effort addressed the fragmentation of existing encodings for different languages and scripts, aiming for a single, unified approach to text representation. By 1988, the project had formalized a character database, and in 1989, it expanded to include collaborators from organizations such as and the Research Libraries Group, aligning the scope with emerging international standards. In 1990, joined the initiative, and work focused on mapping to the draft ISO/IEC 10646 standard, finalizing for East Asian ideographs, and establishing a compatibility zone for legacy encodings. The was officially incorporated in on January 3, 1991, to oversee the project. That year, the Consortium merged efforts with ISO/IEC 10646, agreeing to maintain full compatibility between and the emerging ISO standard, which ensured synchronized growth and avoided divergent paths in global text encoding. 1.0 was released in October 1991, defining an initial repertoire of over 7,000 code points primarily in the Basic Multilingual Plane (BMP), covering major scripts like Latin, , , , and Hebrew, while prioritizing with ASCII in the range U+0000 to U+007F. Subsequent versions built on this foundation, expanding the encoded repertoire to support globalization. 2.0, released in , significantly broadened the character set by incorporating additional scripts and completing the initial allocation plan for the , which spans 65,536 code points from U+0000 to U+FFFF and serves as the core plane for common usage. This version added support for scripts such as , , and Thai, enhancing compatibility with regional standards. By 3.1 in , the standard began populating Plane 1 (the Supplementary Multilingual Plane, U+10000 to U+1FFFF), introducing historic scripts like Gothic, , and Old Italic, marking the first extensions beyond the to accommodate less frequently used or ancient writing systems. Unicode 6.0, released in 2010, further solidified the standard's maturity amid rising , coinciding with the widespread dominance of as the preferred encoding for due to its ASCII and variable-length . This version added thousands of characters for scripts like Lepcha and Vai, while 's adoption exceeded 90% of web pages by the mid-2010s, driven by browser support and XML standards. integration emerged around 2007, with initial symbol additions evolving into dedicated emoji support; by Unicode 5.2 in 2009, the first explicitly emoji-intended characters were encoded for cross-platform , addressing the need for visual expression in digital communication. In the modern era, Unicode has adopted an annual release cycle to accommodate ongoing demands for script inclusion and cultural representation. For instance, Unicode 14.0 in 2021 added 838 characters, primarily additions to existing scripts and symbols, including new emoji and extensions for historical notations, bringing the total assigned code points to 144,697. Unicode 16.0, released in 2024, introduced 5,185 new characters—such as the West African Garay script and historic Tulu-Tigalari—along with enhancements for and legacy symbols, resulting in a total of 154,998 assigned code points. Unicode 17.0, released in September 2025, added 4,803 characters, including four new scripts such as Beria Erfe and Sidetic, along with emoji and symbol extensions, bringing the total assigned code points to 159,801.

References

  1. [1]
    Glossary of Unicode Terms
    Code Point. (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4, Characters and ...
  2. [2]
    Chapter 2 – Unicode 16.0.0
    In the Unicode character encoding model, precisely defined encoding forms specify how each integer (code point) for a Unicode character is to be expressed as a ...
  3. [3]
    The Unicode® Standard: A Technical Introduction
    Aug 22, 2019 · Each of these numbers is called a code point and, when referred to in text, is listed in hexadecimal form following the prefix "U+".
  4. [4]
    Character encodings: Essential concepts
    Units of a coded character set are known as code points . A code point value represents the position of a character in the coded character set. For example, the ...Missing: authoritative sources
  5. [5]
    UTR#17: Character Encoding Model - Unicode
    An abstract character is defined to be in a coded character set if the coded character set maps from it to an integer. That integer is said to be the code point ...
  6. [6]
    The US ASCII Character Set - Columbia University
    US ASCII, ANSI X3.4-1986 (ISO 646 International Reference Version). Codes 0 through 31 and 127 (decimal) are unprintable control characters. Code 32 (decimal) ...
  7. [7]
    Code point - Glossary - MDN Web Docs
    Jul 11, 2025 · In Unicode, a code point is expressed in the form "U+1234" where "1234" is the assigned number. For example, the character "A" is assigned a ...
  8. [8]
    UAX #15: Unicode Normalization Forms
    Jul 30, 2025 · Normalization provides a unique order for combining marks, with a uniform order for all D and C forms. Even when there is no precomposed ...
  9. [9]
    UTR#17: Unicode Character Encoding Model
    In Unicode, the character at code point U+FEFF is defined as the byte order mark, while its byte-reversed counterpart, U+FFFE is a noncharacter (U+FFFE) in UTF ...<|control11|><|separator|>
  10. [10]
    Unicode Character 'GRINNING FACE' (U+1F600) - FileFormat.Info
    Unicode Character 'GRINNING FACE' (U+1F600) ; UTF-8 (binary), 11110000:10011111:10011000:10000000 ; UTF-16 (hex), 0xD83D 0xDE00 (d83dde00) ; UTF-16 (decimal) ...Missing: bytes | Show results with:bytes
  11. [11]
  12. [12]
    Character Encoding
    The Good: It's fixed width! Constant-time to find the nth character in a ... (But yeah, in this case you have to agree beforehand on how the metadata is encoded..Missing: limitations | Show results with:limitations
  13. [13]
    Text_view: A C++ concepts and range based character encoding ...
    Jun 13, 2017 · An encoding may be fixed width or variable width. In fixed width encodings, all characters are encoded using a single code unit sequence and ...Missing: advantages | Show results with:advantages
  14. [14]
    ISO/IEC 8859-1:1998 - Information technology — 8-bit single-byte ...
    8-bit single-byte coded graphic character setsPart 1: Latin alphabet No. 1. Published (Edition 1, 1998).Missing: encoding | Show results with:encoding
  15. [15]
    The EBCDIC character set - IBM
    ASCII and EBCDIC are both 8-bit character sets. The difference is the way they assign bits for specific characters.
  16. [16]
    Code Page Identifiers - Win32 apps - Microsoft Learn
    Jan 7, 2021 · windows-1252, ANSI Latin 1; Western European (Windows). 1253, windows-1253, ANSI Greek; Greek (Windows). 1254, windows-1254, ANSI Turkish; ...
  17. [17]
  18. [18]
    Unicode 17.0.0
    Sep 9, 2025 · This page summarizes the important changes for the Unicode Standard, Version 17.0.0. This version supersedes all previous versions of the Unicode Standard.
  19. [19]
    UAX #44: Unicode Character Database
    Aug 27, 2025 · Code points are expressed as hexadecimal numbers with four to six digits. (See Appendix A, Notational Conventions in [Unicode] for a full, ...
  20. [20]
    Unicode® Character Encoding Stability Policies
    Jan 9, 2024 · Unicode policies ensure encoded characters remain valid and unchanged, with no moving or removal. Character names are also stable, and ...
  21. [21]
  22. [22]
    Unicode 17.0 Character Code Charts
    Unicode 17.0 Character Code Charts. Scripts | Symbols & Punctuation | Name ... Emoji & Pictographs. Dingbats · Ornamental Dingbats · Emoticons · Miscellaneous ...Help and Links · Name Index · Unihan Database Lookup
  23. [23]
    Special Areas and Format Characters - Unicode
    This chapter describes several kinds of characters that have special properties as well as areas of the codespace that are set aside for special purposes.
  24. [24]
    The IBM punched card
    Punched cards, also known as punch cards, dated to the late 18th and early 19th centuries when they were used to “program” cloth-making machinery and looms.Missing: numerical | Show results with:numerical
  25. [25]
    Douglas W. Jones's punched card index - University of Iowa
    The punched card as used for data processing, originally invented by Herman Hollerith, was first used for vital statistics tabulation.Missing: 19th | Show results with:19th
  26. [26]
    The Baudot Code - Electrical and Computer Engineering
    The Baudot Code is an early example of a binary character code based on 5-bit values defining 32 different codewords. The invention of this code in 1870 by ...
  27. [27]
    The Roots of Computer Code Lie in Telegraph Code
    Sep 11, 2017 · "Baudot's Printing Telegraph was an encoding system that ran off five-bit binary code. It was not the first binary code, of course, but it ...
  28. [28]
    7-bit character sets - Aivosto
    It was also the first 7-bit character set to be standardized. During the years, several revisions of ASCII were published.
  29. [29]
    History - ASCII - SparkFun Learn
    7 bits allow for 128 characters. While only American English characters and symbols were chosen for this encoding set, 7 bits meant minimized costs associated ...
  30. [30]
    [PDF] ISO 8859-1:1987 - iTeh Standards
    This set of graphic characters is suitable for use in a version of an 8-bit code according to IS0 2022 or IS0 4873. NOTE - IS0 8859 is not intended for use ...
  31. [31]
    ISO 8859-7:1987 Information processing — 8-bit single-byte coded ...
    This part of ISO 8859 defines a set of 185 graphic characters identified as the Latin/Greek alphabet, and specifies the coded representation of each of these ...Missing: code | Show results with:code
  32. [32]
    About CNS \ Chinese Code Status - 全字庫 CNS11643 (2024)
    1. History. In September 1980, the Science and Technology Information Center (STIC) of Executive Yuan gathered local encoding specialists and academics to ...
  33. [33]
    Kanji and the Computer
    While the JIS and EUC encoding techniques were extended to include them, the Shift-JIS technique could not easily be modified and thus, since it was the main ...
  34. [34]
    Legacy Character Models and an Introduction to Unicode
    Heterogenous requirements for character sets for different countries and languages led to the development of many incompatible character sets with greatly ...
  35. [35]
    Summary Narrative - Unicode
    Aug 31, 2006 · Unicode began as a project in late 1987 after discussions between engineers from Apple and Xerox: Joe Becker, Lee Collins and Mark Davis.
  36. [36]
    Enumerated Versions - Unicode
    The table on this page lists all of the versions of the Unicode Standard. The version numbering and the role of each component are explained in About Versions.
  37. [37]
    PDUTR #27: Unicode 3.1
    A process shall not interpret either U+FFFE or U+FFFF a noncharacter code point as an abstract character. The code points may be used internally, such as for ...<|control11|><|separator|>
  38. [38]
    UTS #51: Unicode Emoji
    That broad sense is used in the Unicode block name Emoticons, covering the code points from U+1F600 to U+1F64F. 1.2 Encoding Considerations. Unicode is the ...Missing: UTF-
  39. [39]
    Announcing The Unicode® Standard, Version 14.0
    Sep 14, 2021 · This version adds 838 characters, for a total of 144,697 characters. These additions include five new scripts, for a total of 159 scripts, as ...
  40. [40]
    Unicode 16.0.0
    ### Summary of Unicode 16.0.0 Key Changes
  41. [41]
    July 2025 - The Unicode Blog
    Jul 30, 2025 · These characters have been postponed to Unicode 18.0. With this change, the total number of new characters for Unicode 17.0 will be 4,803, ...