Code point
A code point is any of the 1,114,112 numerical values in the Unicode codespace, ranging from 0 to 10FFFF in hexadecimal notation (U+0000 to U+10FFFF), each potentially assigned to represent an abstract character in the Unicode Standard.[1] These values form the foundation of Unicode's character encoding model, distinguishing between assigned code points—such as those for graphic characters, format characters, control characters, and private-use areas—and unassigned or reserved ones like surrogates (U+D800–U+DFFF) and noncharacters.[2] In practice, code points are denoted using the "U+" prefix followed by four to six hexadecimal digits, such as U+0041 for the Latin capital letter A, emphasizing their role in abstracting characters from specific glyphs or visual forms to enable universal text representation across scripts and languages.[3] This notation facilitates precise referencing in standards, software, and documentation, where 159,801 characters have been assigned as of Unicode 17.0, supporting 172 scripts.[2] The Unicode encoding model transforms code points into sequences of code units via defined encoding forms: UTF-8 (variable 1–4 bytes, backward-compatible with ASCII), UTF-16 (1 or 2 16-bit units, using surrogate pairs for code points beyond U+FFFF), and UTF-32 (fixed 32-bit units for direct mapping).[2] This structure allows efficient storage and processing of text, with code points organized into 17 planes—the Basic Multilingual Plane (BMP, U+0000 to U+FFFF) containing most commonly used characters, and supplementary planes for rarer scripts and symbols.[3] Code points may combine in sequences to form grapheme clusters or normalized forms, ensuring compatibility in rendering and collation, while reserved areas prevent conflicts in future expansions of the standard.[2]Fundamentals
Definition
A code point is a numerical index or value, typically an integer, used in a character encoding scheme to uniquely identify an abstract character from a predefined repertoire.[4][5] Code points serve as mappings between human-readable characters and machine-readable binary data, enabling the systematic representation and processing of text in computing systems.[3] For example, in the ASCII encoding, the code point 65 represents the uppercase letter 'A'.[6] In Unicode, the same character is identified by the code point U+0041.[7] A code point designates an abstract character, which is a semantic unit independent of its specific encoded form or visual appearance.[5]Relation to Characters and Glyphs
An abstract character represents a unit of information with semantic value, serving as the smallest component of written language independent of any particular encoding scheme or visual rendering method. For example, the abstract character for the letter 'é' embodies the concept of an 'e' with an acute accent, regardless of how it is stored digitally or displayed. In the Unicode Standard, each such abstract character is uniquely identified by a code point, which is a non-negative integer assigned within the Unicode codespace.[8][1] In distinction to abstract characters, a glyph is the particular visual image or shape used to depict a character during rendering or printing. Glyphs are defined by font technologies and can differ widely; for instance, the abstract character 'A' might be rendered as a serif glyph in one typeface or a sans-serif variant in another. The Unicode Standard specifies that glyphs are not part of the encoding model itself but result from the interpretation of code points by rendering engines.[8][5] A code point generally maps to a single abstract character, but the transition from abstract character to glyph introduces variability based on context and presentation rules. One abstract character may correspond to multiple glyphs, such as in the case of positional forms in cursive scripts like Arabic, where the shape adapts to initial, medial, or final positions. Conversely, ligatures can combine multiple abstract characters—each with its own code point—into a single composite glyph, as seen with 'fi' forming a unified shape in many fonts to improve readability.[8][5] Unicode normalization addresses scenarios where distinct code point sequences encode the same abstract character, enabling consistent text processing across systems. For example, the precomposed 'é' (U+00E9) is canonically equivalent to the sequence 'e' (U+0065) followed by a combining acute accent (U+0301), allowing normalization forms like NFC (which favors precomposed characters) or NFD (which decomposes them) to standardize representations without altering semantic meaning. This equivalence ensures that applications can interchange text reliably while preserving the underlying abstract character.[9][8]Representation in Encodings
Code Units vs. Code Points
In character encodings, a code unit represents the smallest fixed-size unit of storage or transmission for text data, typically defined by the bit width of the encoding form. For instance, UTF-8 employs 8-bit code units (bytes), while UTF-16 uses 16-bit code units.[1] These code units serve as the basic building blocks for representing sequences of text, allowing computers to process and interchange Unicode data efficiently across different systems.[10] The primary distinction between code points and code units lies in their roles and granularity: a code point is a single numerical value (from 0 to 10FFFF in hexadecimal) assigned to an abstract character in the Unicode standard, whereas code units are the encoded bits that collectively form one or more code points.[1] In fixed-width encodings like UTF-32, each code point corresponds directly to one code unit (a 32-bit value), simplifying access. However, in variable-width encodings such as UTF-8 and UTF-16, a single code point often requires multiple code units to encode, particularly for characters beyond the Basic Multilingual Plane. This multi-unit representation enables compact storage but introduces complexity in parsing text streams. A concrete example illustrates this difference: the Unicode code point U+1F600, which maps to the grinning face emoji (😀), is encoded as four 8-bit code units in UTF-8 (hexadecimal F0 9F 98 80, or bytes 240, 159, 152, 128) and as two 16-bit code units in UTF-16 (hexadecimal D83D DE00, forming a surrogate pair).[11] In UTF-16, the first unit (D83D) is a high surrogate and the second (DE00) a low surrogate, together representing the full code point; treating them separately would yield invalid or unintended characters. When processing text, algorithms must correctly decode sequences of code units into complete code points to ensure accurate interpretation of abstract characters. Failure to handle multi-unit code points properly—such as by assuming each code unit is an independent character—can result in errors like mojibake, where encoded text is misinterpreted and rendered as garbled symbols during decoding with an incompatible scheme.[4] This underscores the need for encoding-aware software to normalize and validate input, preventing data corruption in applications ranging from web browsers to file systems.Fixed-Width Encodings
Fixed-width encodings are character encoding schemes in which each code point is represented using a consistent number of code units, such as bits or bytes, resulting in sequences of uniform length for all characters. This direct one-to-one mapping between code points and fixed-size code units simplifies the representation process, as no variable-length sequences are required to encode different characters.[12] The primary advantages of fixed-width encodings include ease of implementation and processing, since there is no need for complex decoding algorithms to determine character boundaries. They also enable efficient random access to individual characters within a text stream, allowing the position of the nth character to be computed in constant time by simple arithmetic on the code unit offsets. These properties make fixed-width encodings particularly suitable for applications with small character repertoires, where simplicity outweighs storage efficiency.[13][14] However, fixed-width encodings have significant limitations due to their uniform sizing, which caps the total number of representable code points at the power of two corresponding to the width (e.g., 128 for 7 bits or 256 for 8 bits). This restricts their ability to accommodate large or diverse character sets, such as those required for multilingual text, often necessitating multiple incompatible variants for different languages. For the full Unicode range, UTF-32 is a fixed-width encoding using 32-bit code units, providing direct mapping for all 1,114,112 possible code points without surrogates or variable lengths, though it uses more storage for ASCII-range text compared to variable-width forms.[2] Prominent examples include ASCII, a 7-bit encoding supporting 128 code points from 0x00 to 0x7F, primarily for English-language text and control characters. ISO/IEC 8859-1, an 8-bit extension of ASCII, provides 256 code points for Western European languages, with the first 128 matching ASCII.[15] EBCDIC, another 8-bit scheme developed by IBM, uses a different bit assignment for characters and remains in use on mainframe systems.[16] Windows-1252, a Microsoft variant of ISO/IEC 8859-1, also employs 8 bits but includes additional printable characters in the upper range for enhanced Western European support.[17]Variable-Width Encodings
Variable-width encodings represent Unicode code points using a varying number of code units, allowing for more efficient storage of text with predominantly low-range characters while supporting the full range up to U+10FFFF.[2] This approach contrasts with fixed-width encodings like UTF-32, which allocate uniform space regardless of the code point value.[2] By adjusting the number of code units based on the code point's magnitude, these encodings optimize space for common scripts such as Latin and Cyrillic, which fit into fewer units, while extending to rarer or higher-range characters with additional units.[2] UTF-8, a widely used variable-width encoding, employs 8-bit code units and determines the sequence length from the leading bits of the first byte.[2] Code points in the range U+0000 to U+007F (basic Latin and ASCII) are encoded in a single byte, ensuring compatibility with legacy ASCII systems.[2] For U+0080 to U+07FF (e.g., extended Latin, Greek, Cyrillic, Arabic), two bytes are used; U+0800 to U+FFFF (including most of the Basic Multilingual Plane, or BMP) require three bytes; and U+10000 to U+10FFFF (supplementary planes) use four bytes.[2] The encoding algorithm ensures that continuation bytes (always 10xxxxxx in binary) follow the lead byte, which specifies the total length, enabling a self-synchronizing property where parsers can detect sequence boundaries efficiently, even after data corruption, by examining at most four bytes backward.[2]| Code Point Range | Bytes in UTF-8 | Example Characters |
|---|---|---|
| U+0000–U+007F | 1 | Basic Latin (A–Z) |
| U+0080–U+07FF | 2 | Extended Latin, Greek, Cyrillic, Arabic |
| U+0800–U+FFFF | 3 | Devanagari, Thai, BMP Han |
| U+10000–U+10FFFF | 4 | Emoji, supplementary CJK |