Fact-checked by Grok 2 weeks ago

Precomposed character

A precomposed character, also known as a , is a single code point that encodes a base character combined with one or more diacritical marks or other modifiers, such as é (U+00E9 LATIN SMALL LETTER E WITH ACUTE), which is canonically equivalent to the decomposed sequence of e (U+0065) followed by the combining acute accent (U+0301). This design allows Unicode to represent accented or modified letters directly as atomic units, facilitating efficient storage and processing in systems that prefer single-code-point representations over sequences of combining characters. In the Unicode Standard, precomposed characters are integral to handling multilingual text, particularly for languages using Latin scripts with diacritics, such as , , or , where characters like à (U+00E0) or ñ (U+00F1) are encoded as distinct code points rather than relying solely on base letters plus combining marks. Their use contrasts with decomposed forms, which break such characters into a base and separate combining characters (e.g., a + for à), enabling flexible rendering but potentially increasing complexity in text comparison and searching. Unicode normalization processes address equivalences between precomposed and decomposed representations through four standard forms: (Normalization Form Canonical Composition), which applies canonical decomposition followed by to produce fully precomposed text where possible; NFD (Normalization Form Canonical Decomposition), which decomposes precomposed characters into their base and combining components while reordering marks canonically; and the compatibility variants NFKC and NFKD, which additionally handle font-specific or compatibility mappings. For instance, the string "café" in precomposed form normalizes to the same sequence as "c a f é" (with e decomposed), ensuring consistent behavior across applications for , searching, and display. This normalization framework, defined in Standard Annex #15, promotes interoperability in software handling diverse scripts and is essential for algorithms in databases, web browsers, and text editors.

Definition and Basic Concepts

Definition of Precomposed Characters

A precomposed , also known as a composite or decomposable , is a single that directly encodes a formed by combining a base with one or more diacritics or other modifiers, allowing for a unified representation of such composites. For instance, the é (U+00E9, LATIN SMALL LETTER E WITH ACUTE) represents the base letter "e" (U+0065) combined with an , providing a fixed, atomic unit that systems can handle without needing to process separate combining marks. This precomposition ensures compatibility with legacy encodings and simplifies rendering in environments where dynamic composition might be inefficient. Precomposed characters are distinct from atomic characters, such as the basic Latin letters in the Basic Latin block (U+0000–U+007F), which do not incorporate any combinations and represent standalone glyphs without diacritics or ligatures. Examples abound in the block (U+0080–U+00FF), including ñ (U+00F1, LATIN SMALL LETTER N WITH TILDE) for the "enye" and Å (U+00C5, LATIN CAPITAL LETTER A WITH RING ABOVE) used in Scandinavian languages. Additionally, certain ligatures are encoded as precomposed forms, such as fi (U+FB01, LATIN SMALL LIGATURE FI), which merges "f" and "i" into a single typographic unit for aesthetic and historical compatibility. This encoding approach emphasizes precomposed characters as stable, predefined units within , facilitating consistent text processing across diverse scripts while contrasting with forms that use separate s for components.

Comparison with Decomposed Characters

Precomposed characters are represented by a single Unicode that encodes the complete , including any diacritical marks, as in the case of é (U+00E9, LATIN SMALL LETTER E WITH ACUTE). In contrast, decomposed characters express the same visual result through a sequence of code points: a base character followed by one or more combining marks, such as e (U+0065, LATIN SMALL LETTER E) combined with the (U+0301, COMBINING ACUTE ACCENT). These two representations are canonically equivalent according to the Unicode Standard, meaning they are intended to be interchangeable and rendered identically in conformant implementations, though they differ in storage and processing. The use of precomposed characters ensures reliable and consistent rendering, particularly in legacy systems or applications that lack full support for combining sequences, as the single avoids the need for complex positioning logic. Decomposed forms, however, offer greater flexibility by allowing dynamic composition of accented characters from a smaller set of base letters and marks, which is useful for scripts requiring many variations not predefined as precomposed glyphs. This flexibility comes with potential risks, such as incorrect visual attachment of combining marks to unintended base characters if the sequence order is disrupted during text or input, potentially leading to garbled output without proper handling. A practical example is the Swedish surname Åström. In precomposed form, it employs (U+00C5, LATIN CAPITAL LETTER A WITH RING ABOVE) and (U+00F6, LATIN SMALL LETTER O WITH DIAERESIS), each as a unified for straightforward processing. The decomposed equivalent would instead use A (U+0041, LATIN CAPITAL LETTER A) with combining ring above (U+030A, COMBINING RING ABOVE) for , and o (U+006F, LATIN SMALL LETTER O) with combining diaeresis (U+0308, COMBINING DIAERESIS) for , relying on the rendering engine to compose the accents correctly. can interconvert these forms to maintain equivalence across systems.

Historical Development

Origins in Early Character Encodings

The concept of precomposed characters originated in the 1960s amid the transition from mechanical typewriters to early digital computing, where limited code spaces necessitated efficient representations of accented letters and diacritics for non-English European languages. Typewriters handled accents through dedicated keys or "dead-key" mechanisms—pressing a modifier key followed by a base letter to overstrike and form a combined glyph—which inspired computing encodings to assign single code points to frequent combinations, avoiding cumbersome multi-step input and storage. This approach addressed the typewriter-era challenge of fitting diverse scripts onto fixed keyboards while enabling compact data handling in resource-constrained systems. One of the earliest implementations appeared in IBM's , introduced in 1963 for mainframe computers, which used 8-bit s to include precomposed diacritics tailored to regional needs. For instance, 0037 (U.S./) allocated a single byte to characters like é for loanwords, while variants such as 0297 () further adapted positions for additional accents, replacing less common symbols to prioritize language-specific combinations within the 256-code-point limit. These extensions reflected the motivation to support and scientific computing on systems without expanding beyond single-byte efficiency. Parallel developments occurred with ASCII, standardized by ANSI in 1963 as a 7-bit code primarily for English, but quickly extended through national variants in the 1960s and 1970s to accommodate diacritics via code-point substitutions. Examples include the German DIN 66003 variant, which reassigned positions originally for symbols like # or $ to precomposed umlauts such as ä, ö, and ü, allowing direct keyboard entry and display in limited environments. By the 1980s, the ISO 8859 family of standards, developed by ECMA and adopted by ISO, built on these foundations with 8-bit encodings that preserved the lower 128 code points for ASCII compatibility while dedicating the upper half to precomposed forms for Western European languages. ISO 8859-1 (Latin-1), for example, included á at code 0xE1 to represent common accented vowels in French, Spanish, and Portuguese, optimizing for scripts where diacritic combinations were prevalent and multi-byte alternatives were impractical. Precomposition thus emerged as a pragmatic legacy solution, enabling single-byte storage and processing of composite glyphs in an era dominated by 8-bit architectures.

Incorporation into Unicode

The incorporation of precomposed characters into the standard began with version 1.0, released in 1991, which directly integrated precomposed forms from established ISO standards such as ISO 8859-1 to cover the Latin-1 repertoire. This initial set included characters like U+00E9 (, Latin small letter e with acute) as single code points in the block (U+0080–U+00FF), ensuring seamless representation of accented Latin letters used in Western European languages. Early encodings like ISO 8859 provided the foundational basis for this integration, allowing Unicode to encompass 256 characters in its Basic Multilingual Plane for immediate compatibility. Subsequent versions expanded the scope of precomposed characters to support additional scripts while maintaining alignment with ISO/IEC 10646. For instance, 1.1 (1993) introduced the and block (U+0370–U+03FF), adding precomposed forms such as U+03AF (ί, small letter with dialytika) for monotonic . Further growth occurred in 3.0 (2000), which incorporated the Extended block (U+1F00–U+1FFF) with 233 precomposed polytonic characters, including combinations like U+1FD3 (ΐ, small letter with dialytika and tonos). These additions reflected ongoing efforts to broaden global script coverage through precomposed allocations, particularly for diacritic-heavy languages. The primary rationale for including precomposed characters was to facilitate and round-trip conversion with legacy systems that relied on single-byte encodings. By assigning dedicated code points to common composites—rather than relying solely on combining sequences—Unicode ensured that data from older standards could migrate without loss or alteration, a core design principle emphasized in the 's architecture. This approach also supported efficient processing in environments with limited combining mark support, preventing rendering discrepancies during data exchange. A pivotal decision by the involved balancing the use of precomposed and combining characters to optimize both and extensibility, leading to the allocation of code points in the Basic Multilingual Plane and supplementary planes for rarer combinations. This policy, formalized in stability guidelines, froze new compositions after Unicode 3.1 to preserve consistency across versions. By Unicode 17.0 (2025), this strategy had resulted in over 1,000 precomposed Latin characters across blocks like through , underscoring the standard's commitment to comprehensive legacy support.

Role in Unicode Standardization

Unicode Equivalence and Normalization

In Unicode, canonical equivalence refers to the principle that certain sequences of characters represent the same abstract character, ensuring semantic and visual identity despite differences in encoding form. Similarly, the precomposed é (U+00E9 LATIN SMALL LETTER E WITH ACUTE) is equivalent to the decomposed e (U+0065 LATIN SMALL LETTER E) combined with the acute accent (U+0301 COMBINING ACUTE ACCENT), allowing flexible representation while preserving meaning. Normalization is the process of transforming text into a standard form to handle such equivalences consistently, by first decomposing sequences into their components, reordering any combining marks according to a defined , and optionally recomposing where possible. This converts mixed representations—such as a string containing both precomposed and decomposed characters—into either a fully decomposed or fully composed form, eliminating variations that could arise from input methods or legacy encodings. The importance of normalization lies in its role in enabling reliable text processing across applications; for example, it ensures that equivalent forms like é and e+́ are treated as identical during comparisons, searches, , and operations, preventing mismatches in databases or user interfaces. These concepts are formally defined in Standard Annex #15, which specifies the algorithms and conformance requirements to guarantee interoperability.

Normalization Forms

Unicode normalization forms provide standardized ways to transform text into consistent representations, particularly addressing precomposed and decomposed characters. There are four primary forms defined by the Unicode Standard: Normalization Form C (), Normalization Form D (NFD), Normalization Form Compatibility Composition (NFKC), and Normalization Form Compatibility Decomposition (NFKD). NFC and NFD rely on canonical equivalence, which preserves the visual appearance and behavior of characters, while NFKC and NFKD use compatibility equivalence, which may alter formatting or round-trip distinctions from legacy encodings. NFC (Normalization Form C) applies canonical decomposition followed by canonical composition, resulting in a fully composed form where precomposed characters are used whenever possible. For example, the precomposed character (U+00E9) remains as is, while a decomposed sequence like e (U+0065) followed by (U+0301) is recomposed into . In contrast, NFD (Normalization Form D) performs only canonical decomposition, breaking precomposed characters into their base letter and combining diacritics without recomposition; thus, becomes e + ◌́ (U+0065 U+0301). The algorithms for these forms begin with , which recursively applies mappings from the Decomposition_Mapping . decomposition targets characters with strict equivalences, such as accented letters, by replacing them with a sequence of a base and non-spacing marks sorted by canonical combining class. For and NFKD, this is followed by , which pairs adjacent base characters with compatible combining marks to form precomposed characters, excluding cases defined in the Composition_Exclusion to avoid ambiguous pairings. NFKC and NFKD extend this by using decomposition, which includes additional mappings for characters like ligatures or variant forms that are visually similar but not canonically equivalent. For instance, the ligature fi (U+FB01) remains intact in and NFD but decomposes to f + i (U+0066 U+0069) in NFKD and NFKC. A practical example illustrates the difference between and processes: the German word "Straße" containing the sharp s (U+00DF) normalizes unchanged in NFD and , as ß has no . However, in NFKD and NFKC, replaces ß with s + s (U+0073 U+0073), yielding "Strasse". This highlights how forms handle broader equivalences, such as those from historical encodings, beyond the stricter rules that focus solely on attachments and order-independent sequences.

Applications in Writing Systems

Latin-Based Scripts

Precomposed characters play a dominant role in Latin-based scripts, especially for Western languages that incorporate diacritical marks and ligatures as integral parts of their orthographies. In , for instance, the acute-accented e, é (U+00E9), is a standard precomposed form used to distinguish pronunciation and meaning, such as in "café" versus "cafe". Similarly, employs the precomposed sharp s, ß (U+00DF), which represents a distinct sound and is encoded as a single character rather than a combination of base letters, ensuring compatibility with legacy systems like ISO 8859-1. These examples highlight how precomposed encoding preserves established conventions in and text processing for languages with historical ties to print traditions. Unicode allocates specific blocks to accommodate the extensive inventory of precomposed Latin characters, facilitating support across diverse orthographies. The block (U+0100–U+017F) includes over 100 such characters, like the Polish (U+0142), which combines l with a for phonetic accuracy. This block, along with others such as (U+1E00–U+1EFF), primarily consists of precomposed combinations of Latin letters with diacritics, enabling straightforward rendering without reliance on combining sequences. Beyond , languages like leverage these precomposed forms for tonal marks; for example, ả (U+1EA3) integrates a with a and , streamlining input and display in digital environments. In total, hundreds of precomposed Latin characters exist in , supporting tens or even hundreds of languages that employ Latin-based scripts and minimizing the use of combining marks in legacy text corpora. This abundance—spanning blocks like and Extended series—allows for efficient handling of accented forms in over 100 languages, from to variants, while promoting round-trip compatibility with older encodings. However, ambiguities can arise in proper names, such as "naïve," where the precomposed ï (U+00EF) may coexist with decomposed equivalents (i + diaeresis), resolvable via normalization to ensure consistent and rendering.

Chinese and Han Unification

Han unification is the process by which the , in collaboration with the Ideographic Research Group (IRG), merges multiple national and regional standards for Han ideographs used in , , and writing systems into a single, unified repertoire of characters. This effort ensures that visually similar or identical glyphs across these languages are assigned to the same code points, abstracting away minor orthographic variations such as those between simplified and traditional forms, or differences in and typography. The unification criteria, developed since the early , prioritize compatibility with existing standards like GB 13000 for and JIS X 0228 for , while minimizing the total number of code points required. In Unicode, Han ideographs are encoded exclusively as precomposed characters, meaning each ideograph is represented by a single, atomic code point rather than being decomposed into components such as radicals or strokes. The primary block for these characters is the CJK Unified Ideographs range from U+4E00 to U+9FFF, which contains 20,993 code points, with additional extensions bringing the total to over 100,000 Han characters across blocks like Extension A (U+3400–U+4DBF) and Extension B (U+20000–U+2A6DF) as of Unicode 17.0 (September 2025), which added 4,298 new ideographs in Extension J. For example, the Chinese character 汉 (meaning "Han" as in Han Chinese) is encoded at U+6C49 as a single precomposed unit, preserving its integrity in digital text without breakdown into its constituent parts. This approach contrasts with potential analytical decompositions used in lexicographical tools or educational contexts, where radicals and strokes may be separated for study, but such breakdowns are not part of the core text encoding standard. The implications of this precomposed encoding and unification are significant for encoding efficiency and cross-linguistic compatibility. By treating ideographs as indivisible units, Unicode avoids the exponential proliferation of code points that would result from encoding every variant separately, allowing a compact representation that covers the needs of , , and users with shared glyphs. Variants that are not unifiable—due to distinct semantic or orthographic roles—are handled through separate compatibility ideographs or extensions, ensuring that text processing remains straightforward while supporting the diverse scripts' requirements. This strategy has enabled the inclusion of over 100,000 code points in total, balancing comprehensiveness with the limitations of 16-bit and 32-bit encoding schemes.

Other Scripts

In scripts beyond Latin and Chinese, precomposed characters play a role in encoding diacritics and modifications for compatibility with legacy systems, particularly in alphabetic systems like Greek and Cyrillic, where they combine base letters with accents or breathings into single code points. For example, in modern Greek, the character ΰ (U+03CD, Greek small letter upsilon with dialytika and tonos) is a precomposed form equivalent to the base upsilon (U+03C5) followed by combining diaeresis (U+0308) and acute accent (U+0301), facilitating simpler input and display in systems supporting monotonic orthography. Polytonic Greek, used for classical texts, relies heavily on precomposed characters in the Greek Extended block (U+1F00–U+1FFF); a representative example is ᾖ (U+1F16, Greek small letter eta with psili and perispomeni), which integrates the base eta (U+03B7) with a rough breathing (U+0314) and circumflex (U+0342) as a single unit to represent ancient pronunciations accurately. Cyrillic scripts similarly employ precomposed characters for letters with diacritics, especially in and extended uses for minority languages. The character ё (U+0451, Cyrillic small yo) is a precomposed form of (U+0435) with a diaeresis (U+0308), essential for where it denotes a distinct sound and is treated as a separate in dictionaries and sorting. Further extensions appear in the Cyrillic Supplement block (U+0500–U+052F), which includes precomposed letters like Ӓ (U+04D2, Cyrillic capital a with ) for languages such as Komi and Chuvash, combining base a (U+0410) with a breve mark to support phonetic distinctions without multiple combining sequences. In scripts, precomposed characters are more limited, with a preference for combining marks to allow flexible and consonant modifications. For Hebrew, ( points) are predominantly encoded as combining characters (U+05B0–U+05C7), such as אָ formed by alef (U+05D0) plus qamats (U+05B8); however, a small set of precomposed forms exists in the Alphabetic Presentation Forms block (U+FB1D–U+FB4F) for compatibility, like אַ (U+FB2E, Hebrew letter alef with patah), though their use is discouraged in favor of canonical combining sequences for modern processing. employs even fewer precomposed diacritics, relying on contextual shaping for letters; tatweel (U+0640, tatweel) functions as a spacing extender in presentation forms to justify lines or form ligatures, but core letters with marks like are often precomposed (e.g., أ U+0623, letter alef with madda above), with combining alternatives available for complex vocalization. Overall, Unicode incorporates hundreds of precomposed characters for Greek and Cyrillic combined, primarily for compatibility with existing standards and to minimize dependence on combining marks in these scripts, ensuring robust text handling across diverse applications.

Advantages and Challenges

Benefits of Precomposed Characters

Precomposed characters provide essential compatibility with legacy systems that do not fully support combining marks, ensuring correct rendering in environments with limited Unicode capabilities. For example, in older DOS fonts or similar legacy setups, the precomposed é (U+00E9) displays as a single accented character, while its decomposed equivalent—e (U+0065) followed by combining acute accent (U+0301)—may render as uncombined 'e' and a floating mark or fail entirely. This approach maintains visual fidelity when migrating text from pre-Unicode encodings, such as ISO 8859 series, which predominantly use precomposed forms that align directly with Unicode's Normalization Form C (NFC). From a perspective, precomposed characters streamline processing and searching by representing each accented as a single, fixed , in contrast to the variable-length sequences in decomposed forms. This simplifies operations like , indexing, and , as NFC-normalized strings require only quick verification rather than iterative or , reducing computational overhead in text handling. For instance, searching for "café" yields consistent results across representations, avoiding mismatches due to form differences. Furthermore, precomposed forms contribute to more efficient storage, particularly in texts rich with diacritics, by using one per character instead of multiple for base and marks; in encoding, é occupies two bytes, versus three for its decomposed counterpart. In workflows, this aids seamless data import from sources to without losses, preserving integrity during transitions.

Limitations and Rendering Issues

Precomposed characters in Unicode do not encompass all possible combinations of base letters and diacritics, leaving gaps in coverage for rare or complex stacks. For instance, while common forms like the Proto-Indo-European ḱ (U+1E31 LATIN SMALL LETTER K WITH ACUTE) are provided as precomposed code points, unusual sequences such as n̈̃n—formed by decomposing n (U+006E) with diaeresis (U+0308) and (U+0303)—lack dedicated precomposed glyphs and must rely on combining marks. Similarly, in scripts like , complex combinations such as alpha with both and (dasia) are not precomposed, requiring decomposition for representation. A primary reason for these coverage gaps is the that would result from precomposing every conceivable variant, which would rapidly exhaust available code points. With over 100 combining diacritics and numerous base characters in the alone, full precomposition could demand tens of thousands of additional code points, far exceeding practical needs; thus prioritizes widely used forms while reserving space for other scripts and symbols within its limit of approximately 1,114,112 possible code points, of which 159,801 are assigned as of Unicode 17.0 (2025). Rendering issues further compound these limitations, particularly when fonts lack support for precomposed glyphs, forcing fallback to decomposed sequences that may misalign or overlap improperly. For example, the Swedish name "Åström," if stored in decomposed form as A + ring above (U+00C5 decomposed to U+0041 U+030A) followed by o + diaeresis (U+00F6 decomposed to U+006F U+0308), can display with the ring floating awkwardly above the A and the diaeresis shifted on the o in fonts without optimized combining mark positioning. Such problems arise because many legacy fonts and rendering engines prioritize precomposed forms for simplicity, leading to poor handling of arbitrary combining mark stacks via inadequate anchoring or gap calculations (e.g., default 1/8 em spacing). normalization processes can partially address these by recomposing compatible sequences where precomposed forms exist, but they cannot resolve gaps for unsupported combinations.

Implementation Considerations

In Fonts and Software

In fonts such as and formats, precomposed characters are supported through direct mapping in the 'cmap' (character to glyph index mapping) table, which associates code points with specific . For example, the code point U+00E9 (Latin small letter e with acute) is typically mapped to the glyph representing 'é'. If a font does not include a precomposed for a particular character, the 'ccmp' (glyph composition/decomposition) feature allows rendering engines to compose it on-the-fly by substituting the base glyph and one or more combining marks. Software applications process precomposed characters by applying Unicode normalization forms, such as , to standardize text input and ensure compatibility with font mappings. Text editors like , running on Windows, default to normalization during input and editing, which favors precomposed forms over decomposed sequences for efficiency in legacy-compatible workflows. In web browsers, CSS font fallback rules handle missing precomposed glyphs by automatically selecting alternative fonts from the specified font-family that support the required , preventing rendering failures. Modern fonts like Google's Noto Sans provide extensive support for precomposed characters in widely used scripts, including full coverage for Latin and to enable seamless display without . In contrast, for exotic or less common scripts, font implementations often depend on of characters into base forms plus combining marks, which are then synthesized during the rendering process to achieve visual representation.

Handling in Programming and Text Processing

In programming languages and text processing systems, precomposed characters are managed primarily through Unicode normalization processes, which convert text between decomposed (using combining marks) and composed (precomposed) forms to ensure canonical equivalence. The Unicode Standard defines four normalization forms—NFD, , NFKD, and NFKC—with being the most common for producing precomposed characters by applying canonical decomposition followed by composition. This allows developers to standardize representations, preventing issues like duplicate entries or failed comparisons due to variant encodings of the same cluster, such as "é" (U+00E9 LATIN SMALL LETTER E WITH ACUTE) versus "e" (U+0065 LATIN SMALL LETTER E) + "́" (U+0301 COMBINING ACUTE ACCENT). Programming languages provide built-in APIs for normalization to handle precomposed characters. In , the java.text.Normalizer class implements these forms, enabling conversion to for composition: for example, Normalizer.normalize("e\u0301", Normalizer.Form.[NFC](/page/NFC)) yields the precomposed "é". Similarly, Python's unicodedata module offers unicodedata.normalize('[NFC](/page/NFC)', 'e\u0301'), which recomposes the string into the precomposed form, facilitating equivalence checks like normalized_a == normalized_b after applying the same form to both inputs. These functions are essential for tasks such as or , where precomposed forms reduce storage variability and improve in string comparisons. In text processing, collation algorithms treat precomposed and decomposed equivalents as identical to support linguistically accurate and . The (ICU) library implements the Collation Algorithm (UCA), automatically normalizing inputs to a fast decomposition (FCD) form during comparison, ensuring that "café" (with precomposed "é") sorts equivalently to "café" (decomposed). ICU's collator, used in languages like and C++, ignores decomposition differences, tailoring rules from the Common Locale Data Repository (CLDR) for locale-specific behavior. For regular expressions, matching requires prior normalization to or NFD, as many engines (e.g., ICU's regex or 's java.util.regex) do not inherently equate variants; thus, patterns like /\p{L}+/ match letters but may miss equivalents without preprocessing. Precomposed characters promote consistency in structured data formats like XML and , where prevents parsing discrepancies from variant representations. In XML processing, the XML Normalization specification recommends to canonicalize attribute values and element content, ensuring that equivalent strings yield identical serialized output. , as a Unicode-based format, benefits similarly, with libraries like Python's json module or Java's javax.json maintaining equivalence post-normalization to avoid key collisions in objects. In databases, MySQL's utf8mb4_unicode_ci leverages the UCA to normalize comparisons, treating precomposed and decomposed forms as equal during queries (e.g., SELECT * FROM table WHERE name = 'café' matches both variants), supporting full Unicode coverage including supplementary characters.

References

  1. [1]
    UAX #15: Unicode Normalization Forms
    ### Summary of UAX #15: Unicode Normalization Forms
  2. [2]
    The Unicode® Standard: A Technical Introduction
    Aug 22, 2019 · Certain sequences of characters can also be represented as a single character, called a precomposed character (or composite or decomposable ...
  3. [3]
    UAX #15: Unicode Normalization Forms
    Jul 30, 2025 · Many characters are known as canonical composites, or precomposed characters. ... defined to be Version 3.1.0 of the Unicode Character Database.
  4. [4]
    Chapter 2 – Unicode 16.0.0
    Most importantly, this means that once Unicode characters are assigned ... Precomposed characters are formally known as decomposables, because they ...
  5. [5]
    Chapter 3 – Unicode 17.0.0
    Because Unicode is a universal encoding, any abstract character that could ever be encoded is a potential candidate to be encoded, regardless of whether the ...
  6. [6]
    Normalization Charts - Unicode
    Normalization Charts. Help · Latin · Greek · Cyrillic · Armenian · Hebrew · Arabic ... É 00C9, É 00C9, É 0045 0301, É 00C9, É 0045 0301. é. 00E9, é. 00E9, é 0065 ...
  7. [7]
    [PDF] UAX #15: Unicode Normalization Forms
    Many characters are known as canonical composites, or precomposed characters. In the D forms, they are decomposed; in the C forms, they are usually precomposed.
  8. [8]
    http://www.columbia.edu/kermit/ftp/charsets/iso885...
    ... typewriters different, and the computer keyboards are modelled after them. A few letters moved about, digits on the uppercase side, accented ... I know of at ...
  9. [9]
    Legacy Character Models and an Introduction to Unicode
    This report tries to give an overview of the various types of character codes used today, the motivation behind their development and to give an introduction ...2.1 Ascii · 2.3 Cjk... · 3.3 Unicode Design
  10. [10]
    Lost in Translation 1 - EBCDIC Code Pages - Longpela Expertise
    In the first article of a series of three, we look at EBCDIC code pages - what they are, why they're used, and what this means.
  11. [11]
    ISO 8859 Alphabet Soup
    ### Summary of ISO 8859 Precomposed Characters, 8-bit Encodings, and Examples
  12. [12]
  13. [13]
    [PDF] Greek Extended - The Unicode Standard, Version 17.0
    These charts are provided as the online reference to the character contents of the Unicode Standard, Version 17.0 but do not provide all the information needed ...
  14. [14]
    Unicode® Character Encoding Stability Policies
    Jan 9, 2024 · In particular, once a character is encoded, its canonical combining class and decomposition mapping will not be changed in any way.<|separator|>
  15. [15]
  16. [16]
  17. [17]
  18. [18]
  19. [19]
    UTN #19: Recommendations for Creating New Orthographies
    Hundreds of pre-composed combinations are available for the Latin script. You can also test these, or other, combinations on widely available fonts first (such ...
  20. [20]
    Supported Scripts - Unicode
    ... Latin script, may be used to write tens or even hundreds of languages. In other cases, only one language employs a particular script—for example, Hangul ...
  21. [21]
    Han Unification History - Unicode
    The Unicode Standard draws its unified Han character repertoire from a number of different character set standards.Development of the URO · Continuing Research on...
  22. [22]
    [PDF] Unification of the Han Characters - Unicode
    This chapter explains the background, repertoire, and ordering rationale for the unification of the. Chinese, Japanese, and Korean ideographic characters. (For ...
  23. [23]
  24. [24]
    Chapter 21 – Unicode 17.0.0
    The precomposed versions are provided mainly for convenience. However, if any normalization form is applied, including NFC, the characters will be decomposed.
  25. [25]
    Chapter 7 – Unicode 16.0.0
    Legacy character sets ... Most of these characters are equivalent to precomposed combinations of base character forms and combining diacritical marks.
  26. [26]
    [PDF] L2/11-369 - Unicode
    An example of this is U+04C7/04C8 (CYRILLIC CAPITAL & SMALL LETTER EN WITH. HOOK), which is interchangeable in a specific language (Nenets) with U+043D+0433. ( ...
  27. [27]
    [PDF] Cyrillic Supplement - The Unicode Standard, Version 17.0
    These charts are provided as the online reference to the character contents of the Unicode Standard, Version 17.0 but do not provide all the information needed ...Missing: precomposed | Show results with:precomposed
  28. [28]
    Chapter 9 – Unicode 17.0.0
    The Hebrew characters in this block are chiefly of two types: variants of letters and marks encoded in the main Hebrew block, and precomposed combinations of a ...Missing: niqqud | Show results with:niqqud
  29. [29]
  30. [30]
    Greek Encoding Details - Gentium - SIL Language Technology
    Version 7 extends this support to less-common and even very rare combinations with much more robust OpenType handling. Most OS and application environments ...
  31. [31]
    A Programmer's Introduction to Unicode – Nathan Reed's coding blog
    Mar 3, 2017 · This has clear benefits for compatibility—it's easy ... In Unicode, precomposed characters exist alongside the dynamic composition system.
  32. [32]
    UTN #2: A General Method for Rendering Combining Marks - Unicode
    Oct 28, 2002 · This document discusses a generalized method for the display of arbitrary combinations of combining mark glyphs (accents or diacritics) with respect to some ...<|control11|><|separator|>
  33. [33]
    Character to Glyph Mapping Table - TrueType Reference Manual
    The 'cmap' table maps character codes to glyph indices. The choice of encoding for a particular font is dependent upon the conventions used by the intended ...
  34. [34]
    Registered features, a-e (OpenType 1.9.1) - Typography
    Jul 6, 2024 · ... pre-composed characters, but not be applicable to other unrelated characters. ... If a font supports two or more character variant features, use ...
  35. [35]
    Using Unicode Normalization to Represent Strings - Win32 apps
    Jan 7, 2021 · The Unicode Consortium has defined four normalization forms: NFC (form C), NFD (form D), NFKC (form KC), and NFKD (form KD).Provide Multiple... · Use the Four Defined...Missing: specification | Show results with:specification
  36. [36]
    Normalizer (Java Platform SE 8 ) - Oracle Help Center
    For further API reference and developer documentation, see Java SE Documentation. That documentation contains more detailed, developer-targeted descriptions ...
  37. [37]
    unicodedata — Unicode Database — Python 3.14.0 documentation
    Normal form C (NFC) first applies a canonical decomposition, then composes pre-combined characters again. In addition to these two forms, there are two ...
  38. [38]
    Collation | ICU Documentation
    In other words, ICU implements the CLDR Collation Algorithm which is an extension of the Unicode Collation Algorithm (UCA) which is an extension of ISO 14651.
  39. [39]
    ICU - International Components for Unicode
    ICU's collation is based on the Unicode Collation Algorithm plus locale-specific comparison rules from the Common Locale Data Repository, a comprehensive ...
  40. [40]
    UTS #18: Unicode Regular Expressions
    Feb 8, 2022 · Summary. This document describes guidelines for how to adapt regular expression engines to use Unicode. Status.
  41. [41]
    XML Normalization - W3C
    Mar 15, 2013 · XML Normalization defines a means by which XML parsers can produce normalized output of any parsed document. This normalized form is similar to ...
  42. [42]
    MySQL 8.4 Reference Manual :: 12.10.1 Unicode Character Sets
    MySQL implements language-specific Unicode collations if the ordering with utf8mb4_unicode_ci does not work well for a language. For example, utf8mb4_unicode_ci ...Missing: precomposed | Show results with:precomposed