Fact-checked by Grok 2 weeks ago

Latin script in Unicode

The Latin script in Unicode refers to the standardized encoding of characters used in the Latin alphabet and its numerous extensions, supporting the writing systems of over 100 languages worldwide, from major European tongues to indigenous and minority languages in Africa, the Americas, and Asia. As of Unicode version 17.0, released on September 9, 2025, the Latin script encompasses 1,492 assigned characters, primarily letters with diacritics, phonetic symbols, and related punctuation, distributed across dedicated blocks such as Basic Latin (U+0000–U+007F), Latin-1 Supplement (U+0080–U+00FF), Latin Extended-A (U+0100–U+017F), Latin Extended-B (U+0180–U+024F), Latin Extended Additional (U+1E00–U+1EFF), Latin Extended-C (U+2C60–U+2C7F), Latin Extended-D (U+A720–U+A7FF), Latin Extended-E (U+AB30–U+AB6F), Latin Extended-F (U+10780–U+107BF), and Latin Extended-G (U+1DF00–U+1DFFF). The encoding of the Latin script originated in Unicode 1.0 (1991), building on legacy standards like ASCII and ISO Latin-1 to ensure compatibility while allowing expansion for global linguistic diversity. Key principles include the provision of both precomposed characters (e.g., á at U+00E1) for efficient legacy support and combining diacritical marks (e.g., U+0061 a + U+0301 acute accent) for flexible composition, enabling the representation of accented letters in languages like French, German, and Vietnamese. This dual approach accommodates historical orthographies, phonetic notations such as the International Phonetic Alphabet (IPA), and modern adaptations, with uppercase/lowercase pairings and left-to-right rendering as standard. Notable extensions in later Unicode versions address specific needs, such as the Latin Extended-D block for Caucasian Albanian and other historical scripts, and Latin Extended-G for advanced diacritic stacking in scholarly and linguistic applications. The script's versatility is evident in its use for transliterations (e.g., Pinyin for Chinese) and in mixed-script environments, like Japanese loanwords, underscoring Unicode's goal of universal text interchange without favoring any single language. Ongoing proposals continue to refine glyphs and add characters, such as subscript forms and variant 'r' shapes, to better support diverse typographic traditions.

Overview

Definition and Scope

The Latin script in Unicode encompasses a family of alphabets derived from the ancient Roman script, with characters assigned the Unicode script property value "Latn" to indicate their primary association with this writing system. This classification supports the encoding of letters, diacritics, and modifiers used in various Latin-based orthographies worldwide. As of Unicode 17.0, released in September 2025, exactly 1,492 characters are classified with the Latin script property, spanning basic alphabetic forms to specialized extensions. This total reflects ongoing expansions, including 5 new characters added to the Latin Extended-D block in version 17.0. The scope of the Latin script covers encodings for modern European languages (such as English, French, and German), African languages (like Swahili and Yoruba), Vietnamese with its diacritic-rich alphabet, indigenous American and Australian scripts adapted to Latin bases, phonetic notations including the International Phonetic Alphabet, and historical variants for medieval or classical texts, while deliberately excluding characters from independent scripts like Cyrillic (assigned "Cyrl") or Greek ("Grek") despite visual similarities. These inclusions ensure comprehensive support for linguistic diversity without overlapping script boundaries. The majority of Latin-script characters are concentrated in dedicated Unicode blocks, such as:
Block NameCode Point Range
Basic LatinU+0000–U+007F
Latin-1 SupplementU+0080–U+00FF
Latin Extended-AU+0100–U+017F
Latin Extended-BU+0180–U+024F
IPA ExtensionsU+0250–U+02AF
Spacing Modifier LettersU+02B0–U+02FF
Latin Extended AdditionalU+1E00–U+1EFF
Latin Extended-CU+2C60–U+2C7F
Latin Extended-DU+A720–U+A7FF
Latin Extended-EU+AB30–U+AB6F
Latin Extended-FU+10780–U+107BF
Latin Extended-GU+1DF00–U+1DF2A
Phonetic ExtensionsU+1D00–U+1D7F
Phonetic Extensions SupplementU+1D80–U+1DBF
Superscripts and SubscriptsU+2070–U+209F
Note that while some characters with the "Latn" property appear in auxiliary blocks such as Letterlike Symbols or Alphabetic Presentation Forms, full details appear in later sections on Unicode blocks. Additionally, certain Latin-like characters appear outside these primary blocks in areas such as the Currency Symbols block (U+20A0–U+20CF), where symbols like the bitcoin symbol ₿ (U+20BF) receive the "Common" script property due to their multi-script applicability, distinguishing them from core Latin alphabetic encoding.

Importance and Usage

The Latin script is the most widely used writing system within the Unicode Standard, supporting hundreds of languages across the globe, including prominent European languages such as English, Spanish, French, and German, as well as non-European ones like Swahili and Turkish. This extensive coverage stems from its adaptability to diverse linguistic needs through basic letters, diacritics, and extensions, making it indispensable for multilingual digital environments. Over 1,500 languages worldwide employ the Latin script as their primary writing system, underscoring its dominance in global text representation. The encoding of the Latin script plays a pivotal role in web content creation, software localization, and compliance with international standards such as ISO/IEC 10646, which defines the Universal Coded Character Set for universal text interchange. Languages using the Latin script account for approximately 70% of internet content, with English alone comprising over 50% of websites, followed by Spanish, German, French, and others, facilitating seamless global communication and application development. In software localization, Unicode's Latin support enables developers to handle text in a single encoding scheme, reducing conversion errors and supporting internationalization for billions of users. The historical evolution of Latin encoding in computing—from the 7-bit ASCII standard, which covered only unaccented English characters, to the 8-bit ISO 8859-1 (Latin-1) that introduced diacritics for Western European languages—directly shaped Unicode's approach, ensuring compatibility while expanding to over 1,000 Latin characters. Unicode resolves legacy challenges in accented languages by offering both precomposed forms (e.g., é as U+00E9) and combining diacritics (e.g., e + acute accent), with normalization algorithms that reconcile variations from older encodings without data loss. This framework prevents compatibility issues, such as mismatched representations in legacy systems, promoting consistent rendering across platforms. Additionally, the Latin script underpins specialized applications, including mathematical notations via derived symbols like italic or bold variants of Latin letters (e.g., in the Mathematical Alphanumeric Symbols block) and serves as a foundational element in emoji, where Latin-based characters form regional indicators and modifier bases for diverse expressions. These roles highlight its versatility beyond everyday text, enhancing technical and expressive digital content.

Historical Development

Initial Encoding

The initial encoding of the Latin script in Unicode began with version 1.0, released in October 1991, which established the foundational blocks for Latin characters to ensure broad compatibility with existing computing standards. This version included the Basic Latin block (U+0000–U+007F), comprising 128 characters that directly mirrored the American Standard Code for Information Interchange (ASCII), encompassing unaccented uppercase and lowercase English letters (A–Z and a–z), digits (0–9), basic punctuation, and control codes. The design preserved the exact bit patterns of ASCII within the first 128 code points, allowing seamless integration with legacy systems that relied on 7-bit encodings. Unicode 1.1, released in June 1993, expanded the Latin repertoire by incorporating the Latin-1 Supplement block (U+0080–U+00FF), adding another 128 characters primarily drawn from the ISO/IEC 8859-1 standard, also known as Latin-1. This block introduced 61 accented Latin letters—such as á, ç, and ñ—along with currency symbols, mathematical operators, and additional punctuation to support Western European languages like French, German, Spanish, and Italian. The inclusion of these characters extended the total Latin encoding to 256 positions, aligning Unicode with the 8-bit ISO 8859-1 encoding that had become prevalent in personal computing during the 1980s. The rationale for this initial approach prioritized compatibility with established 8-bit encodings to facilitate widespread adoption and minimize disruption in software and data migration. By starting with ASCII and extending to ISO 8859-1, Unicode addressed the fragmentation of character sets in global computing, where different regions used incompatible codes for Latin-based scripts, thereby promoting efficient text interchange without requiring extensive re-encoding of existing files. This compatibility was a core principle from the project's inception, as early designers aimed to create a universal standard that could "begin at 0 and add the next character" sequentially, avoiding the inefficiencies of variable-width legacy encodings. Key milestones trace back to collaborative efforts in the 1980s by international standards bodies. In 1984, ISO/TC97/SC2 proposed a two-byte international character set as a precursor to broader universal encoding initiatives. By late 1987, discussions within ECMA and ISO working groups, alongside ANSI X3L2, focused on unifying Western character sets, with the term "Unicode" coined by Joe Becker at Xerox to describe a unique, universal, and uniform encoding scheme. These efforts culminated in the initial repertoire emphasizing 256 Western Latin characters, formalized through the Unicode Consortium's incorporation in January 1991. Despite these advancements, the early Unicode versions had notable limitations, lacking support for Eastern European languages that use extended Latin characters, such as those with diacritics for Polish, Czech, or Hungarian. The focus remained on Western European needs via ISO 8859-1, deferring broader Latin extensions until later versions to maintain initial simplicity and compatibility.

Subsequent Expansions

Following the initial encoding in Unicode 1.x, expansions began with version 2.0 in July 1996, which introduced the Latin Extended-A block (U+0100–U+017F) to support additional Latin letters for Central and Eastern European languages, such as the Polish ł (U+0142) and Czech č (U+010D). This addition addressed the need for characters beyond the Latin-1 Supplement, enabling better representation of accented letters in scripts like those used in Polish, Czech, and Hungarian orthographies. Unicode 3.0, released in September 2000, expanded the repertoire with the Latin Extended-B block (U+0180–U+024F), incorporating characters for African languages, including click consonants like ǂ (U+01C2), and Vietnamese tone marks such as ơ (U+01A1) and ư (U+01B0). These inclusions were driven by proposals to accommodate orthographic reforms and linguistic diversity in non-European contexts, filling gaps in earlier standards like ISO 6438 for African scripts. Subsequent versions built on this foundation with targeted additions for specialized uses. Unicode 5.0 (July 2006) introduced the Latin Extended-C block (U+2C60–U+2C7F) for orthographic and phonetic uses, while initial elements of Latin Extended-D (U+A720–U+A7FF) were added in Unicode 4.1 (April 2005) for medieval and phonetic notations, such as ꭲ (U+AB72) for historical paleography. Versions 5.0 (July 2006) and 6.0 (October 2010) added characters to the IPA Extensions block (U+0250–U+02AF), including ɳ (U+0273), and the Spacing Modifier Letters block (U+02B0–U+02FF), such as ʔ (U+02BC), to support phonetic transcription systems proposed by linguists for broader IPA coverage. From Unicode 7.0 (June 2014) through 15.0 (September 2022), further expansions included Latin Extended-E (U+AB30–U+AB6F, added in 7.0), Latin Extended-F (U+10780–U+107BF, added in 14.0), and Latin Extended-G (U+1DF00–U+1DFFF, added in 14.0), incorporating characters for indigenous languages like those in the Anthropos alphabet (e.g., ꭎ U+AB4E) and historical scripts such as Gothic transliterations. These were motivated by submissions from linguists and academic groups to preserve minority and historical orthographies, often addressing deficiencies in legacy encoding systems. Unicode 16.0 (September 2024) and 17.0 (September 2025) continued this trend by adding characters such as additional letters for Gaulish in Latin Extended-D and extensions to Latin Extended-G for phonetic notations, enhancing support for indigenous and historical languages including Sami and Inuktitut orthographies. These expansions stemmed from proposals by linguistic experts and indigenous communities to standardize digital representation of their writing systems. The motivations for these expansions have consistently involved contributions from linguists, academic institutions, and governments, particularly for African orthographies (e.g., proposals for Nguni and Bamileke scripts) and medieval studies, to bridge gaps in digital fonts and promote cultural preservation. As of November 2025, previews for Unicode 18.0 indicate ongoing additions for African and Oceanian Latin variants, such as U+A7E2 LATIN CAPITAL LETTER R WITH LONG LEG for Egyptian Arabic and southern African languages, potentially increasing the total Latin script characters beyond 1,500.

Encoding Principles

Composition and Decomposition

The Latin script in Unicode supports both precomposed characters, which are single code points representing a base letter combined with one or more diacritics, and sequences formed by combining a base character with separate combining diacritics. Precomposed forms are provided for commonly used accented letters to ensure efficient encoding and compatibility with legacy systems; for example, U+00E9 LATIN SMALL LETTER E WITH ACUTE (é) is a single code point equivalent to the base letter e followed by an acute accent. Combining characters allow for the dynamic construction of less common or context-specific forms by attaching diacritics to base letters, such as U+0065 LATIN SMALL LETTER E (e) followed by U+0301 COMBINING ACUTE ACCENT (◌́), which renders as é. The combining diacritics used with Latin letters are primarily drawn from the Combining Diacritical Marks block (U+0300–U+036F), which is shared across multiple scripts to promote reusability and limit the total number of code points required. Unicode Normalization Forms standardize the representation of these equivalent sequences to facilitate consistent processing and comparison of text. Normalization Form C (NFC) decomposes precomposed characters when necessary and then recomposes them, preferring precomposed forms where possible; for instance, the word "résumé" in NFC uses precomposed code points like U+00E9 and U+00E9, while its NFD equivalent decomposes to base letters plus combining marks (e.g., U+0072 U+0065 U+0301 U+0073 U+0075 U+0301 U+0065). Conversely, Normalization Form D (NFD) fully decomposes precomposed characters into base and combining components without recomposition. This dual approach—precomposed for frequent usage and combining for flexibility—enables compact encoding of prevalent Latin characters while accommodating rare combinations without an excessive proliferation of dedicated code points; the Unicode Standard encodes over 1,000 such precomposed Latin letters across its Basic Latin, Latin-1 Supplement, and extended Latin blocks.

Script Classification and Properties

In the Unicode Standard, the Latin script is designated with the four-letter code "Latn" as its script property value for the majority of its characters, including basic letters like U+0041 LATIN CAPITAL LETTER A and extended forms in various Latin blocks. This assignment is documented in the Unicode Character Database (UCD) file Scripts.txt, which catalogs script memberships to aid in text processing, such as script-specific rendering or regular expression matching. Punctuation and symbols shared across scripts, such as U+0020 SPACE, receive the "Zyyy" (Common) script value instead, reflecting their script-agnostic usage. Directionality properties for Latin characters are primarily left-to-right, with most letters classified under the bidirectional class "L" (Left-to-Right) in the UCD's DerivedBidiClass.txt and UnicodeData.txt files. This ensures that Latin text flows horizontally from left to right by default, as specified in the Unicode Bidirectional Algorithm. Control characters and certain modifiers, however, are assigned "N" (Neutral), allowing them to inherit directionality from surrounding text without imposing their own. For instance, U+00AD SOFT HYPHEN is given the "BN" (Boundary Neutral) bidirectional class, which treats it as a non-directional break opportunity in mixed left-to-right and right-to-left scripts, influencing line layout without altering embedding levels. Additional character properties include the general category, which partitions Latin characters into classes such as "Lu" (Uppercase Letter) for forms like U+0041 A and "Ll" (Lowercase Letter) for U+0061 a, or "Mn" (Nonspacing Mark) for diacritics like U+0301 COMBINING ACUTE ACCENT that attach to base letters without advancing horizontally. Decomposition mappings, stored in the fifth field of UnicodeData.txt, break down precomposed characters (e.g., U+00E9 LATIN SMALL LETTER E WITH ACUTE into U+0065 e + U+0301 ´) to support canonical equivalence in text normalization processes. These properties are normative and derived from the UCD, enabling consistent handling across implementations. The UCD files, including Scripts.txt for script assignments and UnicodeData.txt for general categories, bidirectional classes, and decompositions, serve as the authoritative tools for querying these metadata. They inform critical applications, such as font rendering engines that apply the bidirectional algorithm to resolve display order in multilingual text, and search systems that leverage properties for accurate matching. A distinctive feature of the Latin script's encoding is its high density of paired uppercase and lowercase forms, which underpin robust case folding operations as defined in CaseFolding.txt, allowing for locale-independent case-insensitive comparisons across thousands of characters. Furthermore, Latin uniquely supports titlecase forms in the general category "Lt" (Titlecase Letter) for specific digraphs, such as U+01C5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON (Dž) used in Serbian and Croatian orthographies, where the initial component is uppercased while the trailing one remains lowercase to preserve digraph integrity in titles.

Unicode Blocks

Core Blocks

The core blocks for the Latin script in Unicode provide the foundational encoding for the most common Latin characters, ensuring compatibility with legacy systems while supporting a wide range of Western and Central European languages. These blocks—Basic Latin, Latin-1 Supplement, Latin Extended-A, and Latin Extended-B—collectively encode essential letters, punctuation, and controls, forming the basis for text processing in modern computing environments. The Basic Latin block (U+0000–U+007F) encompasses 128 characters, fully compatible with the 7-bit ASCII standard, and includes the C0 control characters (U+0000–U+001F and U+007F) for device management, as well as 95 graphic characters from U+0020 to U+007E. It features 52 letters (26 uppercase A–Z at U+0041–U+005A and 26 lowercase a–z at U+0061–U+007A), the digits 0–9 (U+0030–U+0039), and common punctuation such as the period (U+002E) and exclamation mark (U+0021). Additionally, it contains spacing clones of diacritics, like the circumflex accent (U+005E), which serve as compatibility characters rather than combining marks. This block supports the core alphabet for English and other basic Latin-script languages, with no unassigned positions in Unicode 17.0. The Latin-1 Supplement block (U+0080–U+00FF) adds another 128 characters, extending Basic Latin to cover Western European languages in alignment with ISO/IEC 8859-1. It includes C1 controls (U+0080–U+009F), precomposed accented letters such as á (U+00E1) for Spanish and Portuguese, ç (U+00E7) for French and Catalan, and ñ (U+00F1) for Spanish, along with symbols like the diaeresis (U+00A8) and acute accent (U+00B4) as spacing forms. Ordinal indicators ª (U+00AA) and º (U+00BA) are also encoded here, with glyph variations possible across fonts. This block enables representation of languages including Danish, Dutch, Finnish, German, Icelandic, Irish, Italian, Norwegian, and Swedish, with all positions assigned. Latin Extended-A (U+0100–U+017F) provides 128 characters focused on Central and Eastern European Latin-script languages, building on the previous blocks with precomposed forms featuring diacritics. It supports alphabets for Afrikaans, Basque, Breton, Croatian, Czech, Esperanto, Estonian, Hungarian, Latvian, Lithuanian, Polish, Romanian, Slovak, Slovenian, Turkish, and Welsh, including characters like ą (U+0105) for Polish, č (U+010D) for Czech and Croatian, and ā (U+0101) for Latvian. Five compatibility digraphs are included, such as ij (U+0133) for Dutch and ʼn (U+0149) for Afrikaans, though the latter is deprecated in favor of decomposition. Some characters resemble International Phonetic Alphabet (IPA) symbols, such as ŋ (U+014B) for Northern Sámi. All code points are assigned, ensuring comprehensive coverage compatible with ISO/IEC 6937. Latin Extended-B (U+0180–U+024F) allocates 208 code points for less common Latin extensions, marking the first block with significant non-European focus, and includes 208 assigned characters in Unicode 17.0. It encodes letters for African languages per ISO 6438 (e.g., Ɓ U+0181 for Bamum), Vietnamese and Pinyin forms from GB 2312 and JIS X 0212 (e.g., ơ U+01A1), and Sami extensions from ISO/IEC 8859-10 (e.g., ʼn U+0149 compatibility). Examples include Ɗ (U+018A) for Dinka and ƣ (U+01A3, LATIN SMALL LETTER OI). The block also features caseless letters like the glottal stop modifier (not directly encoded here but related) and some phonetic symbols, with characters arranged alphabetically by uppercase form.

Extended Blocks

The extended blocks for the Latin script in Unicode accommodate specialized orthographies, historical variants, and characters for minority and indigenous languages, with most additions occurring after Unicode version 3.0 in 2000 to address gaps in representation for underrepresented writing systems. These blocks collectively provide around 900 characters, enabling digital support for languages such as various African, Nordic indigenous, and Southeast Asian orthographies that extend beyond basic Latin encoding. Key expansions include provisions for phonetic notations, medieval paleography, and modern minority scripts, reflecting ongoing efforts to preserve linguistic diversity. As of Unicode 17.0, the Latin script includes 1,492 assigned characters across all blocks. The Latin Extended Additional block (U+1E00–U+1EFF) comprises 256 characters designed to support accented letters and digraphs for languages employing extended Latin alphabets, including Vietnamese tonal marks and Irish orthographic forms. Notable examples include Ḃ (U+1E02, LATIN CAPITAL LETTER B WITH DOT ABOVE) for Irish Gaelic distinctions and ḫ (U+1E2B, LATIN SMALL LETTER H WITH BREVE BELOW) for phonetic and historical uses in various European languages. This block was significantly expanded post-2000 to incorporate compatibility with legacy encodings and to cover additional precomposed forms for efficient text processing in these languages. Latin Extended-C (U+2C60–U+2C7F) is a compact block of 32 characters primarily encoding orthographic extensions for minority languages, with a focus on African click consonants and forms analogous to those in Lepcha script. Characters such as 𝼏 (U+2C60, LATIN CAPITAL LETTER T WITH PALATAL HOOK) and 𝼐 (U+2C61, LATIN SMALL LETTER T WITH PALATAL HOOK) facilitate the representation of click sounds in languages like those of the Khoisan peoples, while other entries include historic punctuation and letter variants added in Unicode 5.0 to support linguistic documentation. The Latin Extended-D block (U+A720–U+A7FF) allocates 224 code points for a diverse array of historical and orthographic characters, emphasizing medieval Latin abbreviations and notations related to Irish Ogham traditions. It includes forms like ꝓ (U+A74D, LATIN SMALL LETTER OO) for scribal abbreviations in medieval manuscripts and ꝕ (U+A74F, LATIN SMALL LETTER REVERSED P) used in paleographic reconstructions, alongside letters for modern minority orthographies. Introduced in Unicode 5.1 and expanded thereafter, this block aids in the digitization of historical texts and supports revived usages in Irish and other Celtic contexts. Latin Extended-E (U+AB30–U+AB6F) offers 58 assigned characters out of 64 code points, tailored to the orthographies of Sami and other indigenous Nordic languages, incorporating unique letter forms for phonetic accuracy in minority European scripts. Examples include ꬰ (U+AB30, LATIN SMALL LETTER B WITH STROKE) for Sami dialectal distinctions and ꬶ (U+AB36, LATIN SMALL LETTER US) to represent specific sounds in northern indigenous writing systems. Added in Unicode 7.0, this block addresses the needs of Uralic and Finno-Ugric languages previously underrepresented in digital formats. Latin Extended-F (U+10780–U+107BF) includes 57 assigned characters out of 64 code points, supporting modifier letters for phonetic transcription, including nearly all IPA and extIPA symbols. These additions, introduced in Unicode 15.0, enable precise representation of sounds in linguistic research and documentation. Finally, Latin Extended-G (U+1DF00–U+1DFFF) provides 256 characters, introduced in Unicode 14.0, containing additional characters for phonetic transcription to support advanced linguistic notations. This block exemplifies continued Unicode expansions for global linguistic equity.

Auxiliary Blocks

Auxiliary blocks in Unicode encompass ranges outside the primary Latin script allocations that nonetheless include a substantial number of characters classified under the Latin script (Script=Latin or Script=Latin in ScriptExtensions) for phonetic, modifier, and diacritical purposes. These blocks primarily support linguistic transcription systems, such as the International Phonetic Alphabet (IPA), and enable advanced notations in scholarly and orthographic contexts. While not dedicated exclusively to Latin orthographies, they contain characters that integrate seamlessly with Latin-based writing systems, often through composition with base letters from core blocks. Collectively, these contribute approximately 480 characters with the Latin script property, playing a crucial role in supporting linguistic research, phonetic transcription, and extended orthographies worldwide. The IPA Extensions block (U+0250–U+02AF) provides 96 characters dedicated to phonetic transcription in the IPA, including symbols like ɐ (U+0250 LATIN SMALL LETTER TURNED A) and ʙ (U+0299 LATIN LETTER SMALL CAPITAL B). These characters represent sounds not adequately covered by basic Latin letters, such as turned or inverted forms for vowels and consonants, and are classified as Latin script despite their specialized phonetic application. This block facilitates precise representation of global language phonologies in linguistic research and documentation. Spacing Modifier Letters (U+02B0–U+02FF) contains 80 characters used as spacing forms of diacritics, tone marks, and other modifiers, exemplified by ʾ (U+02BE MODIFIER LETTER RIGHT HALF TRIANGULAR COLON) and ˀ (U+02C0 MODIFIER LETTER APOSTROPHE). These include superscript letters and symbols for prosodic features like stress or intonation, which can combine with Latin base characters to denote phonetic modifications. The block supports notations in phonology and dialectology, ensuring compatibility with Latin script rendering. The Combining Diacritical Marks block (U+0300–U+036F) includes 112 non-spacing marks that attach to base characters, many of which are essential for Latin script compositions, such as ◌̈ (U+0308 COMBINING DIAERESIS). These marks, including acute, grave, and circumflex accents, are shared across scripts but predominantly used with Latin letters to represent vowel qualities or tonal distinctions in languages like French, German, and Vietnamese. Their combining nature allows for normalized decomposition and canonical equivalence in text processing. Phonetic Extensions (U+1D00–U+1D7F) allocates 128 characters for advanced phonetic notations beyond core IPA, including modifier letters like ᶛ (U+1D9B MODIFIER LETTER SMALL TURNED R) and ꟲ (U+1DF2 LATIN SMALL LETTER TURNED M WITH HOOK, added in later versions). This block extends support for the Uralic Phonetic Alphabet (UPA) and other systems, with many characters bearing the Latin script property for integration into Latin-based transcriptions. It enables detailed representation of suprasegmental features in linguistic analysis. Phonetic Extensions Supplement (U+1D80–U+1DBF) adds 64 characters, incorporating additional IPA extensions and symbols for Linear B syllabary notations, such as ᶀ (U+1D80 LATIN SMALL LETTER B WITH PALATAL HOOK). These include hooks and curls for articulatory details, classified under Latin script to support scholarly phonetic work. The block complements earlier phonetic ranges by providing further granularity for rare or specialized sounds.

Character Categories

Letters and Digraphs

The Latin script in Unicode begins with the 26 base letters of the English alphabet, provided as uppercase and lowercase pairs in the Basic Latin block, ranging from A (U+0041) to Z (U+005A) and a (U+0061) to z (U+007A). These form the foundational alphabetic characters, supporting the core structure for most Latin-based writing systems worldwide. Unicode encodes approximately 844 precomposed Latin letters (as of Unicode 17.0), including accented and modified forms that combine base letters with diacritics or other modifications into single code points for efficiency and compatibility. Examples include Ä (U+00C4) for German and Swedish, Ñ (U+00D1) for Spanish, and Ʒ (U+01B7) for historical uses. These are distributed across various blocks, with over 100 forms supporting Western European orthographies in blocks like Latin-1 Supplement and Latin Extended-A, such as É (U+00C9) and Ł (U+0141). More than 200 precomposed letters cater to African languages, primarily in Latin Extended-B and Latin Extended-D, including Ɖ (U+0189) for Ewe and Ɓ (U+0181) for Shona and other African languages representing implosive sounds. Digraphs and ligatures, which represent combined letter forms treated as single units in certain languages, are encoded as distinct characters to preserve orthographic integrity, with around 50 such forms across relevant blocks. Notable examples include Æ (U+00C6) for Danish and Old English, Œ (U+0152) for French, and IJ (U+0132) for Dutch, alongside historical variants like ʒʒ (U+0292 U+0292) in phonetic contexts. These encodings facilitate accurate representation without relying on decomposition, particularly for languages like Old English where digraphs such as Ð (U+00D0) denote specific phonemes. Unicode also includes specialized variants of Latin letters to support diverse applications. Small capital letters, such as ᴀ (U+1D00), appear in phonetic and linguistic notations within the Phonetic Extensions block, providing compact uppercase-like forms for abbreviations and emphasis. Historical variants, like Ꝏ (U+A74E) used in Old English manuscripts for the "oo" ligature, are encoded in Latin Extended-D to revive medieval scribal traditions. Indigenous orthographies draw on unique forms, such as Ŋ (U+014A) in Inuktitut to represent the ng sound. Unicode 18.0 (2025) added further historical variants, such as Latin capital letter R with long leg (U+A7E2), enhancing support for medieval European orthographies. Special case mapping rules apply to certain Latin characters, exemplified by the German ß (U+00DF), which uppercases to SS (U+0053 U+0053) rather than a single uppercase form, reflecting orthographic conventions while a dedicated uppercase ẞ (U+1E9E) exists for precise needs. Unicode avoids encoding every possible digraph as a single character to prevent redundancy, favoring composition with base letters and modifiers where feasible, thus maintaining a balance between compatibility and extensibility.

Diacritics and Modifiers

The Latin script in Unicode incorporates a wide array of diacritical marks and modifiers to represent phonetic nuances, tonal distinctions, and orthographic variations across languages that use Latin-based writing systems. These elements are essential for encoding accented letters, such as those in French (é U+00E9), Vietnamese (ệ U+1EC7), or pinyin (nǐ), by allowing base letters to be modified without dedicated precomposed characters for every combination. Combining diacritics, which attach to preceding base characters, form the core of this system, enabling flexible stacking for complex notations like multiple tones in Southeast Asian languages. Combining diacritics encompass approximately 200 marks (as of Unicode 17.0) distributed across several Unicode blocks, primarily the Combining Diacritical Marks (U+0300–U+036F), Supplement (U+1DC0–U+1DCF), and Extended (U+1AB0–U+1AFF) blocks. Common examples include the acute accent (◌́ U+0301) for stress in Spanish or rising tones in Vietnamese, and the grave accent (◌̀ U+0300) for falling tones or open syllables in Italian and African languages like Yoruba. These marks support stacking, as seen in Vietnamese orthography where a base vowel might combine with a tone mark (e.g., ◌̉ U+0309 hook above for hỏi tone) followed by another diacritic, allowing representations like ử (u + horn + hook above). The design prioritizes canonical ordering in normalization processes to ensure consistent rendering across systems. Spacing modifiers, numbering around 100 characters mainly in the Spacing Modifier Letters block (U+02B0–U+02FF), function as independent glyphs that influence adjacent letters without combining, often in phonetic transcriptions or scripts requiring explicit spacing. For instance, the modifier letter apostrophe ʿ (U+02BF) denotes glottal stops or ejective sounds in African languages like Hausa, while the right half ring ʢ (U+02A2) marks pharyngeal fricatives in Semitic transliterations adapted to Latin. These are crucial for linguistic notations where non-spacing attachment would alter readability, such as in dialectology or African orthographies like those for Igbo. Phonetic symbols, totaling about 300 characters drawn from IPA Extensions (U+0250–U+02AF), Phonetic Extensions (U+1D00–U+1D7F), and related blocks, extend Latin forms for International Phonetic Alphabet (IPA) usage in linguistic analysis. The glottal stop ʔ (U+0294), a standalone modifier letter, represents a catch in the voice in languages like Danish or Tagalog transliterations, while extensions like the turned question mark ʞ (U+029A) denote uvular clicks in Khoisan languages. These symbols integrate with Latin bases to transcribe sounds absent in standard alphabets, supporting scholarly and educational applications. Other specialized modifiers include the double acute (◌̋ U+030B) for Hungarian long vowels (e.g., ő) and the hook above (◌̉ U+0309) for Vietnamese hỏi tone, with sequencing rules dictating application order—typically base letter first, then non-spacing marks, followed by spacing or enclosing modifiers—to avoid rendering ambiguities. Unique mechanisms like the zero-width joiner (ZWJ, U+200D) enable ligature control in Latin contexts, such as joining letters in historical texts or preventing unwanted breaks in compound words. Additionally, compatibility decompositions normalize precomposed ligatures, such as fi (U+FB01) into f + i (U+0066 U+0069), facilitating searchable plain text while preserving visual intent in legacy encodings.

Representations

Block Summary Table

The Unicode blocks containing characters classified under the Latin script (Script=Latn) provide a structured allocation of code points for letters, modifier letters, and related symbols used in Latin-based writing systems. As of Unicode version 17.0 (released September 2025), there are 20 such blocks encompassing a total of 1,492 characters. These blocks have evolved over time, starting with 116 characters across the Basic Latin and Latin-1 Supplement blocks in version 1.0, followed by the addition of 1,376 more characters in subsequent versions to support expanded linguistic needs. Unicode 17.0 added characters to Latin Extended-G and Phonetic Extensions Supplement for additional support of indigenous and phonetic notations. The table below summarizes each block, including its code point range, the number of Latin-script characters (Script=Latn, excluding unassigned or reserved points, which are indicated in gray in official charts), the version in which the block was introduced, and a brief description of its purpose. This overview serves as a quick reference for developers and linguists working with Latin script encoding.
Block NameCode Point RangeNumber of Latin CharactersVersion AddedBrief Purpose
Basic LatinU+0000–007F521.0Core letters for ASCII compatibility
Latin-1 SupplementU+0080–00FF641.0Western European accented letters
Latin Extended-AU+0100–017F1281.0Accented letters for European languages
Latin Extended-BU+0180–024F1821.0African, indigenous, and phonetic extensions
IPA ExtensionsU+0250–02AF961.1Phonetic symbols for linguistics
Spacing Modifier LettersU+02B0–02FF801.0Modifiers for tone, stress, and phonetics
Combining Diacritical MarksU+0300–036F1121.0Diacritics compatible with Latin letters
Latin Extended AdditionalU+1E00–1EFF2561.1Additional accented and dotted letters
Superscripts and SubscriptsU+2070–209F151.1Superscript/subscript Latin forms
Letterlike SymbolsU+2100–214F41.0Mathematical Latin variants (e.g., Å)
Number FormsU+2150–218F411.0Roman numerals and fractions
Phonetic ExtensionsU+1D00–1D7F1084.0Extended phonetic notation
Phonetic Extensions SupplementU+1D80–1DBF604.1Supplemental phonetic symbols
Latin Extended-CU+2C60–2C7F325.0Extensions for Glagolitic, Coptic, and historic scripts
Latin Extended-DU+A720–A7FF1605.0Phonetic, Mayanist, and medievalist forms
Latin Extended-EU+AB30–AB6F327.0Meetei Mayek and Indic extensions
Latin Extended-FU+10780–107BF5711.0Historic transliterations of Egyptian hieroglyphs
Latin Extended-GU+1DF00–1DFFF8014.0Advanced diacritic stacking for linguistics
Alphabetic Presentation Forms (Latin ligatures)U+FB00–FB4F71.1Typographic ligatures (e.g., fi)
Halfwidth and Fullwidth Forms (Latin)U+FF00–FFEF521.1East Asian width variants of Latin letters
Note that reserved code points within these blocks (e.g., for future allocation) are not counted and appear grayed in the official Unicode charts. Counts reflect only characters with Script=Latn as per the Unicode Character Database.

Character Inventory Table

The character inventory for the Latin script in Unicode encompasses 1,492 assigned characters as of version 17.0, excluding control characters and non-letter elements unless they are explicitly part of Latin orthographies. These characters are distributed across multiple blocks, with properties such as general category (e.g., Lu for uppercase letters, Ll for lowercase letters, Mn for non-spacing marks) derived from the Unicode Character Database. The table below presents representative examples grouped by block, including code points in hexadecimal, glyphs (where renderable in standard fonts), official names, categories, and the Unicode version in which they were first encoded. Reserved code points within Latin blocks are marked as such; compatibility characters (e.g., decomposed forms from legacy encodings) are noted where applicable. For the complete list, refer to the official Unicode charts and data files.
BlockCode PointGlyphNameCategoryVersion Added
Basic Latin (U+0000–U+007F; excludes controls U+0000–U+001F, U+007F)U+0041ALATIN CAPITAL LETTER ALu1.0
U+0061aLATIN SMALL LETTER ALl1.0
U+0042BLATIN CAPITAL LETTER BLu1.0
Latin-1 Supplement (U+0080–U+00FF)U+00C0ÀLATIN CAPITAL LETTER A WITH GRAVELu1.0
U+00E0àLATIN SMALL LETTER A WITH GRAVELl1.0
U+00DFßLATIN SMALL LETTER SHARP SLl1.0
Latin Extended-A (U+0100–U+017F)U+0106ĆLATIN CAPITAL LETTER C WITH ACUTELu1.0
U+0107ćLATIN SMALL LETTER C WITH ACUTELl1.0
U+0141ŁLATIN CAPITAL LETTER L WITH STROKELu1.0
Latin Extended-B (U+0180–U+024F)U+0181ƁLATIN CAPITAL LETTER B WITH HOOKLu1.0
U+0253ɓLATIN SMALL LETTER B WITH HOOKLl1.1
U+01BAƺLATIN SMALL LETTER EZH WITH RETROFLEX HOOKLl1.0
IPA Extensions (U+0250–U+02AF)U+025BɛLATIN SMALL LETTER OPEN ELl1.1
U+0283ʃLATIN SMALL LETTER ESHLl1.1
U+0292ʒLATIN SMALL LETTER EZHLl1.1
Spacing Modifier Letters (U+02B0–U+02FF)U+02BCʼMODIFIER LETTER APOSTROPHESk1.0
U+02BBʻMODIFIER LETTER TURNED COMMASk1.0
Combining Diacritical Marks (U+0300–U+036F)U+0300`COMBINING GRAVE ACCENTMn1.0
U+0301´COMBINING ACUTE ACCENTMn1.0
U+0323̣COMBINING DOT BELOWMn1.0
Latin Extended Additional (U+1E00–1EFF)U+1E02LATIN CAPITAL LETTER B WITH DOT ABOVELu1.1
U+1E03LATIN SMALL LETTER B WITH DOT ABOVELl1.1
U+1E9BLATIN SMALL LETTER LONG S WITH DOT ABOVELl5.1
Latin Extended-C (U+2C60–U+2C7F)U+2C60LATIN CAPITAL LETTER L WITH DOUBLE BARLu5.0
U+2C61LATIN SMALL LETTER L WITH DOUBLE BARLl5.0
U+2C67LATIN CAPITAL LETTER H WITH HOOKLu5.0
Latin Extended-D (U+A720–U+A7FF)U+A730LATIN LETTER SMALL CAPITAL FLl5.1
U+A731LATIN LETTER SMALL CAPITAL SLl5.1
Latin Extended-E (U+AB30–AB6F)U+AB30LATIN SMALL LETTER BARRED ALPHALl7.0
U+AB31LATIN SMALL LETTER ALPHA WITH MACRONLl7.0
Latin Extended-F (U+10780–107BF)U+10780𐞀LATIN SMALL LETTER P WITH STROKE THROUGH DESCENDERLl11.0
U+10781𐞁LATIN SMALL LETTER T WITH PALATAL HOOKLl11.0
Latin Extended-G (U+1DF00–1DFFF)U+1DF07LATIN SMALL LETTER N WITH PALATAL HOOKLl14.0
U+1DF1ELATIN SMALL LETTER S WITH CURLLl14.0
This selection highlights the evolution of the Latin inventory, from core ASCII-compatible characters in version 1.0 to specialized extensions in later versions for linguistic diversity. Compatibility characters, such as certain digraphs in Latin Extended Additional, originate from legacy font mappings and are flagged for decomposition where relevant. Reserved points indicate unassigned code points available for future allocation.

References

  1. [1]
    Scripts.txt - Unicode
    # Scripts-17.0.0.txt # Date: 2025-07-24, 13:28:55 GMT # © 2025 Unicode ... LATIN SMALL LETTER A..CIRCLED LATIN SMALL LETTER Z 24EA..24FF ; Common # No ...
  2. [2]
    Unicode 17.0 Character Code Charts
    Latin Extended-B · Latin Extended-C · Latin Extended-D · Latin Extended-E · Latin Extended-F · Latin Extended-G · Latin Extended Additional · Latin Ligatures.Help and Links · Name Index · Unihan Database Lookup
  3. [3]
    Supported Scripts - Unicode
    ... writing systems of particular languages. In many cases, a single script, such as the Latin script, may be used to write tens or even hundreds of languages.Missing: documentation | Show results with:documentation
  4. [4]
    Chapter 7 – Unicode 16.0.0
    The Latin Extended-B block contains letterforms used to extend Latin scripts to represent additional languages. It also contains phonetic symbols not included ...
  5. [5]
    UTN #26: On the Encoding of Latin, Greek, Cyrillic, and Han - Unicode
    Mar 7, 2023 · This document discusses background information and encoding decisions pertaining to Latin, Greek, Cyrillic and Han characters in Unicode.
  6. [6]
    [PDF] Changing Latin script r glyphs and adding their capital characters
    Aug 10, 2024 · Their representative glyphs in the Unicode charts should be modified to better guide font designers creating glyphs for these characters and for ...
  7. [7]
  8. [8]
    How many writing systems are there? - Rubric
    Jun 27, 2024 · Over 1500 languages use the Latin script. After Latin, the most used writing systems (by the number of languages that use them) are Chinese ...<|separator|>
  9. [9]
    Usage statistics of content languages for websites - W3Techs
    This diagram shows the percentages of websites using various content languages. ... Latin · Yiddish · Maori · Oromo · Friulian · Hausa · Dari · Javanese · Niuean ...Missing: script | Show results with:script
  10. [10]
    What Is Unicode? - Phrase
    Jul 31, 2025 · A limitless encoding system that would provide codes for the characters of all the world's languages without the need for code conversions.
  11. [11]
    Technical Introduction
    ### Summary of Latin Script in Unicode
  12. [12]
    UAX #15: Unicode Normalization Forms
    ### Summary of Unicode Normalization Forms (UAX #15)
  13. [13]
    Chapter 22 – Unicode 16.0.0
    This block includes symbols based on Latin, Greek, and Hebrew letters. Stylistic variations of single letters are used for semantics in mathematical notation.
  14. [14]
    [PDF] Unicode Plain Text Encoding of Mathematics
    Nov 16, 2016 · Introduction. With a few conventions, Unicode can encode most mathematical expressions in a readable nearly plain text called UnicodeMath.
  15. [15]
    Early Years of Unicode
    Mar 26, 2015 · Ground work for the Unicode project began in late 1987 with initial discussions between three software engineers -- Joe Becker of Xerox Corporation, Lee ...
  16. [16]
    None
    ### Summary of History of Initial Latin Encoding, Basic Latin, and Latin-1 Supplement in Unicode Versions 1.0 and 1.1
  17. [17]
    [PDF] Latin-1 Supplement - The Unicode Standard, Version 17.0
    These charts are provided as the online reference to the character contents of the Unicode Standard, Version 17.0 but do not provide all the information needed ...
  18. [18]
    Chronology of Unicode Version 1.0
    Earliest documented use of the term "Unicode" coined by Becker; from unique, universal, and uniform character encoding. February 1988. Collins begins work at ...Missing: ECMA 1980s
  19. [19]
    [PDF] Latin Extended-A - The Unicode Standard, Version 17.0
    These charts are provided as the online reference to the character contents of the Unicode Standard, Version 17.0 but do not provide all the information needed ...<|separator|>
  20. [20]
  21. [21]
    [PDF] Latin Extended-B - The Unicode Standard, Version 17.0
    These charts are provided as the online reference to the character contents of the Unicode Standard, Version 17.0 but do not provide all the information needed ...Missing: blocks | Show results with:blocks
  22. [22]
  23. [23]
    [PDF] Latin Extended-D - The Unicode Standard, Version 17.0
    Latin Extended-D. Range: A720–A7FF. This file contains an excerpt from the character code tables and list of character names for. The Unicode Standard, Version ...Missing: details | Show results with:details
  24. [24]
    [PDF] Latin Extended-E - The Unicode Standard, Version 17.0
    Latin Extended-E. Range: AB30–AB6F. This file contains an excerpt from the character code tables and list of character names for. The Unicode Standard, Version ...Missing: block | Show results with:block
  25. [25]
  26. [26]
    [PDF] Latin Extended-G - The Unicode Standard, Version 17.0
    Latin Extended-G. 1DF00. 1DF1C 𝼜 LATIN SMALL LETTER TESH DIGRAPH WITH. RETROFLEX HOOK. 1DF1D 𝼝 LATIN SMALL LETTER C WITH RETROFLEX. HOOK. IPA extension. 1DF1E 𝼞 ...Missing: block | Show results with:block
  27. [27]
    [PDF] Combining Diacritical Marks Supplement - Unicode
    These charts are provided as the online reference to the character contents of the Unicode Standard, Version 17.0 but do not provide all the information needed ...
  28. [28]
    [PDF] Recommendations to UTC #184 (July 2025) on Script Proposals
    Jul 18, 2025 · This proposal requests encoding one Latin letter used in phonetic transcription of Egyptian Arabic as well as Nguni languages in southern Africa ...Missing: Oceanian | Show results with:Oceanian
  29. [29]
    Proposed New Characters: The Pipeline - Unicode
    This page presents a summary of the characters, scripts, variation sequences, and named character sequences that the Unicode Technical Committee has accepted ...
  30. [30]
  31. [31]
  32. [32]
  33. [33]
  34. [34]
  35. [35]
    Chapter 4 – Unicode 17.0.0
    ### Summary of General Category Values (Lu, Ll, Mn) for Latin Characters
  36. [36]
  37. [37]
    UAX #44: Unicode Character Database
    Aug 27, 2025 · This annex provides the core documentation for the Unicode Character Database (UCD). It describes the layout and organization of the Unicode Character Database.
  38. [38]
    Chapter 7 – Unicode 17.0.0
    Modern European alphabetic scripts are derived from or influenced by the Greek script, which itself was an adaptation of the Phoenician alphabet.
  39. [39]
    [PDF] C0 Controls and Basic Latin - The Unicode Standard, Version 17.0
    These charts are provided as the online reference to the character contents of the Unicode Standard, Version 17.0 but do not provide all the information needed ...
  40. [40]
    [PDF] Latin Extended Additional - The Unicode Standard, Version 17.0
    1E30 Ḱ LATIN CAPITAL LETTER K WITH ACUTE. ≡ 004B K 0301 $́. In this block the names "WITH LINE BELOW" refer to a macron below the letter. Latin general use ...
  41. [41]
    [PDF] IPA Extensions - The Unicode Standard, Version 17.0
    IPA extensions. IPA includes basic Latin letters and a number of Latin or Greek letters from other blocks. → 00E6 æ latin small letter ae. → 00E7 ç latin ...
  42. [42]
    [PDF] Spacing Modifier Letters - The Unicode Standard, Version 17.0
    Spacing Modifier Letters. 02B0. 02B. 02C. 02D. 02E. 02F. ʰ. ʱ. ʲ. ʳ. ʴ. ʵ. ʶ. ʷ ... See also superscript Latin letters in the Phonetic Extensions block starting ...
  43. [43]
    [PDF] Combining Diacritical Marks - The Unicode Standard, Version 17.0
    Combining Diacritical Marks. 0300. 030B $̋ COMBINING DOUBLE ACUTE ACCENT. • Hungarian, Chuvash. → 0022 " quotation mark. → 02BA ʺ modifier letter double prime.Missing: block | Show results with:block
  44. [44]
    [PDF] Phonetic Extensions - The Unicode Standard, Version 17.0
    U1D00 is part of the Phonetic Extensions range (1D00-1D7F) in Unicode, which contains character code tables and names.
  45. [45]
    [PDF] Phonetic Extensions Supplement - Unicode
    IPA recommends transcribing vowels with r-coloring. (rhoticity) with the rhotic hook instead. → 02DE ˞ modifier letter rhotic hook. Additional letters with ...
  46. [46]
    None
    Nothing is retrieved...<|control11|><|separator|>
  47. [47]
    [PDF] Combining Diacritical Marks Extended - Unicode
    These charts are provided as the online reference to the character contents of the Unicode Standard, Version 17.0 but do not provide all the information needed ...
  48. [48]
    Spacing Modifier Letters - Unicode
    Spacing Modifier Letters. Latin superscript modifier letters. See also superscript Latin letters in the Phonetic Extensions block starting at 1D00. →, 2071 ⁱ ...
  49. [49]
    Unicode 17.0.0
    Sep 9, 2025 · Unicode 17.0 adds 4803 characters, for a total of 159,801 characters. The new additions include 4 new scripts: Sidetic; Tolong Siki; Beria ...