Latin script in Unicode
The Latin script in Unicode refers to the standardized encoding of characters used in the Latin alphabet and its numerous extensions, supporting the writing systems of over 100 languages worldwide, from major European tongues to indigenous and minority languages in Africa, the Americas, and Asia. As of Unicode version 17.0, released on September 9, 2025, the Latin script encompasses 1,492 assigned characters, primarily letters with diacritics, phonetic symbols, and related punctuation, distributed across dedicated blocks such as Basic Latin (U+0000–U+007F), Latin-1 Supplement (U+0080–U+00FF), Latin Extended-A (U+0100–U+017F), Latin Extended-B (U+0180–U+024F), Latin Extended Additional (U+1E00–U+1EFF), Latin Extended-C (U+2C60–U+2C7F), Latin Extended-D (U+A720–U+A7FF), Latin Extended-E (U+AB30–U+AB6F), Latin Extended-F (U+10780–U+107BF), and Latin Extended-G (U+1DF00–U+1DFFF).[1][2] The encoding of the Latin script originated in Unicode 1.0 (1991), building on legacy standards like ASCII and ISO Latin-1 to ensure compatibility while allowing expansion for global linguistic diversity.[3] Key principles include the provision of both precomposed characters (e.g., á at U+00E1) for efficient legacy support and combining diacritical marks (e.g., U+0061 a + U+0301 acute accent) for flexible composition, enabling the representation of accented letters in languages like French, German, and Vietnamese.[4] This dual approach accommodates historical orthographies, phonetic notations such as the International Phonetic Alphabet (IPA), and modern adaptations, with uppercase/lowercase pairings and left-to-right rendering as standard.[5] Notable extensions in later Unicode versions address specific needs, such as the Latin Extended-D block for Caucasian Albanian and other historical scripts, and Latin Extended-G for advanced diacritic stacking in scholarly and linguistic applications.[2] The script's versatility is evident in its use for transliterations (e.g., Pinyin for Chinese) and in mixed-script environments, like Japanese loanwords, underscoring Unicode's goal of universal text interchange without favoring any single language.[4] Ongoing proposals continue to refine glyphs and add characters, such as subscript forms and variant 'r' shapes, to better support diverse typographic traditions.[6]Overview
Definition and Scope
The Latin script in Unicode encompasses a family of alphabets derived from the ancient Roman script, with characters assigned the Unicode script property value "Latn" to indicate their primary association with this writing system.[7] This classification supports the encoding of letters, diacritics, and modifiers used in various Latin-based orthographies worldwide. As of Unicode 17.0, released in September 2025, exactly 1,492 characters are classified with the Latin script property, spanning basic alphabetic forms to specialized extensions.[1] This total reflects ongoing expansions, including 5 new characters added to the Latin Extended-D block in version 17.0.[8] The scope of the Latin script covers encodings for modern European languages (such as English, French, and German), African languages (like Swahili and Yoruba), Vietnamese with its diacritic-rich alphabet, indigenous American and Australian scripts adapted to Latin bases, phonetic notations including the International Phonetic Alphabet, and historical variants for medieval or classical texts, while deliberately excluding characters from independent scripts like Cyrillic (assigned "Cyrl") or Greek ("Grek") despite visual similarities.[7] These inclusions ensure comprehensive support for linguistic diversity without overlapping script boundaries. The majority of Latin-script characters are concentrated in dedicated Unicode blocks, such as:| Block Name | Code Point Range |
|---|---|
| Basic Latin | U+0000–U+007F |
| Latin-1 Supplement | U+0080–U+00FF |
| Latin Extended-A | U+0100–U+017F |
| Latin Extended-B | U+0180–U+024F |
| IPA Extensions | U+0250–U+02AF |
| Spacing Modifier Letters | U+02B0–U+02FF |
| Latin Extended Additional | U+1E00–U+1EFF |
| Latin Extended-C | U+2C60–U+2C7F |
| Latin Extended-D | U+A720–U+A7FF |
| Latin Extended-E | U+AB30–U+AB6F |
| Latin Extended-F | U+10780–U+107BF |
| Latin Extended-G | U+1DF00–U+1DF2A |
| Phonetic Extensions | U+1D00–U+1D7F |
| Phonetic Extensions Supplement | U+1D80–U+1DBF |
| Superscripts and Subscripts | U+2070–U+209F |
Importance and Usage
The Latin script is the most widely used writing system within the Unicode Standard, supporting hundreds of languages across the globe, including prominent European languages such as English, Spanish, French, and German, as well as non-European ones like Swahili and Turkish.[3] This extensive coverage stems from its adaptability to diverse linguistic needs through basic letters, diacritics, and extensions, making it indispensable for multilingual digital environments. Over 1,500 languages worldwide employ the Latin script as their primary writing system, underscoring its dominance in global text representation.[9] The encoding of the Latin script plays a pivotal role in web content creation, software localization, and compliance with international standards such as ISO/IEC 10646, which defines the Universal Coded Character Set for universal text interchange. Languages using the Latin script account for approximately 70% of internet content, with English alone comprising over 50% of websites, followed by Spanish, German, French, and others, facilitating seamless global communication and application development.[10] In software localization, Unicode's Latin support enables developers to handle text in a single encoding scheme, reducing conversion errors and supporting internationalization for billions of users.[11] The historical evolution of Latin encoding in computing—from the 7-bit ASCII standard, which covered only unaccented English characters, to the 8-bit ISO 8859-1 (Latin-1) that introduced diacritics for Western European languages—directly shaped Unicode's approach, ensuring compatibility while expanding to over 1,000 Latin characters.[12] Unicode resolves legacy challenges in accented languages by offering both precomposed forms (e.g., é as U+00E9) and combining diacritics (e.g., e + acute accent), with normalization algorithms that reconcile variations from older encodings without data loss.[13] This framework prevents compatibility issues, such as mismatched representations in legacy systems, promoting consistent rendering across platforms. Additionally, the Latin script underpins specialized applications, including mathematical notations via derived symbols like italic or bold variants of Latin letters (e.g., in the Mathematical Alphanumeric Symbols block) and serves as a foundational element in emoji, where Latin-based characters form regional indicators and modifier bases for diverse expressions.[14][15] These roles highlight its versatility beyond everyday text, enhancing technical and expressive digital content.Historical Development
Initial Encoding
The initial encoding of the Latin script in Unicode began with version 1.0, released in October 1991, which established the foundational blocks for Latin characters to ensure broad compatibility with existing computing standards.[16] This version included the Basic Latin block (U+0000–U+007F), comprising 128 characters that directly mirrored the American Standard Code for Information Interchange (ASCII), encompassing unaccented uppercase and lowercase English letters (A–Z and a–z), digits (0–9), basic punctuation, and control codes.[17] The design preserved the exact bit patterns of ASCII within the first 128 code points, allowing seamless integration with legacy systems that relied on 7-bit encodings.[16] Unicode 1.1, released in June 1993, expanded the Latin repertoire by incorporating the Latin-1 Supplement block (U+0080–U+00FF), adding another 128 characters primarily drawn from the ISO/IEC 8859-1 standard, also known as Latin-1.[17] This block introduced 61 accented Latin letters—such as á, ç, and ñ—along with currency symbols, mathematical operators, and additional punctuation to support Western European languages like French, German, Spanish, and Italian.[18] The inclusion of these characters extended the total Latin encoding to 256 positions, aligning Unicode with the 8-bit ISO 8859-1 encoding that had become prevalent in personal computing during the 1980s.[16] The rationale for this initial approach prioritized compatibility with established 8-bit encodings to facilitate widespread adoption and minimize disruption in software and data migration. By starting with ASCII and extending to ISO 8859-1, Unicode addressed the fragmentation of character sets in global computing, where different regions used incompatible codes for Latin-based scripts, thereby promoting efficient text interchange without requiring extensive re-encoding of existing files.[16] This compatibility was a core principle from the project's inception, as early designers aimed to create a universal standard that could "begin at 0 and add the next character" sequentially, avoiding the inefficiencies of variable-width legacy encodings.[19] Key milestones trace back to collaborative efforts in the 1980s by international standards bodies. In 1984, ISO/TC97/SC2 proposed a two-byte international character set as a precursor to broader universal encoding initiatives.[16] By late 1987, discussions within ECMA and ISO working groups, alongside ANSI X3L2, focused on unifying Western character sets, with the term "Unicode" coined by Joe Becker at Xerox to describe a unique, universal, and uniform encoding scheme.[19] These efforts culminated in the initial repertoire emphasizing 256 Western Latin characters, formalized through the Unicode Consortium's incorporation in January 1991.[19] Despite these advancements, the early Unicode versions had notable limitations, lacking support for Eastern European languages that use extended Latin characters, such as those with diacritics for Polish, Czech, or Hungarian.[17] The focus remained on Western European needs via ISO 8859-1, deferring broader Latin extensions until later versions to maintain initial simplicity and compatibility.[16]Subsequent Expansions
Following the initial encoding in Unicode 1.x, expansions began with version 2.0 in July 1996, which introduced the Latin Extended-A block (U+0100–U+017F) to support additional Latin letters for Central and Eastern European languages, such as the Polish ł (U+0142) and Czech č (U+010D).[20] This addition addressed the need for characters beyond the Latin-1 Supplement, enabling better representation of accented letters in scripts like those used in Polish, Czech, and Hungarian orthographies.[21] Unicode 3.0, released in September 2000, expanded the repertoire with the Latin Extended-B block (U+0180–U+024F), incorporating characters for African languages, including click consonants like ǂ (U+01C2), and Vietnamese tone marks such as ơ (U+01A1) and ư (U+01B0).[22] These inclusions were driven by proposals to accommodate orthographic reforms and linguistic diversity in non-European contexts, filling gaps in earlier standards like ISO 6438 for African scripts. Subsequent versions built on this foundation with targeted additions for specialized uses. Unicode 5.0 (July 2006) introduced the Latin Extended-C block (U+2C60–U+2C7F) for orthographic and phonetic uses, while initial elements of Latin Extended-D (U+A720–U+A7FF) were added in Unicode 4.1 (April 2005) for medieval and phonetic notations, such as ꭲ (U+AB72) for historical paleography.[23][24] Versions 5.0 (July 2006) and 6.0 (October 2010) added characters to the IPA Extensions block (U+0250–U+02AF), including ɳ (U+0273), and the Spacing Modifier Letters block (U+02B0–U+02FF), such as ʔ (U+02BC), to support phonetic transcription systems proposed by linguists for broader IPA coverage. From Unicode 7.0 (June 2014) through 15.0 (September 2022), further expansions included Latin Extended-E (U+AB30–U+AB6F, added in 7.0), Latin Extended-F (U+10780–U+107BF, added in 14.0), and Latin Extended-G (U+1DF00–U+1DFFF, added in 14.0), incorporating characters for indigenous languages like those in the Anthropos alphabet (e.g., ꭎ U+AB4E) and historical scripts such as Gothic transliterations.[25][26][27] These were motivated by submissions from linguists and academic groups to preserve minority and historical orthographies, often addressing deficiencies in legacy encoding systems. Unicode 16.0 (September 2024) and 17.0 (September 2025) continued this trend by adding characters such as additional letters for Gaulish in Latin Extended-D and extensions to Latin Extended-G for phonetic notations, enhancing support for indigenous and historical languages including Sami and Inuktitut orthographies.[28] These expansions stemmed from proposals by linguistic experts and indigenous communities to standardize digital representation of their writing systems. The motivations for these expansions have consistently involved contributions from linguists, academic institutions, and governments, particularly for African orthographies (e.g., proposals for Nguni and Bamileke scripts) and medieval studies, to bridge gaps in digital fonts and promote cultural preservation.[29] As of November 2025, previews for Unicode 18.0 indicate ongoing additions for African and Oceanian Latin variants, such as U+A7E2 LATIN CAPITAL LETTER R WITH LONG LEG for Egyptian Arabic and southern African languages, potentially increasing the total Latin script characters beyond 1,500.[30]Encoding Principles
Composition and Decomposition
The Latin script in Unicode supports both precomposed characters, which are single code points representing a base letter combined with one or more diacritics, and sequences formed by combining a base character with separate combining diacritics. Precomposed forms are provided for commonly used accented letters to ensure efficient encoding and compatibility with legacy systems; for example, U+00E9 LATIN SMALL LETTER E WITH ACUTE (é) is a single code point equivalent to the base letter e followed by an acute accent.[4][13] Combining characters allow for the dynamic construction of less common or context-specific forms by attaching diacritics to base letters, such as U+0065 LATIN SMALL LETTER E (e) followed by U+0301 COMBINING ACUTE ACCENT (◌́), which renders as é. The combining diacritics used with Latin letters are primarily drawn from the Combining Diacritical Marks block (U+0300–U+036F), which is shared across multiple scripts to promote reusability and limit the total number of code points required.[13] Unicode Normalization Forms standardize the representation of these equivalent sequences to facilitate consistent processing and comparison of text. Normalization Form C (NFC) decomposes precomposed characters when necessary and then recomposes them, preferring precomposed forms where possible; for instance, the word "résumé" in NFC uses precomposed code points like U+00E9 and U+00E9, while its NFD equivalent decomposes to base letters plus combining marks (e.g., U+0072 U+0065 U+0301 U+0073 U+0075 U+0301 U+0065). Conversely, Normalization Form D (NFD) fully decomposes precomposed characters into base and combining components without recomposition.[31][32] This dual approach—precomposed for frequent usage and combining for flexibility—enables compact encoding of prevalent Latin characters while accommodating rare combinations without an excessive proliferation of dedicated code points; the Unicode Standard encodes over 1,000 such precomposed Latin letters across its Basic Latin, Latin-1 Supplement, and extended Latin blocks.[4]Script Classification and Properties
In the Unicode Standard, the Latin script is designated with the four-letter code "Latn" as its script property value for the majority of its characters, including basic letters like U+0041 LATIN CAPITAL LETTER A and extended forms in various Latin blocks. This assignment is documented in the Unicode Character Database (UCD) file Scripts.txt, which catalogs script memberships to aid in text processing, such as script-specific rendering or regular expression matching. Punctuation and symbols shared across scripts, such as U+0020 SPACE, receive the "Zyyy" (Common) script value instead, reflecting their script-agnostic usage.[7][33] Directionality properties for Latin characters are primarily left-to-right, with most letters classified under the bidirectional class "L" (Left-to-Right) in the UCD's DerivedBidiClass.txt and UnicodeData.txt files. This ensures that Latin text flows horizontally from left to right by default, as specified in the Unicode Bidirectional Algorithm. Control characters and certain modifiers, however, are assigned "N" (Neutral), allowing them to inherit directionality from surrounding text without imposing their own. For instance, U+00AD SOFT HYPHEN is given the "BN" (Boundary Neutral) bidirectional class, which treats it as a non-directional break opportunity in mixed left-to-right and right-to-left scripts, influencing line layout without altering embedding levels.[34][35] Additional character properties include the general category, which partitions Latin characters into classes such as "Lu" (Uppercase Letter) for forms like U+0041 A and "Ll" (Lowercase Letter) for U+0061 a, or "Mn" (Nonspacing Mark) for diacritics like U+0301 COMBINING ACUTE ACCENT that attach to base letters without advancing horizontally. Decomposition mappings, stored in the fifth field of UnicodeData.txt, break down precomposed characters (e.g., U+00E9 LATIN SMALL LETTER E WITH ACUTE into U+0065 e + U+0301 ´) to support canonical equivalence in text normalization processes. These properties are normative and derived from the UCD, enabling consistent handling across implementations.[36][37] The UCD files, including Scripts.txt for script assignments and UnicodeData.txt for general categories, bidirectional classes, and decompositions, serve as the authoritative tools for querying these metadata. They inform critical applications, such as font rendering engines that apply the bidirectional algorithm to resolve display order in multilingual text, and search systems that leverage properties for accurate matching.[38] A distinctive feature of the Latin script's encoding is its high density of paired uppercase and lowercase forms, which underpin robust case folding operations as defined in CaseFolding.txt, allowing for locale-independent case-insensitive comparisons across thousands of characters. Furthermore, Latin uniquely supports titlecase forms in the general category "Lt" (Titlecase Letter) for specific digraphs, such as U+01C5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON (Dž) used in Serbian and Croatian orthographies, where the initial component is uppercased while the trailing one remains lowercase to preserve digraph integrity in titles.[39]Unicode Blocks
Core Blocks
The core blocks for the Latin script in Unicode provide the foundational encoding for the most common Latin characters, ensuring compatibility with legacy systems while supporting a wide range of Western and Central European languages. These blocks—Basic Latin, Latin-1 Supplement, Latin Extended-A, and Latin Extended-B—collectively encode essential letters, punctuation, and controls, forming the basis for text processing in modern computing environments.[39] The Basic Latin block (U+0000–U+007F) encompasses 128 characters, fully compatible with the 7-bit ASCII standard, and includes the C0 control characters (U+0000–U+001F and U+007F) for device management, as well as 95 graphic characters from U+0020 to U+007E. It features 52 letters (26 uppercase A–Z at U+0041–U+005A and 26 lowercase a–z at U+0061–U+007A), the digits 0–9 (U+0030–U+0039), and common punctuation such as the period (U+002E) and exclamation mark (U+0021). Additionally, it contains spacing clones of diacritics, like the circumflex accent (U+005E), which serve as compatibility characters rather than combining marks. This block supports the core alphabet for English and other basic Latin-script languages, with no unassigned positions in Unicode 17.0.[39][40] The Latin-1 Supplement block (U+0080–U+00FF) adds another 128 characters, extending Basic Latin to cover Western European languages in alignment with ISO/IEC 8859-1. It includes C1 controls (U+0080–U+009F), precomposed accented letters such as á (U+00E1) for Spanish and Portuguese, ç (U+00E7) for French and Catalan, and ñ (U+00F1) for Spanish, along with symbols like the diaeresis (U+00A8) and acute accent (U+00B4) as spacing forms. Ordinal indicators ª (U+00AA) and º (U+00BA) are also encoded here, with glyph variations possible across fonts. This block enables representation of languages including Danish, Dutch, Finnish, German, Icelandic, Irish, Italian, Norwegian, and Swedish, with all positions assigned.[39][18] Latin Extended-A (U+0100–U+017F) provides 128 characters focused on Central and Eastern European Latin-script languages, building on the previous blocks with precomposed forms featuring diacritics. It supports alphabets for Afrikaans, Basque, Breton, Croatian, Czech, Esperanto, Estonian, Hungarian, Latvian, Lithuanian, Polish, Romanian, Slovak, Slovenian, Turkish, and Welsh, including characters like ą (U+0105) for Polish, č (U+010D) for Czech and Croatian, and ā (U+0101) for Latvian. Five compatibility digraphs are included, such as ij (U+0133) for Dutch and ʼn (U+0149) for Afrikaans, though the latter is deprecated in favor of decomposition. Some characters resemble International Phonetic Alphabet (IPA) symbols, such as ŋ (U+014B) for Northern Sámi. All code points are assigned, ensuring comprehensive coverage compatible with ISO/IEC 6937.[39][20] Latin Extended-B (U+0180–U+024F) allocates 208 code points for less common Latin extensions, marking the first block with significant non-European focus, and includes 208 assigned characters in Unicode 17.0. It encodes letters for African languages per ISO 6438 (e.g., Ɓ U+0181 for Bamum), Vietnamese and Pinyin forms from GB 2312 and JIS X 0212 (e.g., ơ U+01A1), and Sami extensions from ISO/IEC 8859-10 (e.g., ʼn U+0149 compatibility). Examples include Ɗ (U+018A) for Dinka and ƣ (U+01A3, LATIN SMALL LETTER OI). The block also features caseless letters like the glottal stop modifier (not directly encoded here but related) and some phonetic symbols, with characters arranged alphabetically by uppercase form.[39][22]Extended Blocks
The extended blocks for the Latin script in Unicode accommodate specialized orthographies, historical variants, and characters for minority and indigenous languages, with most additions occurring after Unicode version 3.0 in 2000 to address gaps in representation for underrepresented writing systems. These blocks collectively provide around 900 characters, enabling digital support for languages such as various African, Nordic indigenous, and Southeast Asian orthographies that extend beyond basic Latin encoding. Key expansions include provisions for phonetic notations, medieval paleography, and modern minority scripts, reflecting ongoing efforts to preserve linguistic diversity. As of Unicode 17.0, the Latin script includes 1,492 assigned characters across all blocks.[4][1] The Latin Extended Additional block (U+1E00–U+1EFF) comprises 256 characters designed to support accented letters and digraphs for languages employing extended Latin alphabets, including Vietnamese tonal marks and Irish orthographic forms. Notable examples include Ḃ (U+1E02, LATIN CAPITAL LETTER B WITH DOT ABOVE) for Irish Gaelic distinctions and ḫ (U+1E2B, LATIN SMALL LETTER H WITH BREVE BELOW) for phonetic and historical uses in various European languages. This block was significantly expanded post-2000 to incorporate compatibility with legacy encodings and to cover additional precomposed forms for efficient text processing in these languages.[41][4] Latin Extended-C (U+2C60–U+2C7F) is a compact block of 32 characters primarily encoding orthographic extensions for minority languages, with a focus on African click consonants and forms analogous to those in Lepcha script. Characters such as 𝼏 (U+2C60, LATIN CAPITAL LETTER T WITH PALATAL HOOK) and 𝼐 (U+2C61, LATIN SMALL LETTER T WITH PALATAL HOOK) facilitate the representation of click sounds in languages like those of the Khoisan peoples, while other entries include historic punctuation and letter variants added in Unicode 5.0 to support linguistic documentation.[4] The Latin Extended-D block (U+A720–U+A7FF) allocates 224 code points for a diverse array of historical and orthographic characters, emphasizing medieval Latin abbreviations and notations related to Irish Ogham traditions. It includes forms like ꝓ (U+A74D, LATIN SMALL LETTER OO) for scribal abbreviations in medieval manuscripts and ꝕ (U+A74F, LATIN SMALL LETTER REVERSED P) used in paleographic reconstructions, alongside letters for modern minority orthographies. Introduced in Unicode 5.1 and expanded thereafter, this block aids in the digitization of historical texts and supports revived usages in Irish and other Celtic contexts.[24][4] Latin Extended-E (U+AB30–U+AB6F) offers 58 assigned characters out of 64 code points, tailored to the orthographies of Sami and other indigenous Nordic languages, incorporating unique letter forms for phonetic accuracy in minority European scripts. Examples include ꬰ (U+AB30, LATIN SMALL LETTER B WITH STROKE) for Sami dialectal distinctions and ꬶ (U+AB36, LATIN SMALL LETTER US) to represent specific sounds in northern indigenous writing systems. Added in Unicode 7.0, this block addresses the needs of Uralic and Finno-Ugric languages previously underrepresented in digital formats.[25][4] Latin Extended-F (U+10780–U+107BF) includes 57 assigned characters out of 64 code points, supporting modifier letters for phonetic transcription, including nearly all IPA and extIPA symbols. These additions, introduced in Unicode 15.0, enable precise representation of sounds in linguistic research and documentation.[4][26] Finally, Latin Extended-G (U+1DF00–U+1DFFF) provides 256 characters, introduced in Unicode 14.0, containing additional characters for phonetic transcription to support advanced linguistic notations. This block exemplifies continued Unicode expansions for global linguistic equity.[27]Auxiliary Blocks
Auxiliary blocks in Unicode encompass ranges outside the primary Latin script allocations that nonetheless include a substantial number of characters classified under the Latin script (Script=Latin or Script=Latin in ScriptExtensions) for phonetic, modifier, and diacritical purposes. These blocks primarily support linguistic transcription systems, such as the International Phonetic Alphabet (IPA), and enable advanced notations in scholarly and orthographic contexts. While not dedicated exclusively to Latin orthographies, they contain characters that integrate seamlessly with Latin-based writing systems, often through composition with base letters from core blocks. Collectively, these contribute approximately 480 characters with the Latin script property, playing a crucial role in supporting linguistic research, phonetic transcription, and extended orthographies worldwide.[39] The IPA Extensions block (U+0250–U+02AF) provides 96 characters dedicated to phonetic transcription in the IPA, including symbols like ɐ (U+0250 LATIN SMALL LETTER TURNED A) and ʙ (U+0299 LATIN LETTER SMALL CAPITAL B). These characters represent sounds not adequately covered by basic Latin letters, such as turned or inverted forms for vowels and consonants, and are classified as Latin script despite their specialized phonetic application. This block facilitates precise representation of global language phonologies in linguistic research and documentation.[42] Spacing Modifier Letters (U+02B0–U+02FF) contains 80 characters used as spacing forms of diacritics, tone marks, and other modifiers, exemplified by ʾ (U+02BE MODIFIER LETTER RIGHT HALF TRIANGULAR COLON) and ˀ (U+02C0 MODIFIER LETTER APOSTROPHE). These include superscript letters and symbols for prosodic features like stress or intonation, which can combine with Latin base characters to denote phonetic modifications. The block supports notations in phonology and dialectology, ensuring compatibility with Latin script rendering.[43] The Combining Diacritical Marks block (U+0300–U+036F) includes 112 non-spacing marks that attach to base characters, many of which are essential for Latin script compositions, such as ◌̈ (U+0308 COMBINING DIAERESIS). These marks, including acute, grave, and circumflex accents, are shared across scripts but predominantly used with Latin letters to represent vowel qualities or tonal distinctions in languages like French, German, and Vietnamese. Their combining nature allows for normalized decomposition and canonical equivalence in text processing.[44] Phonetic Extensions (U+1D00–U+1D7F) allocates 128 characters for advanced phonetic notations beyond core IPA, including modifier letters like ᶛ (U+1D9B MODIFIER LETTER SMALL TURNED R) and ꟲ (U+1DF2 LATIN SMALL LETTER TURNED M WITH HOOK, added in later versions). This block extends support for the Uralic Phonetic Alphabet (UPA) and other systems, with many characters bearing the Latin script property for integration into Latin-based transcriptions. It enables detailed representation of suprasegmental features in linguistic analysis.[45] Phonetic Extensions Supplement (U+1D80–U+1DBF) adds 64 characters, incorporating additional IPA extensions and symbols for Linear B syllabary notations, such as ᶀ (U+1D80 LATIN SMALL LETTER B WITH PALATAL HOOK). These include hooks and curls for articulatory details, classified under Latin script to support scholarly phonetic work. The block complements earlier phonetic ranges by providing further granularity for rare or specialized sounds.[46]Character Categories
Letters and Digraphs
The Latin script in Unicode begins with the 26 base letters of the English alphabet, provided as uppercase and lowercase pairs in the Basic Latin block, ranging from A (U+0041) to Z (U+005A) and a (U+0061) to z (U+007A).[40] These form the foundational alphabetic characters, supporting the core structure for most Latin-based writing systems worldwide. Unicode encodes approximately 844 precomposed Latin letters (as of Unicode 17.0), including accented and modified forms that combine base letters with diacritics or other modifications into single code points for efficiency and compatibility.[2] Examples include Ä (U+00C4) for German and Swedish, Ñ (U+00D1) for Spanish, and Ʒ (U+01B7) for historical uses. These are distributed across various blocks, with over 100 forms supporting Western European orthographies in blocks like Latin-1 Supplement and Latin Extended-A, such as É (U+00C9) and Ł (U+0141).[18] More than 200 precomposed letters cater to African languages, primarily in Latin Extended-B and Latin Extended-D, including Ɖ (U+0189) for Ewe and Ɓ (U+0181) for Shona and other African languages representing implosive sounds.[22] Digraphs and ligatures, which represent combined letter forms treated as single units in certain languages, are encoded as distinct characters to preserve orthographic integrity, with around 50 such forms across relevant blocks.[2] Notable examples include Æ (U+00C6) for Danish and Old English, Œ (U+0152) for French, and IJ (U+0132) for Dutch, alongside historical variants like ʒʒ (U+0292 U+0292) in phonetic contexts.[18] These encodings facilitate accurate representation without relying on decomposition, particularly for languages like Old English where digraphs such as Ð (U+00D0) denote specific phonemes. Unicode also includes specialized variants of Latin letters to support diverse applications. Small capital letters, such as ᴀ (U+1D00), appear in phonetic and linguistic notations within the Phonetic Extensions block, providing compact uppercase-like forms for abbreviations and emphasis. Historical variants, like Ꝏ (U+A74E) used in Old English manuscripts for the "oo" ligature, are encoded in Latin Extended-D to revive medieval scribal traditions. Indigenous orthographies draw on unique forms, such as Ŋ (U+014A) in Inuktitut to represent the ng sound. Unicode 18.0 (2025) added further historical variants, such as Latin capital letter R with long leg (U+A7E2), enhancing support for medieval European orthographies.[6] Special case mapping rules apply to certain Latin characters, exemplified by the German ß (U+00DF), which uppercases to SS (U+0053 U+0053) rather than a single uppercase form, reflecting orthographic conventions while a dedicated uppercase ẞ (U+1E9E) exists for precise needs.[47] Unicode avoids encoding every possible digraph as a single character to prevent redundancy, favoring composition with base letters and modifiers where feasible, thus maintaining a balance between compatibility and extensibility.Diacritics and Modifiers
The Latin script in Unicode incorporates a wide array of diacritical marks and modifiers to represent phonetic nuances, tonal distinctions, and orthographic variations across languages that use Latin-based writing systems. These elements are essential for encoding accented letters, such as those in French (é U+00E9), Vietnamese (ệ U+1EC7), or pinyin (nǐ), by allowing base letters to be modified without dedicated precomposed characters for every combination. Combining diacritics, which attach to preceding base characters, form the core of this system, enabling flexible stacking for complex notations like multiple tones in Southeast Asian languages.[44] Combining diacritics encompass approximately 200 marks (as of Unicode 17.0) distributed across several Unicode blocks, primarily the Combining Diacritical Marks (U+0300–U+036F), Supplement (U+1DC0–U+1DCF), and Extended (U+1AB0–U+1AFF) blocks. Common examples include the acute accent (◌́ U+0301) for stress in Spanish or rising tones in Vietnamese, and the grave accent (◌̀ U+0300) for falling tones or open syllables in Italian and African languages like Yoruba. These marks support stacking, as seen in Vietnamese orthography where a base vowel might combine with a tone mark (e.g., ◌̉ U+0309 hook above for hỏi tone) followed by another diacritic, allowing representations like ử (u + horn + hook above). The design prioritizes canonical ordering in normalization processes to ensure consistent rendering across systems.[44][48] Spacing modifiers, numbering around 100 characters mainly in the Spacing Modifier Letters block (U+02B0–U+02FF), function as independent glyphs that influence adjacent letters without combining, often in phonetic transcriptions or scripts requiring explicit spacing. For instance, the modifier letter apostrophe ʿ (U+02BF) denotes glottal stops or ejective sounds in African languages like Hausa, while the right half ring ʢ (U+02A2) marks pharyngeal fricatives in Semitic transliterations adapted to Latin. These are crucial for linguistic notations where non-spacing attachment would alter readability, such as in dialectology or African orthographies like those for Igbo.[43][49] Phonetic symbols, totaling about 300 characters drawn from IPA Extensions (U+0250–U+02AF), Phonetic Extensions (U+1D00–U+1D7F), and related blocks, extend Latin forms for International Phonetic Alphabet (IPA) usage in linguistic analysis. The glottal stop ʔ (U+0294), a standalone modifier letter, represents a catch in the voice in languages like Danish or Tagalog transliterations, while extensions like the turned question mark ʞ (U+029A) denote uvular clicks in Khoisan languages. These symbols integrate with Latin bases to transcribe sounds absent in standard alphabets, supporting scholarly and educational applications.[42] Other specialized modifiers include the double acute (◌̋ U+030B) for Hungarian long vowels (e.g., ő) and the hook above (◌̉ U+0309) for Vietnamese hỏi tone, with sequencing rules dictating application order—typically base letter first, then non-spacing marks, followed by spacing or enclosing modifiers—to avoid rendering ambiguities. Unique mechanisms like the zero-width joiner (ZWJ, U+200D) enable ligature control in Latin contexts, such as joining letters in historical texts or preventing unwanted breaks in compound words. Additionally, compatibility decompositions normalize precomposed ligatures, such as fi (U+FB01) into f + i (U+0066 U+0069), facilitating searchable plain text while preserving visual intent in legacy encodings.[44]Representations
Block Summary Table
The Unicode blocks containing characters classified under the Latin script (Script=Latn) provide a structured allocation of code points for letters, modifier letters, and related symbols used in Latin-based writing systems. As of Unicode version 17.0 (released September 2025), there are 20 such blocks encompassing a total of 1,492 characters.[50] These blocks have evolved over time, starting with 116 characters across the Basic Latin and Latin-1 Supplement blocks in version 1.0, followed by the addition of 1,376 more characters in subsequent versions to support expanded linguistic needs. Unicode 17.0 added characters to Latin Extended-G and Phonetic Extensions Supplement for additional support of indigenous and phonetic notations. The table below summarizes each block, including its code point range, the number of Latin-script characters (Script=Latn, excluding unassigned or reserved points, which are indicated in gray in official charts), the version in which the block was introduced, and a brief description of its purpose. This overview serves as a quick reference for developers and linguists working with Latin script encoding.[2]| Block Name | Code Point Range | Number of Latin Characters | Version Added | Brief Purpose |
|---|---|---|---|---|
| Basic Latin | U+0000–007F | 52 | 1.0 | Core letters for ASCII compatibility |
| Latin-1 Supplement | U+0080–00FF | 64 | 1.0 | Western European accented letters |
| Latin Extended-A | U+0100–017F | 128 | 1.0 | Accented letters for European languages |
| Latin Extended-B | U+0180–024F | 182 | 1.0 | African, indigenous, and phonetic extensions |
| IPA Extensions | U+0250–02AF | 96 | 1.1 | Phonetic symbols for linguistics |
| Spacing Modifier Letters | U+02B0–02FF | 80 | 1.0 | Modifiers for tone, stress, and phonetics |
| Combining Diacritical Marks | U+0300–036F | 112 | 1.0 | Diacritics compatible with Latin letters |
| Latin Extended Additional | U+1E00–1EFF | 256 | 1.1 | Additional accented and dotted letters |
| Superscripts and Subscripts | U+2070–209F | 15 | 1.1 | Superscript/subscript Latin forms |
| Letterlike Symbols | U+2100–214F | 4 | 1.0 | Mathematical Latin variants (e.g., Å) |
| Number Forms | U+2150–218F | 41 | 1.0 | Roman numerals and fractions |
| Phonetic Extensions | U+1D00–1D7F | 108 | 4.0 | Extended phonetic notation |
| Phonetic Extensions Supplement | U+1D80–1DBF | 60 | 4.1 | Supplemental phonetic symbols |
| Latin Extended-C | U+2C60–2C7F | 32 | 5.0 | Extensions for Glagolitic, Coptic, and historic scripts |
| Latin Extended-D | U+A720–A7FF | 160 | 5.0 | Phonetic, Mayanist, and medievalist forms |
| Latin Extended-E | U+AB30–AB6F | 32 | 7.0 | Meetei Mayek and Indic extensions |
| Latin Extended-F | U+10780–107BF | 57 | 11.0 | Historic transliterations of Egyptian hieroglyphs |
| Latin Extended-G | U+1DF00–1DFFF | 80 | 14.0 | Advanced diacritic stacking for linguistics |
| Alphabetic Presentation Forms (Latin ligatures) | U+FB00–FB4F | 7 | 1.1 | Typographic ligatures (e.g., fi) |
| Halfwidth and Fullwidth Forms (Latin) | U+FF00–FFEF | 52 | 1.1 | East Asian width variants of Latin letters |
Character Inventory Table
The character inventory for the Latin script in Unicode encompasses 1,492 assigned characters as of version 17.0, excluding control characters and non-letter elements unless they are explicitly part of Latin orthographies.[50] These characters are distributed across multiple blocks, with properties such as general category (e.g., Lu for uppercase letters, Ll for lowercase letters, Mn for non-spacing marks) derived from the Unicode Character Database.[38] The table below presents representative examples grouped by block, including code points in hexadecimal, glyphs (where renderable in standard fonts), official names, categories, and the Unicode version in which they were first encoded. Reserved code points within Latin blocks are marked as such; compatibility characters (e.g., decomposed forms from legacy encodings) are noted where applicable. For the complete list, refer to the official Unicode charts and data files.[2]| Block | Code Point | Glyph | Name | Category | Version Added |
|---|---|---|---|---|---|
| Basic Latin (U+0000–U+007F; excludes controls U+0000–U+001F, U+007F) | U+0041 | A | LATIN CAPITAL LETTER A | Lu | 1.0 |
| U+0061 | a | LATIN SMALL LETTER A | Ll | 1.0 | |
| U+0042 | B | LATIN CAPITAL LETTER B | Lu | 1.0 | |
| Latin-1 Supplement (U+0080–U+00FF) | U+00C0 | À | LATIN CAPITAL LETTER A WITH GRAVE | Lu | 1.0 |
| U+00E0 | à | LATIN SMALL LETTER A WITH GRAVE | Ll | 1.0 | |
| U+00DF | ß | LATIN SMALL LETTER SHARP S | Ll | 1.0 | |
| Latin Extended-A (U+0100–U+017F) | U+0106 | Ć | LATIN CAPITAL LETTER C WITH ACUTE | Lu | 1.0 |
| U+0107 | ć | LATIN SMALL LETTER C WITH ACUTE | Ll | 1.0 | |
| U+0141 | Ł | LATIN CAPITAL LETTER L WITH STROKE | Lu | 1.0 | |
| Latin Extended-B (U+0180–U+024F) | U+0181 | Ɓ | LATIN CAPITAL LETTER B WITH HOOK | Lu | 1.0 |
| U+0253 | ɓ | LATIN SMALL LETTER B WITH HOOK | Ll | 1.1 | |
| U+01BA | ƺ | LATIN SMALL LETTER EZH WITH RETROFLEX HOOK | Ll | 1.0 | |
| IPA Extensions (U+0250–U+02AF) | U+025B | ɛ | LATIN SMALL LETTER OPEN E | Ll | 1.1 |
| U+0283 | ʃ | LATIN SMALL LETTER ESH | Ll | 1.1 | |
| U+0292 | ʒ | LATIN SMALL LETTER EZH | Ll | 1.1 | |
| Spacing Modifier Letters (U+02B0–U+02FF) | U+02BC | ʼ | MODIFIER LETTER APOSTROPHE | Sk | 1.0 |
| U+02BB | ʻ | MODIFIER LETTER TURNED COMMA | Sk | 1.0 | |
| Combining Diacritical Marks (U+0300–U+036F) | U+0300 | ` | COMBINING GRAVE ACCENT | Mn | 1.0 |
| U+0301 | ´ | COMBINING ACUTE ACCENT | Mn | 1.0 | |
| U+0323 | ̣ | COMBINING DOT BELOW | Mn | 1.0 | |
| Latin Extended Additional (U+1E00–1EFF) | U+1E02 | Ḃ | LATIN CAPITAL LETTER B WITH DOT ABOVE | Lu | 1.1 |
| U+1E03 | ḃ | LATIN SMALL LETTER B WITH DOT ABOVE | Ll | 1.1 | |
| U+1E9B | ẛ | LATIN SMALL LETTER LONG S WITH DOT ABOVE | Ll | 5.1 | |
| Latin Extended-C (U+2C60–U+2C7F) | U+2C60 | Ⱡ | LATIN CAPITAL LETTER L WITH DOUBLE BAR | Lu | 5.0 |
| U+2C61 | ⱡ | LATIN SMALL LETTER L WITH DOUBLE BAR | Ll | 5.0 | |
| U+2C67 | Ⱨ | LATIN CAPITAL LETTER H WITH HOOK | Lu | 5.0 | |
| Latin Extended-D (U+A720–U+A7FF) | U+A730 | ꜰ | LATIN LETTER SMALL CAPITAL F | Ll | 5.1 |
| U+A731 | ꜱ | LATIN LETTER SMALL CAPITAL S | Ll | 5.1 | |
| Latin Extended-E (U+AB30–AB6F) | U+AB30 | ꬰ | LATIN SMALL LETTER BARRED ALPHA | Ll | 7.0 |
| U+AB31 | ꬱ | LATIN SMALL LETTER ALPHA WITH MACRON | Ll | 7.0 | |
| Latin Extended-F (U+10780–107BF) | U+10780 | 𐞀 | LATIN SMALL LETTER P WITH STROKE THROUGH DESCENDER | Ll | 11.0 |
| U+10781 | 𐞁 | LATIN SMALL LETTER T WITH PALATAL HOOK | Ll | 11.0 | |
| Latin Extended-G (U+1DF00–1DFFF) | U+1DF07 | ᶷ | LATIN SMALL LETTER N WITH PALATAL HOOK | Ll | 14.0 |
| U+1DF1E | ᶾ | LATIN SMALL LETTER S WITH CURL | Ll | 14.0 |