Latin Extended-A
Latin Extended-A is a block of the Unicode Standard comprising 128 characters in the code point range from U+0100 to U+017F, which extends the Basic Latin and Latin-1 Supplement blocks by providing precomposed Latin letters with diacritical marks such as macrons, breves, ogoneks, and carons, along with ligatures like Œ (U+0152) and special letters like Ð (U+0110).[1][2] This block originated from standards including ISO/IEC 8859 parts 2, 3, 4, and 9, as well as ISO/IEC 6937:1984, to support the representation of text in numerous languages that use Latin-based alphabets beyond the basic set.[1] It includes compatibility digraphs such as IJ (U+0132) for Dutch and LJ (U+01C7) for Croatian, facilitating accurate orthographic rendering in digital text.[2] The characters in Latin Extended-A are essential for languages including Afrikaans, Albanian, Basque, Breton, Catalan, Croatian, Czech, Danish, Dutch, Esperanto, Estonian, Finnish, French, Frisian, Galician, German, Hungarian, Icelandic, Irish, Italian, Latvian, Lithuanian, Maltese, Norwegian, Polish, Portuguese, Romanian, Romany, Sámi, Slovak, Slovenian, Sorbian, Spanish, Swedish, Turkish, Welsh, and others, enabling proper encoding of accented letters like Ā (U+0100) for Latvian and Ć (U+0106) for Polish.[1][2] As part of the Basic Multilingual Plane, it ensures compatibility with legacy systems while supporting modern multilingual computing needs.[1]Overview
Block Specifications
The Latin Extended-A Unicode block occupies the code point range from U+0100 to U+017F, encompassing 128 consecutive positions within the standard.[3] This range follows immediately after the Latin-1 Supplement block (U+0080 to U+00FF), extending support for additional Latin-based characters beyond the initial ISO Latin-1 set. The block was introduced in Unicode version 1.0 in October 1991 and was fully allocated with 128 characters in version 1.1 in June 1993.[4] All 128 code points in this block are assigned, with no reserved or unallocated positions, and they exclusively belong to the Latin script category (denoted as "L" for letters in Unicode properties). As a component of the Basic Multilingual Plane (BMP), which corresponds to plane 0 in the Unicode code space (U+0000 to U+FFFF), Latin Extended-A facilitates efficient encoding for legacy systems and ensures compatibility within the 16-bit BMP subset. For a visual overview of the characters and their glyphs, refer to the official Unicode code chart U0100.pdf.[3]Purpose and Scope
The Latin Extended-A block, spanning the code point range U+0100–U+017F, encodes Latin letters derived from the ISO/IEC 8859 series (excluding the Latin-1 subset in Part 1) and the ISO 6937 standard, providing support for extended European alphabets.[1][3] This block complements the Basic Latin (U+0000–U+007F) and Latin-1 Supplement (U+0080–U+00FF) blocks by incorporating precomposed characters with diacritical marks, ensuring compatibility with legacy 8-bit encodings used in European text processing.[1] Its primary aim is to facilitate the representation of accented and modified Latin characters required by languages that extend beyond the basic ASCII and Latin-1 repertoires, focusing on orthographic needs in European scripts.[1] The block covers 63 pairs of uppercase and lowercase letters, along with special forms such as ligatures, tailored for these alphabets.[3] It deliberately excludes phonetic symbols and non-Latin extensions, which are addressed in separate blocks like IPA Extensions (U+0250–U+02AF).[1] As of Unicode 17.0, released in 2025, the block contains no new character assignments and has remained stable since its full establishment in version 1.1, with all 128 code points allocated to maintain consistency in legacy support.[5][3]Character Categories
Diacritic-Equipped Letters
The Latin Extended-A block (U+0100–U+017F) encompasses numerous uppercase and lowercase letter pairs modified by diacritics, enabling precise representation of phonetic distinctions in various European languages. These modifications, totaling 63 pairs excluding non-letters and specials, facilitate support for orthographies requiring indications of vowel length, stress, nasalization, or palatalization. Diacritics in this block build on classical Latin traditions while extending to modern usages, with forms like the macron and caron deriving from ancient prosodic marks adapted for contemporary scripts.[3][2] Letters with macrons feature a horizontal bar (¯) above the base letter, a diacritic originating from the Ancient Greek makrón ("long"), initially used in Greco-Roman metrics to denote syllable length and later adopted for vowel duration in languages like Latvian and Sami. In this block, macrons appear on A, E, I, O, and Y, distinguishing long vowels from short counterparts in Latvian orthography, where they alter pronunciation to reflect historical Indo-European length contrasts. Examples include Ā/ā (U+0100/U+0101), Ē/ē (U+0112/U+0113), Ī/ī (U+012A/U+012B), Ō/ō (U+014C/U+014D), and Ū/ū (U+016A/U+016B).[6][2] The acute accent (´), etymologically from Latin acūtus ("sharp"), a calque of Ancient Greek oxús for high pitch in prosody, marks stress or palatalization; while basic forms like Á/á overlap with the Latin-1 Supplement (U+00C1/U+00E1), Extended-A extends it to consonants for languages such as Polish and Croatian. Here, it appears on C, L, N, R, S, and Z, indicating soft or affricate sounds, as in Ć/ć (U+0106/U+0107) for the palatalized /tɕ/ in Polish. Other instances include Ĺ/ĺ (U+0139/U+013A), Ń/ń (U+0143/U+0144), Ŕ/ŕ (U+0154/U+0155), Ś/ś (U+015A/U+015B), and Ź/ź (U+0179/U+017A).[2] Breves (˘), shaped like an inverted arc and named from Latin brevis ("short") to contrast the macron, indicate short vowels or reduced sounds, originating in Latin grammatical texts for phonetic clarity. In Extended-A, they modify A, E, G, I, O, and U, as seen in Romanian for short vowels, with examples like Ă/ă (U+0102/U+0103), Ĕ/ĕ (U+0114/U+0115), Ğ/ğ (U+011E/U+011F), Ĭ/ĭ (U+012C/U+012D), Ŏ/ŏ (U+014E/U+014F), and Ŭ/ŭ (U+016C/U+016D).[7][2] Letters with a dot above (˙) denote distinct consonants or dotted vowels, a diacritic tracing to medieval scribal practices for emphasis or to avoid confusion with undotted forms like ı. In this block, it equips C, E, G, I, and Z for languages including Maltese and Lithuanian, such as Ċ/ċ (U+010A/U+010B), Ė/ė (U+0116/U+0117), Ġ/ġ (U+0120/U+0121), İ/ı (U+0130/U+0131), and Ż/ż (U+017B/U+017C).[2] The ogonek (˛), a hook under the letter meaning "little tail" in Polish, emerged in 15th-century Polish orthography to represent nasal vowels, inspired by Cyrillic forms and later adopted in Lithuanian. It attaches to A, E, I, and U in Extended-A, marking nasalization as in Polish Ą/ą (U+0104/U+0105) and Ę/ę (U+0118/U+0119), or Lithuanian Į/į (U+012E/U+012F) and Ų/ų (U+0172/U+0173).[2] Caron (ˇ), also known as háček ("little hook" in Czech), evolved from a supralinear dot introduced by Jan Hus in early 15th-century Czech orthography to simplify digraphs and indicate palatalization in Slavic languages. In Extended-A, it adorns C, D, E, L, N, R, S, T, and Z, as in Č/č (U+010C/U+010D), Ď/ď (U+010E/U+010F), Ě/ě (U+011A/U+011B), Ľ/ľ (U+013D/U+013E), Ň/ň (U+0147/U+0148), Ř/ř (U+0158/U+0159), Š/š (U+0160/U+0161), Ť/ť (U+0164/U+0165), and Ž/ž (U+017D/U+017E).[2] Other diacritic-adjacent forms include ligatures like Œ/œ (U+0152/U+0153), a fusion of O and E from Latin orthography denoting /œ/ in French, and IJ/ij (U+0132/U+0133), a Dutch digraph for /ɛi/. These transitional forms bridge basic ligatures and accented letters, with Œ/œ etymologically rooted in Vulgar Latin vowel shifts.[2] The following table catalogs all 63 diacritic-equipped letter pairs in the block, with code points, forms, primary diacritic, and a brief example language based on standard usages.| Code Point (Upper/Lower) | Uppercase | Lowercase | Primary Diacritic | Example Language |
|---|---|---|---|---|
| U+0100 / U+0101 | Ā | ā | Macron | Latvian |
| U+0102 / U+0103 | Ă | ă | Breve | Romanian |
| U+0104 / U+0105 | Ą | ą | Ogonek | Polish |
| U+0106 / U+0107 | Ć | ć | Acute | Polish |
| U+0108 / U+0109 | Ĉ | ĉ | Circumflex | Esperanto |
| U+010A / U+010B | Ċ | ċ | Dot Above | Maltese |
| U+010C / U+010D | Č | č | Caron | Czech |
| U+010E / U+010F | Ď | ď | Caron | Czech |
| U+0110 / U+0111 | Đ | đ | Stroke | Croatian |
| U+0112 / U+0113 | Ē | ē | Macron | Latvian |
| U+0114 / U+0115 | Ĕ | ĕ | Breve | Romanian |
| U+0116 / U+0117 | Ė | ė | Dot Above | Lithuanian |
| U+0118 / U+0119 | Ę | ę | Ogonek | Polish |
| U+011A / U+011B | Ě | ě | Caron | Czech |
| U+011C / U+011D | Ĝ | ĝ | Circumflex | Esperanto |
| U+011E / U+011F | Ğ | ğ | Breve | Azerbaijani |
| U+0120 / U+0121 | Ġ | ġ | Dot Above | Maltese |
| U+0122 / U+0123 | Ģ | ģ | Cedilla | Latvian |
| U+0124 / U+0125 | Ĥ | ĥ | Circumflex | Esperanto |
| U+0126 / U+0127 | Ħ | ħ | Stroke | Maltese |
| U+0128 / U+0129 | Ĩ | ĩ | Tilde | Portuguese |
| U+012A / U+012B | Ī | ī | Macron | Latvian |
| U+012C / U+012D | Ĭ | ĭ | Breve | Romanian |
| U+012E / U+012F | Į | į | Ogonek | Lithuanian |
| U+0130 / U+0131 | İ | ı | Dot Above | Turkish |
| U+0132 / U+0133 | IJ | ij | Ligature (IJ) | Dutch |
| U+0134 / U+0135 | Ĵ | ĵ | Circumflex | Esperanto |
| U+0136 / U+0137 | Ķ | ķ | Cedilla | Latvian |
| U+0139 / U+013A | Ĺ | ĺ | Acute | Slovak |
| U+013B / U+013C | Ļ | ļ | Cedilla | Latvian |
| U+013D / U+013E | Ľ | ľ | Caron | Slovak |
| U+013F / U+0140 | Ŀ | ŀ | Middle Dot | Catalan |
| U+0141 / U+0142 | Ł | ł | Stroke | Polish |
| U+0143 / U+0144 | Ń | ń | Acute | Polish |
| U+0145 / U+0146 | Ņ | ņ | Cedilla | Latvian |
| U+0147 / U+0148 | Ň | ň | Caron | Czech |
| U+014A / U+014B | Ŋ | ŋ | Stroke | Inuktitut |
| U+014C / U+014D | Ō | ō | Macron | Latvian |
| U+014E / U+014F | Ŏ | ŏ | Breve | Romanian |
| U+0150 / U+0151 | Ő | ő | Double Acute | Hungarian |
| U+0152 / U+0153 | Œ | œ | Ligature (OE) | French |
| U+0154 / U+0155 | Ŕ | ŕ | Acute | Slovak |
| U+0156 / U+0157 | Ŗ | ŗ | Cedilla | Latvian |
| U+0158 / U+0159 | Ř | ř | Caron | Czech |
| U+015A / U+015B | Ś | ś | Acute | Polish |
| U+015C / U+015D | Ŝ | ŝ | Circumflex | Esperanto |
| U+015E / U+015F | Ş | ş | Cedilla | Turkish |
| U+0160 / U+0161 | Š | š | Caron | Czech |
| U+0162 / U+0163 | Ţ | ţ | Cedilla | Romanian |
| U+0164 / U+0165 | Ť | ť | Caron | Slovak |
| U+0166 / U+0167 | Ŧ | ŧ | Stroke | Sami |
| U+0168 / U+0169 | Ũ | ũ | Tilde | Portuguese |
| U+016A / U+016B | Ū | ū | Macron | Latvian |
| U+016C / U+016D | Ŭ | ŭ | Breve | Esperanto |
| U+016E / U+016F | Ů | ů | Ring Above | Czech |
| U+0170 / U+0171 | Ű | ű | Double Acute | Hungarian |
| U+0172 / U+0173 | Ų | ų | Ogonek | Lithuanian |
| U+0174 / U+0175 | Ŵ | ŵ | Circumflex | Welsh |
| U+0176 / U+0177 | Ŷ | ŷ | Circumflex | Welsh |
| U+0178 (lower: U+00FF) | Ÿ | ÿ | Diaeresis | French |
| U+0179 / U+017A | Ź | ź | Acute | Polish |
| U+017B / U+017C | Ż | ż | Dot Above | Polish |
| U+017D / U+017E | Ž | ž | Caron | Czech |
Ligatures and Special Forms
The Latin Extended-A block includes several ligatures and special character forms that represent fused or variant glyphs essential for specific languages and historical typography. These characters address orthographic needs beyond simple diacritic additions, such as combining letters into single units for phonetic or aesthetic reasons, or providing modified shapes for phonetic distinctions in non-Latin scripts adapted to Latin alphabets.[3] Ligatures in this block primarily consist of the IJ and OE combinations. The capital ligature IJ (U+0132) and its lowercase counterpart ij (U+0133) are used in Dutch to represent the digraph "ij," which functions as a single vowel sound and is treated as a distinct letter in the alphabet; graphically, ij renders as a fused i and j, often with the j's dot shared or omitted for compactness.[3] Similarly, Œ (U+0152) and œ (U+0153) form the OE ligature, employed in French for words like "œuvre" to denote the /œ/ sound, and in Occitan for analogous diphthongs; visually, Œ fuses the o and e, with the e's crossbar integrated into the o's curve, creating a rounded, enclosed form reminiscent of medieval scribal practices.[3] Special forms encompass stroked letters, dotless variants, and historical shapes tailored to linguistic requirements. The D with stroke Đ (U+0110) and đ (U+0111) are vital in Serbo-Croatian (Croatian and Serbian) to represent the /dʒ/ sound, as well as in Vietnamese and Sami languages; the stroke through the d stem distinguishes it phonetically without altering basic letter height.[3] In Polish, the L with stroke Ł (U+0141) and ł (U+0142) denote the /w/ sound, with the vertical stroke crossing the l's descender for clear differentiation in cursive scripts.[3] The dotless i variants—I with dot above İ (U+0130) and dotless ı (U+0131)—support Turkish and Azerbaijani case rules, where uppercase İ retains the dot to match dotted lowercase i, while ı avoids it to prevent redundancy in words like "İstanbul"; this pairing ensures proper titlecasing without semantic shifts.[3] Additional special forms include the historical long s ſ (U+017F), a variant lowercase s used in early modern printing until the 18th century, featuring an elongated ascender similar to f but without the crossbar, still relevant in Fraktur and Gaelic typography.[3] For Sami languages, the T with stroke Ŧ (U+0166) and ŧ (U+0167) represent /θ/, with a horizontal bar through the t's stem, while the eng Ŋ (U+014A) and ŋ (U+014B) encode the velar nasal /ŋ/, shaped like a tailed n.[3] The Catalan legacy form ŀ (U+0140), L with middle dot, combines l and a centered dot (·) for the /ɲ/ sound in words like "l·luna," though modern usage favors separate characters.[3] A deprecated special form is the small letter n preceded by apostrophe (U+0149), once used in Afrikaans for contractions but now discouraged in favor of composed sequences.[3]| Code Point | Character | Name | Primary Usage |
|---|---|---|---|
| U+0132 | IJ | LATIN CAPITAL LIGATURE IJ | Dutch digraph |
| U+0133 | ij | LATIN SMALL LIGATURE IJ | Dutch digraph |
| U+0152 | Œ | LATIN CAPITAL LIGATURE OE | French, Occitan |
| U+0153 | œ | LATIN SMALL LIGATURE OE | French, Occitan |
| U+0110 | Đ | LATIN CAPITAL LETTER D WITH STROKE | Serbo-Croatian, Vietnamese, Sami |
| U+0111 | đ | LATIN SMALL LETTER D WITH STROKE | Serbo-Croatian, Vietnamese, Sami |
| U+0141 | Ł | LATIN CAPITAL LETTER L WITH STROKE | Polish |
| U+0142 | ł | LATIN SMALL LETTER L WITH STROKE | Polish |
| U+0130 | İ | LATIN CAPITAL LETTER I WITH DOT ABOVE | Turkish |
| U+0131 | ı | LATIN SMALL LETTER DOTLESS I | Turkish |
| U+017F | ſ | LATIN SMALL LETTER LONG S | Historical typography |
| U+0166 | Ŧ | LATIN CAPITAL LETTER T WITH STROKE | Sami |
| U+0167 | ŧ | LATIN SMALL LETTER T WITH STROKE | Sami |
| U+014A | Ŋ | LATIN CAPITAL LETTER ENG | Sami |
| U+014B | ŋ | LATIN SMALL LETTER ENG | Sami |
| U+0140 | ŀ | LATIN SMALL LETTER L WITH MIDDLE DOT | Catalan (legacy) |
| U+0149 | 'n | LATIN SMALL LETTER N PRECEDED BY APOSTROPHE | Afrikaans (deprecated) |