Letter frequency
Letter frequency refers to the relative occurrence of each letter in the alphabet within a corpus of written text in a specific language, typically measured as a percentage of the total number of letters used.[1] This distribution varies by language and can reveal patterns in usage, with some letters appearing far more often than others due to phonetic, syntactic, and morphological structures.[2] In the English language, the letter E is the most frequent, accounting for about 12.02% of all letters in a large sample of prose, followed by T at 9.10%, A at 8.12%, and O at 7.68%.[3] These frequencies are derived from extensive analyses of texts, such as novels or literary works, with slight variations occurring based on genre or specific corpus. For instance, studies of Shakespeare's complete works confirm E's dominance at around 12.5%, highlighting the consistency of these patterns from historical to modern English literature.[4] One of the primary applications of letter frequency is in cryptanalysis, where it serves as a foundational tool for deciphering monoalphabetic substitution ciphers, such as the Caesar cipher. By comparing the frequency distribution in an encrypted text to known language norms—like the high occurrence of E in English—cryptanalysts can infer mappings between ciphertext and plaintext letters, often achieving decryption without the key.[1] This method, known as frequency analysis, exploits the non-uniform distribution of letters and has been effective since classical times, though modern ciphers mitigate it through polyalphabetic substitutions or larger blocks.[5] In linguistics, letter frequency analysis aids in examining language evolution, dialectal differences, and text generation models.[2] For example, comparisons across Old, Middle, and Modern English show shifts in frequencies—such as the decline of certain letters like þ (thorn)—reflecting phonological changes and orthographic standardization.[2] It also informs computational linguistics, where frequencies help predict word probabilities or optimize compression algorithms.[6] Frequencies are used to design keyboard layouts, such as the Dvorak simplified keyboard, which places common letters on the home row for efficiency.[7] Overall, these analyses underscore how letter frequencies provide insights into both the structure of languages and practical tools for decoding and processing text.Fundamentals
Definition and Basic Concepts
Letter frequency denotes the statistical distribution of individual letters within a corpus of written text in a given language, quantified either as absolute counts of occurrences or, more commonly, as relative frequencies expressed in percentages. This measure captures how often each letter appears relative to the total number of letters analyzed, providing insight into the structural patterns of language use.[3] In linguistic analyses, letter frequency pertains specifically to the 26 characters of the Latin alphabet (A through Z) for languages like English, with standard computations treating uppercase and lowercase forms as equivalent to ensure case insensitivity. These analyses focus exclusively on single letters, or monograms, and explicitly differ from studies of digraphs (two-letter sequences) or n-grams (sequences of multiple letters), which examine combinations rather than isolated characters.[8] A foundational mnemonic for recalling the typical descending order of letter frequencies in English is "etaoin shrdlu," which approximates the sequence E, T, A, O, I, N, S, H, R, D, L, U based on empirical observations of text corpora. Such orders highlight the uneven distribution of letters, with vowels and common consonants dominating.[8] Letter frequencies differ markedly across languages, influenced by phonetic inventories, orthographic systems that map sounds to symbols, and patterns of word formation and usage in everyday texts. For instance, vowel-heavy languages may exhibit higher frequencies for certain letters compared to consonant-dominant ones.[9] Conceptually, letter frequencies constitute a discrete probability distribution across the alphabet, where the probability assigned to each letter represents its proportional occurrence, and the sum of all probabilities equals 100% (or 1 in decimal form), reflecting the exhaustive coverage of all letters in any text sample.[3]Historical Context
The study of letter frequency originated in cryptanalysis, with the earliest known systematic use described in the 9th century by the Arab polymath Al-Kindi in his treatise A Manuscript on Deciphering Cryptographic Messages, where he introduced frequency analysis to break monoalphabetic substitution ciphers by comparing letter occurrences in ciphertext to known language distributions.[10] It evolved into a key tool in linguistics and statistics during the 19th century. Early observations in Western contexts appeared in the 1840s through Edgar Allan Poe's writings on cryptography, where he discussed intuitive rankings of letter commonality to aid in deciphering substitution ciphers. In his 1843 short story "The Gold-Bug," Poe illustrated this approach by having the protagonist match cipher symbols to plaintext letters based on their relative frequencies in English, such as the prevalence of 'e' over rarer letters like 'z'.[11] Formal advancements followed in the mid-19th century, driven by efforts to break more complex polyalphabetic ciphers. Charles Babbage, known for his mechanical computing designs, applied frequency-based methods in the 1840s and 1850s to cryptanalyze the Vigenère cipher, identifying patterns in repeated letter sequences that revealed key lengths and positional variations in encryption. Independently, Friedrich Kasiski formalized similar techniques in his 1863 book Die Geheimschriften und die Dechiffrir-Kunst, where he examined distances between identical letter groups to determine periodicity, effectively extending single-letter frequency analysis to positional contexts in German and other languages.[12][13] In the late 19th century, letter frequency data influenced practical technologies beyond cryptography. The QWERTY keyboard layout, patented in 1878 by Christopher Latham Sholes, was arranged based on analyses of common digrams in English from the 1870s to separate frequent letter pairs and reduce typewriter key jams, with consideration given to letter frequencies.[14] The 20th century saw systematic tabulations during World War I, led by William F. Friedman, who compiled detailed letter frequency tables from English texts as chief cryptanalyst for the U.S. Army's Signal Intelligence Service; his work in Military Cryptanalysis (published postwar) included distributions from samples of 40,000 words to support codebreaking. Post-World War II, early electronic computers enabled computational shifts in analyzing vast corpora, marking a transition from manual counts to automated processing in linguistics, with foundational efforts in the 1950s and 1960s using machines to derive precise frequencies from texts like the 1961 Brown Corpus.[15][16]English Language Analysis
Overall Letter Frequencies
In English language analysis, overall letter frequencies represent the relative proportions of each alphabet letter in large samples of text, expressed as percentages of total letter occurrences. These frequencies provide a baseline for understanding linguistic patterns and are derived from representative corpora such as the Brown Corpus, a 1-million-word collection of mid-20th-century American English prose across various genres. Standard values from such analyses show E as the most common letter at 12.02%, followed by T at 9.10% and A at 8.12%, reflecting the aggregate distribution across all word positions and text types.[3] Similar results emerge from larger modern datasets like the Google Books Ngram corpus, where E appears at approximately 12.5%, confirming the stability of these rankings.[8] Several factors shape these overall frequencies. The inherent vowel-consonant balance in English plays a key role, with vowels (A, E, I, O, U) accounting for roughly 40% of all letters despite comprising only about 19% of the alphabet, due to their essential role in syllable formation and word structure.[8] High-frequency function words like "the," "of," and "and" disproportionately elevate counts for specific letters—E, T, and H, for instance, benefit significantly from "the" alone, which is the most common English word.[8] Genre and register also introduce variations; prose and narrative texts often exhibit higher E frequencies (around 12-13%) compared to technical or scientific writing, where denser terminology may increase consonants like C and reduce vowels overall.[8] The following table ranks the 26 letters by frequency based on a analysis of approximately 182,000 letters from a 40,000-word English sample, closely aligning with Brown Corpus proportions:| Rank | Letter | Frequency (%) |
|---|---|---|
| 1 | E | 12.02 |
| 2 | T | 9.10 |
| 3 | A | 8.12 |
| 4 | O | 7.68 |
| 5 | I | 7.31 |
| 6 | N | 6.95 |
| 7 | S | 6.28 |
| 8 | R | 6.02 |
| 9 | H | 5.92 |
| 10 | D | 4.32 |
| 11 | L | 3.98 |
| 12 | C | 2.78 |
| 13 | U | 2.76 |
| 14 | M | 2.41 |
| 15 | F | 2.23 |
| 16 | W | 2.09 |
| 17 | G | 2.03 |
| 18 | Y | 1.97 |
| 19 | P | 1.93 |
| 20 | B | 1.49 |
| 21 | V | 0.98 |
| 22 | K | 0.77 |
| 23 | J | 0.15 |
| 24 | X | 0.15 |
| 25 | Q | 0.10 |
| 26 | Z | 0.07 |
Positional Variations in English
In English, letter frequencies vary significantly depending on their position within a word, such as initial (first letter), medial (middle letters), or final (last letter). This positional variation arises from linguistic patterns, including morphological rules, phonetic preferences, and historical influences on word formation. For instance, while the overall frequency of letters is dominated by vowels like E and consonants like T, initial positions favor certain consonants and vowels due to common prefixes and word onsets. Positional frequencies can vary across different corpora due to differences in text genre, size, and era.[8] Analysis of large English corpora reveals that the most common initial letters are T at approximately 15.9% and A at 15.5%, followed by I (8.2%), S (7.8%), and O (7.1%). These rankings differ markedly from overall frequencies, where E leads at around 12.7% and T at 9.1%, highlighting how word-initial positions prioritize sounds suitable for starting utterances, such as plosives and open vowels. Examples from high-frequency word lists like the Oxford 3000 illustrate this: words beginning with T (e.g., "the," "to," "that") and A (e.g., "and," "are," "as") dominate, reflecting their role in articles, prepositions, and conjunctions. In contrast, consonants like Q and J are rare initially, occurring in less than 0.1% of words each, as they typically require following vowels or specific digraphs (e.g., "queen," "jump").[8][17] Medial positions, encompassing letters within words, show frequencies closer to overall patterns but with E remaining dominant at about 15%, due to its prevalence in suffixes, inflections, and stressed syllables (e.g., in "letter," "water"). Other frequent medial letters include A (8.5%) and R (7.2%), supporting the internal structure of multisyllabic words. The letter Y exhibits positional variability: as a consonant initially (e.g., "yes," ~2.5% initial frequency), it shifts to a vowel role medially and finally (e.g., "system," "happy"), where its frequency rises to around 2% in those positions, contrasting its overall 2% rank.[8] Final positions further diverge, with E leading at roughly 19.2% (e.g., in plurals like "dogs" or past tenses like "walked"), followed by S at 14.4% for possessives and plurals (e.g., "dogs," "world's"). This contrasts sharply with overall ranks, where S is third at 6.3% but boosts terminally due to grammatical endings. Letters like Z appear more frequently finally (e.g., "buzz," ~0.5% final vs. 0.07% overall), often in loanwords or onomatopoeia, underscoring how endings favor sibilants and silent E for phonetic closure. Peter Norvig's 2012 analysis of a massive Google Books corpus confirms these biases, showing Z's final frequency is over seven times its initial occurrence.[8][18] The following table compares selected letter frequencies across positions (percentages rounded; based on corpus analyses of millions of words), illustrating key shifts relative to overall usage:| Letter | Overall (%) | Initial (%) | Medial (%) | Final (%) |
|---|---|---|---|---|
| E | 12.7 | 1.5 | 15.0 | 19.2 |
| T | 9.1 | 15.9 | 8.0 | 8.6 |
| A | 8.2 | 15.5 | 8.5 | 2.0 |
| S | 6.3 | 7.8 | 6.0 | 14.4 |
| O | 7.5 | 7.1 | 7.8 | 4.7 |
| Z | 0.07 | 0.01 | 0.05 | 0.5 |
Cross-Linguistic Comparisons
Frequencies in Indo-European Languages
Indo-European languages display notable similarities in letter frequencies owing to their common Proto-Indo-European roots, which influence phonological patterns such as vowel-consonant alternation. Across the family, vowels generally account for 40-50% of letters in written texts, a trend rooted in the phonetic structure favoring open syllables and vowel harmony in ancestral forms. However, branches like Romance, Germanic, and Slavic diverge due to orthographic reforms, dialectal influences, and script variations, leading to shifts in the prominence of specific letters. These patterns are derived from large corpora analyses, providing insights into linguistic evolution within the family.[19] In Romance languages, which evolved from Vulgar Latin, vowels dominate frequency tables, often exceeding 45% combined, with E and A frequently topping the list due to their roles in inflectional endings and common roots. French exhibits a high frequency for E at 14.5%, surpassing A's 7.6%, a pattern attributed to the proliferation of schwa sounds and liaison in spoken French reflected in writing. Spanish, by contrast, emphasizes vowel balance with A at 12.5% and E at 13.2%, stemming from its consistent phonemic orthography that preserves Latin vowel qualities. Italian shows a more even distribution among vowels, with I and O in relative balance—I at approximately 10.2% and O at 10.0%—alongside E (11.5%) and A (10.9%), highlighting the language's melodic prosody and avoidance of diphthongs.[20][21][22] Germanic languages, including English, German, and Dutch, tend toward consonant-heavy profiles compared to Romance counterparts, with vowels around 40% but E often elevated due to grammatical markers. Compared to English (E 12.1%, A 8.6%), German amplifies E to 16.0% while diminishing A to 6.3%, influenced by umlaut shifts and compound word formations that favor certain vowels. Dutch mirrors this consonant emphasis, with E at 19.3% and A at 7.8%, similar to English but with higher frequencies for IJ digraphs in informal texts, reflecting shared West Germanic traits. These variations underscore how sound changes, like the High German consonant shift, alter frequency distributions.[23][24] Slavic languages present additional complexities due to the use of Cyrillic script in many cases, complicating direct comparisons with Latin-based systems and requiring transliteration for analysis. In Russian, for instance, the vowel О (transliterated as O) holds the highest frequency at 11.2%, followed by А (A) at 7.6%, with total vowels comprising about 45%—a pattern echoing Indo-European vowel prominence but adapted to palatalization and stress rules.[25] Transliteration challenges arise because Cyrillic letters like Ё (Yo) or Ъ (hard sign) lack exact Latin equivalents, and frequencies shift when mapping to Romanized forms, potentially inflating certain consonants like N from Н. This orthographic divergence highlights how script choice affects perceived frequencies in cross-linguistic studies. To illustrate comparisons, the following table presents top letter frequencies (in percentages) for representative Indo-European languages using Latin script, benchmarked against English. Data is rounded for clarity and based on large text corpora.| Letter | English | French | Spanish | German | Italian |
|---|---|---|---|---|---|
| E | 12.1 | 14.5 | 13.2 | 16.0 | 11.5 |
| A | 8.6 | 7.6 | 12.5 | 6.3 | 10.9 |
| I | 7.3 | 7.2 | 6.9 | 7.6 | 10.2 |
| O | 7.5 | 5.4 | 9.0 | 2.8 | 10.0 |
| N | 7.2 | 7.3 | 7.1 | 9.6 | 7.0 |
| S | 6.7 | 8.0 | 7.4 | 6.4 | 5.5 |
Frequencies in Non-Indo-European Languages
Letter frequency analysis in non-Indo-European languages reveals diverse patterns shaped by unique linguistic structures and writing systems, ranging from consonant-dominant abjads to syllabaries and featural alphabets. Unlike the more uniform alphabetic scripts common in Indo-European languages, these systems often prioritize morphological or phonological units over isolated letters, leading to skewed distributions that reflect root-based morphology or harmony rules.[26] In Semitic languages such as Arabic and Hebrew, which employ abjad scripts that primarily denote consonants, frequencies underscore a consonant-heavy profile aligned with triconsonantal root systems. For Arabic, corpus-based studies from a 40-million-word collection identify Alif (ا, romanized as "a") as the most frequent letter at approximately 15.7%, followed by Lam (ل, "l") and Yeh (ي, "y"), emphasizing the prevalence of certain consonants in derivational morphology.[27][28] Hebrew exhibits similar trends, with semi-vowels dominating: Yod (י, "y") at 11.06%, He (ה, "h") at 10.87%, Waw (ו, "w") at 10.38%, Aleph (א, glottal stop) at 6.34%, and Bet (ב, "b") at 4.74%, based on a 1.2-million-character literary corpus.[29] These patterns highlight how unwritten vowels in abjads shift focus to consonantal skeletons for frequency counts.[30] Sino-Tibetan languages present additional complexities due to logographic or mixed scripts, necessitating romanization for alphabetic frequency studies. In Chinese, Pinyin transcription yields vowel-dominant distributions, with "i" at 14.29%, "n" at 11.24%, "a" at 10.78%, and "e" at 8.20%, drawn from extensive text analyses that contrast sharply with the character frequencies of the native hanzi system, where thousands of logograms replace letters.[31][32] Japanese kana, a syllabary integrated with kanji, shows high vowel usage reflective of its moraic phonology: /a/ at 23.42%, /i/ at 21.54%, /u/ at 23.47%, /o/ at 20.63%, and /e/ at 10.94%, based on a large newspaper corpus.[33] This vowel prominence facilitates smooth syllable formation but differs from alphabetic letter counts. In other families, scripts like Korean Hangul and Turkish Latin alphabet illustrate balanced yet phonologically constrained distributions. Korean's featural alphabet is notably vowel-heavy, with vowels accounting for over 50% of occurrences in large corpora, including combined /a/ and /e/ sounds approaching 20% due to the language's syllable-block structure.[34] Turkish vowel harmony—requiring vowels within words to share front/back and rounded/unrounded features—affects frequencies, promoting balanced use of sets like {a, ı, o, u} (back) and {e, i, ö, ü} (front), with top letters including "a" at 11.6%, "e" at 9.4%, and "i" at 8.6% in analyzed texts.[35][36][37] Script variations pose key challenges for cross-linguistic frequency comparisons: logographic systems like Chinese hanzi lack discrete letters, requiring phonetic proxies like Pinyin that may not capture native usage; abjads omit vowels, skewing counts toward consonants; and syllabaries like kana treat combined units, blurring individual letter roles.[30][26] The following table summarizes representative frequencies using romanized equivalents for comparability:| Language | Top Letters (Romanized) | Frequencies (%) | Source Corpus Type |
|---|---|---|---|
| Arabic | a (Alif), l (Lam), y (Yeh) | 15.7, ~10.5, ~9.2 | Multi-million word texts [28] |
| Hebrew | y (Yod), h (He), w (Waw) | 11.06, 10.87, 10.38 | 1.2M-character literary [29] |
| Chinese (Pinyin) | i, n, a | 14.29, 11.24, 10.78 | Large Pinyin-transcribed texts [31] |
| Japanese (Kana) | a, i, u | 23.42, 21.54, 23.47 | Newspaper lexical corpus [33] |
| Korean (Hangul) | Vowels (combined a/e) | ~20 (combined) | 85M-character general texts [34] |
| Turkish | a, e, i | 11.6, 9.4, 8.6 | Literary mix texts [35] |