Frequency analysis
Frequency analysis is a fundamental technique in cryptanalysis that involves studying the frequency of occurrence of letters, symbols, or groups thereof in a ciphertext to infer the underlying plaintext, particularly effective against monoalphabetic substitution ciphers such as the Caesar cipher.[1] This method exploits the predictable patterns in natural languages, where certain letters like 'E', 'T', 'A', and 'O' appear far more frequently in English texts than rarer ones such as 'Z', 'Q', or 'X', allowing cryptanalysts to map ciphertext symbols to their plaintext equivalents by comparing frequency distributions.[1] The origins of frequency analysis trace back to the 9th century, when the Arab polymath Al-Kindi (c. 801–873 CE) developed it systematically in his treatise A Manuscript on Deciphering Cryptographic Messages, marking the first known recorded explanation of any cryptanalytic technique.[2] Al-Kindi's innovation involved tallying letter frequencies in both known plaintext samples and encrypted texts, then aligning the most common symbols in the ciphertext with the most frequent letters in the target language to partially or fully decrypt messages, a process that relied on early statistical insights derived from linguistic analysis.[2] This breakthrough not only weakened simple substitution ciphers but also spurred advancements in cryptography, as encryphers sought more complex methods like polyalphabetic substitution to evade detection.[2] In practice, frequency analysis begins with collecting a sufficiently long ciphertext—ideally hundreds of characters—to ensure reliable statistics, followed by ranking symbols by occurrence and hypothesizing mappings based on language norms; for instance, the most frequent ciphertext letter might correspond to 'E' in English, with trial substitutions revealing patterns like common words or digrams (e.g., 'TH' or 'HE').[3] While highly effective against classical ciphers, its utility diminishes against modern polyalphabetic or computationally secure systems, though it remains a cornerstone educational tool in understanding cryptographic vulnerabilities and has influenced fields beyond cryptology, including linguistics and data analysis.[4]Fundamentals
Definition and Basic Principles
Frequency analysis is a cryptographic technique that involves counting and comparing the relative frequencies of symbols, letters, or other units within a text or data stream to reveal underlying patterns or structures.[5] This method exploits the statistical regularities inherent in natural languages and other datasets, where certain elements occur more frequently than others, allowing analysts to infer relationships between ciphertext and plaintext without prior knowledge of the encoding key. At its core, frequency analysis relies on the principle that natural languages exhibit non-uniform distributions of characters, meaning letters do not appear with equal probability. For example, in English, the letters follow an approximate order of frequency remembered by the mnemonic "etaoin shrdlu," where 'e' is the most common, followed by 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', and 'u'.[6] This uneven distribution arises from linguistic patterns, such as the prevalence of common words and grammatical structures. In cryptanalysis, observed frequencies in an encoded text are compared to these expected frequencies from the source language; significant matches or deviations help identify mappings or anomalies, as substitution ciphers preserve the original frequency profile despite obscuring individual symbols.[7] Mathematically, frequency analysis computes relative frequencies as proportions of occurrences. The relative frequency f(x) of a symbol x is given by f(x) = \frac{\text{count of } x}{\text{total count of all symbols}}, yielding values between 0 and 1, often expressed as percentages for interpretation. For instance, the letter 'e' in English text has a relative frequency of approximately 12.7%, making it a key indicator in analysis.[8] This foundational approach enables pattern recognition in encoded texts by highlighting consistencies between anticipated and actual distributions, serving as a prerequisite for more advanced cryptanalytic methods without requiring assumptions about specific encoding schemes.[9]Frequency Distributions in Natural Language
In natural languages, letter frequencies exhibit non-uniform distributions shaped by linguistic structures, with vowels and common consonants appearing far more often than rare ones. These patterns are derived from large corpora of written texts and provide a foundation for analyzing textual regularity. For instance, in English, the letter 'E' occurs approximately 12.02% of the time, followed by 'T' at 9.10% and 'A' at 8.12%, based on a sample of 40,000 words.[7] The following table summarizes the relative frequencies of letters in English, highlighting the dominance of a few characters:| Letter | Frequency (%) |
|---|---|
| E | 12.02 |
| T | 9.10 |
| A | 8.12 |
| O | 7.68 |
| I | 7.31 |
| N | 6.95 |
| S | 6.28 |
| R | 6.02 |
| H | 5.92 |
| D | 4.32 |
| L | 3.98 |
| U | 2.88 |
| C | 2.71 |
| M | 2.61 |
| F | 2.30 |
| Y | 2.11 |
| W | 2.09 |
| G | 2.03 |
| P | 1.82 |
| B | 1.49 |
| V | 1.11 |
| K | 0.69 |
| Q | 0.11 |
| X | 0.17 |
| J | 0.10 |
| Z | 0.07 |
Cryptanalytic Applications
Substitution Ciphers
A monoalphabetic substitution cipher encrypts plaintext by replacing each letter with a unique ciphertext letter according to a fixed permutation, thereby preserving the relative frequency distribution of letters from the original language.[17] This preservation occurs because the substitution is a one-to-one mapping, so the most frequent plaintext letters remain the most frequent in ciphertext, albeit under different symbols. To break such a cipher using frequency analysis, the cryptanalyst first tallies the frequencies of letters in the ciphertext and compares them to known plaintext distributions, such as English where 'E' appears approximately 12.7% of the time, followed by 'T' at 9.1%.[7] The most frequent ciphertext letter is then hypothesized to map to 'E', the next to 'T' or 'A', and so on, forming an initial partial key. This mapping is iteratively refined by examining digraphs (pairs of letters) and trigraphs, whose expected frequencies in English—such as 'TH' at about 2.7%—help resolve ambiguities and confirm substitutions.[18] Cryptanalysts employ tools like frequency charts to visualize these distributions and the index of coincidence (IC) to validate mappings, as the IC for a monoalphabetic ciphertext closely matches English's value of around 0.067, indicating non-random repetition patterns.[15] Additionally, the chi-squared test quantifies the goodness-of-fit between observed and expected frequencies in a proposed decryption: \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} where O_i is the observed count of the i-th letter in the decrypted text, and E_i is the expected count based on language frequencies; lower \chi^2 values suggest a better match to natural language.[19] This method succeeds against monoalphabetic ciphers because the fixed mapping retains detectable frequency patterns, but it fails against polyalphabetic ciphers, which use multiple substitutions to diffuse and flatten letter frequencies, approximating a uniform distribution.[20]Step-by-Step Example
Consider the short ciphertext "URFUA FOBRF MOBYL KFRBF KXDMF XFLBB ZFEUO ZFRKM FEXUO FRKUO LFUAF RBFYA MFURF PMCC", encrypted via a simple substitution cipher where each plaintext symbol (including spaces) is replaced by a unique ciphertext letter.[21] Begin by counting the occurrences of each letter to identify patterns matching expected English frequencies, where spaces and letters like E, T, and A appear most often.| Cipher Letter | Frequency |
|---|---|
| F | 16 |
| R | 7 |
| U | 7 |
| B | 6 |
| M | 5 |
| O | 5 |
| A | 3 |
| L | 3 |
| X | 3 |
| C | 2 |
| E | 2 |
| K | 2 |
| Y | 2 |
| Z | 2 |
| D | 1 |
| Cipher | Plain |
|---|---|
| F | (space) |
| R | T |
| U | I |
| Cipher | Plain |
|---|---|
| F | (space) |
| R | T |
| U | I |
| B | O |
| M | E |
| K | H |
| O | N |
| Y | U |
| L | G |
| Cipher | Plain |
|---|---|
| F | (space) |
| R | T |
| U | I |
| B | O |
| M | E |
| K | H |
| O | N |
| A | S |
| Y | U |
| L | G |
| X | A |
| Z | D |
| E | M |
| C | L |
| D | V |
| P | W |