Punycode
Punycode is an instance of the Bootstring encoding algorithm tailored for converting Unicode strings into compact, reversible ASCII representations, primarily to support internationalized domain names (IDNs) in applications (IDNA).[1] It enables the representation of domain labels using non-ASCII characters by mapping them to ASCII-safe strings that preserve the original string's length, order, and uniqueness, with Punycode-encoded labels in domain names prefixed by "xn--" to denote ASCII-compatible encoding (ACE).[2] Defined in RFC 3492 by Adam M. Costello and published by the Internet Society in March 2003, Punycode serves as a transfer encoding syntax that ensures compatibility with the ASCII-only Domain Name System (DNS).[1] The core purpose of Punycode is to facilitate the global use of domain names in diverse scripts and languages without requiring changes to the underlying DNS protocol, which historically limited labels to letters, digits, and hyphens in ASCII.[2] At a high level, the encoding process begins by placing all basic code points (the alphanumeric ASCII characters A–Z, a–z, 0–9) at the start of the output string, unchanged.[1] Non-basic code points (Unicode characters beyond ASCII) are then encoded as their delta values—incremental differences from the prior code point—using a generalized variable-length integer scheme based on a radix of 36 (0-9 and a-z), with digits inserted in a way that optimizes for common shorter strings through a dynamic bias adjustment mechanism.[1] This bootstring variant uses fixed parameters, including a base of 36, skew of 38, dampening factor of 700, and initial bias of 72, to balance efficiency and simplicity while guaranteeing round-trip conversion without data loss.[1] As an integral part of the IDNA framework outlined in RFC 3490, Punycode handles the ToASCII conversion of Unicode labels (U-labels) into ACE labels (A-labels) for DNS queries and storage, working alongside application-level normalization (such as to Unicode Normalization Form C) to ensure consistency.[2] Although the IDNA specification was revised in 2010 (RFCs 5890–5895) to address security concerns and expand Unicode support, Punycode itself was retained unchanged as the encoding method, with no subsequent revisions by the concluded IETF's IDNA Bis working group.[3][4] As of 2025, Punycode remains the standard encoding for IDNA without further revisions. Today, it is universally implemented in web browsers, operating systems, and DNS software to enable seamless handling of IDNs, supporting over 100 scripts and promoting internet internationalization.[5]History and Development
Standardization
Punycode was developed in 2003 by Adam M. Costello as part of broader initiatives to support internationalized domain names (IDNs) within the Domain Name System (DNS).[6] This effort addressed the need for encoding Unicode characters into ASCII-compatible strings to enable non-Latin scripts in domain names without modifying the underlying DNS infrastructure. The formal standardization of Punycode occurred through its publication as Request for Comments (RFC) 3492 by the Internet Engineering Task Force (IETF) in March 2003.[6] Titled "Punycode: A Bootstring Encoding of Unicode for Internationalized Domain Names in Applications (IDNA)," the document specifies Punycode as a simple, efficient transfer encoding syntax tailored for IDNA applications.[7] It defines the algorithm's structure, including its use of a bootstring method to represent any string of Unicode code points as a sequence of basic code points.[6] Punycode's role is closely integrated with precursor IETF standards that form the IDNA framework. RFC 3490, published concurrently in March 2003, outlines the overall IDNA mechanism for handling internationalized domain names in applications, designating Punycode as the encoding method for converting Unicode labels to ASCII. Complementing this, RFC 3491 from the same period defines Nameprep, a profile of the Stringprep protocol (RFC 3454) for preprocessing IDN labels to ensure consistency before Punycode encoding. These documents collectively establish Punycode's position as the canonical encoding solution within IDNA.[8] Since its initial publication, RFC 3492 has received minor updates and errata corrections. It was updated by RFC 5891 in August 2010, which provides updated definitions and a protocol framework for IDNA while explicitly preserving the Punycode algorithm unchanged.[5] The RFC Editor maintains a list of verified errata for RFC 3492, including clarifications on edge cases in the encoding process, though no major revisions to the core specification have been issued.[9]Origin of the Name
The term "Punycode" was coined by Adam M. Costello in 2002 as the name for his proposed encoding scheme for internationalized domain names.[10] Costello selected the name to evoke a sense of compactness and efficiency, with "puny" highlighting three key attributes of the encoding: the limited repertoire of characters (restricted to ASCII) used in the output strings, the brevity of those encoded strings, and the minimal size of the implementation required to process them.[10] This playful nomenclature also rhymes with "Unicode," underscoring Punycode's purpose of transforming rich Unicode strings into a constrained ASCII format suitable for domain name systems.[10] Costello introduced the name in an email to the IDN working group, renaming an earlier draft (AMC-ACE-Z) to Punycode without altering the underlying algorithm.[10] Punycode represents a specific instantiation of the more general "bootstring" encoding method, which Costello developed as a flexible framework for representing sequences of integers in a compact, human-readable string format using a small base alphabet; the parameters were tailored for IDNA compatibility.[11]Core Concepts
Bootstring Encoding
Bootstring encoding is a general method for representing strings of code points from a larger set using a smaller alphabet of basic code points, where basic code points are copied directly into the output string, while non-basic code points are encoded as position-dependent deltas following a delimiter.[12] This approach belongs to a family of encodings designed to map arbitrary-length sequences of non-negative integers to shorter strings in a compact, reversible manner, leveraging the smaller alphabet to minimize length while preserving uniqueness.[12] In Punycode, the bootstring parameters are specifically tuned for efficiency: it uses a base of 36, corresponding to the alphanumeric characters a-z and 0-9; a delimiter of U+002D (hyphen-minus) to separate the basic code point prefix from the encoded deltas; and an initial bias of 72 that adapts dynamically to favor the encoding of common code points by adjusting the mapping threshold over successive insertions.[12] These parameters enable Punycode to handle Unicode strings by treating code points as non-negative integers and encoding them into ASCII-compatible output. The bootstring method is particularly suited for Internationalized Domain Names in Applications (IDNA) because it provides a bijective mapping between Unicode strings and ASCII strings, ensuring that each valid Unicode domain label corresponds to exactly one Punycode representation, which facilitates unambiguous resolution and registration under a single authority.[12] This reversibility and compactness make it ideal for embedding internationalized characters in protocols limited to 7-bit ASCII transport.[12]Basic and Non-Basic Code Points
In Punycode, code points are classified into basic and non-basic categories to facilitate the encoding of Unicode strings into an ASCII-compatible format suitable for domain names. Basic code points consist of the 128 ASCII characters with Unicode values from 0 to 127 (0x00 to 0x7F), including letters (A-Z, a-z), digits (0-9), and symbols such as the hyphen (U+002D). These are directly representable in ASCII and form the foundation for the "basic string" in the output, where they are copied literally in the order they appear in the input Unicode string.[13] Non-basic code points encompass all other Unicode characters with values from 128 (0x80) onward, which extend beyond the ASCII range and cannot be directly included in domain labels without encoding. These are not copied literally; instead, they are represented indirectly through a sequence of deltas in the encoded portion of the output, allowing the entire Unicode string to be reconstructed uniquely during decoding. This classification ensures that Punycode outputs remain compliant with ASCII-based protocols while preserving the full expressiveness of Unicode.[14] The hyphen (U+002D) holds a special role as a basic code point that doubles as a delimiter in Punycode output. It separates the literal basic code points at the beginning of the string from the encoded representation of non-basic code points that follow, but only if at least one basic code point is present in the input. This delimiter ensures the output adheres to hostname syntax rules, such as those in RFC 952, by avoiding leading hyphens and maintaining a clear boundary.[15] Punycode encoding assumes that the input Unicode string has already been normalized and prepared according to higher-level protocols, such as Nameprep in the Internationalized Domain Names in Applications (IDNA) framework, which handles case folding, canonicalization, and prohibition of invalid code points like controls or unassigned values. This preprocessing guarantees that the input contains only valid, normalized code points, preventing ambiguities in the encoding process.[16]Encoding Process
Step-by-Step Encoding
The Punycode encoding algorithm transforms a sequence of Unicode code points into a compact ASCII-compatible string, suitable for domain names, by first separating and preserving the basic (ASCII) characters while encoding the non-basic (Unicode) ones efficiently. This process ensures that the output uses only the characters a-z, 0-9, and the hyphen, maintaining compatibility with legacy systems. The algorithm, defined in RFC 3492, operates as a bootstring mechanism where basic code points form the initial "digits" and non-basic ones are represented through incremental deltas.[1] The first step involves identifying all basic code points—those in the range U+0000 to U+007F—from the input sequence and extracting them while preserving their relative order. These basic code points are directly copied to the output as ASCII characters. If any non-basic code points (U+0080 and above) are present in the input, a hyphen (U+002D) is inserted immediately after the sequence of basic code points to serve as a delimiter, signaling the start of the encoded portion; no hyphen is added if the input consists solely of basic code points. This separation allows the decoder to reconstruct the original positions accurately.[17] Subsequently, the non-basic code points are encoded by repeatedly finding the smallest non-basic code point m greater than or equal to the current reference n (initially 128), computing a delta that accounts for the gap (m - n) weighted by the current output position (h + 1, where h is the number of code points already output), and then incrementing the delta for each skipped code point (those less than n or basic) and outputting the current delta each time an instance of m is encountered in the remaining input. After processing all occurrences of m, n is incremented to m + 1, and the process repeats until all input is handled. These deltas capture the incremental adjustments needed to insert the non-basic characters back into the basic sequence during decoding, ensuring the relative order of all code points is maintained.[17] Each computed delta is then encoded as a variable-length integer in base 36, where the digits 0-25 map to a-z and 26-35 to 0-9, producing a sequence of ASCII characters that are appended to the output after the hyphen (or directly after the basic portion if no hyphen is present). This encoding leverages a generalized variable-length scheme to represent large values compactly, with details on the integer representation covered in the relevant section on variable-length integers. After encoding each delta, the bias parameter is adapted to prioritize shorter representations for future non-basic code points, adjusting the reference thresholds based on the number of characters processed so far.[17][18] The resulting output is a single ASCII string constrained to the characters a-z (lowercase), 0-9, and a single hyphen (if applicable), with a maximum length suitable for domain label constraints, typically up to 63 characters excluding the leading ACE prefix in IDNA contexts. This format ensures lossless round-trip conversion when decoded, preserving the original Unicode sequence without exceeding ASCII boundaries.[17]Delta Calculation and Bias Adaptation
In Punycode encoding, the delta represents the weighted difference introduced by inserting a non-basic code point into the sequence of code points, ensuring that the resulting string remains a valid representation in the basic alphabet while minimizing length for common inputs.[17] Specifically, the algorithm starts with delta = 0 and current n = 128. It finds the minimum non-basic code point m ≥ n in the remaining input, then sets delta = delta + (m - n) × (h + 1), where h is the number of code points already output. It then scans the remaining input, incrementing delta for each code point c < n or basic, and when c = n (or = m), encodes the current delta (which includes the weight for the insertion position and skips), resets delta to 0, and increments h. After all occurrences of m, n is set to m + 1, and the process repeats. This scales the contribution based on the insertion's depth in the sequence.[17] Following the delta computation for each insertion, the bias is adapted to dynamically shift the encoding's "expected" range toward recently used values, optimizing for inputs with clustered code points such as those in common scripts or languages.[17] The adaptation begins by damping the delta: if it is the first non-basic, divide by 700; otherwise, divide by 2, then add \delta \div (h + 1) where h + 1 is the position of the current insertion in the output string.[17] This adjusted \delta is then iteratively reduced while \delta > ((base - t_{min}) \times t_{max}) / 2, dividing by base - t_{min} (with base = 36, t_{min} = 1, t_{max} = 26) and accumulating k += base each time, until the condition fails.[17] The new bias is then set as bias = k + ((base - t_{min} + 1) \times \delta) / (\delta + skew), with skew = 38, ensuring the bias converges toward frequently occurring deltas for more compact representations.[17] The purpose of this bias adaptation is to make the variable-length integer encoding of subsequent deltas more efficient by biasing the digit weights toward the range of recent insertions, reducing the average number of digits needed for typical Unicode inputs like domain names in non-Latin scripts.[17] For instance, after encoding characters from a dense script block, the bias shifts to favor similar values, shortening the Punycode output compared to a fixed bias.[17] To prevent computational overflow, the delta must not exceed the maximum representable value for a 26-bit unsigned integer (initially $2^{26} - 1), with the process failing and signaling an error if it does, as this indicates invalid input exceeding Unicode code point limits (0 to 10FFFF) or label length constraints (up to 63 characters).[19] These deltas are subsequently represented in a variable-length base-36 format, but the adaptation ensures they remain bounded for practical implementation.[17]Variable-Length Integer Representation
Punycode serializes delta values from the encoding process into compact ASCII strings using a base-36 variable-length integer scheme. The digit alphabet consists of the 26 lowercase letters 'a' through 'z' (values 0 to 25) and the 10 digits '0' through '9' (values 26 to 35), ensuring all output uses only these 36 safe ASCII characters.[12] The scheme is self-delimiting: the decoder can unambiguously parse the boundaries between consecutive integers in the concatenated digit stream by applying position-dependent thresholds to the digit values, without needing separators or fixed lengths. The encoding of 0 is the empty string; for positive integers, there are no leading zeros in the sense that the most significant digit satisfies the stopping condition relative to its threshold.[12] To encode a positive integer q (a delta), digits are computed iteratively starting from the least significant position (k=0) and appended directly to the output string. For each position k, compute t as: t = tmin (1) if k ≤ bias, t = k - bias if bias < k < bias + tmax (26), t = tmax (26) if k ≥ bias + tmax. Then, if q < t, output the digit for q and stop; otherwise, output the digit for t + ((q - t) mod (base - t)), set q = floor((q - t) / (base - t)), increment k, and repeat. This uses varying effective bases (36 - t) depending on the position relative to bias: 35 for low positions (t=1), decreasing to 10 for high positions (t=26), with the first digit dampened to avoid 0 ('a') and ensure delimitability (digits 1-35, 'b' to '9').[12] The resulting digit sequence appears in little-endian order in the string, with the least significant (dampened) digit first. Small deltas typically encode in 1–2 characters for efficiency, while larger ones extend as needed; the thresholds ensure the most significant digit is always in 0-25 (a-z) for positions where t could be 26 or higher, signaling the end of the integer during decoding.[12]Decoding Process
Step-by-Step Decoding
The Punycode decoding algorithm reconstructs a Unicode string from its ASCII-compatible encoding by separating initial basic code points from subsequent deltas that represent non-basic code points. This process ensures efficient reversal while handling variable-length representations and insertion positions to maintain the original character order. The procedure is defined in RFC 3492, which specifies the exact sequence to avoid ambiguity in internationalized domain name applications.[1] The decoding starts with initialization: set the initial non-basic code pointn to 128 (the first non-ASCII Unicode value), the insertion index i to 0, and the initial bias to 72; create an empty output string to build the result.[20]
First, split the input string at the last hyphen delimiter, which separates the basic portion (consisting of ASCII characters, or basic code points from 0 to 127) from the encoded portion. Copy all code points before the last hyphen directly to the output string, but fail the process if any non-basic code points (Unicode values above 127) appear in this prefix, as they violate the format. If basic code points are present, consume the delimiter itself; otherwise, treat the entire input as basic with no encoded portion. This step ensures that only valid ASCII prefixes are processed without alteration.[20]
Second, while the encoded portion (after the delimiter) has remaining input, parse it into generalized variable-length integers in base-36, where digits are mapped from ASCII letters and digits (a-z for 0-25, 0-9 for 26-35). For each integer: set oldi = i and w = 1; then in a loop starting from k = 36 incrementing by 36, consume a code point, obtain its digit value (fail if invalid or end-of-input), accumulate i = i + digit * w (fail on overflow), compute threshold t as 1 if k <= bias, 26 if k >= bias + 26, else k - bias; break if digit < t, otherwise update w = w * (36 - t) (fail on overflow). This parsing, which starts accumulation from the current i, extracts the delta value integrated with the insertion state, reversing the encoding's compression under the current bias.[20][13]
Third, for the accumulated i after parsing: adapt the bias using the added amount (i - oldi) as input to the adaptation function (detailed below), with numpoints = length(output) + 1 and firsttime if oldi == 0. Compute the new non-basic code point as n = previous n + floor(i / (length(output) + 1)) (fail on overflow or if resulting n < 128, as it would be basic). Set the insertion position to i = i mod (length(output) + 1). Insert n at position i in the output string, then increment i by 1. Repeat for all remaining input in the encoded portion, interleaving non-basic characters among the basic ones in their original sequence.[20]
Finally, output the complete Unicode string from the built output array. The process must error out for invalid inputs, such as the absence of a hyphen when non-ASCII characters are expected, end-of-input during delta parsing, overflow in code point or index calculations, invalid digit values (non a-z0-9), improper variable-length integer termination (e.g., digit >= threshold without break), or a computed n < 128. These checks ensure robustness and prevent malformed reconstructions.[20]
Bias Handling in Decoding
In Punycode decoding, bias handling reverses the dynamic adaptations applied during encoding to accurately recover non-basic code points from the variable-length integer representation of deltas. The process starts with an initial bias value of 72, which influences the digit thresholds (t) in the generalized variable-length integers parsed from the input.[21]
After parsing each variable-length integer to accumulate i (yielding delta = i - oldi), the bias is updated using the reversible adapt function to synchronize with encoding:
The adapt(delta, numpoints, firsttime) function computes:
-
If
firsttime, setdelta = [floor](/page/Floor)(delta / 700)(damp); elsedelta = [floor](/page/Floor)(delta / 2). -
Then
delta = delta + [floor](/page/Floor)(delta / numpoints). -
Set
k = 0; whiledelta > [floor](/page/Floor)(([36](/page/36) - 1) * 26 / 2)(i.e., > 469), setdelta = [floor](/page/Floor)(delta / ([36](/page/36) - 1)),k = k + [36](/page/36). -
Return
k + [floor](/page/Floor)( ([36](/page/36) - 1 + 1) * delta / (delta + 38) )(skew = 38).
i modulo (length(output) + 1) after bias adaptation and n computation, enabling precise interleaving as the output grows; i is then incremented post-insertion.[21]
Decoding fails under error conditions such as a computed code point n < 128 (invalid non-basic) or n exceeding the Unicode range (up to 0x10FFFF, excluding surrogates), preventing malformed outputs.[21]