Fact-checked by Grok 2 weeks ago

Percent-encoding

Percent-encoding, also known as encoding, is a mechanism for encoding special or non-ASCII s in Uniform Resource Identifiers (URIs) by replacing them with a (%) followed by two digits representing the octet value of the in its encoding (typically for non-ASCII s). This process ensures that URIs remain syntactically correct and unambiguous across different systems, preventing reserved s from being misinterpreted as delimiters or structural elements. It is essential for representing arbitrary data in web addresses, such as spaces in query parameters or non-Latin s in paths, while maintaining compatibility with the limited US-ASCII set required for safe transmission over the . In percent-encoding, only unreserved characters—letters (A-Z, a-z), digits (0-9), hyphen (-), period (.), underscore (_), and tilde (~)—may appear literally without encoding in most URI components, as they pose no risk of confusion. Reserved characters, divided into generic delimiters (such as : / ? # [ ] @) and sub-delimiters (! $ & ' ( ) * + , ; =), must be percent-encoded when used as data rather than for their syntactic roles, to avoid altering the URI's structure. For example, a space is encoded as %20, a hash (#) as %23, and a non-ASCII character like é (UTF-8 octet sequence) would be encoded as %C3%A9 in UTF-8-based URIs. Decoding reverses this by converting %HH sequences back to their original octets, with uppercase hexadecimal digits (A-F) preferred for consistency. The concept originated in the early development of the , first specified in RFC 1738 for Uniform Resource Locators (URLs) in 1994, which introduced percent-encoding to handle unsafe characters in network paths. It was refined in RFC 2396 for generic URIs in 1998 and fully standardized in RFC 3986 in 2005, which obsoleted prior versions and clarified rules for internationalized resource identifiers (IRIs) via encoding. Today, percent-encoding is widely implemented in web technologies, including HTML forms, HTTP requests, and JavaScript APIs, ensuring robust handling of user-generated content in URLs.

Overview

Definition and Purpose

Percent-encoding is a mechanism for representing data octets within components of a (URI) or similar contexts by replacing characters outside the allowed set or those that could interfere with parsing—such as reserved delimiters—with a (%) followed by two uppercase digits corresponding to the octet's value. This method, formally defined as pct-encoded = "%" HEXDIG HEXDIG, ensures that the encoded form adheres to the US-ASCII subset while preserving the original data. The primary purpose of percent-encoding is to enable the safe and unambiguous transmission of arbitrary or non-standard characters across network protocols that rely on specific characters as structural delimiters, such as in URIs where characters like /, ?, or # define components. By transforming potentially conflicting characters into a standardized , it prevents misinterpretation by parsers, servers, or intermediaries, thereby supporting the inclusion of spaces, non-ASCII text, or other unsafe elements in web addresses and queries without disrupting protocol syntax. A common example is the encoding of a space character (ASCII 32), which becomes %20 to avoid its interpretation as a query separator. For non-ASCII characters, the process involves first converting the character to its byte sequence and then percent-encoding each byte; for instance, the accented letter (Unicode U+00E9) is encoded as the bytes C3 A9, resulting in %C3%A9. Although frequently called URL encoding in informal contexts, percent-encoding specifically refers to this %-hexadecimal escape mechanism and differs from specialized encodings like , which handles internationalized domain names by mapping to ASCII-compatible strings.

History and Development

Percent-encoding emerged in the early alongside the foundational development of the World-Wide Web, serving as a to represent non-ASCII and unsafe characters in network addresses while ensuring compatibility with 7-bit ASCII transmission. Initially implemented in WWW software as early as 1990, it addressed the need to avoid corruption of URIs in environments with limited character sets, such as those using only printable US-ASCII characters. This approach drew from broader traditions of encoding binary or 8-bit data for safe transit over text-based protocols, predating the widespread adoption of for . The technique saw its first informal applications around 1993 with the introduction of the (CGI) by the (NCSA), where it facilitated the encoding of form data in query strings passed to server-side scripts. As web usage expanded rapidly following the release of graphical browsers like NCSA Mosaic in 1993, variations in how percent-encoding was handled across early implementations highlighted the need for clearer guidelines to prevent interoperability issues. These early uses built on draft specifications for uniform resource locators, emphasizing the ("%") followed by two digits to problematic octets without conflicting with existing syntax like Unix paths or attribute-value pairs. A key milestone came in 1994 with RFC 1630, authored by , which formally introduced percent-encoding for relative uniform resource identifiers (URIs) in the WWW context, defining it as a way to encode reserved and unsafe characters while maintaining hierarchical structure. This was further refined in RFC 1738 later that year, specifying mandatory escaping for certain octets and reserved delimiters to ensure consistent interpretation across protocols. By 1998, RFC 2396 provided a comprehensive generic syntax for URIs, clarifying ambiguities from prior documents like RFC 1738 and RFC 1808, and expanding the set of unreserved characters to include tilde (~) while standardizing the escaping mechanism for broader applicability. The evolution continued with RFC 3986 in 2005, which updated the syntax to better support by integrating encoding: non-ASCII characters are first converted to UTF-8 octets, then percent-encoded if not unreserved. This revision mandated uppercase digits for consistency, addressed challenges, and replaced terms like "escaped" with "percent-encoded" for precision, reflecting lessons from two decades of web deployment and the shift toward global character handling. These IETF efforts, driven by practical inconsistencies in early browser and server implementations, solidified percent-encoding as a core component of web standards.

Encoding Mechanism

Encoding Process

The encoding process for percent-encoding involves transforming input data, typically represented as a sequence of characters, into a sequence of bytes using encoding, then replacing each byte that does not belong to the unreserved set with a (%) followed by its two-digit uppercase representation. This mechanism ensures that the encoded data can be safely transmitted within components without ambiguity, as defined in the URI generic syntax standard. For instance, a space character (Unicode U+0020), which encodes to the byte 0x20 in , becomes %20. The algorithmic steps for encoding are as follows:
  1. Convert the input string to a byte sequence using encoding.
  2. For each byte in the sequence, determine if it corresponds to an unreserved (A-Z, a-z, 0-9, '-', '.', '_', or '~').
  3. If the byte is unreserved, output it directly as a .
  4. If the byte is not unreserved, output a (%) immediately followed by the uppercase representation of the byte value, padded to two digits (e.g., byte 0xA3 becomes %A3).
This process operates on individual bytes rather than characters to handle international text via , ensuring compatibility with ASCII-based protocols. Decoding reverses this process by scanning the encoded string for percent-encoded triplets (% followed by two hexadecimal digits), converting each such triplet back to its corresponding byte, and then interpreting the resulting byte sequence as characters. To avoid false positives, the decoder must only interpret %XX sequences where XX are valid digits (0-9, A-F), and literal percent signs in the original data are handled by first encoding them as %25 during the encoding step. Decoding is typically applied after parsing the into components to prevent misinterpretation of delimiters. Edge cases include the encoding of the itself, which becomes %25 to distinguish it from escape sequences, and the handling of multi-byte sequences, where each byte is encoded separately (e.g., the character é (U+00E9) encodes to bytes 0xC3 0xA9, resulting in %C3%A9). Modern libraries prevent over-encoding by strictly adhering to the unreserved set and avoiding unnecessary escapes for safe characters. Many programming languages provide built-in functions to perform this encoding and decoding. In , the encodeURIComponent() function implements percent-encoding for URI components by UTF-8 encoding the input and escaping non-unreserved bytes as %HH, per the ECMAScript specification. Similarly, Python's urllib.parse.quote() in the applies percent-encoding to strings using bytes, leaving specified safe characters unencoded.

Character Classification

In percent-encoding, characters are classified based on their syntactic roles and compatibility within protocols like URIs, determining whether they can remain literal or must be encoded as a sequence of three octets: a (%) followed by two digits representing the byte value. This classification ensures unambiguous transmission of data across systems, preventing misinterpretation of delimiters or invalid characters. Unreserved characters are those that pose no risk of confusion in most contexts and thus can be transmitted literally without percent-encoding. These include uppercase and lowercase letters (A-Z and a-z), digits (0-9), and the symbols hyphen (-), period (.), underscore (_), and tilde (~). For example, the string "example.com" can appear unchanged, as all its characters fall into this set. Although unreserved characters may optionally be percent-encoded (e.g., "A" as %41), such forms are considered equivalent and should be decoded for normalization to avoid redundancy. Reserved characters, by contrast, have special syntactic purposes in protocols such as URIs and must be percent-encoded when used in a role rather than their function, to avoid altering the structure. They are divided into two subsets: general delimiters (gen-delims), which include colon (:), slash (/), (?), (#), left and right square brackets ([ and ]), and commercial at (@); and sub-delimiters (sub-delims), which encompass (!), ($), (&), single quote ('), left and right parentheses ( ( and ) ), (*), plus sign (+), (,), (;), and (=). For instance, a slash (/) in a component might delimit segments but requires encoding as %2F if intended as literal within a segment. Characters outside the unreserved and reserved sets—such as non-ASCII characters, control characters (e.g., NUL, , or line feed), or any octet not in the US-ASCII range (0-127)—must always be percent-encoded for safety and compatibility. Non-ASCII and control characters are first converted to their UTF-8 byte representation before encoding, ensuring portability across character sets like . This byte-level approach, rather than character-level, allows percent-encoding to handle international text by representing each UTF-8 octet individually (e.g., the accented character "" becomes %C3%A0), supporting global while adhering to URI octet-sequence assumptions. The (%) itself receives special treatment and is never used literally in encoded data; it must always be encoded as %25 to prevent with the start of a percent-encoded sequence. This rule applies universally, even if % appears in unreserved or contexts, safeguarding the integrity of the encoding mechanism.

Applications

In Uniform Resource Identifiers (URIs)

Percent-encoding plays a crucial role in Uniform Resource Identifiers (URIs) by allowing the inclusion of characters that could otherwise interfere with the syntactic structure of URI components. According to RFC 3986, URIs consist of components such as the , , , query, and fragment, where percent-encoding is applied selectively to preserve delimiters and ensure unambiguous parsing. Reserved characters, which include generic delimiters like ":", "/", "?", "#", "[", "]", "@" and sub-delimiters like "!", "$", "&", "'", "(", ")", "*", "+", ",", ";", "=", must be percent-encoded when used as data within a component to avoid confusion with their syntactic roles. Unreserved characters, such as alphanumeric characters and "-", ".", "_", "~", may be left unencoded but can be percent-encoded without altering equivalence. In the path component, percent-encoding is used to encode characters outside the allowed pchar production, which permits unreserved characters, percent-encoded octets, sub-delimiters, ":", and "@". For instance, spaces or other characters in a must be encoded to prevent misinterpretation as path separators. The URI http://[example.com](/page/Example.com)/[path](/page/Path) with space/to [file](/page/File) would be encoded as http://example.com/path%20with%20space/to%20file, where the forward slash "/" remains unencoded as it serves as the path delimiter. This ensures the path is treated as a single hierarchical segment without unintended splits. The query component, following the "?" , employs percent-encoding more permissively, allowing pchar, "/", and "?" while encoding characters that might conflict with delimiters like "&" and "=". This prevents the from being parsed incorrectly; for example, a with an in its value, such as name=John&Doe, must have the ampersand encoded as name=John%26Doe to prevent the from being parsed as multiple parameters. In practice, implementations often encode all non-alphanumeric characters except those explicitly needed for structure, though the advises encoding only when necessary to avoid conflicts. For the fragment component, introduced by "#", similar rules apply using the same allowed characters as the query, with percent-encoding for any data that could mimic delimiters. Browsers and user agents may automatically apply percent-encoding during navigation to handle fragments safely, but the semantics of fragments are media-type dependent rather than -specific. The and components generally prohibit or limit percent-encoding: schemes use no encoding and are case-insensitive, while authority parts like userinfo and registered names allow it for non-ASCII or reserved data via octet encoding. Common pitfalls in URI percent-encoding include double-encoding, where an already encoded sequence like "%20" is re-encoded to "%2520", leading to incorrect dereferencing. Implementations must avoid encoding or decoding the same multiple times, as the percent character "%" itself requires encoding as "%25" when used as data. digits in percent-encodings are case-insensitive per the , but uppercase is conventionally preferred for consistency across systems. A complete example is the http://[example.com](/page/Example.com)/search?q=hello%20world#results, where the space in the query is encoded as "%20" to maintain integrity, and the fragment remains unencoded unless containing data.

In Form Data Submission

In the submission of HTML form data using the application/x-www-form-urlencoded media type, percent-encoding ensures that special characters in user input do not interfere with the structured transmission over HTTP. This format, the default for form enctype attributes, serializes form fields as a sequence of key-value pairs joined by & delimiters, with each pair formatted as key=value. Keys and values undergo percent-encoding to escape reserved characters, preventing conflicts with the syntax of the encoded string. The encoding process follows rules akin to those for URI query strings but incorporates a key distinction: spaces are replaced with + symbols, a convention established in early HTML specifications rather than the strict %20 used elsewhere. Other special characters, such as &, =, and non-ASCII bytes, are represented as % followed by two hexadecimal digits corresponding to their UTF-8 byte values. Non-ASCII characters are first converted to UTF-8 bytes, which are then individually percent-encoded if they belong to the form-urlencoded percent-encode set—encompassing all code points except ASCII alphanumerics and the symbols *, -, ., _. This set ensures safe transmission while minimizing encoding overhead for common characters. The historical use of + for spaces stems from conventions in initial web form handling, as defined in HTML 2.0. For instance, a form containing fields name with value "John Doe" and age with value "30" serializes to name=John+Doe&age=30. Here, the space in "John Doe" becomes +, while no further encoding is needed for the numeric "30" or the alphanumeric "John". If the name included a special character like "&", it would encode as name=John%26Doe. Upon receipt, servers or parsers decode + back to spaces and expand %HH sequences to their original bytes, reconstructing the form data. Unlike the multipart/form-data format, which partitions data into labeled parts suitable for binary files and avoids universal percent-encoding, application/x-www-form-urlencoded treats all content as text and mandates encoding of potentially unsafe characters across the entire . This makes it more compact for simple textual submissions but less versatile. web browsers handle this encoding automatically during form submission via GET (appending to the query) or (in the request body), ensuring compliance with the standard. However, for forms including inputs, browsers default to multipart/form-data to accommodate binary uploads without encoding distortion. A primary limitation of this format is its incompatibility with , as percent-encoding can inflate payload sizes significantly (e.g., each non-ASCII byte becomes three characters) and risks corruption if binary sequences mimic control characters like or &. It is thus recommended only for short, text-only forms, with multipart/form-data preferred for complex or binary-inclusive submissions.

Standards and Variations

Current Standards

The current standards for percent-encoding are primarily defined in , published in 2005, which specifies the syntax for Uniform Resource Identifiers (URIs) and mandates percent-encoding for reserved characters when they serve a non-reserved purpose (i.e., used as data), and for non-ASCII characters. In this specification, percent-encoding uses the (%) followed by two hexadecimal digits to encode octets, with decoding requiring case-insensitive matching of the hexadecimal values. For internationalized resource identifiers (), extends these rules by requiring encoding of characters prior to percent-encoding, enabling support for non-ASCII scripts in URIs while maintaining compatibility with legacy ASCII-based systems. In the context of HTTP, RFC 9110, published in 2022, which defines HTTP semantics, references percent-encoding for handling characters in request and response headers, as well as in message bodies where URI-like components appear, ensuring safe transmission of encoded data over the protocol. Browser implementations are further guided by the WHATWG URL Standard, a living specification that aligns with RFC 3986 for parsing and serializing URLs, including strict percent-encoding rules to promote interoperability across web agents. For application/x-www-form-urlencoded data, commonly used in submissions, the Living Standard specifies that characters must be encoded using followed by percent-encoding, with a focus on strict adherence to avoid issues like injection attacks. This includes encoding spaces as %20 and other non-alphanumeric characters as needed, while prohibiting certain ambiguities in decoding. Since the publication of RFC 3986, there have been no fundamental changes to the core percent-encoding mechanism, but subsequent errata and clarifications emphasize UTF-8 normalization for input characters and address edge cases, such as the encoding of addresses within URIs to prevent parsing errors. Libraries and implementations achieving compliance must support case-insensitive decoding, as explicitly required to handle variations in how encoded data is generated across systems.

Non-Standard Implementations

In early implementations of web browsers, deviations from standard percent-encoding rules were common. For instance, early versions of (prior to IE 7) often left the (~) character unencoded in paths, leading to interoperability issues with servers expecting strict compliance. Similarly, allowed lowercase hexadecimal digits in percent-encodings (e.g., %7e instead of %7E), which, while functional in many cases, violated the RFC 3986 recommendation for uppercase hex digits to ensure consistent normalization and decoding across systems. Extensions of percent-encoding appear in non-URI contexts, though they often conflict with preferred standards. In email headers, RFC 2047 specifies an encoded-word syntax using a "Q" encoding akin to for non-ASCII text, explicitly avoiding percent-encoding to prevent parsing ambiguities; however, some legacy mail clients and custom implementations have misused percent-encoding for header values, resulting in non-compliant messages that may fail delivery or display. In contrast, URL-safe Base64 variants used in JSON Web Tokens (JWTs) and similar formats modify the alphabet (replacing + with - and / with _) to produce output that requires minimal or no additional percent-encoding, distinguishing it from traditional percent-encoding while serving analogous URL-safety goals. Non-web applications sometimes apply percent-encoding to arbitrary data in ways that extend beyond specifications. For example, certain APIs over-encode fragments (the portion after #), percent-encoding characters that standards leave to application-specific handling, which can cause decoding errors or expose unintended paths when clients expect raw fragments. Vendor-specific libraries introduce further variations. In , the urlencode() function is tailored for application/x-www-form-urlencoded data, replacing spaces with + and encoding other reserved characters, whereas rawurlencode() adheres more closely to RFC 3986 by using %20 for spaces and encoding a broader set of characters suitable for general components. Similarly, Java's URLEncoder class defaults to application/x-www-form-urlencoded format, encoding spaces as + and relying on the platform's default charset (historically not always ), which can lead to inconsistent results across environments unless is explicitly specified. Modern implementations reveal ongoing gaps, particularly in legacy or under-maintained libraries. Older software libraries often provide incomplete support for percent-encoding, truncating or misinterpreting multi-byte sequences and resulting in garbled international characters or data loss during encoding/decoding. These issues heighten risks, such as CRLF injection, where incomplete decoding of sequences like %0D%0A (representing and line feed) allows attackers to inject malicious headers, enabling and potential if input is not strictly sanitized before processing. In mobile environments, Android's URL handling exhibits quirks that persist into recent versions. Up through (2023), the platform's class and related s inconsistently handle percent-encoded spaces in query strings, sometimes favoring + over %20 in form-like data, which deviates from strict decoding and can cause mismatches in API calls or web views.

References

  1. [1]
  2. [2]
    Percent-encoding - Glossary - MDN Web Docs
    Jul 11, 2025 · Percent-encoding is a mechanism to encode 8-bit characters that have specific meaning in the context of URLs. It is sometimes called URL encoding.
  3. [3]
  4. [4]
  5. [5]
  6. [6]
    RFC 1738: Uniform Resource Locators (URL)
    This document specifies a Uniform Resource Locator (URL), the syntax and semantics of formalized information for location and access of resources via the ...Missing: history | Show results with:history
  7. [7]
    RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax
    Percent-Encoding A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the ...
  8. [8]
  9. [9]
  10. [10]
  11. [11]
    RFC 3492 - Punycode: A Bootstring encoding of Unicode for ...
    Punycode is a simple and efficient transfer encoding syntax designed for use with Internationalized Domain Names in Applications (IDNA).
  12. [12]
  13. [13]
    RFC 1738: Uniform Resource Locators (URL)
    ### Summary of Percent-Encoding in RFC 1738
  14. [14]
    The World Wide Web - The Common Gateway Interface (CGI)
    Jan 28, 2009 · The CGI interface has been in use with the World Wide Web since 1993, and the current version is CGI/1.1. ... The URL-encoding replaces any ...
  15. [15]
    RFC 2396 - Uniform Resource Identifiers (URI): Generic Syntax
    This document defines the generic syntax of URI, including both absolute and relative forms, and guidelines for their use.Missing: history | Show results with:history
  16. [16]
  17. [17]
    encodeURIComponent() - JavaScript - MDN Web Docs
    Oct 30, 2025 · The encodeURIComponent() function encodes a URI by replacing each instance of certain characters by one, two, three, or four escape sequences representing the ...
  18. [18]
  19. [19]
  20. [20]
  21. [21]
  22. [22]
  23. [23]
  24. [24]
    HTML Standard
    ### Summary of `application/x-www-form-urlencoded` Serialization and Related Details from HTML Standard (Section 4.10 Forms)
  25. [25]
  26. [26]
  27. [27]
  28. [28]
  29. [29]
  30. [30]
  31. [31]
  32. [32]
  33. [33]
    .net UrlEncode - lowercase problem - Stack Overflow
    May 27, 2009 · URL Encoding returns the original string with invalid characters replaced by %xx, where xx is the hexadecimal value of the invalid character in ISO-8859-1.Is URL percent-encoding case sensitive? - Stack OverflowInternet Explorer having problems with special chars in querystringsMore results from stackoverflow.comMissing: legacy Netscape
  34. [34]
    Hex digits in URL encoding should be upper-case #2281 - GitHub
    Apr 1, 2021 · For consistency, URI producers and normalizers should use uppercase hexadecimal digits for all percent-encodings. Implementing this change would ...Missing: legacy IE tilde ~ Netscape
  35. [35]
    MIME: Message Header Extensions for Non-ASCII Text - IETF
    The "Q" encoding is similar to the "Quoted-Printable" content- transfer-encoding defined in RFC 2045. It is designed to allow text containing mostly ASCII ...
  36. [36]
    Are these email headers RFC-2047 compliant? - Stack Overflow
    May 24, 2017 · This is a totally invalid mime header according to RFC 2047. It has no quoted-printable identifier (?Q?), the different bytes are encoded with % ...How to decode a quoted printable e-mail header (with MimeKit)c# - Decode quoted printable correct - Stack OverflowMore results from stackoverflow.com
  37. [37]
    URL Standard
    Oct 30, 2025 · The application/x-www-form-urlencoded percent-encode set contains all code points, except the ASCII alphanumeric, U+002A (*), U+002D (-), U+002E ...
  38. [38]
    rawurlencode - Manual - PHP
    PHP's functions rawurlencode() and urlencode(), both encode the whole argument parameter string, making the result useless as a valid link. The function listed ...Description ¶ · Return Values ¶ · Examples ¶
  39. [39]
    php - urlencode vs rawurlencode? - Stack Overflow
    Jun 15, 2009 · Differences in ASCII: · UrlEncode checks for space, assigns a + sign, RawURLEncode does not. · UrlEncode does not assign a \0 byte to the string, ...Urlencode Vs Rawurlencode? · 7 Comments · Differences In AsciiWhat is the difference between urlencode and rawurlencode?what's the difference between rawurldecode() and ... - Stack OverflowMore results from stackoverflow.com
  40. [40]
    URLEncoder (Java Platform SE 8 ) - Oracle Help Center
    Translates a string into x-www-form-urlencoded format. This method uses the platform's default encoding as the encoding scheme to obtain the bytes for unsafe ...
  41. [41]
    Use UTF-8 (Unicode) charset encoding for pages and email for ...
    A quick web search shows that some browsers do have issues with utf8-encoding, although it appeasr that they're ok with the ascii (or maybe latin-1) ...
  42. [42]
    CWE-113: Improper Neutralization of CRLF Sequences in HTTP ...
    When an HTTP request contains unexpected CR and LF characters, the server may respond with an output stream that is interpreted as splitting the stream into ...
  43. [43]
    How to fix "Improper Neutralization of CRLF Sequences in HTTP ...
    Feb 24, 2014 · Try adding a function call to remove any carriage returns or line feed characters (including their encoded equivalents like %0d and %0a ) from that query ...
  44. [44]
    URL encoding the space character: + or %20? - Stack Overflow
    Oct 27, 2009 · The real percent encoding uses %20 while form data in URLs is in a modified form that uses +. So you're most likely to only see + in URLs in the query string ...URL encoding in AndroidWhen should space be encoded to plus (+) or %20?More results from stackoverflow.com
  45. [45]
    Android Security Bulletin-January 2025
    Jan 1, 2025 · The Android Security Bulletin contains details of security vulnerabilities affecting Android devices. Security patch levels of 2025-01-05 or later address all ...<|control11|><|separator|>