UTF-7
UTF-7 (7-bit Unicode Transformation Format) is an obsolete variable-length character encoding designed to represent Unicode characters using only 7-bit ASCII octets, ensuring compatibility with mail-safe transports that restrict data to 7-bit channels.[1] Defined in RFC 2152 and obsoleting the earlier RFC 1642, it supports the Unicode 2.0 standard and ISO/IEC 10646-1:1993, allowing encoding of most world writing systems while remaining human-readable.[1] The format encodes directly printable ASCII characters as themselves and shifts to a modified Base64 representation enclosed in '+' and '-' delimiters for non-ASCII Unicode characters, averaging about 2⅔ octets per character plus overhead.[1]
Although intended primarily for email and similar protocols, UTF-7's use has declined significantly with the adoption of UTF-8, which offers better efficiency and broader support; modern specifications often prohibit or deprecate UTF-7 outside legacy contexts due to security risks from its variable escaping.[2] A modified variant, known as IMAP-modified UTF-7, is specified in RFC 3501 for encoding international mailbox names in the Internet Message Access Protocol (IMAP), using a restricted Base64 alphabet to avoid conflicts with IMAP control characters.[3] This IMAP-specific form, registered as the charset "UTF-7-IMAP" with IANA, remains in limited use for legacy IMAP implementations but is being phased out in favor of UTF-8 extensions like those in RFC 6855.[4] Overall, UTF-7 exemplifies early efforts to balance Unicode's universality with the constraints of 7-bit networks, though it is now largely historical.[1]
Background and Motivation
Historical Development
UTF-7 was developed in the early 1990s by the Internet Engineering Task Force (IETF) to enable the encoding of Unicode characters in a format compatible with 7-bit transport protocols, such as those used in early Internet email systems. The initial proposal for UTF-7 was presented by David Goldsmith in December 1993 on the IETF mailing list, addressing the need for a transformation format that preserved ASCII readability while supporting the emerging Unicode standard (version 1.1 at the time). This led to the publication of RFC 1642 in July 1994, authored by David Goldsmith and Mark Davis, both of Taligent, Inc., which formalized UTF-7 as a mail-safe encoding scheme.[5]
RFC 1642 was later obsoleted by RFC 2152 in May 1997, also authored by Goldsmith and Davis, which refined the encoding rules for better alignment with evolving Unicode specifications and MIME standards. The update addressed minor ambiguities in the original while maintaining the core principles of 7-bit safety and human readability for email applications. This formalization occurred amid growing adoption of Unicode in Internet protocols, with UTF-7 positioned as a specialized solution for environments constrained by 7-bit channels.[6]
UTF-7 found particular application in the Internet Message Access Protocol (IMAP), where RFC 3501 (published in March 2003) recommends a modified variant of UTF-7 for encoding international characters in mailbox names to ensure compatibility with legacy 7-bit systems. This usage persists in some legacy IMAP implementations for folder naming, despite broader shifts toward 8-bit clean encodings like UTF-8. However, by the mid-2010s, UTF-7's role diminished as modern protocols favored more efficient alternatives, reflecting deprecation discussions in standards bodies due to security concerns and the prevalence of UTF-8.[7]
Design Objectives
UTF-7 was developed to encode Unicode characters, initially based on UCS-2 and later adaptable to UTF-16, into a stream of 7-bit US-ASCII octets suitable for transmission over protocols such as SMTP for email and NNTP for network news that assumed 7-bit clean channels.[6] This addressed the challenge of transporting international text in legacy systems limited to 7-bit data, preventing corruption from 8-bit bytes that could occur in non-8-bit-clean networks.[6]
A primary design goal was to maintain human readability, particularly for English text, by preserving US-ASCII characters in their unmodified form while encoding non-ASCII Unicode characters using a modified Base64 scheme enclosed in shift sequences.[6] This approach resembled Base64 to leverage familiarity but modified it—omitting the padding character "="—to avoid ambiguity with email header encodings like those in RFC 2047.[6] Additionally, UTF-7 was engineered for compatibility with existing mechanisms like Quoted-Printable, allowing it to serve as an efficient alternative to double-encoding schemes (e.g., UTF-8 wrapped in Quoted-Printable), which could expand non-ASCII text up to nine times in size.[6]
The design explicitly acknowledged limitations, positioning UTF-7 not as a general-purpose Unicode Transformation Format like UTF-8, but as a specialized "mail-safe" encoding optimized for 7-bit email and news transports.[6] It was not intended for general text files or 8-bit environments, where direct use of UTF-8 or native Unicode representations would be preferable, reflecting the transitional context of early Unicode adoption before widespread UTF-8 support.[6]
Encoding Principles
Character Classification
UTF-7 classifies most Unicode characters into three primary categories to determine how they are handled during encoding: directly encoded characters (Set D), optionally directly encoded characters (Set O), and encoded characters (all remaining characters except for specific whitespace). Directly encoded characters from Set D, which include the uppercase and lowercase letters A-Z and a-z, the digits 0-9, and the nine special characters ' ( ) , - . / : ? (excluding + and =), are output as their ASCII equivalents without modification.[1] This set comprises 62 printable ASCII characters designed to be safe and unambiguous in contexts like email headers.[1]
Optionally directly encoded characters from Set O include additional ASCII symbols such as ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }, excluding backslash () and tilde (~) for compatibility reasons.[1] These characters may be output directly if they do not conflict with the surrounding context, but they can also be encoded to ensure safety in environments sensitive to certain symbols. Encoded characters encompass all other Unicode code points outside Sets D and O (except whitespace), including non-ASCII characters and most control characters, which must be shifted into modified Base64 blocks for representation.[1] Whitespace characters—the space (U+0020), tab (U+0009), carriage return (U+000D), and line feed (U+000A)—are always directly represented by their ASCII equivalents as a special case, even though not in Sets D or O.[1] However, newlines (carriage return and line feed) terminate any active encoded block, forcing a return to direct encoding mode to maintain readability and structural integrity in text streams.[1]
For edge cases, surrogate code points in UTF-16 are treated as individual 16-bit quantities rather than paired, with no special pairing mechanism applied during classification.[1] Invalid Unicode code points, such as those outside the addressable range or resulting from ill-formed octet sequences (e.g., odd-numbered octets with non-zero discarded bits), are not encoded and render the input sequence ill-formed per the specification.[1] Characters from ISO 10646 beyond the surrogate-addressable range cannot be encoded in UTF-7.[1]
Modified Base64 Mapping
UTF-7 utilizes a modified form of Base64 encoding to transform sequences of 16-bit Unicode code units into a stream of 7-bit safe ASCII characters, specifically for representing non-ASCII content. This adaptation, outlined in RFC 2152, employs the standard Base64 alphabet of 64 printable ASCII characters: A-Z (values 0-25), a-z (26-51), 0-9 (52-61), + (62), and / (63), while explicitly excluding the padding character = (ASCII 61).[1]
The mapping process begins by converting each 16-bit Unicode code unit into a two-octet sequence, with the most significant octet first, forming a continuous byte stream. This stream is then divided into 6-bit groups for Base64 encoding; to eliminate the need for padding, zero bits are appended to the end of the stream until its length is a multiple of 6 bits. Each 6-bit group is subsequently mapped to its corresponding character in the alphabet. During decoding, the reverse process discards any trailing bits that do not complete a full 16-bit code unit, ensuring well-formed output.[1]
These Base64-encoded blocks are delimited within the overall UTF-7 stream to integrate with direct ASCII representation: a block starts immediately after a + (shift-in) character and continues with the Base64 characters until terminated by the first non-Base64 character (such as whitespace) or by a - (shift-out), which is consumed and not part of the subsequent text. This delimitation prevents ambiguity, as the + delimiter itself—when appearing as literal ASCII—is represented within a block as +-. The Base64 characters, including internal + and /, are not escaped or modified within the block, as the delimiters provide clear boundaries.[1]
Key distinctions from conventional Base64 include the prohibition of = to avoid potential issues in 7-bit transports like email, the internal zero-bit padding to maintain block integrity without explicit pads, and the reliance on contextual delimiters rather than fixed-length segments for embedding in mixed ASCII-Unicode text. These modifications prioritize compatibility with legacy 7-bit systems while preserving the efficiency of 6-bit encoding for Unicode data.[1]
Algorithm Details
Encoding Steps
The UTF-7 encoding algorithm converts a sequence of Unicode characters, each represented as a 16-bit UCS-2 or UTF-16 code unit, into a stream of 7-bit US-ASCII octets suitable for environments like email transport.[6] The process uses a state-based approach, alternating between direct output for compatible ASCII characters and shifted sequences for others, ensuring the output remains human-readable and safe for 7-bit channels.[6] Character classification determines direct encodability, with set D including alphanumeric characters (A-Z, a-z, 0-9) and the symbols space, tab, LF, CR, '(),-./:?.[6]
The encoding scans the input sequentially and maintains an internal state (direct or encoded mode). It begins in direct mode. For each 16-bit code unit:
-
In direct mode, if the current character belongs to set D, append its ASCII representation directly to the output stream.[6] If not, switch to encoded mode. As a special case per RFC 2152, if the current character is U+002B ('+'), append '+' followed by '-', advance to the next character, and remain in direct mode. Otherwise, append the shift character '+' (ASCII 43), and begin buffering the octet representation of this and subsequent non-set-D characters.[6]
-
In encoded mode (for non-'+' cases), treat the sequence of non-set-D characters as a stream of 16-bit values, converting each to two octets with the most significant octet first.[6] Concatenate these octets into a bit stream, then encode groups of 6 bits using modified Base64 (the 64-character alphabet A-Z, a-z, 0-9, +, / from RFC 2045, omitting the '=' padding symbol).[6][8] Output the resulting Base64 characters immediately, advancing the buffer as needed to maintain 6-bit boundaries; any trailing bits fewer than 6 are zero-padded to form complete 6-bit groups for encoding.[6]
-
Remain in encoded mode until a set-D character is encountered or the input ends. Upon exit, append the terminator '-' (ASCII 45), which is absorbed and does not represent input data.[6] Switch back to direct mode and process the terminating set-D character (if any) as in step 1. If the encoded sequence is empty in non-special cases, output '+-' directly, though such cases are rare outside the '+' special handling.[6]
This process ensures seamless transitions; for instance, a '+' in direct mode triggers encoding as '+-'.[6] Spaces, tabs, carriage returns, and line feeds are always treated as set-D characters and output directly, adhering to MIME transport rules.[6]
The following pseudocode outlines the core algorithm (assuming valid input; ill-formed sequences, such as invalid Unicode code units, are typically dropped or flagged by implementations, though the specification assumes well-formed UCS-2 input):[6]
function encodeUTF7(input: sequence of 16-bit code units) -> string:
output = ""
mode = "direct"
buffer = [] // list of 8-bit octets
i = 0
while i < length(input):
char = input[i]
if mode == "direct":
if isSetD(char):
output += asciiOf(char)
i += 1
else:
if char == 0x002B: // Special case for '+'
output += "+-"
i += 1
else:
mode = "encoded"
output += "+"
// fall through to buffer char
if mode == "encoded":
// Add char as two octets (MSB first)
buffer.append((char >> 8) & 0xFF)
buffer.append(char & 0xFF)
i += 1
// Continue buffering until set D or end
while i < length(input) and not isSetD(input[i]):
char = input[i]
buffer.append((char >> 8) & 0xFF)
buffer.append(char & 0xFF)
i += 1
// Encode buffer as modified [Base64](/page/Base64)
encoded = modifiedBase64(concat(buffer))
output += encoded
output += "-"
buffer = []
mode = "direct"
// If i < length and isSetD(input[i]), it will be handled in next iteration
return output
function encodeUTF7(input: sequence of 16-bit code units) -> string:
output = ""
mode = "direct"
buffer = [] // list of 8-bit octets
i = 0
while i < length(input):
char = input[i]
if mode == "direct":
if isSetD(char):
output += asciiOf(char)
i += 1
else:
if char == 0x002B: // Special case for '+'
output += "+-"
i += 1
else:
mode = "encoded"
output += "+"
// fall through to buffer char
if mode == "encoded":
// Add char as two octets (MSB first)
buffer.append((char >> 8) & 0xFF)
buffer.append(char & 0xFF)
i += 1
// Continue buffering until set D or end
while i < length(input) and not isSetD(input[i]):
char = input[i]
buffer.append((char >> 8) & 0xFF)
buffer.append(char & 0xFF)
i += 1
// Encode buffer as modified [Base64](/page/Base64)
encoded = modifiedBase64(concat(buffer))
output += encoded
output += "-"
buffer = []
mode = "direct"
// If i < length and isSetD(input[i]), it will be handled in next iteration
return output
In this pseudocode, isSetD checks membership in the direct set, asciiOf maps the code unit to its ASCII byte if applicable, and modifiedBase64 performs the unpadded Base64 encoding per RFC 2045, zero-padding trailing bits to 6-bit multiples as needed.[6][8]
Decoding Steps
The decoding process for UTF-7 reverses the encoding to produce a sequence of 16-bit Unicode code units from an input stream of 7-bit US-ASCII octets. It operates via a state machine that alternates between direct ASCII output and encoded Base64 segments, ensuring only valid sequences are processed while rejecting ill-formed input.[1]
The decoder begins in direct mode, parsing the input sequentially and emitting each character as a Unicode code unit if it belongs to the directly encoded set (A-Z, a-z, 0-9, and the symbols space, tab, LF, CR, '(),-./:?, excluding '+'). This continues until a '+' character is encountered, which signals the start of an optional encoded segment; the '+' itself is not output but initiates a shift to Base64 mode. In this mode, the decoder collects subsequent characters from the Base64 set (A-Z, a-z, 0-9, '+', '/') until a character outside this set appears or the input ends. The collected Base64 string is then decoded into a binary octet stream using the modified Base64 alphabet, where each group of four Base64 characters yields three octets (six bits each). These octets are paired into 16-bit Unicode code units, with the most significant octet first, and appended to the output. If the terminating character is '-', it is absorbed without output; otherwise, the Base64 decoding completes first, followed by emitting the terminating character as Unicode. As a special case, the sequence "+-" decodes to U+002B ('+'), with the empty Base64 yielding the '+' character.[1][9]
Validation occurs throughout to ensure well-formedness. The Base64 segment length need not be a multiple of four characters, as padding is implicit and variable-length segments are permitted, but the resulting octet stream must contain an even number of octets to form complete 16-bit pairs; an odd count triggers rejection as ill-formed. Additionally, any excess bits from incomplete six-bit groups must be zero, or the sequence is invalid. A lone '-' without a preceding unmatched '+' is treated as a direct '-', but an erroneous '-' terminating without a prior '+' in context (e.g., isolated) does not inherently error unless part of invalid Base64. Invalid Base64 characters outside the set immediately terminate the segment for processing, but the entire input is rejected if padding or bit rules fail. Whitespace characters (space, tab, newline) in direct mode are output directly and terminate any potential Base64 block if encountered after '+', without starting a new one.[1][9]
Edge cases include the sequence "+-" , which decodes to a single '+' Unicode character, as the '-' terminates the empty Base64 (yielding no octets) and is absorbed, with the special rule outputting '+'. A '+' appearing within a Base64 segment is treated as a valid Base64 character (value 62) rather than a nested shift, avoiding errors unless it leads to invalid octet pairing. Unterminated Base64 at input end (no closing non-Base64 character) results in decoding the partial segment if valid, but rejection if odd octets or non-zero padding bits occur. Line breaks do not inherently invalidate segments, though UTF-7 is designed for non-spanning in mail contexts; the decoder processes linearly regardless.[1][9]
The process can be formalized as a state machine with two states: direct (default) and Base64 (shifted). Below is conceptual pseudocode outlining the decoder:
state = DIRECT
output = []
buffer = []
i = 0
while i < input.length:
char = input[i]
if state == DIRECT:
if char == '+':
state = BASE64
i += 1
continue
else:
output.append(unicode_from_ascii(char))
i += 1
else: # BASE64
if is_base64_char(char):
buffer.append(char)
i += 1
else:
if buffer.length > 0:
octets = modified_base64_decode(buffer)
if octets.length % 2 != 0 or has_nonzero_padding(octets):
reject("Ill-formed Base64")
for j = 0 to octets.length step 2:
code_unit = (octets[j] << 8) | octets[j+1]
output.append(code_unit)
buffer = []
else if char == '-':
output.append(0x002B) // Special case for "+-"
state = DIRECT
if char != '-':
output.append(unicode_from_ascii(char))
i += 1
if buffer.length > 0:
octets = modified_base64_decode(buffer)
if octets.length % 2 != 0 or has_nonzero_padding(octets):
reject("Ill-formed unterminated Base64")
else:
for j = 0 to octets.length step 2:
code_unit = (octets[j] << 8) | octets[j+1]
output.append(code_unit)
return output # Sequence of 16-bit Unicode code units
state = DIRECT
output = []
buffer = []
i = 0
while i < input.length:
char = input[i]
if state == DIRECT:
if char == '+':
state = BASE64
i += 1
continue
else:
output.append(unicode_from_ascii(char))
i += 1
else: # BASE64
if is_base64_char(char):
buffer.append(char)
i += 1
else:
if buffer.length > 0:
octets = modified_base64_decode(buffer)
if octets.length % 2 != 0 or has_nonzero_padding(octets):
reject("Ill-formed Base64")
for j = 0 to octets.length step 2:
code_unit = (octets[j] << 8) | octets[j+1]
output.append(code_unit)
buffer = []
else if char == '-':
output.append(0x002B) // Special case for "+-"
state = DIRECT
if char != '-':
output.append(unicode_from_ascii(char))
i += 1
if buffer.length > 0:
octets = modified_base64_decode(buffer)
if octets.length % 2 != 0 or has_nonzero_padding(octets):
reject("Ill-formed unterminated Base64")
else:
for j = 0 to octets.length step 2:
code_unit = (octets[j] << 8) | octets[j+1]
output.append(code_unit)
return output # Sequence of 16-bit Unicode code units
This yields 16-bit Unicode output for valid input, with errors halting processing on invalid sequences. The modified Base64 decoding maps characters to 6-bit values (A=0, ..., /=63) and converts to octets without standard '=' padding.[1][9][10]
Practical Examples
Basic ASCII Handling
In UTF-7 encoding, characters from the basic ASCII range that belong to the direct sets (Sets D and O) or specific control characters are represented identically to their original ASCII byte values, ensuring no transformation or overhead for pure ASCII text. This direct mapping applies to alphanumeric characters (A-Z, a-z, 0-9), common punctuation in Set D such as '(),-./:?, and optionally to additional symbols in Set O like !#%&*;<=>@[]^_`{|}. Furthermore, whitespace and line control characters—including the space (ASCII 32), tab (ASCII 9), carriage return (ASCII 13), and line feed (ASCII 10)—are preserved as their ASCII equivalents without any encoding shifts.[6]
This preservation allows seamless handling of standard ASCII content, such as the input string "Hello World", which encodes to exactly "Hello World" since all characters fall within the direct sets. Similarly, a single character like "A" remains "A", and an empty string encodes to an empty string, with no additional bytes introduced. Newlines and spaces in multi-line ASCII text, for instance in a simple message body, are maintained as-is, avoiding any disruption to formatting.[6]
In protocols like email headers and bodies, where 7-bit ASCII compatibility is essential, this zero-overhead approach for basic ASCII enables UTF-7 to integrate Unicode capabilities without altering legacy ASCII data, reducing processing complexity in mail gateways that expect unmodified printable ASCII.[6]
Non-ASCII Unicode Encoding
UTF-7 encodes non-ASCII Unicode characters by shifting into a modified Base64 representation enclosed in + and - delimiters, allowing the inclusion of international text within a 7-bit ASCII stream.[6] This mechanism packs 16-bit Unicode code units into groups of 6 bits each, using the Base64 alphabet (A-Z, a-z, 0-9, +, /) without padding, to represent characters outside the directly encodable ASCII range.[6]
Consider the string "Hi €", where the Euro sign (U+20AC) is a non-ASCII character. The UTF-7 encoding is "Hi +IKw-", with "Hi " remaining as direct ASCII and "+IKw-" representing the Euro sign.[6] To derive this, the 16-bit value 0x20AC (binary 0010000010101100) is padded with two zero bits to 18 bits (001000 001010 110000), grouped into 6-bit segments: 001000 (decimal 8, 'I'), 001010 (decimal 10, 'K'), and 110000 (decimal 48, 'w'). These map to the modified Base64 characters, enclosed by + and -.[6] This process introduces overhead, as the single 16-bit character expands to five 7-bit ASCII characters (+IKw-), compared to two bytes in UTF-16BE.
For sequences of consecutive non-ASCII characters, such as the Japanese text "nihongo" (U+65E5 U+672C U+8A9E, meaning "Japanese language"), UTF-7 combines them into a single encoded block: "+ZeVnLIqe-".[6] Here, the three 16-bit code units (48 bits total) are concatenated, grouped into eight 6-bit segments (48 bits exactly), and mapped to Base64: ZeVnLIqe (Z=25, e=30, V=21, n=39, L=11, I=8, q=42, e=30). This efficient packing minimizes delimiters for runs of non-ASCII text, though the overall expansion resembles Base64's approximately 33% overhead for binary data.[6]
When mixing ASCII and special characters, UTF-7 requires escaping to avoid conflicts with shift sequences. For "Hello + world", the literal "+" must be encoded to prevent misinterpretation as a shift, resulting in "Hello +- world".[6] The "+-" directly represents the literal "+", while the rest remains unchanged.
The following table illustrates input-output comparisons for these examples:
| Unicode Input | UTF-7 Output | Notes on Expansion |
|---|
| Hi € | Hi +IKw- | 8 chars total (from 4 input chars, with € (2 bytes in UTF-16) expanding to 5 chars) |
| nihongo (Japanese) | +ZeVnLIqe- | 10 chars total (from 3 input chars + 6 bytes in UTF-16) |
| Hello + world | Hello +- world | 14 chars total (from 13 input chars, +1 for escaping literal +) |
These encodings ensure compatibility with 7-bit transports while preserving Unicode semantics, though at the cost of increased length for non-ASCII content.[6]
Special Features and Considerations
Byte Order Mark Integration
UTF-7, being an ASCII-compatible encoding, does not inherently require a Byte Order Mark (BOM) and typically omits it, as the format assumes big-endian byte order for its internal 16-bit UCS-2 representation during encoding and decoding.[6] This assumption aligns with network byte order conventions, eliminating the need for endianness signaling in most cases.[6]
However, a BOM can be optionally included by encoding the Unicode character U+FEFF (ZERO WIDTH NO-BREAK SPACE) at the beginning of the text, which serves as a format indicator compatible with UTF-16 big-endian.[11] The encoding of this BOM in UTF-7 produces the sequence "+/v8-" when followed by an ASCII character, as the modified Base64 encoding of the bytes 0xFE 0xFF (big-endian) results in "/v8" within the shifted block, terminated by "-".[12] For instance, the string consisting of a BOM followed by "Hello" encodes as "+/v8-Hello", where the BOM signals the presence of Unicode content while allowing seamless continuation into ASCII.[12] Variations like "+/v9-" may occur depending on the high bits of the subsequent character's byte, but "+/v8-" is the common form for ASCII-following BOMs.[12]
UTF-7 provides no mechanism for indicating little-endian order, as the specification mandates big-endian serialization exclusively.[6] In contexts like IMAP, where a modified variant of UTF-7 is used for international mailbox names, a BOM is neither required nor standard, though some implementations may preserve it for compatibility; decoders generally ignore it during processing.[7] The Unicode Standard discourages the use of UTF-7 altogether due to its obsolescence and potential for ambiguity, rendering BOM integration rare and non-recommended in modern applications.[13]
Security Implications
One significant security risk associated with UTF-7 arises from encoding confusion in web browsers and email clients, where content presumed to be in UTF-8 or ASCII is misinterpreted as UTF-7, facilitating cross-site scripting (XSS) attacks. This vulnerability stems from UTF-7's use of ASCII-compatible sequences that can embed executable HTML tags without triggering standard filters; for example, the sequence +ADw-script+AD4-alert(+AC8-XSS+AC8-)+ADw-/script+AD4- decodes to <script>alert('XSS')</script>, allowing malicious script injection if auto-detection occurs.[14][15] Such misinterpretation has been exploited when no explicit charset is declared in HTML documents, as older parsers default to probing for UTF-7 signatures like the "+" shift character.[16]
Historical exploits highlight UTF-7's dangers in early web applications. Between 2008 and 2010, attackers leveraged UTF-7 in Internet Explorer to bypass input sanitization via malformed HTTP responses, potentially enabling remote code execution through XSS payloads disguised in Base64-like encodings.[17] Similar attacks targeted email services like Hotmail, where UTF-7-encoded scripts evaded filters in webmail interfaces, allowing injection of arbitrary code when viewed in vulnerable browsers. These incidents underscored UTF-7's incompatibility with secure parsing, prompting widespread scrutiny.
The Unicode Consortium ceased recommending UTF-7 for new uses following its omission from the Unicode Standard version 3.0 in 2000, with subsequent versions prioritizing UTF-8, UTF-16, and UTF-32 as the sole transformation formats. In HTML5, parsers disable UTF-7 by default, as the specification explicitly forbids its support to avert XSS risks.[18] Post-2010 browser updates further hardened defenses: Chrome and Firefox removed UTF-7 auto-detection around 2012, treating such content as plain text rather than decoding it.[19]
Mitigations emphasize avoiding UTF-7 entirely in modern systems, coupled with strict decoder validation to reject ambiguous or invalid sequences during input processing. For legacy contexts like IMAP, where modified UTF-7 persists for mailbox names under RFC 3501, RFC 6855 (2013) introduces UTF-8 as a secure replacement via the UTF8=ACCEPT extension, effectively deprecating UTF-7 usage and addressing associated risks through capability negotiation. Developers should enforce explicit charsets (e.g., UTF-8) in web headers and sanitize user inputs against encoding shifts to prevent exploitation.[20]