Email address
An email address is a unique identifier for a mailbox in the Internet's electronic mail system, consisting of a local-part (specifying the recipient on the host), the "@" symbol, and a domain (identifying the host or server).[1] The local-part may include letters, numbers, and certain special characters, either as a dot-atom (e.g., "user.name") or a quoted-string for more complex cases, while the domain follows hostname conventions with subdomains separated by dots (e.g., "example.com").[1] This structure ensures precise routing and delivery of messages across the global network.[1] The concept of the email address emerged in 1971 when Ray Tomlinson, working on the ARPANET, developed a program to send messages across distributed computers, selecting the "@" symbol to distinguish the user from the host machine.[2] By 1973, email accounted for about 75% of ARPANET traffic, highlighting its rapid adoption among researchers.[2] The format was first standardized in RFC 822 in 1982, which defined the addr-spec syntax as local-part "@" domain and introduced hierarchical domains for broader scalability.[3] Subsequent updates in RFC 2822 (2001) and RFC 5322 (2008) refined the syntax for clarity and compatibility, while prohibiting obsolete elements like source routes.[1] In the Internet Mail Architecture, email addresses function as globally unique identifiers that enable spontaneous end-to-end communication without prior setup, appearing in SMTP commands for envelope routing (e.g., MAIL FROM and RCPT TO) and in message headers (e.g., From:, To:) for content association.[4] They extend beyond mere delivery to serve as persistent online identities for services like authentication and notifications.[4] To accommodate global users, internationalized email addresses supporting non-ASCII characters in both local-parts and domains were specified starting in RFC 6530 (2011), with UTF-8 encoding for broader linguistic inclusion.[5]Role in Email Communication
Definition and Purpose
An email address is a unique string that identifies the recipient of an electronic mail message within the Internet's messaging framework, serving as a specific identifier for a mailbox on a host computer.[6] It typically follows the format of a local-part followed by an "@" symbol and a domain, enabling precise targeting of messages to individual users or shared mailboxes.[7] The primary purpose of an email address is to facilitate the routing and delivery of messages across interconnected networks, supporting both one-to-one correspondence and one-to-many distributions such as mailing lists.[8] Beyond message transport, it functions as a foundational digital identity, commonly used for user authentication, account registration, subscription to services like newsletters, and integration with other online systems.[8] Email addresses originated in the early 1970s as part of the ARPANET, the precursor to the modern Internet, where engineer Ray Tomlinson developed the first networked email system in 1971 by extending existing programs to allow inter-host messaging.[9] This innovation quickly evolved into a global standard for internet-based electronic communication, standardizing user addressing across diverse systems.[10] Unlike telephone numbers, which primarily enable voice or short message services, or IP addresses, which identify network devices for data routing, email addresses specifically target human users or virtual mailboxes for asynchronous text-based exchange.[11]Message Transport Usage
Email addresses play a central role in the Simple Mail Transfer Protocol (SMTP), the standard for transporting email messages across the internet. In an SMTP transaction, the sender's email address is specified using the MAIL FROM command, which defines the reverse-path for error notifications and delivery reports.[12] Similarly, each recipient's email address is indicated via the RCPT TO command, establishing the forward-path to guide message delivery.[13] These commands form the SMTP envelope, which encapsulates the routing information separate from the message content itself.[14] The routing process relies on the domain portion of the email address to determine the appropriate mail server. When an SMTP server receives a message, it resolves the recipient's domain through DNS MX (Mail Exchanger) records to identify the target server for relay or final delivery.[15] The local-part of the address then specifies the individual mailbox on that server, enabling precise delivery.[13] A key distinction exists between the transport envelope and the message headers. The envelope addresses (from MAIL FROM and RCPT TO) are used exclusively for routing and are not visible to end users, whereas header fields like From: and To: serve display and informational purposes within the email client.[16] This separation ensures that routing remains efficient and independent of the message's visible content, such as in cases of blind carbon copies where recipients are not listed in headers.[17] If an email address proves undeliverable during transport, the SMTP server generates error responses and bounce messages. For instance, a 550 reply code indicates a permanent failure, such as an invalid or non-existent recipient, prompting the sending server to notify the original sender via the reverse-path.[18] These bounce messages, often containing diagnostic details, are sent back to the MAIL FROM address to inform the sender of the issue.[19]Syntax and Components
Local-part
The local-part of an email address is the portion preceding the "@" symbol, which specifies the recipient's mailbox or alias on the mail server indicated by the domain.[7] It serves to uniquely identify the user within that specific domain, allowing for flexible naming conventions determined by the receiving server.[20] According to RFC 5322, the syntax for the local-part is defined as a dot-atom, a quoted-string, or an obsolete local-part form (obs-local-part).[7] The dot-atom consists of one or more dot-atom-text elements separated by dots, where dot-atom-text includes letters (a-z and A-Z), digits (0-9), and the special characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~, but it cannot begin or end with a dot, nor contain consecutive dots.[21] The quoted-string format encloses content in double quotes, permitting a broader range of ASCII characters (excluding CR and LF) through escaped quoted-pairs, such as backslash-escaped specials or spaces.[22] Obsolete forms, retained for backward compatibility, allow additional structures like unquoted spaces or other legacy characters, though modern implementations favor the standard dot-atom and quoted-string.[23] The maximum length of the local-part is 64 octets, as specified in RFC 5321 for SMTP compliance, ensuring compatibility across mail transfer agents.[24] Regarding case sensitivity, RFC 5321 mandates that the local-part be treated as case-sensitive, requiring SMTP servers to preserve its casing during transmission.[25] However, many email providers, such as those implementing common extensions, treat it as case-insensitive for delivery purposes to improve user experience and reduce errors.[26] Common formats for the local-part include simple alphanumeric usernames (e.g., user), dotted variants for substructure (e.g., user.name), and plus-addressing extensions (e.g., user+tag), where the plus sign and following tag are valid per RFC 5322 and often used by providers like Gmail for filtering or disposable aliases.[21] Server-specific quoting enables inclusion of spaces or other restricted characters, such as "user name" or "user with space", by wrapping in double quotes and escaping as needed.[22] These formats enhance flexibility while adhering to the core syntax rules.Domain
The domain part of an email address is the segment following the "@" symbol, which specifies the destination mail server or organization for message delivery. It typically consists of a fully qualified domain name (FQDN), such as "example.com," or an IP address literal, ensuring the email can be routed accurately within the internet mail system.[27] The syntax of the domain adheres to rules outlined in RFC 5321 and aligns with DNS hostname specifications in RFC 1035. It comprises one or more labels separated by periods, where each label includes only letters (a-z, A-Z), digits (0-9), and hyphens (-), with hyphens not permitted at the start or end of a label and no underscores allowed in standard domain names. The entire domain must not exceed 255 octets in length to maintain compatibility with SMTP transport limits.[27][28] To resolve the domain for email routing, the sending SMTP server queries the Domain Name System (DNS) for MX (Mail Exchanger) records associated with the domain, as defined in RFC 5321 and detailed in RFC 974. These records list the preferred mail servers, ordered by a numeric preference value (lower values indicating higher priority), allowing selection of the optimal server for delivery. In the absence of MX records, the server falls back to querying A (IPv4) or AAAA (IPv6) records to obtain the domain's IP address directly.[27][29] Domain literals provide an alternative to FQDNs by embedding IP addresses directly in the email address, enclosed in square brackets to distinguish them from domain names. For IPv4, this appears as [192.0.2.1]; for IPv6, it uses the format [IPv6:2001:db8::1], supporting literal resolution without DNS involvement, though such usage is deprecated in modern systems for security reasons.[27] Domains incorporating non-ASCII characters, known as Internationalized Domain Names (IDNs), are represented in Punycode (xn-- prefix) to ensure ASCII compatibility during transmission, with full details on encoding provided in RFC 3490.Sub-addressing
Sub-addressing, also known as plus-addressing or tagged addressing, is an extension to the local-part of an email address that allows users to append optional tags using specific delimiters, enabling emails to be routed to the same mailbox without requiring a separate account. For instance, an email sent to [email protected] is delivered to the primary mailbox associated with [email protected], as the receiving server interprets the tag after the delimiter and strips it during processing.[30][31] The most common delimiter is the plus sign (+), which is supported by major providers such as Gmail and Microsoft Exchange Online, where it separates the base local-part from the tag. Other delimiters include the hyphen (-), used by some systems like certain spam filtering services, and the pipe (|), which is less commonly implemented across providers. These delimiters are permitted within the local-part syntax as defined by RFC 5322, but their interpretive handling for sub-addressing is implementation-specific and not mandated by the standard.[32][33] Common use cases for sub-addressing include organizing incoming mail by category, such as directing messages to [email protected] for professional correspondence or [email protected] for e-commerce notifications, thereby facilitating automated filtering rules. It also enables tracking the origin of email sign-ups, for example, by using [email protected] to identify which services might be sources of spam or data breaches. Additionally, users create temporary aliases for one-time purposes, like online registrations, to enhance privacy without exposing the primary address.[34][35] Support for sub-addressing varies significantly among email providers and mail transfer agents, as it is not standardized in RFC 5322 and relies on server-side configuration to recognize and process the delimiters by stripping them along with the tag before final delivery. While widely implemented in consumer services like Gmail, Outlook.com, and Proton Mail, enterprise systems or older infrastructures may not support it, potentially causing delivery failures if the tag is not handled.[32][36] Limitations of sub-addressing include its inconsistent treatment regarding case sensitivity, where tags are generally ignored in case comparisons since the base local-part's sensitivity is domain-dependent, but most modern providers treat the entire local-part as case-insensitive in practice. Furthermore, the feature can be vulnerable to abuse in spam filtering scenarios, as attackers might leverage varying provider support to generate multiple aliases and bypass blacklists or rate limits, though it is more commonly employed by legitimate users to detect and mitigate unwanted mail.[30][32][37]Examples
Valid Email Addresses
Valid email addresses must adhere to the syntax rules defined in RFC 5322, which specifies the permissible structures for the local-part and domain to ensure proper parsing and routing in Internet mail.[1] Basic valid examples demonstrate straightforward formats using alphanumeric characters in the local-part and a simple domain name. For instance,[email protected] is valid because the local-part "user" consists solely of allowed letters, and the domain "domain.com" follows the dot-atom structure with periods separating label sequences of permitted characters.[1] Similarly, [email protected] is acceptable, as the local-part incorporates dots to separate components without leading, trailing, or consecutive periods, while the domain uses hierarchical labels connected by dots, all within the atext character set (letters, digits, and specific symbols).[1]
The local-part supports quoting to include spaces or other non-standard characters. An example is "user name"@domain.com, where double quotes enclose the local-part to allow the embedded space, adhering to the quoted-string production in the standard.[1] To illustrate the full range of special characters permitted without quoting, !#$%&'*+-/=?^_{|}~@domain.com` is valid, as each symbol belongs to the atext set defined for unquoted local-parts, enabling robust handling of diverse identifiers up to 64 octets in length.[1][38]
Sub-addressing extends functionality within the local-part syntax. For example, [email protected] is syntactically correct, since the plus sign (+) is an allowed atext character, allowing the tag to augment the base local-part without violating length limits or character restrictions.[1]
Domain variations further highlight flexibility in addressing. The address user@[IPv6:2001:db8::1] uses a domain literal enclosed in brackets to specify an IPv6 address directly, bypassing DNS resolution as permitted for transport scenarios.[1] Additionally, [email protected] is valid, with the domain incorporating hyphens within labels, as hyphens are part of the permitted characters and conform to the overall domain length constraint of 255 octets.[1][28]
These examples reflect the core syntax rules for local-parts and domains, providing a foundation for compliant email construction.[1]
Invalid Email Addresses
Invalid email addresses are those that violate the syntactic rules defined for the Internet Message Format, primarily outlined in RFC 5322, which specifies the structure of an addr-spec as a local-part followed by "@" and a domain.[7] These violations prevent proper parsing and transport in email systems, leading to rejection during validation or delivery attempts. Note that some examples below are syntactically valid but practically invalid due to real-world constraints like DNS resolution or system compatibility. Common issues arise from missing components, improper character usage, or exceeding length constraints, as detailed in standards like RFC 3696, which imposes practical limits on address components to ensure compatibility with SMTP protocols.[38] One frequent syntax error is the absence of a domain after the "@" symbol, as in "user@", which fails because the addr-spec requires a non-empty domain following the separator.[7] Similarly, while "user@domain" is syntactically valid as a single-label domain per RFC 5322, it is practically invalid because it lacks a top-level domain (TLD) required for DNS resolution in Internet mail systems.[7] Another basic violation occurs with unquoted spaces in the local-part, such as "user [email protected]", since spaces are not permitted in dot-atom form without enclosing quotes, and quoted-strings must properly escape such characters.[21] More explicit syntax violations include multiple "@" symbols, like "user@@domain.com", which contravenes the single-separator rule in the addr-spec definition, allowing only one "@" between local-part and domain.[7] Consecutive dots in the domain, as in "[email protected]", are prohibited because the dot-atom syntax mandates at least one atext character (letters, digits, or specified specials) between dots.[21] Addresses exceeding 254 characters in total length, such as a contrived local-part of 200 characters followed by a long domain, are invalid due to SMTP command length restrictions clarified in errata for RFC 3696 and aligned with RFC 5321's path limits. Deprecated or non-standard forms further illustrate invalidity under modern rules. For instance, a domain starting with a dot, like "[email protected]", violates the dot-atom requirement that labels begin with atext, not a period, as obsolete syntax allowing leading dots has been prohibited.[39] Although "[email protected]" is syntactically valid since digits are allowed in atext, numeric-only local-parts are non-standard in many legacy systems and may fail delivery in contexts enforcing alphanumeric requirements for mailboxes.[21] Additionally, the inclusion of comments, like "user(comment)@domain.com", is invalid in current addr-spec syntax, as RFC 5322 explicitly prohibits comments within local-parts or domains to avoid parsing ambiguities, obsoleting their use from earlier standards.[39]| Invalid Example | Reason for Invalidity | Relevant RFC Reference |
|---|---|---|
| user@ | Missing domain after "@" | RFC 5322, Section 3.4.1[7] |
| user@domain | Lacks TLD (syntactically valid but practically invalid for DNS resolution) | RFC 5322, Section 3.4.1[7] |
| user [email protected] | Unquoted space in local-part | RFC 5322, Section 3.2.3[21] |
| user@@domain.com | Multiple "@" symbols | RFC 5322, Section 3.4.1[7] |
| [email protected] | Consecutive dots in domain | RFC 5322, Section 3.2.3[21] |
| [Very long address exceeding 254 chars]@domain.com | Exceeds total length limit | RFC 3696 Errata |
| [email protected] | Leading dot in domain | RFC 5322, Section 4[39] |
| user(comment)@domain.com | Comments not allowed in addr-spec | RFC 5322, Section 4[39] |
Internationalized Email Addresses
Internationalized email addresses incorporate non-ASCII characters from various scripts and languages, enabling users worldwide to employ native writing systems in both the local-part and domain components. These addresses conform to standards that extend traditional ASCII-based email syntax, allowing Unicode characters while maintaining compatibility with existing infrastructure. For instance, domains with accented or non-Latin characters are encoded using the Internationalizing Domain Names in Applications (IDNA) protocol, which converts them to Punycode for DNS resolution.[40] A common example involves an IDNA domain, such as user@exämple.com, where the domain "exämple.com" is represented in Punycode as xn--exmple-cua.com to ensure ASCII compatibility in the Domain Name System (DNS). This format supports internationalized domain names (IDNs) by mapping Unicode labels to ASCII-compatible encoding (ACE) strings prefixed with "xn--". Similarly, an ASCII local-part paired with a non-Latin domain, like user@dömäin.tld, uses the Punycode equivalent xn--dmin-5qa.tld for the domain, demonstrating mixed-language support in email routing.[40] The local-part can also include Unicode characters when the Simple Mail Transfer Protocol (SMTP) server supports the SMTPUTF8 extension, which permits UTF-8 encoding throughout the email transmission process. For example, café@domain.com or é[email protected] are valid under this extension, as it expands the allowable characters in the local-part beyond ASCII while preserving quoted or bracketed structures from earlier standards. Without SMTPUTF8, such addresses may fail delivery, as legacy systems expect ASCII-only local-parts.[41] Fully internationalized addresses combine Unicode in both parts, such as π@δóμäïň.com, where the local-part uses the Greek letter pi (π) and the domain incorporates accented Latin characters along with Greek delta (δ). The domain resolves via Punycode as xn--nxad5e.com, and the entire address requires SMTPUTF8 for transport to handle the non-ASCII local-part. Another illustration is 您好@example.com, featuring Chinese characters in the local-part (U+60A8 U+597D), which is supported in contexts like X.509 certificates for email verification. These examples highlight how internationalized addresses facilitate global communication but depend on end-to-end UTF-8 support to avoid downgrading or rejection.[41]Validation and Verification
Syntax Validation
Syntax validation of an email address involves verifying its format against established standards, such as those defined in RFC 5322, without performing any network queries or existence checks. This process ensures the address adheres to syntactic rules for the local-part (before the @ symbol) and the domain (after the @ symbol), focusing on character sets, lengths, and structural elements. The primary goal is to identify malformed addresses early, preventing errors in applications like user registration or data entry forms. Regex-based validation is a common approach, using regular expressions to match the complex patterns outlined in RFC 5322. For the local-part, which can include up to 64 characters of letters, digits, and special symbols like dots (.), hyphens (-), and quoted strings for unusual characters, a comprehensive regex might incorporate escaped characters and domain literals (e.g., [IPv4-address]). The domain portion requires patterns for dot-separated labels, each consisting of 1-63 characters from letters, digits, and hyphens, excluding leading or trailing hyphens. An example regex for basic validation could be^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$, but more robust implementations account for RFC 5322's allowances like comments (in parentheses) and folding whitespace, though these are rarely used in practice. Such patterns are derived directly from the RFC's ABNF (Augmented Backus-Naur Form) grammar for the addr-spec production rule.
Algorithmic checks provide an alternative or complementary method, parsing the address step-by-step rather than relying on a single regex. This begins by locating the @ symbol, ensuring exactly one occurrence and that it is neither at the start nor end of the string. The local-part is then validated for length (up to 64 octets) and permissible characters, including checking for properly quoted sections if present (e.g., "user name"@example.com). For the domain, the string is split by dots to verify each label's length (1-63 characters) and composition, confirming it ends with a top-level domain of at least two characters and disallowing consecutive dots or dots at the beginning or end. These checks align with RFC 5321 for SMTP envelope syntax but are applied locally without transmission. Tools implementing this often use state machines or recursive descent parsers for accuracy.
Programming libraries and tools facilitate syntax validation in various languages, balancing strict adherence to standards with practical usability. In Python, the email.utils module's parseaddr function or the validate_email package from PyPI performs checks based on RFC 5322, returning structured components or raising exceptions for invalid formats; it supports both strict mode (rejecting non-ASCII without quoting) and lenient mode (accepting common real-world variations). Similarly, Java's javax.mail.internet.InternetAddress class validates via its constructor, throwing AddressException for syntax errors and offering options for lenient parsing to handle legacy or internationalized addresses. Strict parsing ensures compliance but may reject valid yet uncommon formats like those with comments, while lenient approaches improve user experience by accepting 99% of practical addresses at the cost of potential false positives. Pros of library use include built-in handling of edge cases and updates for standard revisions, whereas cons involve dependency on specific implementations that might not cover all RFC nuances.
Common pitfalls in syntax validation arise from oversimplification or misunderstanding of the standards. A frequent error is using basic regex patterns like ^[\w\.-]+@[\w\.-]+\.[\w]{2,}$, which fail to handle quoted local-parts (e.g., "O'Brien"@example.com) or international characters without proper encoding, leading to rejection of valid addresses. Another issue is ignoring domain length limits or allowing invalid top-level domains, as domains must conform to DNS rules where labels avoid certain reserved characters. Additionally, validators might overlook the distinction between display names and actual addresses in full RFC 822-style strings (e.g., User [email protected]), parsing only the addr-spec. These errors can result in high false negative rates for simplistic checkers compared to full RFC compliance, emphasizing the need for comprehensive testing against diverse examples.
Existence Verification
Existence verification refers to methods used to determine whether an email address corresponds to an active mailbox that can receive messages, focusing on deliverability and user activity rather than format alone. A common technique is SMTP probing, which involves initiating a connection to the recipient's mail server and sending theRCPT TO command as defined in the Simple Mail Transfer Protocol (SMTP). This command specifies the recipient address, and the server responds with codes indicating acceptance or rejection; for instance, a 250 OK response signifies the mailbox is valid and will accept mail, while a 550 response (e.g., "User unknown") indicates the address does not exist.[42] The probe simulates the early stages of email transmission—connecting via the domain's MX record, greeting the server, and querying the recipient—without sending a full message or body, thereby testing server-side confirmation of the address.[43]
Callback verification, also known as double opt-in, provides an interactive confirmation by sending a verification email to the address and requiring the recipient to respond, typically by clicking a link or replying with a code. This method verifies not only existence but also the user's intent and control over the mailbox, as the address is not activated until confirmation is received. In practice, after an initial signup, an automated confirmation email is dispatched with clear instructions for action, ensuring compliance with regulations like CAN-SPAM and improving list quality by filtering out invalid or mistyped entries.[44]
Third-party services offer automated existence verification through APIs, often combining SMTP probing with proprietary checks to validate addresses at scale. For example, Hunter.io's Email Verifier performs an SMTP test to assess if the address exists by simulating a server handshake, alongside domain and database lookups, achieving high accuracy for business emails.[45] Similarly, NeverBounce integrates SMTP validation within its 20+ step process, conducted from multiple global locations to confirm deliverability and reduce bounces, supporting integrations with over 85 platforms.[46][47] These tools are widely used in marketing to clean lists, but they raise privacy concerns, as probing can inadvertently expose valid addresses to unauthorized parties or facilitate spam if data is mishandled.[48]
Despite their utility, these methods have significant limitations. Catch-all domains, configured to accept emails for any local-part (e.g., *@example.com routes all to a single inbox), produce false positives by returning acceptance codes for non-existent addresses, complicating accurate verification.[49] Anti-spam protections further hinder probing; many servers disable or restrict RCPT TO responses since the late 1990s to prevent address enumeration by spammers, often returning generic errors or temporary failures (e.g., 450 codes). High-volume probes can trigger rate limiting, firewalls, or blacklisting, rendering services unreliable over time and potentially damaging the verifier's IP reputation.[48]
Internationalization
IDNA and Domain Internationalization
The Internationalizing Domain Names in Applications (IDNA) protocol enables the use of non-ASCII characters in domain names by defining a mechanism to map Unicode strings to ASCII-compatible encodings, ensuring compatibility with the Domain Name System (DNS).[50] Specified in RFC 5890 through RFC 5894, IDNA2008 (the current version) replaces the earlier IDNA2003 framework and relies on Punycode for the actual encoding process.[50] Under IDNA, a domain label containing Unicode characters—known as a U-label—is converted to an A-label, which is an ASCII string prefixed with "xn--" and encoded in Punycode, allowing it to be stored and resolved in the DNS without modifications to the underlying infrastructure.[51] Punycode, detailed in RFC 3492, is a bootstring encoding algorithm that transforms a Unicode string into a representation using only ASCII letters, digits, and hyphens, preserving the original string's order and length constraints.[52] The process separates basic ASCII characters (which remain unchanged) from non-ASCII ones, then encodes the latter using a base-36 numbering system with a delimiter ("-") to indicate the insertion point for the encoded portion.[52] For example, the Unicode label "café" (where "é" is U+00E9) encodes to the A-label "xn--caf-dma", which can then be used in DNS queries.[52] This encoding ensures reversibility: decoding an A-label yields the original U-label in Unicode Normalization Form C (NFC).[50] In the context of email addresses, IDNA integration occurs at the DNS level, where MX records for internationalized domains are registered and resolved using A-label forms.[50] SMTP protocols, as defined in RFC 5321, require domain names in commands like MAIL FROM and RCPT TO to be in ASCII, so applications must convert U-labels to A-labels before performing DNS lookups for MX records.[42] This means email servers and clients need IDNA-aware implementations to handle the conversion; otherwise, resolution fails for non-ASCII domains.[51] Browser and server support for IDNA has become widespread, with modern systems automatically applying Punycode encoding during domain registration and resolution.[50] IDNA imposes several limitations to ensure security and stability, including validity checks that prohibit certain Unicode code points classified as DISALLOWED in RFC 5892, such as many punctuation marks and symbols that could lead to confusion or attacks. For right-to-left (RTL) scripts like Arabic or Hebrew, RFC 5893 defines bidirectional rules to mitigate visual spoofing risks: RTL labels must begin and end with specific character types (e.g., starting with R, AL, or L, and ending with R, AL, EN, or AN, optionally followed by non-spacing marks), and they cannot mix certain numeric types or include left-to-right characters inappropriately.[53] These rules prevent unrestricted RTL usage in domains, requiring strict validation during encoding to avoid invalid labels that could be rejected by DNS resolvers.[53]Local-part Internationalization and SMTPUTF8
The original specification for the Simple Mail Transfer Protocol (SMTP) in RFC 5321 restricts the local-part of email addresses to ASCII characters, explicitly prohibiting non-ASCII octets (those with the high-order bit set to 1) and ASCII control characters (decimal values 0-31 and 127).[26] This limitation confines usernames to the Latin alphabet, numerals, and a limited set of symbols, creating significant challenges for international users who wish to employ native scripts such as Cyrillic, Arabic, or Chinese characters in their email addresses.[26] As global internet usage expands beyond English-speaking regions, this ASCII-only constraint hinders email accessibility, cultural inclusivity, and the ability to create personalized, linguistically appropriate usernames.[41] To overcome these restrictions, RFC 6531 defines the SMTPUTF8 extension, which extends SMTP to support the transport and delivery of email messages containing internationalized addresses and header information encoded in UTF-8.[41] This extension permits UTF-8 characters in the local-part of mailbox addresses (e.g., before the "@" symbol) and in header fields, while domain names remain encoded via Internationalizing Domain Names in Applications (IDNA) for DNS compatibility.[41] Servers implementing SMTPUTF8 must advertise their capability by including the "SMTPUTF8" keyword—without parameters—in the response to the client's EHLO command, informing the sender that non-ASCII content can be transmitted without modification.[41] Without this advertisement, clients are prohibited from sending internationalized messages to avoid delivery failures.[41] Server implementation of SMTPUTF8 involves several key requirements to ensure reliable handling of UTF-8 content. Servers must validate UTF-8 syntax in mailbox local-parts and headers, perform IDNA-compliant domain lookups, and store messages using UTF-8 encoding, typically in conjunction with the 8BITMIME extension (RFC 6152) to support 8-bit data in message bodies.[41] No inspection of the message body for non-ASCII content is mandated, but servers should reject invalid UTF-8 sequences with appropriate error codes, such as 553 for mailbox issues.[41] In cases where the receiving server does not support SMTPUTF8, sending clients must not attempt delivery and should either reject the transaction (e.g., with a 550 or 553 response) or, if configured, downgrade the message to an ASCII-compatible form, though the latter risks data loss and is discouraged.[41] Adoption of the SMTPUTF8 extension remains partial and uneven across the email ecosystem. Major providers such as Google Workspace (including Gmail) have supported SMTPUTF8 since 2014, enabling users to send and receive emails with UTF-8 local-parts. Similarly, Microsoft has integrated support in Exchange Server 2019 and later, as well as in Microsoft 365 environments.[54] However, legacy systems, on-premises deployments of older Exchange versions, and many smaller or regional providers continue to lack compatibility, resulting in bounce rates and errors for internationalized messages—such as the common "SMTPUTF8 is required, but was not offered" rejection.[55] Recent advancements, including ICANN's achievement of full Email Address Internationalization (EAI) support in its systems in July 2025 and the Universal Acceptance Steering Group's (UASG) FY2025-2029 strategic plan focusing on governments and providers, signal increasing momentum, though global uptake was limited to approximately 10% of domains as of 2021, with ongoing efforts to accelerate deployment in multilingual regions.[56][57]History and Evolution
Early Development
The development of email addresses began in the context of the ARPANET, the precursor to the modern Internet, where early messaging systems required a way to specify recipients across networked computers. In 1971, Ray Tomlinson, working at Bolt, Beranek and Newman (BBN), implemented the first program to send electronic mail between users on different ARPANET hosts using the TENEX operating system. He introduced the "@" symbol as a separator to denote "user at host," creating the foundational format ofuser@host to distinguish the recipient's identifier from the destination machine. This choice of the "@" was arbitrary among available non-alphanumeric symbols on the keyboard, but it quickly became the standard delimiter for network email addressing.[58]
Early standardization efforts followed to address inconsistencies in mail headers and formats across ARPANET systems. RFC 561, published in September 1973 by Abhay Bhushan and Ray Tomlinson, proposed uniform network mail headers, defining fields such as "FROM: host-phrase combining a user phrase with a host-indicator using "@" or "at" (e.g., "Neuman@BBN-TENEXA"). It introduced support for hierarchical routing paths with multiple "@" signs (e.g., "User@hosta@local-net1@major-net") and explicitly restricted characters to the 128-printable ASCII set from TELNET (codes 32-126 decimal), establishing an ASCII-only assumption that persisted in early implementations. By 1973, email had already become dominant on ARPANET, comprising 75% of network traffic, underscoring its rapid adoption among researchers.[59][60][2]local-part@domain where the domain is a dot-separated sequence of sub-domains (e.g., "[email protected]"). It eliminated multi-"@" paths in favor of source routing via separate mechanisms, making addresses more logical and extensible for internetwork use while preserving the local-part's case sensitivity and uninterpreted nature by intermediate systems. RFC 822 retained the ASCII character restriction, focusing on printable US-ASCII for compatibility. Concurrently, email addressing spread beyond ARPANET through systems like UUCP (Unix-to-Unix Copy Protocol), introduced in the late 1970s for dial-up Unix networks, which initially used "bang-path" notation (e.g., "host1!host2!user") but increasingly integrated with Internet-style @ addresses for interoperability in the early 1980s, enabling wider adoption in academic and research communities.[3]