Numeric character reference
A numeric character reference (NCR) is a markup construct used in SGML and SGML-derived languages such as HTML and XML to represent a specific Unicode character by referencing its code point with a numeric value, allowing the inclusion of characters that may be reserved, difficult to input directly, or unavailable in certain character sets.[1][2] In HTML, as defined in the WHATWG HTML Living Standard, an NCR begins with the sequence&# followed by a decimal integer (optionally prefixed with x for hexadecimal notation) and ends with a semicolon (;), such as A for the Latin capital letter A (U+0041) or A for the same character in hexadecimal form.[1] The parser converts the numeric value to the corresponding Unicode code point, emitting it as a character token, while handling invalid references—such as those exceeding U+10FFFF or representing surrogate code points—by substituting the replacement character U+FFFD.[1] Semicolons are required for conformance, though their absence triggers a parse error but does not prevent processing in tolerant parsers.[1]
Similarly, in XML 1.0 as specified by the W3C, NCRs follow the production CharRef ::= '&#' [0-9]+ ';' for decimal or CharRef ::= '&#x' [0-9a-fA-F]+ ';' for hexadecimal, ensuring representation of legal XML characters within the defined ranges (e.g., U+0020 to U+10FFFF, excluding certain controls).[2] These references must denote valid characters per the XML Char production and are expanded immediately by processors into the referenced character data, facilitating interoperability across diverse encoding environments.[2] Both standards emphasize NCRs as a fundamental mechanism for escaping special characters like < (< or <) and & (& or &) in markup, distinct from named character references that use predefined entity names.[1][2]
Syntax and Formats
Decimal Form
The decimal form of a numeric character reference begins with an ampersand (&) immediately followed by a number sign (#), then one or more decimal digits (0 through 9) that represent the Unicode code point value in base-10, and ends with a semicolon (;).[2][3]
This form is used to reference Unicode code points in the range from U+0000 to U+10FFFF, subject to the validity rules defined by the markup language specification.[4]
Leading zeros are permitted in the decimal sequence and do not change the interpreted value, allowing flexibility in formatting while maintaining equivalence (e.g., A equals A).[2][5]
For example, Σ denotes the Greek capital letter sigma (Σ) at code point U+03A3.[3]
The decimal form serves as the base-10 alternative to the hexadecimal notation for referencing the same Unicode code points.[2][3]
Hexadecimal Form
The hexadecimal form of a numeric character reference provides an alternative to the decimal variant by expressing the Unicode code point in base-16 notation.[6][7] It begins with the sequence&#x (uppercase &#X in HTML only), followed by one or more hexadecimal digits representing the code point value, and terminates with a semicolon (;).[6][7] This format allows for the inclusion of characters whose code points are more succinctly represented in hexadecimal, particularly for higher values beyond the basic ASCII range.[6]
Hexadecimal digits in this reference are case-insensitive, accepting both uppercase (A-F) and lowercase (a-f) letters alongside digits 0-9; for instance, both Σ and Σ resolve to the Greek capital letter sigma (Σ, U+03A3).[6][7] The valid numeric range mirrors that of the decimal form, from U+0000 to U+10FFFF, subject to language-specific restrictions on certain code points such as controls and noncharacters.[6][8] Leading zeros are permitted but not required, as in 
 for the line feed character (U+000A), though they do not alter the interpreted value.[6][7]
A practical example is ♥, which renders as ♥ (black heart suit, U+2665), demonstrating hexadecimal's advantage in brevity for code points like this one, where the four-digit hex equivalent (2665) is shorter than its five-digit decimal counterpart (9829).[6][7]
Applications in Markup Languages
In HTML
Numeric character references (NCRs) in HTML serve to represent characters that cannot be directly entered from a keyboard or that might conflict with markup syntax, such as the less-than sign (<) or ampersand (&). For instance, the reserved ampersand can be encoded as & to prevent it from being interpreted as the start of another entity.[9]
In HTML5, both decimal and hexadecimal NCR forms are supported, allowing references to any valid Unicode code point from U+0000 to U+10FFFF. The decimal form uses &# followed by decimal digits, while the hexadecimal form uses &#x or &#X followed by hexadecimal digits; both must typically end with a semicolon (;) to terminate the reference. However, semicolons are required except in certain ambiguous cases, such as when the reference is followed by trailing digits that could otherwise extend the numeric value, where parsers may still resolve it for compatibility.[9][10]
NCRs are resolved to their corresponding Unicode characters during parsing, independent of the document's declared encoding, with UTF-8 assumed as the default if none is specified. Invalid NCRs, such as those referencing code points outside the Unicode range or malformed sequences, are typically treated as literal text by parsers or replaced with the Unicode replacement character U+FFFD.[3][11]
HTML5 permits NCRs for C1 control characters (U+0080 to U+009F), which are mapped to specific Unicode equivalents during tokenization in certain contexts, such as attributes, but they may not display consistently across browsers due to varying handling of control codes.[12][13]
A unique aspect of HTML parsing involves resolving ambiguities, such as distinguishing ; (encoding a semicolon) from a plain semicolon; most parsers enforce the semicolon terminator to avoid misinterpretation, treating non-terminated sequences as literal ampersands followed by digits. In contrast to XML's stricter requirements, HTML's more lenient approach ensures broader compatibility with legacy content.[14]
In XML and SGML
Numeric character references were introduced in the Standard Generalized Markup Language (SGML), defined by ISO 8879:1986, to reference characters by their numeric position within the document character set, a coded character set specified in the SGML declaration that determines the repertoire of allowable characters.[15] This mechanism allowed authors to include characters not directly available in a limited encoding by substituting them with references like &#N;, where N denotes the character's code position.[15] In SGML, these references are resolved relative to the declared document character set, providing flexibility for various international character sets, though SGML itself does not mandate Unicode.[16] XML, as a subset of SGML, adapted numeric character references to align with Unicode (ISO/IEC 10646), requiring them to denote valid Unicode code points while mandating a terminating semicolon in all cases for unambiguous parsing.[17] Both decimal (&#d;) and hexadecimal (&#xh;) forms have been supported since XML 1.0's initial 1998 recommendation, enabling references to characters across the Unicode range.[18] In XML 1.0 editions prior to 2000, valid code points were restricted to exclude most control characters (e.g., #x1 through #x1F were forbidden both directly and via reference, except for #x9, #xA, and #xD), surrogates (#xD800–#xDFFF), and noncharacters like #xFFFE and #xFFFF, limiting the range effectively to U+0009, U+000A, U+000D, U+0020–U+D7FF, and U+E000–U+FFFD.[18] XML 1.1, introduced in 2004, extended support to the full Unicode repertoire (up to U+10FFFF) by permitting numeric references to additional control characters (#x1–#x1F and #x7F–#x9F, excluding #x0), while maintaining prohibitions on surrogates and requiring special handling in UTF-16 encodings where surrogates might appear in streams.[19] XML parsers enforce numeric character references through strict well-formedness validation, expanding valid ones immediately into character data and treating invalid references—such as those to disallowed code points—as fatal errors that halt processing.[17] This rigorous enforcement contrasts with HTML's more permissive approach, which tolerates certain malformed references for practical web authoring.[17] In both XML 1.0 and 1.1, processors must reject documents containing invalid numeric references to ensure conformance to Unicode semantics and document integrity.[19]Illustrative Examples
Common ASCII and Latin Characters
Numeric character references (NCRs) provide a standardized way to represent common characters from the ASCII range (Unicode U+0020 to U+007E) and the Latin-1 Supplement (U+0080 to U+00FF), particularly those that are reserved in markup languages or difficult to input on standard keyboards.[20][2] These references are especially useful for symbols like the ampersand (&) and less-than sign (<), which must be escaped in HTML and XML to avoid interpretation as markup delimiters.[20] In practice, NCRs are frequently employed for the characters & (ampersand), < (less-than), > (greater-than), " (quotation mark), and ' (apostrophe) to prevent entity conflicts and ensure document validity.[2] For the basic ASCII printable characters, NCRs allow direct reference to their Unicode code points. For instance, the ampersand (&, U+0026) is represented as & in decimal form, while the less-than sign (<, U+003C) uses <.[20] Similarly, the greater-than sign (>, U+003E) is >, the quotation mark (", U+0022) is ", and the apostrophe (', U+0027) is '.[2] Letters like uppercase A (A, U+0041) can be denoted as A, though such usage is rare for alphanumeric characters that are easily typed.[20] Extending to the Latin-1 Supplement, NCRs facilitate inclusion of accented and symbolic characters common in Western European languages. The copyright symbol (©, U+00A9) is encoded as ©, and the cent sign (¢, U+00A2) as ¢.[20] For ligatures in legacy contexts, such as the small oe (œ, U+0153 in Unicode, often mapped from Windows-1252 byte 0x9C), the correct NCR is œ in decimal, though older systems might reference it via encoding-specific decimals like 156 for compatibility.[2] These examples highlight how NCRs bridge basic input limitations while adhering to Unicode standards.[20]| Character | Description | Unicode | Decimal NCR | Hexadecimal NCR |
|---|---|---|---|---|
| & | Ampersand | U+0026 | & | & |
| < | Less-than | U+003C | < | < |
| > | Greater-than | U+003E | > | > |
| " | Quotation mark | U+0022 | " | " |
| ' | Apostrophe | U+0027 | ' | ' |
| © | Copyright | U+00A9 | © | © |
| ¢ | Cent sign | U+00A2 | ¢ | ¢ |
| œ | Small oe ligature | U+0153 | œ | œ |