General Punctuation
General Punctuation is a Unicode block that provides a collection of punctuation marks, spacing characters, and format controls designed for universal use across all scripts and writing systems.[1] It spans the code point range from U+2000 to U+206F, encompassing 112 code points of which 111 are assigned characters.[2] This block primarily includes common punctuation elements derived from Latin typography, such as dashes (e.g., U+2010 HYPHEN and U+2014 EM DASH), quotation marks (e.g., U+2018 LEFT SINGLE QUOTATION MARK), and ellipses (e.g., U+2026 HORIZONTAL ELLIPSIS), alongside specialized marks and format controls such as the zero-width space (U+200B).[2] These characters facilitate text organization by separating sentences, phrases, and clauses, while also enabling precise layout adjustments and bidirectional text handling in multilingual contexts.[1] Notable for its versatility, General Punctuation supports semantic functions that vary by language and script, including paired delimiters for nesting structures and invisible controls that influence rendering without visible output.[1] Introduced in early versions of the Unicode Standard and refined over time, it remains essential for digital typography, ensuring consistent punctuation rendering in global computing environments.[3]Overview
Definition and Scope
The General Punctuation block in the Unicode Standard is defined as the range U+2000–U+206F, encompassing 111 assigned code points out of 112 possible positions, with one unassigned at U+2065.[2] Of these assigned code points, 109 belong to the Common script, while 2 are classified under the Inherited script, specifically the Zero Width Non-Joiner (U+200C) and Zero Width Joiner (U+200D), which inherit their script properties from surrounding characters to facilitate complex script rendering.[4] This block serves as a foundational collection of characters designed for universal applicability across writing systems, ensuring consistent text processing in multilingual environments.[5] The scope of the General Punctuation block extends to a diverse array of punctuation marks, spacing characters, invisible operators, and control codes that support typographic formatting and compatibility in global text handling.[2] It includes elements essential for bidirectional text support, such as the Left-to-Right Mark (U+200E) and Right-to-Left Mark (U+200F), which override default rendering directions in mixed-script documents, as well as zero-width characters like the Zero Width Space (U+200B) for line breaking without visible gaps.[6] Additionally, it incorporates deprecated directional isolates (U+202A–U+202E), whose use is discouraged in favor of newer isolates (U+2066–U+2069) to avoid legacy compatibility issues in modern applications.[2] These features collectively enable robust handling of text layout, particularly in environments processing Arabic, Hebrew, or other right-to-left scripts alongside left-to-right ones. In practice, the block's characters enhance readability in typography by providing flexible spacing options, such as the En Space (U+2002) and Em Space (U+2003), which adjust inter-word gaps proportionally.[2] They also support advanced formatting in software like word processors, where curly quotation marks (e.g., U+2018 left single quotation mark) replace straight quotes for professional typesetting, automatically converting typed inputs to contextually appropriate variants.[5] This versatility ensures seamless integration in digital documents, from web content to printed materials, without script-specific dependencies.[2]Unicode Allocation and Properties
The General Punctuation block in the Unicode Standard spans the code point range U+2000 to U+206F, comprising 112 positions in total. Of these, 111 are assigned to characters, while one remains reserved.[2][7] Most assigned characters in this block are designated with the Common script property (Zyyy), with two classified under the Inherited script—the Zero Width Non-Joiner (U+200C) and Zero Width Joiner (U+200D)—facilitating their use across multiple writing systems without script-specific affiliation. In terms of general category distributions, the block includes 13 space separators (Zs), such as the en space (U+2002) and narrow no-break space (U+202F); approximately 58 other punctuation marks (Po), including the bullet (U+2022) and horizontal ellipsis (U+2026); 25 format controls (Cf), like the zero-width non-joiner (U+200C); and smaller numbers in categories such as dash punctuation (Pd) with 4 entries (e.g., em dash U+2014) and math symbols (Sm) with 5 (e.g., prime U+2032).[8][9] Unicode properties for these characters support diverse text processing needs. The East Asian Width property classifies many as narrow (e.g., thin space U+2009), wide (e.g., em dash U+2014), or ambiguous (e.g., certain quotation marks like U+2018), aiding layout in East Asian typography. Bidirectional properties include the left-to-right mark (U+200E) with Bidi_Class L (Left-to-Right) to enforce text directionality. Line Break rules vary, with mandatory breaks (BK) for characters like the hair space (U+200A) and ideographic breaks (ID) for punctuation such as the ellipsis (U+2026), influencing word wrapping and paragraph formatting.[10][8] The reserved code point is U+2065, held for potential future assignment without current character encoding. Additionally, the six code points U+206A through U+206F—representing inhibit symmetric swapping (U+206A), activate symmetric swapping (U+206B), inhibit Arabic form shaping (U+206C), activate Arabic form shaping (U+206D), national digit shapes (U+206E), and nominal digit shapes (U+206F)—have been deprecated since Unicode 3.0 due to their obsolescence in modern text rendering.[2][11]Character Categories
Whitespace and Separator Characters
The General Punctuation block (U+2000–U+206F) in the Unicode Standard includes a variety of whitespace characters designed primarily for typographic spacing and layout control, distinct from the basic space (U+0020). These characters provide precise widths relative to the em (a unit equal to the current font's height), enabling fine adjustments in text composition. They are classified under the Separator, Space (Zs) general category, which identifies them as horizontal spacing characters that do not initiate line breaks unless specified otherwise.[2] The block features several fixed-width spaces, each with defined proportions for use in professional typesetting. For example, the en quad (U+2000) and en space (U+2002) both measure half an em, while the em quad (U+2001) and em space (U+2003) span a full em, equivalent to the font's type size in points. Narrower variants include the three-per-em space (U+2004) at one-third em, four-per-em space (U+2005) at one-quarter em, and six-per-em space (U+2006) at one-sixth em. The figure space (U+2007) aligns with the width of a tabular numeral, typically matching the zero digit, and the punctuation space (U+2008) approximates the width of narrow punctuation like a period. The thin space (U+2009) is usually one-fifth to one-sixth em, the hair space (U+200A) is even thinner at one-tenth to one-sixteenth em, the narrow no-break space (U+202F) matches the thin space but prevents line breaks, and the medium mathematical space (U+205F) measures four-eighteenths of an em. These metrics allow for consistent rendering across fonts, though actual widths may vary slightly based on design.[2][1]| Code | Name | Width Description | Notes |
|---|---|---|---|
| U+2000 | En Quad | Half em | Equivalent to en space |
| U+2001 | Em Quad | Full em (type size) | Equivalent to em space |
| U+2002 | En Space | Half em | Also called "nut" |
| U+2003 | Em Space | Full em | Also called "mutton" |
| U+2004 | Three-per-em Space | One-third em | Thick space variant |
| U+2005 | Four-per-em Space | One-quarter em | Mid space variant |
| U+2006 | Six-per-em Space | One-sixth em | Often like thin space |
| U+2007 | Figure Space | Tabular numeral width (e.g., zero) | For aligning figures |
| U+2008 | Punctuation Space | Narrow punctuation width (e.g., period) | Follows punctuation |
| U+2009 | Thin Space | One-fifth to one-sixth em | Narrow space |
| U+200A | Hair Space | One-tenth to one-sixteenth em | Thinnest traditional space |
| U+202F | Narrow No-break Space | Same as thin space | Prevents line breaks |
| U+205F | Medium Mathematical Space | Four-eighteenths em | For math spacing |