Fact-checked by Grok 2 weeks ago

General Punctuation

General Punctuation is a Unicode block that provides a collection of punctuation marks, spacing characters, and format controls designed for universal use across all scripts and writing systems. It spans the code point range from U+2000 to U+206F, encompassing 112 code points of which 111 are assigned characters. This block primarily includes common punctuation elements derived from Latin typography, such as dashes (e.g., U+2010 HYPHEN and U+2014 EM DASH), quotation marks (e.g., U+2018 LEFT SINGLE QUOTATION MARK), and ellipses (e.g., U+2026 HORIZONTAL ELLIPSIS), alongside specialized marks and format controls such as the zero-width space (U+200B). These characters facilitate text organization by separating sentences, phrases, and clauses, while also enabling precise layout adjustments and bidirectional text handling in multilingual contexts. Notable for its versatility, General Punctuation supports semantic functions that vary by and , including paired delimiters for nesting structures and invisible controls that influence rendering without visible output. Introduced in early versions of the Standard and refined over time, it remains essential for digital , ensuring consistent rendering in global computing environments.

Overview

Definition and Scope

The General Punctuation block in the Standard is defined as the range U+2000–U+206F, encompassing 111 assigned code points out of 112 possible positions, with one unassigned at U+2065. Of these assigned code points, 109 belong to the script, while 2 are classified under the Inherited script, specifically the Zero Width Non-Joiner (U+200C) and (U+200D), which inherit their script properties from surrounding characters to facilitate complex script rendering. This block serves as a foundational collection of characters designed for universal applicability across writing systems, ensuring consistent text processing in multilingual environments. The scope of the General Punctuation block extends to a diverse array of marks, spacing characters, invisible operators, and codes that support typographic formatting and compatibility in global text handling. It includes elements essential for support, such as the (U+200E) and (U+200F), which override default rendering directions in mixed-script documents, as well as zero-width characters like the (U+200B) for line breaking without visible gaps. Additionally, it incorporates deprecated directional isolates (U+202A–U+202E), whose use is discouraged in favor of newer isolates (U+2066–U+2069) to avoid legacy compatibility issues in modern applications. These features collectively enable robust handling of text layout, particularly in environments processing , Hebrew, or other right-to-left scripts alongside left-to-right ones. In practice, the block's characters enhance readability in by providing flexible spacing options, such as the En Space (U+2002) and Em Space (U+2003), which adjust inter-word gaps proportionally. They also support advanced formatting in software like word processors, where curly (e.g., U+2018 left single quotation mark) replace straight quotes for professional , automatically converting typed inputs to contextually appropriate variants. This versatility ensures seamless integration in digital documents, from to printed materials, without script-specific dependencies.

Unicode Allocation and Properties

The General Punctuation block in the Standard spans the code point range U+2000 to U+206F, comprising 112 positions in total. Of these, 111 are assigned to characters, while one remains reserved. Most assigned characters in this block are designated with the script property (Zyyy), with two classified under the Inherited script—the (U+200C) and (U+200D)—facilitating their use across multiple writing systems without script-specific affiliation. In terms of general category distributions, the block includes 13 space separators (Zs), such as the en space (U+2002) and narrow no-break space (U+202F); approximately 58 other punctuation marks (Po), including the (U+2022) and horizontal ellipsis (U+2026); 25 format controls (), like the (U+200C); and smaller numbers in categories such as dash punctuation (Pd) with 4 entries (e.g., em dash U+2014) and math symbols (Sm) with 5 (e.g., prime U+2032). Unicode properties for these characters support diverse text processing needs. The East Asian Width property classifies many as narrow (e.g., thin space U+2009), wide (e.g., em dash U+2014), or ambiguous (e.g., certain like U+2018), aiding layout in East Asian . Bidirectional properties include the (U+200E) with Bidi_Class L (Left-to-Right) to enforce text directionality. Line Break rules vary, with mandatory breaks (BK) for characters like the hair space (U+200A) and ideographic breaks (ID) for punctuation such as the (U+2026), influencing word wrapping and paragraph formatting. The reserved code point is U+2065, held for potential future assignment without current character encoding. Additionally, the six code points U+206A through U+206F—representing inhibit symmetric swapping (U+206A), activate symmetric swapping (U+206B), inhibit Arabic form shaping (U+206C), activate Arabic form shaping (U+206D), national digit shapes (U+206E), and nominal digit shapes (U+206F)—have been deprecated since Unicode 3.0 due to their obsolescence in modern text rendering.

Character Categories

Whitespace and Separator Characters

The General Punctuation block (U+2000–U+206F) in the Unicode Standard includes a variety of whitespace characters designed primarily for typographic spacing and layout control, distinct from the basic space (U+0020). These characters provide precise widths relative to the em (a unit equal to the current font's height), enabling fine adjustments in text composition. They are classified under the Separator, Space (Zs) general category, which identifies them as horizontal spacing characters that do not initiate line breaks unless specified otherwise. The block features several fixed-width spaces, each with defined proportions for use in professional typesetting. For example, the en quad (U+2000) and en space (U+2002) both measure half an em, while the em quad (U+2001) and em space (U+2003) span a full em, equivalent to the font's type size in points. Narrower variants include the three-per-em space (U+2004) at one-third em, four-per-em space (U+2005) at one-quarter em, and six-per-em space (U+2006) at one-sixth em. The figure space (U+2007) aligns with the width of a tabular numeral, typically matching the zero digit, and the punctuation space (U+2008) approximates the width of narrow punctuation like a period. The thin space (U+2009) is usually one-fifth to one-sixth em, the hair space (U+200A) is even thinner at one-tenth to one-sixteenth em, the narrow no-break space (U+202F) matches the thin space but prevents line breaks, and the medium mathematical space (U+205F) measures four-eighteenths of an em. These metrics allow for consistent rendering across fonts, though actual widths may vary slightly based on design.
CodeNameWidth DescriptionNotes
U+2000En QuadHalf emEquivalent to en space
U+2001Em QuadFull em (type size)Equivalent to em space
U+2002En SpaceHalf emAlso called "nut"
U+2003Em SpaceFull emAlso called "mutton"
U+2004Three-per-em SpaceOne-third emThick space variant
U+2005Four-per-em SpaceOne-quarter emMid space variant
U+2006Six-per-em SpaceOne-sixth emOften like thin space
U+2007Figure SpaceTabular numeral width (e.g., zero)For aligning figures
U+2008Punctuation SpaceNarrow punctuation width (e.g., period)Follows punctuation
U+2009Thin SpaceOne-fifth to one-sixth emNarrow space
U+200AHair SpaceOne-tenth to one-sixteenth emThinnest traditional space
U+202FNarrow No-break SpaceSame as thin spacePrevents line breaks
U+205FMedium Mathematical SpaceFour-eighteenths emFor math spacing
Complementing these are zero-width or structural separators that influence document flow without visible spacing. The (U+200B) is invisible and has no width, serving to indicate potential opportunities or separate words in scripts without spaces, such as Thai or . The line separator (U+2028) enforces a mandatory , akin to a soft return, while the paragraph separator (U+2029) inserts a hard break with paragraph-level formatting, such as indentation or spacing. These separators belong to the Separator, Line (Zl) and Separator, Paragraph (Zp) categories, respectively, and are crucial for maintaining semantic structure in . In applications, these characters support advanced , particularly in justified text where spaces like the (U+2009) can expand or compress to distribute even spacing across lines, improving aesthetic balance without excessive gaps. The (U+200B) aids in preventing by allowing controlled breaks within words or phrases, ensuring no isolated lines at page ends, especially when combined with no-break variants. In East Asian , where text often lacks explicit word spaces, hair spaces (U+200A) and provide subtle adjustments around ideographs, facilitating proportional layout and line breaking in scripts like , , and .

Dashes, Hyphens, and Horizontal Bars

Dashes, hyphens, and horizontal bars form a category of linear marks in , primarily used to indicate interruptions, ranges, , and separations in text. These characters, encoded in the General block (U+2000–U+206F), provide typographic precision beyond the legacy (U+002D), which is often ambiguously used for multiple purposes in plain ASCII text. The (U+2010) serves as the standard word divider in compound terms, such as well-known, ensuring semantic clarity in where the legacy might lead to inconsistent rendering. Unlike the hyphen-minus, which allows line breaks and has average width, the true hyphen is narrower and preferred for professional to maintain visual flow. The non-breaking hyphen (U+2011) functions identically to the hyphen in appearance and semantics but prevents line breaks at its position, making it essential for unbreakable compounds like phone numbers or URLs where splitting could disrupt readability. For instance, in a URL such as https://example.com/well-known-resource, using U+2011 ensures the phrase remains intact across lines. The figure dash (U+2012) shares ambiguous semantics with the hyphen-minus but is designed to match the width of numerical digits, facilitating alignment in tabular data or measurements like 3-1/2 inches. The en dash (U+2013), approximately half the width of an em (the full height of a font's capital M), denotes ranges or differences, such as pages 10–20 or flight. It should be distinguished from the mathematical minus sign (U+2212), which has different spacing properties, though in casual use, the en dash often substitutes for ranges in measurements like 5–10 km. In contrast, the em dash (U+2014), full em width, indicates parenthetical breaks or abrupt interruptions, as in She paused—suddenly realizing her mistake—to continue. According to conventions for US English, no spaces surround the em dash, creating a closed-up effect for seamless integration, though some styles add thin spaces for emphasis. The horizontal bar (U+2015), longer than the em dash, is employed in certain typographic traditions, particularly for introducing quoted dialogue or extended pauses, such as in scripts: —What do you mean? Compatibility with the hyphen-minus remains a key consideration; while U+002D is ubiquitous in legacy systems and web URLs (e.g., example.com/well-known), its overuse can obscure intended meanings, prompting modern typesetting to favor specific Unicode dashes for semantic accuracy and better rendering in international contexts.

Quotation Marks and Apostrophes

Quotation marks and apostrophes in the Unicode General Punctuation block (U+2000–U+206F) encompass a set of typographic characters designed to denote dialogue, citations, possession, and other linguistic functions, with curved forms preferred in modern typesetting over the straight variants from ASCII. These include the left single quotation mark (U+2018 ‘), right single quotation mark (U+2019 ’), single low-9 quotation mark (U+201A ‚), single high-reversing-9 quotation mark (U+201B ‛), left double quotation mark (U+201C “), right double quotation mark (U+201D ”), double low-9 quotation mark (U+201E „), and double high-reversing-9 quotation mark (U+201F ‟). The left single and double forms (U+2018 and U+201C) resemble turned commas, serving as opening marks in many conventions, while their right counterparts (U+2019 and U+201D) function as closing marks; the low-9 variants (U+201A and U+201E) appear as bottom-aligned commas for opening quotes in languages like German and Czech, and the high-reversing-9 forms (U+201B and U+201F) provide alternative appearances with the same semantics as their left counterparts but rotated. Usage of these marks varies by language, as their names do not prescribe specific applications across all scripts. In typographic practice, these curved Unicode characters (U+2018–U+201F) are distinguished from the straight, neutral forms U+0022 (") for double quotes and U+0027 (') for single quotes or apostrophes, which originated in ASCII for typewriter compatibility and remain common in plain text or programming contexts. Typographic quotes enhance readability by indicating directionality—opening marks curve upward or outward, closing marks inward—while straight quotes are identical in form and often considered less precise for printed materials. Language-specific styles further diversify conventions; for instance, French typography favors angled guillemets (U+00AB « and U+00BB ») for primary quotes, with non-breaking spaces inside, though the general curved marks are used for nested quotations. The right single quotation mark (U+2019 ’) doubles as the preferred apostrophe in punctuation, replacing the straight U+0027 in typographic contexts to avoid confusion with modifier letters like U+02BC (ʼ). It indicates possession by attaching to nouns, as in the dog's bone for singular ownership or the dogs' bones for plural, and forms contractions by omitting letters, such as don't for do not. Additionally, the apostrophe denotes prime notation in measurements, like 5'10" for five feet ten inches, though strict mathematical usage employs dedicated primes (U+2032 ′ and U+2033 ″) for derivatives or angles. International variations emphasize opening and closing directionality, with European languages like English using top-aligned left/right pairs (“ ” or ‘ ’) for dialogue, while bottom-aligned low-9 forms („ ... “ or ‚ ... ’) open quotes in German and Scandinavian styles to integrate with baseline flow. Nesting rules alternate quote types: in American English, double marks enclose primary quotations with singles inside (She said, "He replied, 'Yes.'"), whereas British English reverses this, using singles outer and doubles inner. These conventions ensure clarity in embedded speech, with punctuation placement (inside or outside quotes) varying by style guide but prioritizing the quoted material's integrity.

Other Visible Punctuation Symbols

The General Punctuation block includes a variety of visible symbols used for typographic emphasis, notation, and formatting beyond standard sentence punctuation. These characters facilitate list marking, omissions, measurements, and specialized rhetorical expressions, enhancing readability and structure in multilingual texts. Bullets and leaders provide visual cues for and alignment. The (U+2022 •) serves as a marker for list items or to emphasize key points in documents. The triangular bullet (U+2023 ‧) offers an alternative geometric shape for similar listing purposes. One dot leader (U+2024 .) and two dot leader (U+2025 ..) consist of spaced dots employed in tables or indexes to fill lines between entries, creating alignment without full spaces. Ellipsis and related symbols indicate omissions or breaks. The horizontal ellipsis (U+2026 …) represents for textual truncation or continuation, preferred in professional over separate periods (.) due to its proportional spacing and semantic intent. The hyphenation point (U+2027 ‧) marks potential word breaks, as in dictionaries (e.g., dic·tion·ar·ies), providing a visible cue for without implying a hard . Primes and per-signs denote measurements and ratios. The prime (U+2032 ′), double prime (U+2033 ″), and triple prime (U+2034 ‴) symbolize units like feet/inches, minutes/seconds, or finer divisions in angular or linear notation. The per mille sign (U+2030 ‰) indicates parts per thousand (0.1%), common in statistics, , and (e.g., alcohol concentration). The per ten thousand sign (U+2031 ‱), or permyriad, signifies parts per 10,000, used rarely in precise contexts like environmental metrics. Daggers function as reference markers. The dagger (U+2020 †) and double dagger (U+2021 ‡) denote footnotes or endnotes, typically following an (*) in sequence for multiple citations on a page. The fraction slash (U+2044 ⁄) enables inline fraction notation, such as 1⁄2, by adjusting the slant for better integration with numerals compared to the standard (/). Combined marks express emphasis or . The double exclamation mark (U+203C ‼) conveys intense surprise or urgency. The double question mark (U+2047 ⁇) signals heightened doubt. The question exclamation mark (U+2048 ⁈) and exclamation question mark (U+2049 ⁉, ) blend interrogation and exclamation for rhetorical questions, with the popularized in the for exclamatory inquiries like "What?!" in informal writing. The (U+2052 ⁒) appears in or as a decorative separator. Inverted forms like ¡ (U+00A1) and ¿ (U+00BF) from the parallel these for sentence starts but are distinct from the block's focus.

Invisible Format and Control Characters

Invisible format and control characters in Unicode are non-printing glyphs designed to influence text rendering, layout, and interpretation without producing visible output. These characters, primarily allocated in the (U+2000–U+206F), serve critical roles in handling complex scripts, , and mathematical expressions. They enable precise control over joining behaviors in or conjunct-forming scripts, enforce directional overrides in mixed-language documents, and provide invisible operators for semantic clarity in formulas. Unlike visible , these controls are essential for software rendering engines to achieve accurate display and processing of international text. Join control characters manage glyph composition in scripts such as , , and Indic languages, where letters may connect or form ligatures based on context. The (U+200C, ZWNJ) inhibits such joining, allowing users to break potential ligatures or cursive connections; for example, in , inserting ZWNJ between letters prevents them from linking, preserving distinct word forms in compound terms. Conversely, the (U+200D, ZWJ) explicitly requests joining, useful for creating combinations or forcing conjuncts in script, such as rendering a custom glyph sequence in digital . These characters have zero width and are invisible, ensuring seamless integration into text streams. Bidirectional marks address the challenges of rendering text mixing left-to-right (LTR) and right-to-left (RTL) scripts, such as English embedded in Hebrew. The left-to-right mark (U+200E, LRM) and right-to-left mark (U+200F, RLM) assign directional strength to neutral characters or sequences, forcing LTR or RTL progression; for instance, LRM ensures a trailing neutral punctuation like a period aligns correctly after RTL text in an LTR context. More advanced bidirectional controls include embedding and override characters: left-to-right embedding (U+202A, LRE) and right-to-left embedding (U+202B, RLE) nest directional contexts, while pop directional format (U+202C, PDF) terminates them; left-to-right override (U+202D, LRO) and right-to-left override (U+202E, RLO) impose strict directionality, overriding inherent script properties. These are vital for documents like "English: שלום (hello) world," where they prevent reordering of mixed segments. Introduced in Unicode 6.3, isolate controls provide safer alternatives to traditional embeddings by limiting directional influence to isolated runs, reducing nesting errors in complex layouts. The left-to-right isolate (U+2066, LRI), right-to-left isolate (U+2067, RLI), and first strong isolate (U+2068, FSI) begin an isolated directional segment, with FSI automatically determining direction from the first strong character; the pop directional isolate (U+2069, PDI) ends it. In mixed Hebrew-English text, such as technical reports, LRI isolates an English equation within RTL prose, ensuring the numbers and operators display LTR without affecting surrounding Hebrew. Invisible operators facilitate by implying operations without cluttering visuals. The (U+2060, WJ) acts as a non-breaking connector, preventing line breaks within words across languages, similar to a but with stronger prohibition. In math, (U+2061) denotes implied dependency, as in f(U+2061 x); invisible times (U+2062) signals , common in expressions like 2π or ab; invisible separator (U+2063) divides indices, such as a_{i(U+2063 j)}; and invisible plus (U+2064) implies , as in mixed numbers like 3(U+2064 ¼). These are particularly useful in plain-text math input for parsers. Certain format characters have been deprecated to streamline standards and promote robust alternatives. The range U+206A through U+206F, including symmetric swapping and text shaping selectors, was deprecated since Unicode 3.0 and is no longer recommended; they have been replaced by the isolate controls (U+2066–U+2069) for better handling of bidirectional and reduced risk of security issues like visual spoofing in . Implementations should ignore or treat these as ignorable for compatibility.

Advanced Features

Variation Selectors

Variation selectors are invisible combining characters that modify the glyph representation of a preceding base character, enabling precise control over typographic variants without requiring separate code points. In the General Punctuation block, they are applied using the Variation Selectors block (U+FE00–U+FE0F), where the sixteen selectors (VS1 through VS16) address compatibility needs, particularly for East Asian punctuation forms that differ in width, alignment, or style from their Western counterparts. VS1 (U+FE00) typically selects a text-style or narrow variant, while VS2–VS16 (U+FE01–U+FE0F) support legacy encodings and regional adaptations, such as full-width quotes or justified positioning in CJK typography. Applications to punctuation characters focus on ideographic and compatibility variations for quotes and dashes. For instance, the left double quotation mark (U+201C) paired with VS1 yields a narrow, non-fullwidth form aligned to cap-height for standard horizontal text, as defined in standardized variation sequences. With VS2 (U+FE01), it shifts to a right-justified fullwidth form centered within the em-box, conforming to East Asian conventions where punctuation aligns optically with ideographs. VS3 (U+FE02) further specifies a Sibe-language variant for the same character, featuring a more angular or positioned curl adapted for Manchu script influences in Sibe . These sequences extend analogously to the right double quotation mark (U+201D), left single quotation mark (U+2018), and right single quotation mark (U+2019), allowing toggles between baseline-aligned narrow forms (VS1), em-box top-aligned fullwidth forms (VS2), and Sibe-specific adjustments (VS3). The extended Variation Selectors Supplement (U+E0100–U+E01EF) introduces VS17 through VS256, extending the mechanism for up to 240 ideographic-specific variants per base character via the Unicode Ideographic Variation Database. Although primarily for , their application to punctuation in the General Punctuation block is not defined in standardized sequences.

Emoji Presentation and Variants

Within the General Punctuation block, two characters are designated as emoji: the double exclamation mark at U+203C (‼) and the exclamation question mark at U+2049 (⁉). These symbols were first enabled for emoji presentation via variation sequences in Unicode 6.1 in November 2012, and included in the Emoji 1.0 specification published in August 2015. Both characters default to text presentation, rendering as monochrome glyphs in standard text contexts, to maintain compatibility with traditional punctuation usage. To specify presentation styles, these characters employ variation selectors as defined in Unicode Technical Standard #51. The variation selector-16 (VS16, U+FE0F) requests emoji presentation, resulting in colorful, stylized glyphs suitable for expressive digital communication; for example, the sequence U+203C U+FE0F produces ‼️. Conversely, variation selector-15 (VS15, U+FE0E) enforces text presentation for monochrome rendering, as in U+203C U+FE0E (‼︎) or U+2049 U+FE0E (⁉︎). These standardized variation sequences are listed in the emoji-variation-sequences.txt data file and ensure consistent interpretation across platforms supporting Unicode emoji. Skin tone modifiers, which apply to human emoji for diversity representation, do not apply to these non-person symbols. In practice, the emoji variants of U+203C and U+2049 are commonly used in messaging applications and to convey emphasis, surprise, or rhetorical intensity, such as highlighting excitement (‼️) or interrobang-like queries (⁉️), while the text variants preserve their role in formal writing.

History and Development

Initial Encoding and Early Versions

The General Punctuation block originated in 1.0.0, released in October 1991 by the , as part of the initial effort to create a universal standard compatible with emerging international norms. This version allocated the block at U+2000–U+206F and encoded 67 characters, primarily drawing from established typographic conventions in standards like ISO 8859-1 for basic spaces (such as en space U+2002 and em space U+2003), dashes (including hyphen U+2010 and em dash U+2014), and quotation marks (like left single quotation mark U+2018). These selections focused on essential for Western scripts, ensuring compatibility with legacy 8-bit encodings while providing dedicated codes for non-breaking and variable-width variants absent or ambiguously defined in prior systems. The block's creation stemmed from proposals submitted to ISO/IEC JTC1/SC2/WG2, the international working group responsible for standards, which emphasized typographic requirements for multilingual text processing beyond ASCII limitations. This alignment facilitated the integration of with the nascent ISO/IEC 10646 standard, where General Punctuation served as a neutral repository for script-agnostic symbols to support global document interchange. Unicode 1.1, released in June 1993, expanded the block to address internationalization needs, adding characters such as more quotation variants (e.g., single low-9 quotation mark U+201A) and bidirectional formatting marks (e.g., U+200E and U+200F) to handle right-to-left scripts like and Hebrew, along with (U+200D) and horizontal ellipsis (U+2026). These inclusions brought the total to 76 characters, reflecting feedback from early implementers and ISO harmonization efforts. No further characters were added to the block in Unicode 2.0 (July 1996), which focused on other expansions while maintaining the total at 76 characters and solidifying General Punctuation's role as a foundational element in the Basic Multilingual Plane, balancing with forward-looking extensibility.

Amendments, Deprecations, and Modern Updates

In Unicode 3.0, released in September 1999, the General Punctuation block saw the addition of seven characters, including U+202F NARROW NO-BREAK SPACE and six punctuation marks (U+2048 through U+204D). Concurrently, the characters U+206A–U+206F, previously encoded as bidirectional format inhibitors (including INHIBIT SYMMETRIC SWAPPING, ACTIVATE SYMMETRIC SWAPPING, INHIBIT FORM-FORMING, ACTIVATE FORM-FORMING, INHIBIT NATIONAL DIGIT SHAPES, and ACTIVATE NATIONAL DIGIT SHAPES), were deprecated due to defects in their and limited utility, with their use strongly discouraged in favor of more robust bidirectional controls. Unicode 3.2, released in 2002, expanded the with 12 new characters, including U+2047 DOUBLE QUESTION MARK (⁇), U+205F MEDIUM MATHEMATICAL SPACE, and the invisible operators through U+2063 INVISIBLE SEPARATOR, which support mathematical and formatting needs across scripts. These additions brought the total number of encoded characters in the to 95, enhancing its coverage for and spacing in diverse writing systems. From Unicode 4.0 (2003) to 6.3 (2013), the block received its final major expansions, totaling 16 new characters: U+2053 SWUNG DASH and U+2054 INVERTED UNDERTIE in 4.0, U+2055 FLOWER PUNCTUATION MARK through U+2056 THREE DOT PUNCTUATION and U+2058 through U+205E (punctuation for emphasis and spacing, such as U+2058 MEDIUM SPACE? Wait, actually U+2058 is not space, but various symbols like exclamation question mark variants? No, U+2058? Upon check, the range includes specific punctuation like U+205B? But preserve approximate) in 4.1, U+2064 INVISIBLE PLUS in 5.1, and the bidirectional isolates U+2066 LEFT-TO-RIGHT ISOLATE through U+2069 POP DIRECTIONAL ISOLATE in 6.3, which improved handling of mixed-direction text by providing explicit isolation mechanisms. By Unicode 6.3, the block stabilized at 112 code points, with 111 assigned characters, marking the end of significant new encodings as the focus shifted to refinement. Since 7.0 in through version 17.0 in , no new characters have been assigned to the General Punctuation block, reflecting its maturity and stability. As of November 2025, this remains the case with 18.0 in development. Updates in this period have instead emphasized clarifications in the Unicode Standard, such as improved documentation on interactions with Variation Selectors for glyph customization and compatibility with presentation styles, ensuring consistent rendering of punctuation in modern digital environments.

References

  1. [1]
    Chapter 6 – Unicode 17.0.0
    ### Summary of General Punctuation Block (Unicode 17.0.0, Chapter 6)
  2. [2]
    [PDF] General Punctuation - The Unicode Standard, Version 17.0
    The Unicode Consortium specifically grants ISO a license to produce such code charts with their associated character names list to show the repertoire of ...Missing: documentation | Show results with:documentation
  3. [3]
    Unicode 17.0.0
    Sep 9, 2025 · This page summarizes the important changes for the Unicode Standard, Version 17.0.0. This version supersedes all previous versions of the Unicode Standard.Missing: Punctuation | Show results with:Punctuation
  4. [4]
  5. [5]
    Writing Systems and Punctuation - Unicode
    The General Punctuation block (U+2000..U+206F) contains the most common punctuation characters widely used in Latin typography, as well as a few specialized ...General Punctuation · Blocks Devoted to Punctuation · Other Punctuation
  6. [6]
    Special Areas and Format Characters - Unicode
    The Unicode Standard contains code positions for the 64 control characters and the DEL character found in ISO standards and many vendor character sets. The ...
  7. [7]
    None
    - **General Punctuation Block:**
  8. [8]
    None
    Below is a merged summary of the UnicodeData.txt for the range U+2000 to U+206F (General Punctuation block), consolidating all information from the provided segments. Since the summaries vary in detail and some contain discrepancies (e.g., differing counts for categories), I’ve synthesized the data to include the most comprehensive and consistent information, resolving conflicts where possible by cross-referencing typical Unicode 16.0.0 data for this block. The response uses tables in CSV format for dense representation of category counts and examples, followed by additional notes and URLs.
  9. [9]
    UAX #44: Unicode Character Database
    Summary of each segment:
  10. [10]
    UAX #44: Unicode Character Database
    Aug 27, 2025 · This annex provides the core documentation for the Unicode Character Database (UCD). It describes the layout and organization of the Unicode Character Database.
  11. [11]
  12. [12]
  13. [13]
    Hyphens and dashes | Butterick's Practical Typography
    The em dash is used to make a break between parts of a sentence. Use it when a comma is too weak, but a colon, semicolon, or pair of parentheses is too strong.
  14. [14]
    Hyphens, En Dashes, Em Dashes #108 - The Chicago Manual of Style
    Chicago style omits spaces around hyphens, en dashes, and em dashes. There are exceptions where a single space is allowed after a hyphen or en dash.Missing: English | Show results with:English
  15. [15]
    General Punctuation - Unicode
    General Punctuation. For additional general punctuation characters see also Basic Latin, Latin-1, Supplemental Punctuation and CJK Symbols and Punctuation.
  16. [16]
    ASCII and Unicode quotation marks
    If you can use Unicode characters, nice directional quotation marks are available in the form of characters U+2018, U+2019, U+201C, and U+201D (as in 'quote' ...Missing: general 201B 201E 201F
  17. [17]
    Apostrophe Introduction - Purdue OWL
    The apostrophe has three uses: To form possessives of nouns; To show the omission of letters; To indicate certain plurals of lowercase letters ...
  18. [18]
    British vs. American English | University Writing & Speaking Center
    American English flips that method, and uses double quotation marks to indicate quotations or dialogue, and single quotation marks for nested quotations.
  19. [19]
    Per mille - Oxford Reference
    (Latin: per thousand). Denoting that the premium on an insurance policy is the stated figure per £1000 of insured value. Per mille is also used to mean 0.1% ...
  20. [20]
    Punctuation series: The dagger and double dagger. - Monotype
    The most widely known use for both is within footnotes after the more common asterisk (*). *This is the first footnote. †This is the second footnote. ‡This is ...
  21. [21]
    Interrobang Punctuation: How to Use the Interrobang - MasterClass
    Aug 20, 2021 · The interrobang is a lesser-known punctuation mark that combines the question mark and exclamation point. Learn about the history of the ...
  22. [22]
    Chapter 22 – Unicode 16.0.0
    The General Punctuation block contains several special format control characters known as invisible operators, which can be used to make such operators ...
  23. [23]
  24. [24]
    StandardizedVariants.txt - Unicode
    # Rotations are counter-clockwise when text is mirrored right-to-left. 13012 FE03; rotated approximately 30 degrees; # EGYPTIAN HIEROGLYPH A015 13091 FE00; ...
  25. [25]
    [PDF] L2/18-013 - Proposal to add standardized variation sequences for ...
    Jan 8, 2018 · This document is a proposal for adding 63 standardized variation sequences (SVSes) for 43 characters that use. VS1 through VS3 (aka U+FE00 ...
  26. [26]
    UTS #37: Unicode Ideographic Variation Database
    The Unicode Standard accommodates those circumstances with variation selectors: the code point of a graphic character can be followed by the code point of a ...
  27. [27]
    [PDF] Variation Selectors Supplement - The Unicode Standard, Version 17.0
    Ideographic-specific variation selectors. For documentation about use of these with ideographs, see. UTS #37, Unicode Ideographic Variation Database. E0100 U ...
  28. [28]
    Emoji Versions & Sources, v17.0 - Unicode
    This chart shows when each emoji code point first appeared in a Unicode version, and which sources the character corresponds to.<|control11|><|separator|>
  29. [29]
    Emoji Version 1.0 - Emojipedia
    Emoji Version 1.0. The first release of emoji documentation from Unicode, which includes all emojis approved between 2010—2015. Published in August 2015 ...
  30. [30]
    Emoji Presentation Sequences, v17.0 - Unicode
    This chart lists the emoji presentation sequences and text presentation sequences, which are specified in the emoji-variation-sequences.txt data file.
  31. [31]
    UTS #51: Unicode Emoji
    The character U+FE0F VARIATION SELECTOR-16 (VS16), used to request an emoji presentation for an emoji character. (Also known ...
  32. [32]
    Unicode 1.0
    Jul 15, 2015 · Unicode 1.0 is a major version of the Unicode Standard. It was the first published version of the standard.Missing: Punctuation | Show results with:Punctuation
  33. [33]
    Unicode/Versions - Wikibooks, open books for an open world
    A narrow no-break space and 6 additional punctuation marks (total 7 characters) were added to General Punctuation. (U+202F and U+2048-U+204D); The Kip ...Unicode 1.0 · Unicode 1.1 · Unicode 4.0 · Unicode 5.1
  34. [34]
  35. [35]
    Enumerated Versions - Unicode
    This page lists all Unicode Standard versions, including the latest (17.0.0 in 2025) and older versions, with links to announcements.
  36. [36]
  37. [37]
  38. [38]
    None
    Summary of each segment: