Zero-width space
The zero-width space (ZWSP), designated as Unicode character U+200B, is a non-printing format character that occupies no visual width and serves to indicate invisible word boundaries and line break opportunities in text processing.[1] Introduced in Unicode 1.0 in 1991 as part of the General Punctuation block, the ZWSP was initially classified under the space category (Zs) but was later reclassified as a format character (Cf) to emphasize its role in layout control rather than spacing.[2][3] This character enables proper text rendering in languages and scripts that lack explicit visible spaces between words, such as Thai, Lao, Khmer, Myanmar, and certain uses in Japanese.[4] In the Unicode Line Breaking Algorithm (UAX #14), the ZWSP functions as a break opportunity: line breaks are prohibited before it (along with regular spaces), but explicitly allowed immediately after it, allowing it to delimit words without altering visual appearance.[5] During text justification, it permits the addition of inter-letter spacing, distinguishing it from fixed-width spaces like the en space (U+2002).[1][4] The ZWSP differs from related zero-width characters, such as the zero-width non-joiner (U+200C) and zero-width joiner (U+200D), which control glyph joining and ligature formation in cursive scripts rather than providing break points.[4] While primarily a tool for typesetting and internationalization, its invisible nature requires careful handling in editing software to avoid unintended insertions or rendering issues.[5]Overview
Definition
The zero-width space (ZWSP) is a non-printing Unicode character designated as U+200B, which occupies no horizontal space in rendered text but serves to indicate a potential word boundary or line break opportunity.[4] This invisible character is detectable by text processing systems, allowing them to apply breaking rules without altering the visual layout.[5] Introduced in Unicode 1.0 in October 1991 as part of the General Punctuation block (U+2000 to U+206F), the ZWSP was developed to support international text handling in digital environments.[6] It addresses the needs of scripts that lack explicit visible spaces between words, such as Thai, Lao, Khmer, Myanmar, and Japanese, by providing an invisible separator for word breaks in typesetting and layout algorithms.[4] In rendering, the ZWSP remains completely invisible to users, with no glyph or advance width, yet it influences line-breaking behavior in applications like word processors and web browsers.[4] This property makes it essential for maintaining readability in complex scripts while preserving the intended structure of the text.[5]Distinctions from Related Characters
The zero-width space (U+200B), commonly abbreviated as ZWSP, is primarily intended for invisible word separation and line break control, allowing a line break opportunity without visible width, though it may expand slightly in justified text.[4] In contrast, the zero-width non-joiner (U+200C, ZWNJ) serves to separate characters that would otherwise form ligatures or join in cursive scripts, such as in Arabic or Indic languages, but it does not create a line break opportunity or affect word boundaries.[4] Thus, while both are invisible, the ZWSP facilitates potential breaks for formatting, whereas the ZWNJ prioritizes preventing unwanted glyph joining without altering layout flow.[7] The zero-width joiner (U+200D, ZWJ) functions oppositely to the ZWNJ by forcing the joining of adjacent characters that would not normally connect, such as combining base emojis into sequences (e.g., family or flag emojis) or linking elements in scripts like Devanagari.[4] Unlike the ZWSP, which permits separation and breaks, the ZWJ enforces visual or semantic unity without influencing line breaking, making it unsuitable for word boundary marking.[7] This distinction is critical in complex text rendering, where misuse could disrupt intended glyph formation or emoji display. Another related character is the zero-width no-break space (U+FEFF, ZWNBSP), which prohibits line breaks at its position to maintain text integrity, such as preventing unwanted separation in phrases, and is widely used as a byte order mark (BOM) to indicate encoding in files like UTF-8 or UTF-16.[4] In opposition to the ZWSP's breakable nature, the ZWNBSP ensures non-breaking behavior, and Unicode explicitly deprecates its use for invisible separation in favor of other characters like U+2060 word joiner for similar non-breaking needs.[8] Fundamentally, these differences highlight the ZWSP's role as a neutral, breakable whitespace for general text processing, while the ZWNJ, ZWJ, and ZWNBSP focus on controlling joining or prohibiting breaks in contexts involving complex scripts or encoding signatures.[4]Encoding and Standards
Unicode Specification
The zero-width space is encoded in the Unicode Standard at code point U+200B, named ZERO WIDTH SPACE, and resides in the General Punctuation block spanning U+2000 through U+206F.[9] It was first assigned in Unicode Version 1.0.0, released in October 1991, and has remained stable in its core encoding since that initial version, with no subsequent reallocation or deprecation.[10] In the Unicode Character Database, it is classified with the General_Category property value of Cf (Other, Format), reflecting its role as a formatting control character rather than a visible space; this category was updated from the original Zs (Separator, Space) value in Unicode Version 4.0.1 to better align with its invisible, non-spacing behavior.[9][11] Additional key properties include Bidi_Class=BN (Boundary Neutral), which ensures it does not affect bidirectional text embedding levels, and Line_Break=ZWSP, designating it specifically for enabling invisible line break opportunities without contributing to text width.[9][12] The character is treated as invisible in rendering, with no default glyph or advance width, and it is designated as a Default_Ignorable_Code_Point, meaning it should be ignored in rendering unless explicitly supported for line breaking and other formatting effects.[13][4] No major changes to its encoding or primary properties have occurred since Unicode 1.0.0 beyond the General_Category adjustment, underscoring its foundational status in the standard.[14] It is referenced in Unicode Standard Annex #9 (Unicode Bidirectional Algorithm) as a boundary neutral format character that preserves text directionality without visual impact, and in Unicode Standard Annex #14 (Unicode Line Breaking Properties) as a dedicated class for zero-width break opportunities, distinct from other spaces or joiners.[5]Representations in Markup and Protocols
In HTML, the zero-width space (U+200B) can be represented using the numeric character entities (decimal) or (hexadecimal).[15] These entities allow insertion of the character without direct Unicode support in older parsers. In XML and related markup languages such as those used in RSS and Atom feeds, the zero-width space is typically inserted directly as UTF-8 (bytes E2 80 8B) or UTF-16 encoding, or via numeric character references like or to ensure compatibility across parsers. This approach is common for invisible separators in structured feeds, where the character maintains document integrity without visual impact. Programming languages provide standard methods to generate the zero-width space in string literals, often for testing or text manipulation. In JavaScript, it is created using String.fromCharCode(8203).[16] In Python, the chr(0x200B) or chr(8203) function returns the character. In C#, the Unicode escape sequence \u200B embeds it directly in strings. In network protocols, the zero-width space requires specific encoding for transmission. In URLs, it is percent-encoded in UTF-8 as %E2%80%8B to handle the multi-byte sequence safely. For email via MIME (RFC 2045), it appears in quoted-printable or base64-encoded bodies, or as \u200B in structured parts, preserving invisibility across transports. In JSON (RFC 8259), non-ASCII characters like U+200B are escaped as \u200b to ensure valid parsing.Core Purposes
Word Boundary Marking
The zero-width space (ZWSP, U+200B) functions as an invisible delimiter to mark word boundaries in languages without visible inter-word spacing, such as Thai, Lao, and Khmer. By inserting ZWSP between words, text processors can accurately segment continuous scripts for tasks like dictionary lookups and search engine indexing, preserving the original visual appearance while enabling precise linguistic analysis.[17] In computational linguistics, ZWSP aids parsers in identifying morpheme or word boundaries without disrupting layout, particularly useful in non-spaced scripts or for annotating compounds. For instance, in Thai text like "สวัสดี" (hello), placing a ZWSP after "สวัส" distinguishes it as a compound for processing, such as in segmentation algorithms. This approach enhances applications like natural language processing tools that rely on explicit boundaries for tokenization.[18] The use of ZWSP improves text search accuracy by providing reliable word-level granularity, especially when combined with Unicode text segmentation rules, and supports hyphenation in typesetting systems like LaTeX by allowing breaks at designated points in compounds without visible gaps. For example, in LaTeX, inserting ZWSP after a slash in terms like "input/output" enables proper hyphenation while maintaining compound integrity.[18]Line Break Facilitation
The zero-width space (ZWSP, U+200B) serves as an invisible delimiter in text layout systems, permitting line wrapping at designated points without altering the visual appearance of the content. According to the Unicode Line Breaking Algorithm outlined in UAX #14, ZWSP is assigned the line breaking property class ZW, which enforces specific rules for break opportunities: breaks are prohibited before ZWSP (rule LB7: × ZW), but allowed after it (rule LB8: ZW ÷). This mechanism positions ZWSP as a non-hyphenating alternative to the soft hyphen (U+00AD), enabling controlled fragmentation of otherwise unbreakable sequences while avoiding the insertion of a hyphen mark.[19] In practical scenarios, ZWSP facilitates line breaks in extended constructs such as URLs, where inserting it— for instance, within "https://example.com/very/long/path"—prevents horizontal overflow in constrained viewports without compromising readability.[5] Similarly, it supports wrapping in chemical formulas, like long molecular notations (e.g., C₆₀H₁₂₂), and ideographic scripts such as Chinese, Japanese, or Korean (CJK), where traditional spaces are absent and natural break points are scarce; UAX #14 explicitly notes its utility for indicating potential breaks in non-Latin scripts.[19] These applications ensure text flows adaptively across devices and formats, maintaining semantic integrity. Within web styling contexts, ZWSP integrates with CSS line-breaking behaviors to allow discretionary wraps where standard spaces would introduce undesired width; for example, browsers treat it equivalently to the HTMLPractical Applications
In Multilingual Text Processing
The zero-width space (U+200B) plays a key role in East Asian typography, particularly for Chinese, Japanese, and Korean (CJK) languages, where it provides subtle control over spacing and line breaks without introducing visible gaps. In CJK text processing, algorithms may automatically insert proportional spacing between characters for justification, but inserting a zero-width space can override this behavior to prevent unwanted auto-spacing, ensuring precise layout in horizontal or vertical arrangements.[21] For instance, in Japanese typesetting, the zero-width space facilitates break opportunities, as in CSS features for phrase breaking, maintaining aesthetic balance across lines while adhering to monospaced character grids.[22] This application is especially useful in digital typesetting software, where CJK justification relies on distributed spacing rather than word gaps, and the zero-width space acts as an invisible delimiter to fine-tune character distribution.[21] In bidirectional text processing, the zero-width space supports layouts involving right-to-left scripts like Hebrew and Arabic by serving as a neutral formatting character that does not alter the overall directional flow. According to the Unicode Bidirectional Algorithm (UAX #9), neutral characters such as whitespace (WS class) and boundary neutrals like the zero-width space (U+200B, BN class) are treated as neutral, allowing them to embed boundaries between directional runs—such as isolating left-to-right insertions like numbers or English terms—without forcing reordering or embedding levels that could disrupt the primary right-to-left progression.[23] This neutrality ensures that the zero-width space can mark logical separations in mixed-direction text, such as in Arabic sentences containing Hebrew quotes, while preserving the visual integrity of the right-to-left rendering as defined in the algorithm's resolution phases. For search and indexing in natural language processing (NLP), the zero-width space improves tokenization accuracy in script-mixed multilingual text by explicitly indicating word boundaries where visible spaces are absent or ambiguous. Unicode Standard Annex #29 specifies that U+200B functions as a deliberate word separator, enabling tools to distinguish tokens in languages without inter-word spacing, such as when English words are embedded in Arabic or Thai sentences; for example, inserting it between "hello" and an adjacent Arabic term prevents erroneous merging during indexing. Libraries like the International Components for Unicode (ICU) incorporate these rules in their BreakIterator implementation, supporting precise segmentation for search engines and NLP pipelines handling diverse scripts, thus enhancing retrieval relevance in global corpora. In multilingual input methods, the zero-width space facilitates the entry of invisible punctuation across language locales, particularly in Linux environments where keyboard configurations allow insertion via modifier keys for non-printing characters essential to script-specific formatting. For example, certain international layouts enable users to produce U+200B through compose sequences or AltGr combinations, aiding typists in adding subtle boundaries during real-time composition of mixed-script documents without visible artifacts.[24]In Web Development and HTML
In web development, the zero-width space (U+200B) is inserted into HTML using the numeric entity , enabling line breaks within inline elements without introducing visible spacing.[25] This technique is particularly useful for maintaining layout integrity in scenarios where standard spaces would disrupt design, such as in navigation menus where items need to wrap responsively on smaller screens without awkward gaps.[26] For instance, placing between menu link text allows the browser to break the line at that point if the viewport narrows, preserving readability without adding width. Similarly, in code snippets displayed inline, facilitates natural line wrapping for long identifiers or URLs, ensuring they do not overflow containers while mimicking the original formatting.[27] When integrated with CSS, the zero-width space enhances text wrapping behaviors, especially in responsive designs. It pairs effectively with the white-space: pre property, which preserves whitespace and line breaks, allowing developers to embed ZWSP strategically to control where breaks occur without altering the visual flow. Combined with word-break: break-word, ZWSP provides subtle opportunities for hyphenless breaks in long, space-less strings like URLs or compound words, preventing overflow in fluid layouts across devices.[27] This approach is common in mobile-first designs, where precise control over text reflow is essential to avoid horizontal scrolling. In JavaScript, handling zero-width spaces is crucial for input sanitization to mitigate security risks like injection attacks or hidden payloads in user-submitted data. Developers often detect and remove ZWSP using regular expressions, such as string.replace(/\u200B/g, ''), which targets the Unicode code point and strips all instances globally.[28] This method ensures clean form data processing, particularly in web applications where malicious actors might embed invisible characters to evade validation filters.[29] The zero-width space gained prominence in web development following the adoption of HTML5 around 2010, as mobile browsers improved Unicode support and responsive techniques became standard, enabling better handling of invisible formatting in cross-device layouts.In Typography and Document Formatting
In desktop publishing applications like Adobe InDesign and Microsoft Word, the zero-width space (ZWSP, U+200B) serves as an invisible delimiter to facilitate kerning adjustments in non-Latin scripts, such as Thai or Arabic, where visible spaces are absent between words.[30] By marking word boundaries without adding width, it enables typesetting software to apply appropriate inter-character spacing and optical kerning metrics tailored to the font's design, preventing awkward gaps or overlaps in complex layouts.[31] This is particularly useful for maintaining readability in documents mixing scripts, as the ZWSP informs the engine of logical breaks for justification without altering visual appearance.[7] As an alternative to soft hyphens, the ZWSP allows line breaks in justified text without introducing visible hyphenation marks, promoting cleaner typography in professional outputs.[32] For instance, in legal documents requiring precise and unobtrusive formatting, inserting a ZWSP at potential break points ensures even line endings across paragraphs while avoiding the aesthetic disruption of hyphens, which can imply fragmentation in formal prose.[7] This approach supports full justification by permitting controlled letter-spacing expansion instead of erratic word gaps, aligning with typographic best practices for high-legibility print media.[33] During PDF and ePub generation, the ZWSP aids consistent rendering across devices by explicitly signaling allowable line breaks within embedded fonts, which may vary in glyph metrics or justification behavior.[34] This prevents overflow or reflow issues in digital publications, especially for long compounds or non-spaced scripts, ensuring the document's layout integrity regardless of the reader's font substitution or screen size.[35] In LaTeX environments, packages such as polyglossia incorporate the ZWSP to manage script-specific spacing in multilingual PDFs, inserting it dynamically at language transitions or word boundaries to enforce proper hyphenation and kerning rules per script.[36] This enhances output quality for documents blending Latin and non-Latin content, like academic texts, by leveraging XeLaTeX's fontspec integration for precise, invisible adjustments.[37]Restrictions and Challenges
Prohibitions in Identifiers
The zero-width space (U+200B) is prohibited in internationalized domain names (IDNs) under ICANN policies to prevent homograph attacks and invisible spoofing that could enable phishing or visual deception. The ICANN briefing on IDN permissible code points discusses U+200B as a non-displayed punctuation character in the context of potential user confusion.[38] Similarly, RFC 5892, which outlines the Unicode code points eligible for IDNA labels, classifies U+200B as disallowed, excluding it from the protocol-valid (PVALID) category and thereby barring its use in registered domain labels.[39] In programming languages, the zero-width space is handled specially in identifiers for Java; the compiler ignores U+200B to avoid hidden code and security risks, as it does not qualify under the Java Language Specification's rules for valid identifier parts based on Unicode categories.[40] In contrast, the C++ ISO standard permits U+200B in identifiers under its Unicode support rules, but this allowance is criticized for enabling invisible variations in code that can lead to subtle bugs or malicious insertions.[41] Security implications arise from the zero-width space's potential for malicious use in phishing, where it is inserted into URLs to create homographic domains like "example.com" that evade detection while appearing identical to legitimate ones. This technique, dubbed Z-WASP (zero-width space phishing), has been employed to bypass protections in email systems such as Microsoft Office 365 by obfuscating malicious links without altering their functionality.[42][43] As of January 2025, variants like "shy z-wasp" phishing continue to exploit zero-width characters in email campaigns.[44] Policy evolution regarding the zero-width space in identifiers reflects growing awareness of Unicode security risks, with the Unicode Technical Report #36 (updated in 2010) discouraging the use of invisible characters like U+200B in user-facing identifiers to mitigate confusability and spoofing threats.[45] This guidance influences standards bodies and implementers to prioritize visible, unambiguous characters in contexts like domain registration and code naming.Compatibility and Rendering Issues
The zero-width space (U+200B) presents several compatibility challenges in web browsers, particularly in older versions. For instance, early implementations in Internet Explorer, such as version 6, did not fully support the character in certain fonts, resulting in it being ignored or rendered incorrectly, which disrupted intended line break opportunities. In Safari, inserting zero-width spaces into HTML for URL wrapping could trigger crashes during PDF generation in versions from around 2006.[46] Modern browsers like Chrome have also been observed to inadvertently insert U+200B into copied code snippets from developer tools, complicating debugging.[47] Rendering variations occur across fonts and operating systems. Fonts such as Arial Unicode MS provide support for U+200B, but without appropriate fallback mechanisms, WebKit-based browsers may display it as a square or a tiny visible gap if the primary font lacks the glyph, leading to inconsistent visual output. Input methods on macOS and Windows can accidentally insert the character; for example, selecting text (e.g., via Cmd+A) in web applications like Outlook Web App on macOS has been reported to add extraneous U+200B instances.[48] In legacy ASCII-only environments, the Unicode character cannot be represented and is typically substituted with a replacement glyph like "?" or stripped entirely, nullifying its formatting role. To mitigate these problems, developers commonly employ regular expressions such as/\u200B/g in JavaScript to detect and remove zero-width spaces, a practice increasingly routine in code audits since the early 2020s to eliminate artifacts from web copy-pasting.