Fact-checked by Grok 2 weeks ago

Unicode control characters

Unicode control characters are a category of non-printing characters in the Standard, Version 17.0, designed to manage text processing, layout, and formatting without visible rendering. They encompass 65 legacy control codes (C0 and C1 sets) allocated at code points U+0000–U+001F, U+007F, and U+0080–U+009F for compatibility with ISO/IEC 2022 and earlier standards like ASCII, as well as modern format characters that influence bidirectional ordering, line breaking, and glyph variation. These characters ensure consistent interchange of text across systems, with their semantics largely determined by applications rather than itself, preserving original meanings during encoding transformations. The C0 controls, originating from the 7-bit ASCII set, include fundamental functions such as U+0009 (HORIZONTAL TAB, HT) for spacing and U+000A (LINE FEED, LF) paired with U+000D (, CR) for newline operations. The C1 controls, an extension from 8-bit ISO standards, provide additional capabilities like U+0085 (NEXT LINE, NEL) for line termination and U+0094 (REVERSE LINE FEED, RI) for cursor movement in terminal emulations. Beyond these legacy codes, defines format characters as invisible operators that affect text layout without altering content, categorized into layout controls (e.g., U+200B for word boundaries and U+2028 LINE SEPARATOR for explicit line breaks), join controls (e.g., U+200D for cursive connections in scripts like ), and variation selectors (e.g., U+FE0F VARIATION SELECTOR-16 to specify emoji-style glyphs). These control characters play a critical role in , enabling robust handling of diverse scripts and text directions, such as right-to-left ordering via U+202A–U+202E bidirectional controls. Deprecated format characters, like U+206A INHIBIT SYMMETRIC SWAPPING, are retained for legacy support but discouraged in new implementations. Overall, control characters facilitate precise text manipulation in computing environments, from document rendering to data , underscoring Unicode's foundation in practical .

Overview and Classification

Definition and Role in Text Processing

Unicode control characters are non-printable characters defined in the Standard that manage the processing, formatting, and interpretation of text streams, distinguishing them from printable glyphs that represent visible symbols. These characters belong primarily to the general categories (Other, Control) and (Other, Format), where encompasses legacy control codes for basic text control, and includes format effectors that influence rendering without visible output. Assigned code points span the full Unicode range from U+0000 to U+10FFFF, though most control characters reside in the Basic Multilingual Plane (U+0000 to U+FFFF). The origins of Unicode control characters trace back to the ASCII standard published in 1963, which introduced 33 control codes for hardware and communication control, later extended in international standards like ISO/IEC 646 (first published 1967) and ISO 8859 standards to include additional sets like C1 controls. Unicode unifies and expands these legacy codes into a global framework, preserving their semantics while adapting them for multilingual text handling across diverse scripts and devices. In text processing, these characters enable applications and rendering engines—such as those in /CSS or font systems—to handle line breaks, directionality, selection, and semantic annotations without altering the visible content. For instance, they support complex scripts by facilitating in languages like and Hebrew through the Unicode Bidirectional Algorithm, and they allow emoji variations via variation selectors that specify presentation styles. The Standard, version 17.0 (released September 9, 2025), details their semantics and processing rules in Chapter 23, emphasizing their role in ensuring consistent text interpretation across platforms.

Unicode General Categories for Control Characters

Unicode assigns a general category to every character in its repertoire as part of the Unicode Character Database, providing a fundamental classification based on usage for purposes such as text processing and rendering. Control characters are primarily grouped into the "" (Other, Control) and "" (Other, Format) categories. The category consists of 65 characters, exemplified by the NULL control U+0000, which represent non-printing signaling codes derived from legacy standards like ASCII and ISO 8859-1. These serve roles in data transmission and device control without visual representation. The category includes 161 characters, such as the U+200B, functioning as format effectors that alter text layout, bidirectional direction, or character joining without contributing visible glyphs. Characters in the Cc category originate from Unicode version 1.0, with the full set stabilized by version 1.1 and no subsequent additions. They uniformly possess the bidirectional class "" (Boundary Neutral), ensuring they do not initiate or alter text directionality in bidirectional algorithms. Decomposition mappings are absent for Cc characters, preserving their atomic control nature. Most exhibit the line break class "" (Combining Mark), which prohibits breaks before them to maintain attachment to preceding content, though specific ones like carriage return U+000D use "CR" for mandatory breaks. In contrast, characters are typically invisible in rendering and specialize in modifying interpretive properties of adjacent text, such as enabling ligature joining, embedding directional isolates, or selecting glyph variants via variation selectors. They often share the "" bidirectional class but can include classes like "PDF" (Pop Directional Format) for scope termination. Examples include the U+00AD (SHY), which marks discretionary hyphenation points without forcing breaks, influencing line breaking algorithms to insert hyphens only at line ends. The Cf category has expanded over versions to accommodate complex script requirements. Control categories Cc and Cf differ markedly from separator categories like Zs (Space Separator), Zl (Line Separator), and Zp (Paragraph Separator), which actively contribute to whitespace, line advancement, or paragraph boundaries with defined widths. Controls neither add visual width nor directly form base units in cluster segmentation; instead, they invisibly guide processing rules for , , and without participating in visible text flow. As of Unicode 17.0, the combined and categories total 226 characters, covering essential legacy controls and evolving format needs; while remains fixed at 65 since early versions, continues to grow, as seen with additions like the Arabic Letter Mark U+061C for right-to-left letter isolation in 6.3.

Legacy Control Codes from ASCII and ISO Standards

C0 Control Codes

The C0 control codes comprise the original set of 33 non-printable control characters defined in the 7-bit ASCII standard, which were first specified in ISO 646 () and subsequently incorporated into version 1.0 as code points U+0000 through U+001F, along with U+007F for the . These codes were designed primarily for controlling data transmission, mechanical printing devices, and early computer terminals, providing without adding visible symbols to the text stream. In , they retain their historical semantics but are treated as implementation-defined, allowing applications to interpret them based on context rather than mandating specific behaviors. The following table enumerates the C0 control codes, including their Unicode code points, official names (with common aliases from ISO/IEC 6429:1992), and traditional functions:
Code PointName (Aliases)Traditional Function
U+0000NULLServes as a media or time-fill character; often ignored in data streams without affecting content.
U+0001START OF HEADING (SOH)Marks the beginning of a message heading in data transmission protocols.
U+0002START OF TEXT (STX)Indicates the start of the main text body, terminating any preceding heading.
U+0003END OF TEXT (ETX)Signals the end of the text portion, often prompting a response in communications.
U+0004END OF TRANSMISSION (EOT)Denotes the conclusion of one or more transmitted texts or files.
U+0005ENQUIRY (ENQ)Requests identification or status from a receiving station in data links.
U+0006ACKNOWLEDGE (ACK)Confirms successful receipt of data as an affirmative response.
U+0007BELL (BEL)Triggers an audible alert or attention signal on terminals and printers.
U+0008BACKSPACE (BS)Moves the active position back one character space for overstriking or cursor control.
U+0009CHARACTER TABULATION (HT)Advances the position to the next horizontal tab stop for text alignment.
U+000ALINE FEED (LF, NL, EOL)Advances to the next line, maintaining the horizontal position.
U+000BLINE TABULATION (VT)Moves to the next predetermined vertical line position.
U+000CFORM FEED (FF)Ejects the current page or form on printers, advancing to the next sheet.
U+000DCARRIAGE RETURN (CR)Returns the position to the start of the current line.
U+000ESHIFT OUT (SO)Invokes an extended graphic character set (e.g., in 8-bit environments as LOCKING-SHIFT ONE).
U+000FSHIFT IN (SI)Reverts to the standard character set (e.g., LOCKING-SHIFT ZERO in 8-bit modes).
U+0010DATA LINK ESCAPE (DLE)Modifies the interpretation of subsequent characters for control purposes in links.
U+0011DEVICE CONTROL ONE (DC1)Activates or configures a device, or resumes transmission (e.g., XON flow control).
U+0012DEVICE CONTROL TWO (DC2)Triggers a specific device operation or mode setting.
U+0013DEVICE CONTROL THREE (DC3)Pauses or stops a device (e.g., XOFF flow control).
U+0014DEVICE CONTROL FOUR (DC4)Interrupts or halts a device function.
U+0015NEGATIVE ACKNOWLEDGE (NAK)Indicates a negative response or error in data receipt.
U+0016SYNCHRONOUS IDLE (SYN)Maintains timing synchronization in synchronous data transmission.
U+0017END OF TRANSMISSION BLOCK (ETB)Marks the end of a logical data block.
U+0018CANCEL (CAN)Aborts the preceding data, instructing the receiver to ignore it.
U+0019END OF MEDIUM (EM)Signals the physical end of a storage medium or data section.
U+001ASUBSTITUTE (SUB)Replaces erroneous or invalid characters.
U+001BESCAPE (ESC)Introduces escape sequences for additional control functions.
U+001CFILE SEPARATOR (FS, IS4)Delimits files in a hierarchical data structure.
U+001DGROUP SEPARATOR (GS, IS3)Separates groups within a file.
U+001ERECORD SEPARATOR (RS, IS2)Separates records within a group.
U+001FUNIT SEPARATOR (US, IS1)Separates units or fields within a record.
U+007FDELETE (DEL)Originally used to obliterate characters; in Unicode, treated as a control for padding or erasure.
These codes found widespread traditional applications in terminal control, such as using LF and to advance to a new line (often combined as CRLF, U+000D U+000A, for Windows-style line breaks), HT for columnar formatting, and BEL for alerts. In printing, ejected pages, while enabled overprinting for effects like bold text or accents. For data communications, sequences like SOH, STX, ETX, and EOT provided framing to structure messages and ensure reliable transmission over serial links. In , all C0 control codes belong to the (Other, Control) general category, as defined in the Unicode Character Database. They have no or decompositions, remaining that applications must handle explicitly. While their core roles in text processing persist—such as line breaks and tabs—their exact rendering and effects are left to the discretion of implementations, ensuring with legacy systems.

C1 Control Codes

The C1 control codes comprise 32 characters standardized in ISO/IEC 6429:1992, which are mapped in Unicode to the code points U+0080 through U+009F within the C1 Controls and block. These codes extend the functionality of the C0 controls by providing additional mechanisms for text formatting, device management, and shifts in 8-bit environments, originally designed to support display and printing devices like terminals and printers. In Unicode, all C1 codes belong to the (Control) general category and are preserved primarily for round-trip compatibility with legacy encodings such as ISO/IEC 8859-1:1987, ensuring that data interchange with older systems does not alter semantics. The following table enumerates the C1 control codes, including their Unicode code points, standard abbreviations, and primary names as defined in ISO/IEC 6429:1992:
Code PointAbbreviationName
U+0080PADPadding Character
U+0081HOPHigh Octet Preset
U+0082BPHBreak Permitted Here
U+0083NBHNo Break Here
U+0084INDIndex
U+0085NELNext Line
U+0086SSAStart of Selected Area
U+0087ESAEnd of Selected Area
U+0088HTSHorizontal Tabulation Set
U+0089HTJHorizontal Tabulation with Justification
U+008AVTSVertical Tabulation Set
U+008BPLDPartial Line Down (Forward)
U+008CPLUPartial Line Up (Backward)
U+008DRIReverse Index
U+008ESS2Single-Shift Two
U+008FSS3Single-Shift Three
U+0090DCSDevice Control String
U+0091PU1Private Use One
U+0092PU2Private Use Two
U+0093STSSet Transmit State
U+0094CCHCancel Character
U+0095MWMessage Waiting
U+0096SPAStart of Protected Area
U+0097EPAEnd of Protected Area
U+0098SOSStart of String
U+0099SGCISingle Graphic Character Introducer
U+009ASCISingle Character Introducer
U+009BCSIControl Sequence Introducer
U+009CSTString Terminator
U+009DOSCOperating System Command
U+009EPMPrivacy Message
U+009FAPCApplication Program Command
Originally developed in the late 1970s as part of standards like ECMA-48 and ISO 6429, the C1 codes were intended for applications such as systems (e.g., ) and terminal protocols (e.g., ), where they enabled advanced features like area selection (SSA/ESA), tabulation setup (HTS/VTS), and escape sequence initiation via (U+009B) for commands like cursor positioning or color changes. In 7-bit transmission environments, C1 codes could be represented using the (U+001B) character followed by a specific octet, allowing compatibility with ASCII-based systems. Although many C1 codes are now obsolete in contemporary text processing and web standards due to the adoption of higher-level formatting mechanisms, they remain relevant in legacy systems; for instance, NEL (U+0085) serves as a line-breaking control in encodings, moving the active position to the first character of the next line (with semantics equivalent to followed by line feed). As 8-bit extensions to the C0 controls from ASCII and related 7-bit standards, the C1 set enhances control over text layout and device behavior in environments requiring more granular manipulation than basic transmission protocols.

Unicode-Introduced Separators and Invisible Formatters

Line and Paragraph Separators

Unicode introduces two dedicated characters for explicit line and paragraph separation in : the Line Separator (, U+2028) and the Paragraph Separator (PS, U+2029). These characters were defined in the Unicode Standard version 1.0 to provide unambiguous structural breaks independent of platform-specific conventions. belongs to the General Category Zl (Separator, Line), while PS is in category Zp (Separator, Paragraph). Both have zero width and are not rendered visibly, serving instead as invisible controls that influence text layout and processing. In terms of properties, LS creates a mandatory line break (Line Break class BK) without initiating a new paragraph, allowing the subsequent text to continue with the same formatting, such as indentation or alignment, akin to an HTML <br> element. PS also enforces a mandatory line break but resets paragraph-level formatting, applying interparagraph spacing, margins, and direction changes as needed. Both characters have a bidirectional class of B (Boundary Neutral), treating them as neutral points that do not alter embedding levels but facilitate breaks in bidirectional text. No additional characters have been added to these categories since Unicode 3.0, maintaining their singleton status. These separators are used in applications requiring semantic text structure, such as XML and documents where they can be inserted via numeric entities like &#x2028; for , ensuring consistent rendering across systems without relying on ambiguous sequences. In word processors and international text editors, they support precise line wrapping for multilingual content, particularly in scripts with complex layout needs. They are preferred over legacy sequences like CRLF for semantic breaks because and carry explicit meaning preserved across processes, whereas (U+000D) and LF (U+000A)—categorized as (Other, Control)—have implementation-defined behaviors that vary by platform. Unlike legacy controls, LS and PS remain stable under Unicode normalization forms such as and NFD, with no or mappings, ensuring they are not altered during equivalence checks. This stability enables round-trip compatibility with legacy newline representations through mapping guidelines, such as those in Unicode Technical Report #13, while providing superior handling for diverse scripts like CJK, where consistent break points aid in layout algorithms.

Zero-Width Spaces, Joiners, and Invisible Operators

Unicode's zero-width spaces, joiners, and invisible operators are format control characters (General_Category=) that provide fine-grained control over text rendering without visible width, influencing glyph shaping, line breaking, and . These characters, primarily in the General block (U+2000–U+206F), enable precise adjustments in scripts requiring complex joining behaviors or implicit operations, such as preventing or forcing ligatures in scripts or inserting invisible multipliers in equations. They are default ignorable (Default_Ignorable_Code_Point=Yes) and typically classified as boundary neutral (Bidi_Class=) for bidirectional processing, ensuring they do not disrupt layout unless specified. The (ZWNJ, U+200C), introduced in 1.1, prevents the formation of ligatures or cursive connections between adjacent characters in scripts like , , or Indic languages such as . For instance, in , inserting ZWNJ between ل () and ا () inhibits the default ligature لَا, rendering them separately. This character has a Joining_Type of Non_Joining and a Line_Break property of ZWJ, which prohibits breaks while signaling non-connection in shaping engines. Complementing ZWNJ, the zero-width joiner (ZWJ, U+200D), also from Unicode 1.1, explicitly requests joining or ligature formation where it might not occur by default, such as in emoji sequences or half-consonant forms in Indic scripts. A prominent use is in family emoji compositions, like 👨‍👩‍👧 (man, ZWJ, woman, ZWJ, girl), where ZWJ binds the elements into a single if supported by the font. ZWJ shares the Cf category and BN bidi class but has a Line_Break of ZWJ, integrating with GSUB tables for glyph substitution during rendering. The (ZWSP, U+200B), added in 2.0, serves as an invisible break opportunity, similar to a but without a mark, allowing line breaks in languages without explicit word spaces, such as Thai or . It has a property of ZW, allowing an optional line break opportunity. Unlike visible spaces, ZWSP does not contribute to justification or . For non-breaking scenarios, the word joiner (WJ, U+2060), introduced in Unicode 3.0, acts as an invisible no-break space, preventing line breaks within compound words or phrases in languages like or , where it replaces deprecated uses of U+FEFF. WJ has a Line_Break property of WJ, explicitly prohibiting breaks, and is particularly useful in plain-text preservation of joined forms without affecting width or spacing. In mathematical notation, invisible operators from Unicode 3.2 provide implicit symbols without visible rendering: U+2061 indicates function dependence, as in f(x); U+2062 implies multiplication, rendering 2×3 as 2 3 in plain text; U+2063 INVISIBLE SEPARATOR disambiguates nested radicals or fractions; and U+2064 INVISIBLE PLUS suggests addition in sums. These characters, all with Line_Break=IN (invisible), support semantic markup in tools like or , ensuring correct interpretation by assistive technologies. Specific to the , the Mongolian vowel (U+1806), added in 3.0, inserts a zero-width break between a and subsequent to control vertical stacking and positioning, preventing unwanted connections in traditional Mongolian . It is Cf with Line_Break=PR (prohibits break after) and integrates with features for accurate glyph placement in vertical writing modes. While primarily Cf, related spacing controls like the hair space (U+200A, Zs category from 3.0) offer a borderline thin (1/6 em width) for fine , though it is visible unlike true zero-width operators. No new Cf characters for joining or invisible operations have been added as of Unicode 17.0. All these characters leverage shaping for script-specific behaviors, ensuring consistent rendering across fonts and systems.

Directional and Layout Control Characters

Bidirectional Formatting Controls

Bidirectional formatting controls are a set of Unicode characters designed to explicitly manage text directionality in documents containing mixed left-to-right (LTR) and right-to-left () scripts, such as English interspersed with or Hebrew. These controls allow authors to embed directional segments or override the default bidirectional behavior without relying solely on the implicit Unicode Bidirectional Algorithm described in UAX #9. Introduced primarily in early Unicode versions, they are classified under the Format (Cf) general category and function as invisible characters that influence layout during rendering. The core embedding controls include the Left-to-Right Embedding (LRE, U+202A), which treats subsequent text as embedded LTR by increasing the embedding level to the next even number greater than the current level; the Right-to-Left Embedding (RLE, U+202B), which embeds text by setting the level to the next odd number; the Left-to-Right Override (LRO, U+202D), forcing strong LTR treatment regardless of character types; and the Right-to-Left Override (RLO, U+202E), enforcing strong . These were added in 1.1 and operate in a stack-based manner, with their effects nested and terminated by the Pop Directional Formatting (PDF, U+202C), which restores the previous embedding level. In practice, they enable precise control in files or via character references (e.g., ‪ for LRE), though they can propagate directionality to surrounding text, potentially causing unintended reordering. For security reasons, overrides like LRO and RLO are discouraged in untrusted content due to risks of visual spoofing. Complementing the embeddings are implicit directional marks: the (LRM, U+200E), an invisible zero-width character behaving as a strong LTR type to resolve ambiguities for neutrals like spaces or ; the (RLM, U+200F), acting as a strong RTL neutralizer; and the Arabic Letter Mark (ALM, U+061C), a specialized RTL mark with Arabic letter (AL) bidirectional class for proper presentation in Arabic contexts, such as isolating numbers from following RTL text. LRM and RLM were introduced in Unicode 1.1, while ALM was added in Unicode 6.3 to address specific needs in Arabic without affecting non-Arabic scripts. These marks have localized effects, making them suitable for fine-grained adjustments in mixed-script environments, such as ensuring correct adjacency in email addresses or filenames. All these controls share the Cf general category and specific bidirectional classes (e.g., LRE for embeddings, R for RLM), with no significant property changes since Unicode 4.0. While effective for legacy bidirectional text processing predating Unicode 6.3, embeddings and overrides are now considered legacy mechanisms, with modern applications favoring directional isolates to limit scope and reduce side effects on adjacent content. In HTML and CSS, equivalents like the dir attribute or unicode-bidi: embed properties often supplant direct use of these characters for better maintainability.
CharacterCode PointNameBidirectional ClassFunction Summary
LREU+202ALeft-to-Right EmbeddingLREEmbeds LTR segment
RLEU+202BRight-to-Left EmbeddingRLEEmbeds RTL segment
LROU+202DLeft-to-Right OverrideLROOverrides to LTR
RLOU+202ERight-to-Left OverrideRLOOverrides to RTL
PDFU+202CPop Directional FormattingPDFTerminates embedding/override
LRMU+200ELNeutralizes to LTR
RLMU+200FRNeutralizes to RTL
ALMU+061CArabic Letter MarkALArabic-specific RTL neutralizer

Isolate and Embedding Controls

Isolate and embedding controls in Unicode provide a mechanism for scoped formatting, introduced to address limitations in earlier directional controls by isolating directional effects to specific spans of text without influencing surrounding content. These controls, added in Unicode 6.3, enable safer handling of mixed left-to-right (LTR) and right-to-left (RTL) scripts, such as embedding English text within or Hebrew without causing unintended reordering elsewhere. The Left-to-Right Isolate (LRI, U+2066) initiates an LTR isolated span, setting the embedding level to the next even level and treating subsequent text as LTR until terminated, while preventing bidirectional interactions with external text. The Right-to-Left Isolate (RLI, U+2067) functions similarly but for spans, raising the level to the next odd value. Both belong to the Format (Cf) general category, have bidirectional classes LRI and RLI respectively, and a line break property of Combining Mark (CM). The First Strong Isolate (FSI, U+2068) starts an isolate whose direction is determined by the first strong directional character within it, applying LTR if the first strong character is LTR or neutral, or otherwise, which is useful for automatically adapting to mixed or unknown content directions. Like LRI and RLI, FSI has the general category, FSI bidirectional class, and line break property. The Pop Directional Isolate (PDI, U+2069) terminates the most recent isolate initiated by LRI, RLI, or FSI, popping it from the directional stack and restoring the previous embedding level, ensuring clean scoping even if isolates nest. PDI shares the category, PDI bidirectional class, and line break property. These controls address flaws in legacy embedding codes like RLE and LRO by automatically popping at PDI or paragraph boundaries, preventing runaway directional effects that could disrupt large portions of text. They are particularly preferred in applications handling , such as email clients and web forms, where isolates can wrap dynamic insertions to maintain readability. For deeper nesting, isolates can pair with the legacy Pop Directional Formatting (PDF, U+202C) to terminate any inner embeddings.
ControlCode PointNameBidirectional ClassPrimary Use
LRIU+2066Left-to-Right IsolateLRIIsolate LTR spans
RLIU+2067Right-to-Left IsolateRLIIsolate RTL spans
FSIU+2068First Strong IsolateFSIIsolate with auto-direction
PDIU+2069Pop Directional IsolatePDITerminate any isolate
Adoption of these isolates is widespread in modern rendering engines, with full support in browsers like , , , and since around 2013, enabling CSS equivalents such as unicode-bidi: isolate. No additional isolate controls have been introduced since 6.3, including up to Unicode 17.0, reflecting their sufficiency for scoped bidi needs.

Tagging and Annotation Characters

Language Tags

Language tag characters provide a mechanism for embedding directly into Unicode plain text streams without relying on external markup languages. Introduced in Unicode 3.1, this feature consists of the base character U+E0001 LANGUAGE TAG followed by tag component characters in the U+E0020 through U+E007E, which represent shifted ASCII values for , letters, and digits (for example, U+E0020 TAG SPACE corresponds to a space, and U+E0041 TAG LATIN CAPITAL LETTER A corresponds to 'A'). These tags support up to eight subtags conforming to BCP 47 identifiers, such as "en-Latn-US", enabling applications to switch fonts, scripts, or rendering behaviors based on the embedded information. The structure begins with U+E0001 to initiate the tag, followed by the encoded subtags separated by U+E0020, and optionally terminated by U+E007F CANCEL TAG to reset the tagging state. This allows for plain-text files or data streams to carry semantic cues, facilitating tasks like multilingual document processing or features without or XML attributes. However, the use of these characters for language tagging has been deprecated since Unicode 5.1 in 2008 due to their complexity, limited adoption, and the prevalence of more robust alternatives like the lang attribute or richer markup systems. In 8.0 (2015), the tag component characters U+E0020–U+E007E were restored from deprecation for and potential future uses in rich text or specialized plain-text scenarios, while U+E0001 remains strongly discouraged. These characters are classified as format () with bidirectional class "ON" (other neutral), meaning they do not participate in forms and have neutral impact on text directionality. As of 17.0, no new tag characters have been added since 8.0. For modern applications, alternatives such as BCP 47 tags in metadata or structured formats are preferred over these inline controls.

Interlinear Annotation Sequences

Interlinear annotation sequences consist of three format characters designed to insert annotations between lines of base text in environments, functioning as a mechanism for rich-text-like features without relying on markup languages. These characters, introduced in Unicode 3.0, allow for the delimitation of annotated content, such as text or footnotes, by marking the start, separation, and end of the annotation. They are particularly associated with legacy East Asian practices, where interlinear notes provide phonetic or explanatory glosses above or below the primary text. The sequence begins with U+FFF9 INTERLINEAR ANNOTATION ANCHOR (Cf category, combining class 0), which marks the start of the interlinear note and attaches to the preceding base character. This is followed by the annotation text, then U+FFFA INTERLINEAR ANNOTATION SEPARATOR (Cf, combining class 0), which divides the base text from the annotation. The sequence concludes with U+FFFB INTERLINEAR ANNOTATION TERMINATOR (Cf, combining class 0), signaling the end of the annotation and resuming the base text flow, as in the example: base textannotationnext base text. All three characters have a line break property of (Combining Mark), prohibiting breaks within the sequence and treating it as attached to the base character per UAX #14 rule LB9. In , they are classified as ON (Other Neutral) and resolved according to the surrounding embedding direction under UAX #9 rules N1 and N2; for proper rendering, the bidirectional algorithm is applied to the main text after replacing annotations with the annotated content, bracketed by format controls to maintain contiguity. Despite their utility for plain text annotation, these characters are rare in contemporary use due to their reliance on specific sender-receiver agreements and potential misinterpretation if filtered or unsupported. They have seen no significant changes or expansions since Unicode 3.0 and are generally discouraged for new implementations, with modern alternatives like HTML's <ruby> element preferred for structured documents. Applications typically manage their formatting via out-of-band information rather than exposing them to end-users.

Variation Selectors

Variation selectors are nonspacing marks in Unicode that specify particular glyph variants or styles for a preceding base character, enabling precise control over rendering without altering the character's identity. These selectors form variation sequences, consisting of a graphic base character followed immediately by a variation selector, to disambiguate ambiguous glyph forms in contexts where font or style variations might otherwise lead to inconsistent presentation. They are essential for scripts and symbols requiring multiple compatible appearances, such as , ideographs, and , while ensuring compatibility with normalization processes. The initial set comprises Variation Selector-1 (VS1) through Variation Selector-16 (VS16), encoded at U+FE00 through U+FE0F with general category (nonspacing mark) and bidirectional class NSM (nonspacing mark). Introduced in Unicode 3.2, these selectors support standardized variation sequences for glyph choices like text versus forms. For instance, the sequence U+3046 (hiragana letter u) followed by U+FE00 selects a small variant. Variation selectors have no standalone visual representation and do not affect spacing or line breaking; they modify only the appearance of the associated base character and any subsequent combining marks. A supplementary set, Variation Selector-17 (VS17) through Variation Selector-256 (VS256), is encoded in the at U+ E0100 through U+E01EF, also with general category Mn and bidirectional class NSM. This block was introduced in Unicode 4.0 to accommodate additional needs, such as ideographic distinctions and regional indicator variants. The full range of 240 selectors (VS17 to VS256) became available from Unicode 4.0, enabling expanded use cases like distinguishing from non-emoji styles; for example, U+1F3F3 () followed by U+FE0F selects a text variant. Variation sequences are strictly defined in official Unicode data files to prevent arbitrary use, with the primary list in StandardizedVariants.txt containing over 1,000 entries as of Unicode 17.0, covering categories such as mathematical symbols, CJK ideographs, and script-specific forms like Egyptian hieroglyph rotations. Additional sequences for (using VS15 for text style and VS16 for emoji style) number 371 in emoji-variation-sequences.txt, while ideographic variations are registered in the Ideographic Variation Database per UTS #37. These selectors integrate with (ZWJ) sequences for complex emoji modifiers like skin tones, though ZWJ functionality is addressed separately. Expansions to variation selector usage have continued in recent Unicode versions, with Unicode 15.0 and later adding sequences for emerging scripts and enhanced emoji variants to support diverse linguistic and cultural rendering needs. For example, new standardized sequences as of Unicode 17.0 (September 2025) include additional positional variants for East Asian , Myanmar text forms, and 42 rotations for , ensuring broader compatibility across fonts and platforms. Unrecognized sequences default to the base character's standard glyph, maintaining robustness in rendering.

Visual Representations of Controls

Control Picture Symbols

The Control Pictures block (U+2400–U+243F) consists of graphic symbols designed to provide visible representations of the C0 codes from the ASCII standard, along with a few additional symbols for and delete, facilitating their depiction in text for educational, , and technical purposes. These characters, classified as symbols (So category), were introduced in Unicode version 1.1 in 1993 and have remained stable without further additions to the core set since then. The block includes 42 assigned code points out of 64, with the primary focus on rendering otherwise invisible functions as stylized icons, such as diagonal lettering or boxed forms, though designs may vary across fonts. The core of the block comprises 32 symbols corresponding to the C0 controls from U+0000 () to U+001F (UNIT SEPARATOR), mapped directly to U+2400 through U+241F. For instance, U+2400 (⦰, ) represents NUL, U+2401 (⦱, SYMBOL FOR START OF HEADING) depicts SOH, U+2409 (↹, SYMBOL FOR HORIZONTAL TABULATION) illustrates HT, and U+241F (⦿, SYMBOL FOR UNIT SEPARATOR) shows US. Additionally, U+2420 (␠, SYMBOL FOR SPACE) and U+2421 (␡, ) provide square-enclosed representations for the space character (U+0020) and the DEL control (U+007F), respectively, extending the utility beyond strict C0 codes. These symbols are not intended for everyday text rendering but serve as a standardized way to visualize non-printing characters without altering their underlying semantics. Proposals to extend the block with equivalent symbols for C1 controls (U+0080–U+009F) were considered during Unicode's and again in later submissions, such as a proposal, but were ultimately rejected to maintain stability and avoid redundancy with existing control mechanisms. The block has thus focused exclusively on C0 and select legacy controls, ensuring compatibility with ASCII-derived systems. In practice, these symbols are employed in software tools for and , such as hex editors where they replace raw control bytes with readable icons to aid in file inspection and . They also appear in protocol analyzers to denote control sequences in data streams, enhancing clarity during network or diagnostics. Font support for the block varies, with comprehensive coverage in fonts like Segoe UI Symbol, which includes glyphs for all assigned characters to ensure consistent rendering in Windows environments.

Usage in Debugging and Documentation

In software debugging, Unicode control characters are frequently visualized in hexadecimal dumps using escape sequences, such as \u200B for the (U+200B), to identify and isolate non-printable elements in text streams. Network analysis tools like assist in examining sequences by decoding protocol payloads that may include directional formatting controls, revealing potential rendering issues in transmitted data. For documentation purposes, the Unicode charts employ the Control Pictures block (U+2400–U+243F) to provide graphical symbols representing otherwise invisible control characters, facilitating clearer explanations in technical references. Similarly, specifications such as Unicode Standard Annex #9 (UAX #9) utilize escape notations like LRE (U+202A) and RLE (U+202B) to denote bidirectional formatting controls, ensuring precise description of their behavior without ambiguity. In standards development, protocols like HTTP preserve specific control characters, including the line separator (LS, U+2028) and paragraph separator (PS, U+2029), to maintain structural integrity in transmitted text, as discussed in deliberations on and interchange. Normalization testing, such as under Form C (), confirms that control characters remain unchanged, as they fall outside the decomposition and composition rules applied to combining sequences. Educationally, control characters like the zero-width joiner (ZWJ, U+200D) are taught as essential for constructing complex emojis, such as gendered profession emojis (e.g., woman + ZWJ + microscope for woman scientist), illustrating how invisible operators enable semantic text encoding. The stability policies of the Unicode Standard ensure no new assignments to the Control (Cc) or Format (Cf) categories, as maintained through version 17.0 (2024), promoting backward compatibility in educational materials and implementations. The invisible nature of these characters often leads to challenges in copy-paste operations, where they inadvertently propagate errors like unintended formatting or security vulnerabilities in code. Unicode-aware editors, such as , mitigate this by highlighting control characters through settings like "editor.unicodeHighlight.ambiguousCharacters" or extensions that render them visibly.

References

  1. [1]
    Special Areas and Format Characters - Unicode
    This chapter describes several kinds of characters that have special properties as well as areas of the codespace that are set aside for special purposes.
  2. [2]
  3. [3]
  4. [4]
  5. [5]
  6. [6]
  7. [7]
  8. [8]
    UAX #44: Unicode Character Database
    Aug 27, 2025 · Unicode 17.0.0. Changes in specific files: Appropriate existing data files were updated to add the 4803 new characters encoded in Unicode 17.0.
  9. [9]
    [PDF] Guide to the use of character set standards in Europe - Unicode
    Jul 23, 1999 · The C0 set of ISO/IEC 6429 has its historical origin in the control characters of the ASCII character set. For this reason, 10 of the ...
  10. [10]
    Special Areas and Format Characters - Unicode
    The Unicode Standard contains code positions for the 64 control characters and the DEL character found in ISO standards and many vendor character sets. The ...
  11. [11]
  12. [12]
    Core Spec – Unicode 17.0.0
    23 Special Areas and Format Characters · Control Codes • Layout Controls • Deprecated Format Characters • Variation Selectors • Private-Use Characters ...
  13. [13]
    Character Properties - Unicode
    This chapter gives an overview of character properties, their status and attributes, followed by an overview of the UCD and more detailed notes on some ...
  14. [14]
  15. [15]
  16. [16]
    [PDF] Latin-1 Supplement - The Unicode Standard, Version 17.0
    The Unicode Standard, Version 17.0, Copyright © 1991-2025 Unicode, Inc. All rights reserved. 9. 00A8. C1 Controls and Latin-1 Supplement. 0080. 009F Ș <control>.
  17. [17]
    Chapter 3 – Unicode 16.0.0
    Wherever the precise behavior of all Unicode characters needs to be cited, the full three-field version number should be used, as in the first example below.
  18. [18]
    C0 Controls and Basic Latin - Unicode
    C0 controls. Alias names are those for ISO/IEC 6429:1992. Commonly used alternative aliases are also shown. 0000, ␀, <Control>. = Null. 0001, ␁, <Control>.
  19. [19]
    The set of control characters of ISO 646 - skew.org
    ISO IR-001 is the set of control codes for ISO 646. The control codes have different purposes. Some are meant to control mechanical printing devices.
  20. [20]
    Control characters in ASCII and Unicode - Aivosto
    In fact, the C1 area has been entirely reserved for control codes in Unicode. ... To solve the problem, control characters IND and NEL were added to the C1 area.
  21. [21]
    [PDF] C0 Controls and Basic Latin - The Unicode Standard, Version 17.0
    These charts are provided as the online reference to the character contents of the Unicode Standard, Version 17.0 but do not provide all the information needed ...<|control11|><|separator|>
  22. [22]
    C1 Controls and Latin-1 Supplement - Unicode
    C1 controls. Alias names are those for ISO/IEC 6429:1992. 0080, €, <Control>. 0081, <Control>. 0082, ‚, <Control>. = Break Permitted Here.Missing: codes | Show results with:codes
  23. [23]
  24. [24]
  25. [25]
    [PDF] C0 and C1 stability for Unicode and 10646
    Feb 11, 2022 · The character encoding for many old computer systems do not follow the ISO/IEC 2022 architecture, and their control codes do not follow ISO/IEC ...
  26. [26]
    [PDF] Appendix B: Implementation Guidelines - Unicode
    Nov 19, 2012 · Line and Paragraph Separator. The Unicode standard has two special characters U+2028 LINE SEP-. ARATOR and U+2029 PARAGRAPH SEPARATOR. A new ...
  27. [27]
  28. [28]
  29. [29]
  30. [30]
  31. [31]
    Unicode in XML and other Markup Languages - W3C
    Jul 13, 2017 · Short description: The line and paragraph separator provide unambiguous means to denote hard line breaks and paragraph delimiters in plain text.
  32. [32]
  33. [33]
  34. [34]
  35. [35]
    UTR #13: Unicode Newline Guidelines
    There are two mappings of LF and NEL used by EBCDIC systems. The first EBCDIC column shows the MVS Open Edition (including CP1047) mapping of these characters, ...Missing: U+ | Show results with:U+
  36. [36]
    UTS #51: Unicode Emoji
    The U+200D ZERO WIDTH JOINER (ZWJ) can be used between the elements of a sequence of characters to indicate that a single glyph should be presented if available ...
  37. [37]
    Chapter 22 – Unicode 16.0.0
    #Invisible Function Application. U+2061 FUNCTION APPLICATION is used for an implied function dependence, as inf(x + y). To indicate that this is the function ...
  38. [38]
  39. [39]
  40. [40]
    [PDF] General Punctuation - The Unicode Standard, Version 17.0
    Format characters. 202A Ř LEFT-TO-RIGHT EMBEDDING. • commonly abbreviated LRE. 202B ř RIGHT-TO-LEFT EMBEDDING. • commonly abbreviated RLE. 202C Ś POP ...Missing: U+ | Show results with:U+
  41. [41]
  42. [42]
  43. [43]
  44. [44]
  45. [45]
    [PDF] Arabic - The Unicode Standard, Version 17.0
    061C. 061D. 061E. 061F. 0620. 0621. 0622. 0623. 0624. 0625. 0626. 0627. 0628. 0629. 062A ... ٷ ARABIC LETTER U WITH HAMZA ABOVE. • preferred spelling is 0674 ٴ ...
  46. [46]
  47. [47]
  48. [48]
  49. [49]
  50. [50]
    None
    Summary of each segment:
  51. [51]
    How to use Unicode controls for bidi text - W3C
    Feb 23, 2023 · This article looks at how content authors can apply direction metadata to bidirectional text when markup is not available.
  52. [52]
    CSS property: unicode-bidi: isolate | Can I use... Support ... - CanIUse
    "Can I use" provides up-to-date browser support tables for support of front-end web technologies on desktop and mobile web browsers.
  53. [53]
  54. [54]
    [PDF] Tag characters - The Unicode Standard, Version 17.0
    E0001 LANGUAGE TAG. • This character is deprecated, and its use is strongly discouraged. Tag components. E0020 TAG SPACE. E0021 TAG EXCLAMATION MARK.
  55. [55]
    Tags - Unicode
    Tags ; E0001, ⬚, Language Tag ; Tag components ; E0020, ⬚, Tag Space ; E0021, ⬚, Tag Exclamation Mark.
  56. [56]
    Deprecated Character Proposal - Unicode
    These characters are deprecated, and should not be used particularly with any protocols that provide alternate means of language tagging.... =================== ...
  57. [57]
    UTC 143 Draft Minutes - Unicode
    May 8, 2015 · [143-A74] Action Item for Ken Whistler, Editorial Committee: Update UAX #44 to indicate the change in deprecation status, for Unicode 8.0. See ...
  58. [58]
  59. [59]
    Strings on the Web: Language and Direction Metadata - W3C
    Oct 17, 2024 · ... character U+E0001 LANGUAGE TAG is strongly discouraged. ... Producers insert Unicode tag characters into the data to tag strings with a language.Missing: restored | Show results with:restored
  60. [60]
  61. [61]
    [PDF] variation selector-16 - The Unicode Standard, Version 17.0
    These charts are provided as the online reference to the character contents of the Unicode Standard, Version 17.0 but do not provide all the information needed ...
  62. [62]
    [PDF] Variation Selectors Supplement - The Unicode Standard, Version 17.0
    These charts are provided as the online reference to the character contents of the Unicode Standard, Version 17.0 but do not provide all the information needed ...
  63. [63]
    StandardizedVariants.txt - Unicode
    ... variation sequences that are defined in the # Unicode Standard. # # This ... Emoji variation sequences are defined in the file # emoji-variation ...
  64. [64]
    UTS #37: Unicode Ideographic Variation Database
    This document describes the organization of the Ideographic Variation Database, and the procedure to add sequences to that database.
  65. [65]
    [PDF] Control Pictures - The Unicode Standard, Version 17.0
    These charts are provided as the online reference to the character contents of the Unicode Standard, Version 17.0 but do not provide all the information needed ...
  66. [66]
    Control Pictures - Codepoints
    Control Pictures. Block from U+2400 to U+243F. This block was introduced in Unicode version 1.1 (1993). It contains 42 codepoints.
  67. [67]
    Unicode Block “Control Pictures” - Compart
    Character List Grid List ; U+2400. ␀. Symbol For Null ; U+2401. ␁. Symbol For Start of Heading ; U+2402. ␂. Symbol For Start of Text ; U+2403. ␃. Symbol For End of ...
  68. [68]
    C1 Control Pictures Proposal from Sean Leonard on 2011 ... - Unicode
    Aug 13, 2011 · However, I looked through the "Archive of Notices of Non-Approval" and was unable to find an explicit rejection of his proposals. In any event, ...Why Nothing Ever Goes Away (was: Re: Acquiring DIS 10646)Re: Feedback from C1 Control Pictures Proposal - UnicodeMore results from unicode.org
  69. [69]
    Feature Request: Add the Control Pictures Unicode block · Issue #219
    Dec 17, 2019 · Adding these 39 glyphs would make it possible for utilities such as hexdump to show them in text representations, which is much more helpful ...
  70. [70]
    6 Hex Editors for Malware Analysis - SANS Institute
    Sep 29, 2010 · Hex editors allow examining and modifying a file at the low-level of bytes and bits, usually representing the file's contents in hexadecimal ...
  71. [71]
    Control Pictures characters supported by the Segoe UI Symbol font
    Control Pictures characters supported by the Segoe UI Symbol font ; SYMBOL FOR START OF HEADING (U+2401) ; SYMBOL FOR START OF TEXT (U+2402) ; SYMBOL FOR END OF ...
  72. [72]
    Escape Unicode Characters - CyberChef
    Escape Unicode Characters. Prefix. \u, %u, U+. Encode all chars. Padding. Uppercase hex. pause not_interested keyboard_arrow_up. Step Chef Icon Bake!
  73. [73]
    6.4. Building Display Filter Expressions - Wireshark
    Wireshark provides a display filter language that enables you to precisely control which packets are displayed.Missing: Unicode bidirectional
  74. [74]
    Unicode Mail List Archive: RE: Line Separator and Paragraph ...
    Oct 21, 2003 · email protocols. Had we done this, SLB could have been considered "just whitespace", while LS and PS would have been not-ignorable in HTML (andRe: Line Separator CharacterRe: Line Separator characterMore results from unicode.org
  75. [75]
    UAX #15: Unicode Normalization Forms
    Jul 30, 2025 · Both NFD and NFC maintain compatibility composites. Neither NFKD nor NFKC maintains compatibility composites.Missing: LS PS CRLF
  76. [76]
    ‍Zero Width Joiner Emoji | Meaning, Copy And Paste - Emojipedia
    Zero Width Joiner (ZWJ) is a Unicode character that joins two or more other characters together in sequence to create a new emoji.
  77. [77]
    Unicode® Character Encoding Stability Policies
    Jan 9, 2024 · This page lists the policies of the Unicode Consortium regarding character encoding stability. These policies are intended to ensure that text encoded in one ...
  78. [78]
    ASCII Smuggler Tool: Crafting Invisible Text and Decoding Hidden ...
    Jan 14, 2024 · ASCII Smuggler Tool to help with testing and creation of payloads, and also to check if text might have invisible Unicode Tags you can use ASCII Smuggler.
  79. [79]
    Render Special Characters - Visual Studio Marketplace
    May 26, 2024 · Render Special Characters. This extension makes it easy to identify special characters by displaying them using the UTFx encoding standard.