Fact-checked by Grok 2 weeks ago

Unicode control characters

Unicode control characters are a category of non-printing characters in the Unicode Standard, Version 17.0, designed to manage text processing, layout, and formatting without visible rendering. They encompass 65 legacy control codes (C0 and C1 sets) allocated at code points U+0000–U+001F, U+007F, and U+0080–U+009F for compatibility with ISO/IEC 2022 and earlier standards like ASCII, as well as modern format characters that influence bidirectional ordering, line breaking, and glyph variation.^[1] These characters ensure consistent interchange of text across systems, with their semantics largely determined by applications rather than Unicode itself, preserving original meanings during encoding transformations.^[2] The C0 controls, originating from the 7-bit ASCII set, include fundamental functions such as U+0009 (HORIZONTAL TAB, HT) for spacing and U+000A (LINE FEED, LF) paired with U+000D (CARRIAGE RETURN, CR) for newline operations.^[3] The C1 controls, an extension from 8-bit ISO standards, provide additional capabilities like U+0085 (NEXT LINE, NEL) for line termination and U+0094 (REVERSE LINE FEED, RI) for cursor movement in terminal emulations.^[2] Beyond these legacy codes, Unicode defines format characters as invisible operators that affect text layout without altering content, categorized into layout controls (e.g., U+200B ZERO WIDTH SPACE for word boundaries and U+2028 LINE SEPARATOR for explicit line breaks), join controls (e.g., U+200D ZERO WIDTH JOINER for cursive connections in scripts like Arabic), and variation selectors (e.g., U+FE0F VARIATION SELECTOR-16 to specify emoji-style glyphs).^[4]^[5] These control characters play a critical role in internationalization, enabling robust handling of diverse scripts and text directions, such as right-to-left ordering via U+202A–U+202E bidirectional controls.^[4] Deprecated format characters, like U+206A INHIBIT SYMMETRIC SWAPPING, are retained for legacy support but discouraged in new implementations.^[6] Overall, control characters facilitate precise text manipulation in computing environments, from document rendering to data serialization, underscoring Unicode's foundation in practical interoperability.^[1]

Overview and Classification

Definition and Role in Text Processing

Unicode control characters are non-printable characters defined in the Unicode Standard that manage the processing, formatting, and interpretation of text streams, distinguishing them from printable glyphs that represent visible symbols.^[7] These characters belong primarily to the general categories Cc (Other, Control) and Cf (Other, Format), where Cc encompasses legacy control codes for basic text control, and Cf includes format effectors that influence rendering without visible output.^[7] Assigned code points span the full Unicode range from U+0000 to U+10FFFF, though most control characters reside in the Basic Multilingual Plane (U+0000 to U+FFFF).^[7] The origins of Unicode control characters trace back to the ASCII standard published in 1963, which introduced 33 control codes for hardware and communication control, later extended in international standards like ISO/IEC 646 (first published 1967) and ISO 8859 standards to include additional sets like C1 controls.^[8] Unicode unifies and expands these legacy codes into a global framework, preserving their semantics while adapting them for multilingual text handling across diverse scripts and devices. In text processing, these characters enable applications and rendering engines—such as those in HTML/CSS or font systems—to handle line breaks, directionality, glyph selection, and semantic annotations without altering the visible content.^[1] For instance, they support complex scripts by facilitating bidirectional text in languages like Arabic and Hebrew through the Unicode Bidirectional Algorithm, and they allow emoji variations via variation selectors that specify presentation styles.^[9] The Unicode Standard, version 17.0 (released September 9, 2025), details their semantics and processing rules in Chapter 23, emphasizing their role in ensuring consistent text interpretation across platforms.^[10]

Unicode General Categories for Control Characters

Unicode assigns a general category to every character in its repertoire as part of the Unicode Character Database, providing a fundamental classification based on usage for purposes such as text processing and rendering. Control characters are primarily grouped into the "Cc" (Other, Control) and "Cf" (Other, Format) categories. The Cc category consists of 65 characters, exemplified by the NULL control U+0000, which represent non-printing signaling codes derived from legacy standards like ASCII and ISO 8859-1. These serve roles in data transmission and device control without visual representation. The Cf category includes 161 characters, such as the Zero Width Space U+200B, functioning as format effectors that alter text layout, bidirectional direction, or character joining without contributing visible glyphs.^[11] Characters in the Cc category originate from Unicode version 1.0, with the full set stabilized by version 1.1 and no subsequent additions. They uniformly possess the bidirectional class "BN" (Boundary Neutral), ensuring they do not initiate or alter text directionality in bidirectional algorithms. Decomposition mappings are absent for Cc characters, preserving their atomic control nature. Most exhibit the line break class "CM" (Combining Mark), which prohibits breaks before them to maintain attachment to preceding content, though specific ones like carriage return U+000D use "CR" for mandatory breaks.^[12]^[13] In contrast, Cf characters are typically invisible in rendering and specialize in modifying interpretive properties of adjacent text, such as enabling ligature joining, embedding directional isolates, or selecting glyph variants via variation selectors. They often share the "BN" bidirectional class but can include classes like "PDF" (Pop Directional Format) for scope termination. Examples include the soft hyphen U+00AD (SHY), which marks discretionary hyphenation points without forcing breaks, influencing line breaking algorithms to insert hyphens only at line ends. The Cf category has expanded over versions to accommodate complex script requirements.^[12]^[14] Control categories Cc and Cf differ markedly from separator categories like Zs (Space Separator), Zl (Line Separator), and Zp (Paragraph Separator), which actively contribute to whitespace, line advancement, or paragraph boundaries with defined widths. Controls neither add visual width nor directly form base units in grapheme cluster segmentation; instead, they invisibly guide processing rules for layout, normalization, and collation without participating in visible text flow.^[11]^[15] As of Unicode 17.0, the combined Cc and Cf categories total 226 characters, covering essential legacy controls and evolving format needs; while Cc remains fixed at 65 since early versions, Cf continues to grow, as seen with additions like the Arabic Letter Mark U+061C for right-to-left letter isolation in Unicode 6.3.^[13]

Legacy Control Codes from ASCII and ISO Standards

C0 Control Codes

The C0 control codes comprise the original set of 33 non-printable control characters defined in the 7-bit ASCII standard, which were first specified in ISO 646 (1963) and subsequently incorporated into Unicode version 1.0 as code points U+0000 through U+001F, along with U+007F for the Delete character.^[16]^[17] These codes were designed primarily for controlling data transmission, mechanical printing devices, and early computer terminals, providing in-band signaling without adding visible symbols to the text stream.^[18] In Unicode, they retain their historical semantics but are treated as implementation-defined, allowing applications to interpret them based on context rather than mandating specific behaviors.^[19] The following table enumerates the C0 control codes, including their Unicode code points, official names (with common aliases from ISO/IEC 6429:1992), and traditional functions:

Code Point	Name (Aliases)	Traditional Function
U+0000	NULL	Serves as a media or time-fill character; often ignored in data streams without affecting content.^[17]
U+0001	START OF HEADING (SOH)	Marks the beginning of a message heading in data transmission protocols.^[18]
U+0002	START OF TEXT (STX)	Indicates the start of the main text body, terminating any preceding heading.^[17]
U+0003	END OF TEXT (ETX)	Signals the end of the text portion, often prompting a response in communications.^[18]
U+0004	END OF TRANSMISSION (EOT)	Denotes the conclusion of one or more transmitted texts or files.^[17]
U+0005	ENQUIRY (ENQ)	Requests identification or status from a receiving station in data links.^[17]
U+0006	ACKNOWLEDGE (ACK)	Confirms successful receipt of data as an affirmative response.^[17]
U+0007	BELL (BEL)	Triggers an audible alert or attention signal on terminals and printers.^[18]
U+0008	BACKSPACE (BS)	Moves the active position back one character space for overstriking or cursor control.^[17]
U+0009	CHARACTER TABULATION (HT)	Advances the position to the next horizontal tab stop for text alignment.^[18]
U+000A	LINE FEED (LF, NL, EOL)	Advances to the next line, maintaining the horizontal position.^[17]
U+000B	LINE TABULATION (VT)	Moves to the next predetermined vertical line position.^[18]
U+000C	FORM FEED (FF)	Ejects the current page or form on printers, advancing to the next sheet.^[17]
U+000D	CARRIAGE RETURN (CR)	Returns the position to the start of the current line.^[18]
U+000E	SHIFT OUT (SO)	Invokes an extended graphic character set (e.g., in 8-bit environments as LOCKING-SHIFT ONE).^[19]
U+000F	SHIFT IN (SI)	Reverts to the standard character set (e.g., LOCKING-SHIFT ZERO in 8-bit modes).^[19]
U+0010	DATA LINK ESCAPE (DLE)	Modifies the interpretation of subsequent characters for control purposes in links.^[17]
U+0011	DEVICE CONTROL ONE (DC1)	Activates or configures a device, or resumes transmission (e.g., XON flow control).^[17]
U+0012	DEVICE CONTROL TWO (DC2)	Triggers a specific device operation or mode setting.^[17]
U+0013	DEVICE CONTROL THREE (DC3)	Pauses or stops a device (e.g., XOFF flow control).^[17]
U+0014	DEVICE CONTROL FOUR (DC4)	Interrupts or halts a device function.^[17]
U+0015	NEGATIVE ACKNOWLEDGE (NAK)	Indicates a negative response or error in data receipt.^[17]
U+0016	SYNCHRONOUS IDLE (SYN)	Maintains timing synchronization in synchronous data transmission.^[17]
U+0017	END OF TRANSMISSION BLOCK (ETB)	Marks the end of a logical data block.^[17]
U+0018	CANCEL (CAN)	Aborts the preceding data, instructing the receiver to ignore it.^[17]
U+0019	END OF MEDIUM (EM)	Signals the physical end of a storage medium or data section.^[17]
U+001A	SUBSTITUTE (SUB)	Replaces erroneous or invalid characters.^[19]
U+001B	ESCAPE (ESC)	Introduces escape sequences for additional control functions.^[17]
U+001C	FILE SEPARATOR (FS, IS4)	Delimits files in a hierarchical data structure.^[17]
U+001D	GROUP SEPARATOR (GS, IS3)	Separates groups within a file.^[17]
U+001E	RECORD SEPARATOR (RS, IS2)	Separates records within a group.^[17]
U+001F	UNIT SEPARATOR (US, IS1)	Separates units or fields within a record.^[17]
U+007F	DELETE (DEL)	Originally used to obliterate characters; in Unicode, treated as a control for padding or erasure.^[16]

These codes found widespread traditional applications in terminal control, such as using LF and CR to advance to a new line (often combined as CRLF, U+000D U+000A, for Windows-style line breaks), HT for columnar formatting, and BEL for alerts.^[18] In printing, FF ejected pages, while BS enabled overprinting for effects like bold text or accents.^[18] For data communications, sequences like SOH, STX, ETX, and EOT provided framing to structure messages and ensure reliable transmission over serial links.^[18] In Unicode, all C0 control codes belong to the Cc (Other, Control) general category, as defined in the Unicode Character Database.^[7] They have no canonical or compatibility decompositions, remaining atomic units that applications must handle explicitly.^[19] While their core roles in text processing persist—such as line breaks and tabs—their exact rendering and effects are left to the discretion of implementations, ensuring compatibility with legacy systems.^[19]

C1 Control Codes

The C1 control codes comprise 32 characters standardized in ISO/IEC 6429:1992, which are mapped in Unicode to the code points U+0080 through U+009F within the C1 Controls and Latin-1 Supplement block. These codes extend the functionality of the C0 controls by providing additional mechanisms for text formatting, device management, and character encoding shifts in 8-bit environments, originally designed to support display and printing devices like CRT terminals and printers.^[20]^[21] In Unicode, all C1 codes belong to the Cc (Control) general category and are preserved primarily for round-trip compatibility with legacy encodings such as ISO/IEC 8859-1:1987, ensuring that data interchange with older systems does not alter control semantics. The following table enumerates the C1 control codes, including their Unicode code points, standard abbreviations, and primary names as defined in ISO/IEC 6429:1992:

Code Point	Abbreviation	Name
U+0080	PAD	Padding Character
U+0081	HOP	High Octet Preset
U+0082	BPH	Break Permitted Here
U+0083	NBH	No Break Here
U+0084	IND	Index
U+0085	NEL	Next Line
U+0086	SSA	Start of Selected Area
U+0087	ESA	End of Selected Area
U+0088	HTS	Horizontal Tabulation Set
U+0089	HTJ	Horizontal Tabulation with Justification
U+008A	VTS	Vertical Tabulation Set
U+008B	PLD	Partial Line Down (Forward)
U+008C	PLU	Partial Line Up (Backward)
U+008D	RI	Reverse Index
U+008E	SS2	Single-Shift Two
U+008F	SS3	Single-Shift Three
U+0090	DCS	Device Control String
U+0091	PU1	Private Use One
U+0092	PU2	Private Use Two
U+0093	STS	Set Transmit State
U+0094	CCH	Cancel Character
U+0095	MW	Message Waiting
U+0096	SPA	Start of Protected Area
U+0097	EPA	End of Protected Area
U+0098	SOS	Start of String
U+0099	SGCI	Single Graphic Character Introducer
U+009A	SCI	Single Character Introducer
U+009B	CSI	Control Sequence Introducer
U+009C	ST	String Terminator
U+009D	OSC	Operating System Command
U+009E	PM	Privacy Message
U+009F	APC	Application Program Command

Originally developed in the late 1970s as part of standards like ECMA-48 and ISO 6429, the C1 codes were intended for applications such as videotex systems (e.g., Prestel) and terminal protocols (e.g., VT100), where they enabled advanced features like area selection (SSA/ESA), tabulation setup (HTS/VTS), and escape sequence initiation via CSI (U+009B) for commands like cursor positioning or color changes.^[22] In 7-bit transmission environments, C1 codes could be represented using the ESC (U+001B) character followed by a specific octet, allowing compatibility with ASCII-based systems. Although many C1 codes are now obsolete in contemporary text processing and web standards due to the adoption of higher-level formatting mechanisms, they remain relevant in legacy systems; for instance, NEL (U+0085) serves as a line-breaking control in EBCDIC encodings, moving the active position to the first character of the next line (with semantics equivalent to carriage return followed by line feed).^[23] As 8-bit extensions to the C0 controls from ASCII and related 7-bit standards, the C1 set enhances control over text layout and device behavior in environments requiring more granular manipulation than basic transmission protocols.

Unicode-Introduced Separators and Invisible Formatters

Line and Paragraph Separators

Unicode introduces two dedicated characters for explicit line and paragraph separation in plain text: the Line Separator (LS, U+2028) and the Paragraph Separator (PS, U+2029). These characters were defined in the Unicode Standard version 1.0 to provide unambiguous structural breaks independent of platform-specific conventions.^[24] LS belongs to the General Category Zl (Separator, Line), while PS is in category Zp (Separator, Paragraph).^[25] Both have zero width and are not rendered visibly, serving instead as invisible controls that influence text layout and processing.^[1] In terms of properties, LS creates a mandatory line break (Line Break class BK) without initiating a new paragraph, allowing the subsequent text to continue with the same formatting, such as indentation or alignment, akin to an HTML <br> element.^[26] PS also enforces a mandatory line break but resets paragraph-level formatting, applying interparagraph spacing, margins, and direction changes as needed.^[27] Both characters have a bidirectional class of B (Boundary Neutral), treating them as neutral points that do not alter embedding levels but facilitate breaks in bidirectional text.^[28] No additional characters have been added to these categories since Unicode 3.0, maintaining their singleton status.^[7] These separators are used in applications requiring semantic text structure, such as XML and HTML documents where they can be inserted via numeric entities like   for LS, ensuring consistent rendering across systems without relying on ambiguous newline sequences.^[29] In word processors and international text editors, they support precise line wrapping for multilingual content, particularly in scripts with complex layout needs.^[30] They are preferred over legacy sequences like CRLF for plain text semantic breaks because LS and PS carry explicit meaning preserved across processes, whereas CR (U+000D) and LF (U+000A)—categorized as Cc (Other, Control)—have implementation-defined behaviors that vary by platform.^[31] Unlike legacy controls, LS and PS remain stable under Unicode normalization forms such as NFC and NFD, with no decomposition or composition mappings, ensuring they are not altered during canonical equivalence checks.^[32] This stability enables round-trip compatibility with legacy newline representations through mapping guidelines, such as those in Unicode Technical Report #13, while providing superior handling for diverse scripts like CJK, where consistent break points aid in vertical and horizontal layout algorithms.^[33]^[12]

Zero-Width Spaces, Joiners, and Invisible Operators

Unicode's zero-width spaces, joiners, and invisible operators are format control characters (General_Category=Cf) that provide fine-grained control over text rendering without visible width, influencing glyph shaping, line breaking, and mathematical notation. These characters, primarily in the General Punctuation block (U+2000–U+206F), enable precise adjustments in scripts requiring complex joining behaviors or implicit operations, such as preventing or forcing ligatures in cursive scripts or inserting invisible multipliers in equations. They are default ignorable (Default_Ignorable_Code_Point=Yes) and typically classified as boundary neutral (Bidi_Class=BN) for bidirectional processing, ensuring they do not disrupt layout unless specified.^[7] The zero-width non-joiner (ZWNJ, U+200C), introduced in Unicode 1.1, prevents the formation of ligatures or cursive connections between adjacent characters in scripts like Arabic, Syriac, or Indic languages such as Devanagari. For instance, in Arabic, inserting ZWNJ between ل (lam) and ا (alif) inhibits the default ligature لَا, rendering them separately. This character has a Joining_Type of Non_Joining and a Line_Break property of ZWJ, which prohibits breaks while signaling non-connection in OpenType shaping engines.^[1]^[12] Complementing ZWNJ, the zero-width joiner (ZWJ, U+200D), also from Unicode 1.1, explicitly requests joining or ligature formation where it might not occur by default, such as in emoji sequences or half-consonant forms in Indic scripts. A prominent use is in family emoji compositions, like 👨‍👩‍👧 (man, ZWJ, woman, ZWJ, girl), where ZWJ binds the elements into a single glyph if supported by the font. ZWJ shares the Cf category and BN bidi class but has a Line_Break of ZWJ, integrating with OpenType GSUB tables for glyph substitution during rendering.^[1]^[34] The zero-width space (ZWSP, U+200B), added in Unicode 2.0, serves as an invisible break opportunity, similar to a soft hyphen but without a hyphen mark, allowing line breaks in languages without explicit word spaces, such as Thai or Japanese. It has a Line_Break property of ZW, allowing an optional line break opportunity. Unlike visible spaces, ZWSP does not contribute to justification or kerning.^[1]^[12] For non-breaking scenarios, the word joiner (WJ, U+2060), introduced in Unicode 3.0, acts as an invisible no-break space, preventing line breaks within compound words or phrases in languages like Hindi or German, where it replaces deprecated uses of U+FEFF. WJ has a Line_Break property of WJ, explicitly prohibiting breaks, and is particularly useful in plain-text preservation of joined forms without affecting width or spacing.^[1]^[12] In mathematical notation, invisible operators from Unicode 3.2 provide implicit symbols without visible rendering: U+2061 FUNCTION APPLICATION indicates function dependence, as in f(x); U+2062 INVISIBLE TIMES implies multiplication, rendering 2×3 as 2 3 in plain text; U+2063 INVISIBLE SEPARATOR disambiguates nested radicals or fractions; and U+2064 INVISIBLE PLUS suggests addition in sums. These characters, all Cf with Line_Break=IN (invisible), support semantic markup in tools like LaTeX or MathML, ensuring correct interpretation by assistive technologies.^[35]^[36] Specific to the Mongolian script, the Mongolian vowel separator (U+1806), added in Unicode 3.0, inserts a zero-width break between a vowel and subsequent consonants to control vertical stacking and positioning, preventing unwanted connections in traditional Mongolian typesetting. It is Cf with Line_Break=PR (prohibits break after) and integrates with OpenType features for accurate glyph placement in vertical writing modes. While primarily Cf, related spacing controls like the hair space (U+200A, Zs category from Unicode 3.0) offer a borderline thin separator (1/6 em width) for fine typography, though it is visible unlike true zero-width operators. No new Cf characters for joining or invisible operations have been added as of Unicode 17.0. All these characters leverage OpenType shaping for script-specific behaviors, ensuring consistent rendering across fonts and systems.^[7]^[1]

Directional and Layout Control Characters

Bidirectional Formatting Controls

Bidirectional formatting controls are a set of Unicode characters designed to explicitly manage text directionality in documents containing mixed left-to-right (LTR) and right-to-left (RTL) scripts, such as English interspersed with Arabic or Hebrew. These controls allow authors to embed directional segments or override the default bidirectional behavior without relying solely on the implicit Unicode Bidirectional Algorithm described in UAX #9.^[37] Introduced primarily in early Unicode versions, they are classified under the Format (Cf) general category and function as invisible characters that influence layout during rendering.^[38] The core embedding controls include the Left-to-Right Embedding (LRE, U+202A), which treats subsequent text as embedded LTR by increasing the embedding level to the next even number greater than the current level; the Right-to-Left Embedding (RLE, U+202B), which embeds RTL text by setting the level to the next odd number; the Left-to-Right Override (LRO, U+202D), forcing strong LTR treatment regardless of character types; and the Right-to-Left Override (RLO, U+202E), enforcing strong RTL.^[39] These were added in Unicode 1.1 and operate in a stack-based manner, with their effects nested and terminated by the Pop Directional Formatting (PDF, U+202C), which restores the previous embedding level.^[38] In practice, they enable precise control in plain text files or HTML via character references (e.g., ‪ for LRE), though they can propagate directionality to surrounding text, potentially causing unintended reordering.^[40] For security reasons, overrides like LRO and RLO are discouraged in untrusted content due to risks of visual spoofing.^[41] Complementing the embeddings are implicit directional marks: the Left-to-Right Mark (LRM, U+200E), an invisible zero-width character behaving as a strong LTR type to resolve ambiguities for neutrals like spaces or punctuation; the Right-to-Left Mark (RLM, U+200F), acting as a strong RTL neutralizer; and the Arabic Letter Mark (ALM, U+061C), a specialized RTL mark with Arabic letter (AL) bidirectional class for proper presentation in Arabic contexts, such as isolating numbers from following RTL text.^[42] LRM and RLM were introduced in Unicode 1.1, while ALM was added in Unicode 6.3 to address specific needs in Arabic typography without affecting non-Arabic scripts.^[43]^[44] These marks have localized effects, making them suitable for fine-grained adjustments in mixed-script environments, such as ensuring correct adjacency in email addresses or filenames.^[37] All these controls share the Cf general category and specific bidirectional classes (e.g., LRE for embeddings, R for RLM), with no significant property changes since Unicode 4.0.^[7] While effective for legacy bidirectional text processing predating Unicode 6.3, embeddings and overrides are now considered legacy mechanisms, with modern applications favoring directional isolates to limit scope and reduce side effects on adjacent content.^[45] In HTML and CSS, equivalents like the dir attribute or unicode-bidi: embed properties often supplant direct use of these characters for better maintainability.^[46]

Character	Code Point	Name	Bidirectional Class	Function Summary
LRE	U+202A	Left-to-Right Embedding	LRE	Embeds LTR segment
RLE	U+202B	Right-to-Left Embedding	RLE	Embeds RTL segment
LRO	U+202D	Left-to-Right Override	LRO	Overrides to LTR
RLO	U+202E	Right-to-Left Override	RLO	Overrides to RTL
PDF	U+202C	Pop Directional Formatting	PDF	Terminates embedding/override
LRM	U+200E	Left-to-Right Mark	L	Neutralizes to LTR
RLM	U+200F	Right-to-Left Mark	R	Neutralizes to RTL
ALM	U+061C	Arabic Letter Mark	AL	Arabic-specific RTL neutralizer

Isolate and Embedding Controls

Isolate and embedding controls in Unicode provide a mechanism for scoped bidirectional text formatting, introduced to address limitations in earlier directional controls by isolating directional effects to specific spans of text without influencing surrounding content. These controls, added in Unicode 6.3, enable safer handling of mixed left-to-right (LTR) and right-to-left (RTL) scripts, such as embedding English text within Arabic or Hebrew without causing unintended reordering elsewhere.^[47]^[9] The Left-to-Right Isolate (LRI, U+2066) initiates an LTR isolated span, setting the embedding level to the next even level and treating subsequent text as LTR until terminated, while preventing bidirectional interactions with external text.^[9] The Right-to-Left Isolate (RLI, U+2067) functions similarly but for RTL spans, raising the level to the next odd value.^[9] Both belong to the Format (Cf) general category, have bidirectional classes LRI and RLI respectively, and a line break property of Combining Mark (CM).^[7]^[48] The First Strong Isolate (FSI, U+2068) starts an isolate whose direction is determined by the first strong directional character within it, applying LTR if the first strong character is LTR or neutral, or RTL otherwise, which is useful for automatically adapting to mixed or unknown content directions.^[9] Like LRI and RLI, FSI has the Cf general category, FSI bidirectional class, and CM line break property.^[7]^[48] The Pop Directional Isolate (PDI, U+2069) terminates the most recent isolate initiated by LRI, RLI, or FSI, popping it from the directional stack and restoring the previous embedding level, ensuring clean scoping even if isolates nest.^[9] PDI shares the Cf category, PDI bidirectional class, and CM line break property.^[7]^[48] These controls address flaws in legacy embedding codes like RLE and LRO by automatically popping at PDI or paragraph boundaries, preventing runaway directional effects that could disrupt large portions of text.^[9] They are particularly preferred in applications handling user-generated content, such as email clients and web forms, where isolates can wrap dynamic insertions to maintain readability.^[49] For deeper nesting, isolates can pair with the legacy Pop Directional Formatting (PDF, U+202C) to terminate any inner embeddings.^[9]

Control	Code Point	Name	Bidirectional Class	Primary Use
LRI	U+2066	Left-to-Right Isolate	LRI	Isolate LTR spans
RLI	U+2067	Right-to-Left Isolate	RLI	Isolate RTL spans
FSI	U+2068	First Strong Isolate	FSI	Isolate with auto-direction
PDI	U+2069	Pop Directional Isolate	PDI	Terminate any isolate

Adoption of these isolates is widespread in modern rendering engines, with full support in browsers like Chrome, Firefox, Safari, and Edge since around 2013, enabling CSS equivalents such as unicode-bidi: isolate.^[50] No additional isolate controls have been introduced since Unicode 6.3, including up to Unicode 17.0, reflecting their sufficiency for scoped bidi needs.

Tagging and Annotation Characters

Language Tags

Language tag characters provide a mechanism for embedding language metadata directly into Unicode plain text streams without relying on external markup languages. Introduced in Unicode 3.1, this feature consists of the base character U+E0001 LANGUAGE TAG followed by tag component characters in the range U+E0020 through U+E007E, which represent shifted ASCII values for punctuation, letters, and digits (for example, U+E0020 TAG SPACE corresponds to a space, and U+E0041 TAG LATIN CAPITAL LETTER A corresponds to 'A').^[51] These tags support up to eight subtags conforming to BCP 47 language identifiers, such as "en-Latn-US", enabling applications to switch fonts, scripts, or rendering behaviors based on the embedded language information.^[52] The structure begins with U+E0001 to initiate the tag, followed by the encoded subtags separated by U+E0020, and optionally terminated by U+E007F CANCEL TAG to reset the tagging state.^[53] This allows for plain-text files or data streams to carry semantic language cues, facilitating tasks like multilingual document processing or accessibility features without HTML or XML attributes. However, the use of these characters for language tagging has been deprecated since Unicode 5.1 in 2008 due to their complexity, limited adoption, and the prevalence of more robust alternatives like the HTML lang attribute or richer markup systems.^[54] In Unicode 8.0 (2015), the tag component characters U+E0020–U+E007E were restored from deprecation for backward compatibility and potential future uses in rich text or specialized plain-text scenarios, while U+E0001 remains strongly discouraged.^[55] These characters are classified as format (Cf) with bidirectional class "ON" (other neutral), meaning they do not participate in normalization forms and have neutral impact on text directionality.^[56] As of Unicode 17.0, no new tag characters have been added since Unicode 8.0.^[1] For modern applications, alternatives such as BCP 47 tags in metadata or structured formats are preferred over these inline controls.^[57]

Interlinear Annotation Sequences

Interlinear annotation sequences consist of three format characters designed to insert annotations between lines of base text in plain text environments, functioning as a mechanism for rich-text-like features without relying on markup languages. These characters, introduced in Unicode 3.0, allow for the delimitation of annotated content, such as ruby text or footnotes, by marking the start, separation, and end of the annotation. They are particularly associated with legacy East Asian typography practices, where interlinear notes provide phonetic or explanatory glosses above or below the primary text.^[1] The sequence begins with U+FFF9 INTERLINEAR ANNOTATION ANCHOR (Cf category, combining class 0), which marks the start of the interlinear note and attaches to the preceding base character. This is followed by the annotation text, then U+FFFA INTERLINEAR ANNOTATION SEPARATOR (Cf, combining class 0), which divides the base text from the annotation. The sequence concludes with U+FFFB INTERLINEAR ANNOTATION TERMINATOR (Cf, combining class 0), signaling the end of the annotation and resuming the base text flow, as in the example: base text￹annotation￺￻next base text. All three characters have a line break property of CM (Combining Mark), prohibiting breaks within the sequence and treating it as attached to the base character per UAX #14 rule LB9. In bidirectional text, they are classified as ON (Other Neutral) and resolved according to the surrounding embedding direction under UAX #9 rules N1 and N2; for proper rendering, the bidirectional algorithm is applied to the main text after replacing annotations with the annotated content, bracketed by format controls to maintain contiguity.^[1]^[12] Despite their utility for plain text annotation, these characters are rare in contemporary use due to their reliance on specific sender-receiver agreements and potential misinterpretation if filtered or unsupported. They have seen no significant changes or expansions since Unicode 3.0 and are generally discouraged for new implementations, with modern alternatives like HTML's <ruby> element preferred for structured documents. Applications typically manage their formatting via out-of-band information rather than exposing them to end-users.^[1]

Variation Selectors

Variation selectors are nonspacing marks in Unicode that specify particular glyph variants or styles for a preceding base character, enabling precise control over rendering without altering the character's identity. These selectors form variation sequences, consisting of a graphic base character followed immediately by a variation selector, to disambiguate ambiguous glyph forms in contexts where font or style variations might otherwise lead to inconsistent presentation.^[5] They are essential for scripts and symbols requiring multiple compatible appearances, such as mathematical notation, ideographs, and emoji, while ensuring compatibility with normalization processes.^[58] The initial set comprises Variation Selector-1 (VS1) through Variation Selector-16 (VS16), encoded at U+FE00 through U+FE0F with general category Mn (nonspacing mark) and bidirectional class NSM (nonspacing mark). Introduced in Unicode 3.2, these selectors support standardized variation sequences for glyph choices like text versus presentation forms. For instance, the sequence U+3046 (hiragana letter u) followed by U+FE00 selects a small kana variant.^[59] Variation selectors have no standalone visual representation and do not affect spacing or line breaking; they modify only the appearance of the associated base character and any subsequent combining marks.^[5] A supplementary set, Variation Selector-17 (VS17) through Variation Selector-256 (VS256), is encoded in the Variation Selectors Supplement block at U+ E0100 through U+E01EF, also with general category Mn and bidirectional class NSM. This block was introduced in Unicode 4.0 to accommodate additional needs, such as ideographic distinctions and regional indicator variants.^[60] The full range of 240 selectors (VS17 to VS256) became available from Unicode 4.0, enabling expanded use cases like distinguishing emoji from non-emoji styles; for example, U+1F3F3 (white flag) followed by U+FE0F selects a text presentation variant. Variation sequences are strictly defined in official Unicode data files to prevent arbitrary use, with the primary list in StandardizedVariants.txt containing over 1,000 entries as of Unicode 17.0, covering categories such as mathematical symbols, CJK ideographs, and script-specific forms like Egyptian hieroglyph rotations.^[61] Additional sequences for emoji (using VS15 for text style and VS16 for emoji style) number 371 in emoji-variation-sequences.txt, while ideographic variations are registered in the Ideographic Variation Database per UTS #37.^[62] These selectors integrate with zero-width joiner (ZWJ) sequences for complex emoji modifiers like skin tones, though ZWJ functionality is addressed separately.^[58] Expansions to variation selector usage have continued in recent Unicode versions, with Unicode 15.0 and later adding sequences for emerging scripts and enhanced emoji variants to support diverse linguistic and cultural rendering needs. For example, new standardized sequences as of Unicode 17.0 (September 2025) include additional positional variants for East Asian punctuation, Myanmar text forms, and 42 rotations for Egyptian hieroglyphs, ensuring broader compatibility across fonts and platforms.^[61] Unrecognized sequences default to the base character's standard glyph, maintaining robustness in rendering.^[5]

Visual Representations of Controls

Control Picture Symbols

The Control Pictures block (U+2400–U+243F) consists of graphic symbols designed to provide visible representations of the C0 control codes from the ASCII standard, along with a few additional symbols for space and delete, facilitating their depiction in text for educational, documentation, and technical purposes.^[63] These characters, classified as symbols (So category), were introduced in Unicode version 1.1 in 1993 and have remained stable without further additions to the core set since then.^[64] The block includes 42 assigned code points out of 64, with the primary focus on rendering otherwise invisible control functions as stylized icons, such as diagonal lettering or boxed forms, though glyph designs may vary across fonts.^[63] The core of the block comprises 32 symbols corresponding to the C0 controls from U+0000 (NULL) to U+001F (UNIT SEPARATOR), mapped directly to U+2400 through U+241F. For instance, U+2400 (⦰, SYMBOL FOR NULL) represents NUL, U+2401 (⦱, SYMBOL FOR START OF HEADING) depicts SOH, U+2409 (↹, SYMBOL FOR HORIZONTAL TABULATION) illustrates HT, and U+241F (⦿, SYMBOL FOR UNIT SEPARATOR) shows US.^[65] Additionally, U+2420 (␠, SYMBOL FOR SPACE) and U+2421 (␡, SYMBOL FOR DELETE) provide square-enclosed representations for the space character (U+0020) and the DEL control (U+007F), respectively, extending the utility beyond strict C0 codes.^[63] These symbols are not intended for everyday text rendering but serve as a standardized way to visualize non-printing characters without altering their underlying semantics.^[63] Proposals to extend the block with equivalent symbols for C1 controls (U+0080–U+009F) were considered during Unicode's early development and again in later submissions, such as a 2011 proposal, but were ultimately rejected to maintain stability and avoid redundancy with existing control mechanisms.^[66] The block has thus focused exclusively on C0 and select legacy controls, ensuring compatibility with ASCII-derived systems.^[64] In practice, these symbols are employed in software tools for debugging and analysis, such as hex editors where they replace raw control bytes with readable icons to aid in file inspection and reverse engineering.^[67] They also appear in protocol analyzers to denote control sequences in data streams, enhancing clarity during network or serial communication diagnostics.^[68] Font support for the block varies, with comprehensive coverage in fonts like Segoe UI Symbol, which includes glyphs for all assigned characters to ensure consistent rendering in Windows environments.^[69]

Usage in Debugging and Documentation

In software debugging, Unicode control characters are frequently visualized in hexadecimal dumps using escape sequences, such as \u200B for the zero-width space (U+200B), to identify and isolate non-printable elements in text streams.^[70] Network analysis tools like Wireshark assist in examining bidirectional text sequences by decoding protocol payloads that may include directional formatting controls, revealing potential rendering issues in transmitted data.^[71] For documentation purposes, the Unicode charts employ the Control Pictures block (U+2400–U+243F) to provide graphical symbols representing otherwise invisible control characters, facilitating clearer explanations in technical references.^[63] Similarly, specifications such as Unicode Standard Annex #9 (UAX #9) utilize escape notations like LRE (U+202A) and RLE (U+202B) to denote bidirectional formatting controls, ensuring precise description of their behavior without ambiguity.^[9] In standards development, protocols like HTTP preserve specific control characters, including the line separator (LS, U+2028) and paragraph separator (PS, U+2029), to maintain structural integrity in transmitted text, as discussed in Unicode Consortium deliberations on email and web interchange.^[72] Normalization testing, such as under Normalization Form C (NFC), confirms that control characters remain unchanged, as they fall outside the decomposition and composition rules applied to combining sequences.^[73] Educationally, control characters like the zero-width joiner (ZWJ, U+200D) are taught as essential for constructing complex emojis, such as gendered profession emojis (e.g., woman + ZWJ + microscope for woman scientist), illustrating how invisible operators enable semantic text encoding.^[74] The stability policies of the Unicode Standard ensure no new assignments to the Control (Cc) or Format (Cf) categories, as maintained through version 17.0 (2024), promoting backward compatibility in educational materials and implementations.^[75]^[76] The invisible nature of these characters often leads to challenges in copy-paste operations, where they inadvertently propagate errors like unintended formatting or security vulnerabilities in code.^[77] Unicode-aware editors, such as Visual Studio Code, mitigate this by highlighting control characters through settings like "editor.unicodeHighlight.ambiguousCharacters" or extensions that render them visibly.^[78]

References

[1]
Special Areas and Format Characters - Unicode
This chapter describes several kinds of characters that have special properties as well as areas of the codespace that are set aside for special purposes.
[2]
https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-23/#Control_Codes
[3]
https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-23/#Table_23-1
[4]
https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-23/#Layout_Controls
[5]
https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-23/#Variation_Selectors
[6]
https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-23/#Deprecated_Format_Characters
[7]
https://www.unicode.org/reports/tr44/
[8]
UAX #44: Unicode Character Database
Aug 27, 2025 · Unicode 17.0.0. Changes in specific files: Appropriate existing data files were updated to add the 4803 new characters encoded in Unicode 17.0.
[9]
[PDF] Guide to the use of character set standards in Europe - Unicode
Jul 23, 1999 · The C0 set of ISO/IEC 6429 has its historical origin in the control characters of the ASCII character set. For this reason, 10 of the ...
[10]
Special Areas and Format Characters - Unicode
The Unicode Standard contains code positions for the 64 control characters and the DEL character found in ISO standards and many vendor character sets. The ...
[11]
UAX #9: Unicode Bidirectional Algorithm
Summary of each segment:
[12]
Core Spec – Unicode 17.0.0
23 Special Areas and Format Characters · Control Codes • Layout Controls • Deprecated Format Characters • Variation Selectors • Private-Use Characters ...
[13]
Character Properties - Unicode
This chapter gives an overview of character properties, their status and attributes, followed by an overview of the UCD and more detailed notes on some ...
[14]
UAX #14: Unicode Line Breaking Algorithm
Summary of each segment:
[15]
https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/
[16]
[PDF] Latin-1 Supplement - The Unicode Standard, Version 17.0
The Unicode Standard, Version 17.0, Copyright © 1991-2025 Unicode, Inc. All rights reserved. 9. 00A8. C1 Controls and Latin-1 Supplement. 0080. 009F Ș <control>.
[17]
Chapter 3 – Unicode 16.0.0
Wherever the precise behavior of all Unicode characters needs to be cited, the full three-field version number should be used, as in the first example below.
[18]
C0 Controls and Basic Latin - Unicode
C0 controls. Alias names are those for ISO/IEC 6429:1992. Commonly used alternative aliases are also shown. 0000, ␀, <Control>. = Null. 0001, ␁, <Control>.
[19]
The set of control characters of ISO 646 - skew.org
ISO IR-001 is the set of control codes for ISO 646. The control codes have different purposes. Some are meant to control mechanical printing devices.
[20]
Control characters in ASCII and Unicode - Aivosto
In fact, the C1 area has been entirely reserved for control codes in Unicode. ... To solve the problem, control characters IND and NEL were added to the C1 area.
[21]
[PDF] C0 Controls and Basic Latin - The Unicode Standard, Version 17.0
These charts are provided as the online reference to the character contents of the Unicode Standard, Version 17.0 but do not provide all the information needed ...<|control11|><|separator|>
[22]
C1 Controls and Latin-1 Supplement - Unicode
C1 controls. Alias names are those for ISO/IEC 6429:1992. 0080, €, <Control>. 0081, <Control>. 0082, ‚, <Control>. = Break Permitted Here.Missing: codes | Show results with:codes
[23]
https://www.unicode.org/L2/L2022/22013r-c0-c1-stability.pdf
[24]
http://www.unicode.org/versions/Unicode1.0.0/appB.pdf
[25]
[PDF] C0 and C1 stability for Unicode and 10646
Feb 11, 2022 · The character encoding for many old computer systems do not follow the ISO/IEC 2022 architecture, and their control codes do not follow ISO/IEC ...
[26]
[PDF] Appendix B: Implementation Guidelines - Unicode
Nov 19, 2012 · Line and Paragraph Separator. The Unicode standard has two special characters U+2028 LINE SEP-. ARATOR and U+2029 PARAGRAPH SEPARATOR. A new ...
[27]
https://www.unicode.org/reports/tr14/#LB4
[28]
https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-23/#G12985
[29]
https://www.w3.org/TR/unicode-xml/
[30]
https://www.unicode.org/reports/tr20/tr20-5.html
[31]
Unicode in XML and other Markup Languages - W3C
Jul 13, 2017 · Short description: The line and paragraph separator provide unambiguous means to denote hard line breaks and paragraph delimiters in plain text.
[32]
https://www.unicode.org/reports/tr15/
[33]
http://www.unicode.org/standard/reports/tr13/tr13-5.html
[34]
https://unicode.org/reports/tr51/
[35]
UTR #13: Unicode Newline Guidelines
There are two mappings of LF and NEL used by EBCDIC systems. The first EBCDIC column shows the MVS Open Edition (including CP1047) mapping of these characters, ...Missing: U+ | Show results with:U+
[36]
UTS #51: Unicode Emoji
The U+200D ZERO WIDTH JOINER (ZWJ) can be used between the elements of a sequence of characters to indicate that a single glyph should be presented if available ...
[37]
Chapter 22 – Unicode 16.0.0
#Invisible Function Application. U+2061 FUNCTION APPLICATION is used for an implied function dependence, as inf(x + y). To indicate that this is the function ...
[38]
https://www.unicode.org/charts/PDF/U2000.pdf
[39]
UAX #9: Unicode Bidirectional Algorithm
Summary of each segment:
[40]
[PDF] General Punctuation - The Unicode Standard, Version 17.0
Format characters. 202A Ř LEFT-TO-RIGHT EMBEDDING. • commonly abbreviated LRE. 202B ř RIGHT-TO-LEFT EMBEDDING. • commonly abbreviated RLE. 202C Ś POP ...Missing: U+ | Show results with:U+
[41]
https://www.unicode.org/reports/tr39/tr39-22.html
[42]
https://www.unicode.org/reports/tr9/tr9-51.html#Implicit_Directional_Marks
[43]
https://www.unicode.org/charts/PDF/U0600.pdf
[44]
https://www.unicode.org/versions/beta-6.3.0.html
[45]
[PDF] Arabic - The Unicode Standard, Version 17.0
061C. 061D. 061E. 061F. 0620. 0621. 0622. 0623. 0624. 0625. 0626. 0627. 0628. 0629. 062A ... ٷ ARABIC LETTER U WITH HAMZA ABOVE. • preferred spelling is 0674 ٴ ...
[46]
https://www.unicode.org/reports/tr20/tr20-9.html
[47]
https://www.unicode.org/versions/Unicode6.3.0/
[48]
https://www.unicode.org/Public/UNIDATA/extracted/DerivedLineBreak.txt
[49]
https://www.w3.org/International/questions/qa-bidi-unicode-controls.en.html
[50]
None
Summary of each segment:
[51]
How to use Unicode controls for bidi text - W3C
Feb 23, 2023 · This article looks at how content authors can apply direction metadata to bidirectional text when markup is not available.
[52]
CSS property: unicode-bidi: isolate | Can I use... Support ... - CanIUse
"Can I use" provides up-to-date browser support tables for support of front-end web technologies on desktop and mobile web browsers.
[53]
https://www.unicode.org/charts/nameslist/n_E0000.html
[54]
[PDF] Tag characters - The Unicode Standard, Version 17.0
E0001 LANGUAGE TAG. • This character is deprecated, and its use is strongly discouraged. Tag components. E0020 TAG SPACE. E0021 TAG EXCLAMATION MARK.
[55]
Tags - Unicode
Tags ; E0001, ⬚, Language Tag ; Tag components ; E0020, ⬚, Tag Space ; E0021, ⬚, Tag Exclamation Mark.
[56]
Deprecated Character Proposal - Unicode
These characters are deprecated, and should not be used particularly with any protocols that provide alternate means of language tagging.... =================== ...
[57]
UTC 143 Draft Minutes - Unicode
May 8, 2015 · [143-A74] Action Item for Ken Whistler, Editorial Committee: Update UAX #44 to indicate the change in deprecation status, for Unicode 8.0. See ...
[58]
https://www.unicode.org/reports/tr51/
[59]
Strings on the Web: Language and Direction Metadata - W3C
Oct 17, 2024 · ... character U+E0001 LANGUAGE TAG is strongly discouraged. ... Producers insert Unicode tag characters into the data to tag strings with a language.Missing: restored | Show results with:restored
[60]
https://www.unicode.org/charts/PDF/UE0100.pdf
[61]
[PDF] variation selector-16 - The Unicode Standard, Version 17.0
These charts are provided as the online reference to the character contents of the Unicode Standard, Version 17.0 but do not provide all the information needed ...
[62]
[PDF] Variation Selectors Supplement - The Unicode Standard, Version 17.0
These charts are provided as the online reference to the character contents of the Unicode Standard, Version 17.0 but do not provide all the information needed ...
[63]
StandardizedVariants.txt - Unicode
... variation sequences that are defined in the # Unicode Standard. # # This ... Emoji variation sequences are defined in the file # emoji-variation ...
[64]
UTS #37: Unicode Ideographic Variation Database
This document describes the organization of the Ideographic Variation Database, and the procedure to add sequences to that database.
[65]
[PDF] Control Pictures - The Unicode Standard, Version 17.0
These charts are provided as the online reference to the character contents of the Unicode Standard, Version 17.0 but do not provide all the information needed ...
[66]
Control Pictures - Codepoints
Control Pictures. Block from U+2400 to U+243F. This block was introduced in Unicode version 1.1 (1993). It contains 42 codepoints.
[67]
Unicode Block “Control Pictures” - Compart
Character List Grid List ; U+2400. ␀. Symbol For Null ; U+2401. ␁. Symbol For Start of Heading ; U+2402. ␂. Symbol For Start of Text ; U+2403. ␃. Symbol For End of ...
[68]
C1 Control Pictures Proposal from Sean Leonard on 2011 ... - Unicode
Aug 13, 2011 · However, I looked through the "Archive of Notices of Non-Approval" and was unable to find an explicit rejection of his proposals. In any event, ...Why Nothing Ever Goes Away (was: Re: Acquiring DIS 10646)Re: Feedback from C1 Control Pictures Proposal - UnicodeMore results from unicode.org
[69]
Feature Request: Add the Control Pictures Unicode block · Issue #219
Dec 17, 2019 · Adding these 39 glyphs would make it possible for utilities such as hexdump to show them in text representations, which is much more helpful ...
[70]
6 Hex Editors for Malware Analysis - SANS Institute
Sep 29, 2010 · Hex editors allow examining and modifying a file at the low-level of bytes and bits, usually representing the file's contents in hexadecimal ...
[71]
Control Pictures characters supported by the Segoe UI Symbol font
Control Pictures characters supported by the Segoe UI Symbol font ; SYMBOL FOR START OF HEADING (U+2401) ; SYMBOL FOR START OF TEXT (U+2402) ; SYMBOL FOR END OF ...
[72]
Escape Unicode Characters - CyberChef
Escape Unicode Characters. Prefix. \u, %u, U+. Encode all chars. Padding. Uppercase hex. pause not_interested keyboard_arrow_up. Step Chef Icon Bake!
[73]
6.4. Building Display Filter Expressions - Wireshark
Wireshark provides a display filter language that enables you to precisely control which packets are displayed.Missing: Unicode bidirectional
[74]
Unicode Mail List Archive: RE: Line Separator and Paragraph ...
Oct 21, 2003 · email protocols. Had we done this, SLB could have been considered "just whitespace", while LS and PS would have been not-ignorable in HTML (andRe: Line Separator CharacterRe: Line Separator characterMore results from unicode.org
[75]
UAX #15: Unicode Normalization Forms
Jul 30, 2025 · Both NFD and NFC maintain compatibility composites. Neither NFKD nor NFKC maintains compatibility composites.Missing: LS PS CRLF
[76]
‍Zero Width Joiner Emoji | Meaning, Copy And Paste - Emojipedia
Zero Width Joiner (ZWJ) is a Unicode character that joins two or more other characters together in sequence to create a new emoji.
[77]
Unicode® Character Encoding Stability Policies
Jan 9, 2024 · This page lists the policies of the Unicode Consortium regarding character encoding stability. These policies are intended to ensure that text encoded in one ...
[78]
ASCII Smuggler Tool: Crafting Invisible Text and Decoding Hidden ...
Jan 14, 2024 · ASCII Smuggler Tool to help with testing and creation of payloads, and also to check if text might have invisible Unicode Tags you can use ASCII Smuggler.
[79]
Render Special Characters - Visual Studio Marketplace
May 26, 2024 · Render Special Characters. This extension makes it easy to identify special characters by displaying them using the UTFx encoding standard.