Fact-checked by Grok 2 weeks ago

Bidirectional text

Bidirectional text refers to text containing characters from scripts with opposing horizontal writing directions, primarily left-to-right (LTR) scripts such as Latin-based languages like English and right-to-left (RTL) scripts such as Arabic and Hebrew. To ensure correct visual rendering of such mixed-direction content, computing systems employ the Unicode Bidirectional Algorithm (UBA), a standardized process that reorders characters from their logical storage sequence in memory to the appropriate display order. The UBA, detailed in Unicode Standard Annex #9, processes text in four primary phases: first, separating the input into independent paragraphs using paragraph separators; second, initializing each character's bidirectional type (strong, weak, neutral, or explicit) and base embedding level based on the paragraph's direction; third, resolving embedding levels through explicit directional formatting characters and implicit rules for weak and neutral types; and fourth, reordering directional runs—maximal sequences of characters sharing the same embedding level—within each line for final display. Embedding levels are numeric values from 0 to 125, where even levels indicate LTR direction and odd levels indicate RTL, allowing nested directional contexts up to a practical depth of 61 pairs. Characters are classified by inherent properties: strong directional ones like Latin letters (LTR) or Hebrew letters (RTL), weak ones like European numbers that adapt to surrounding context, and neutral ones like punctuation that inherit direction from adjacent runs or the base paragraph direction. Historically, the UBA extends earlier implicit bidirectional models in Unicode, with major enhancements in version 6.3 (2013) introducing directional isolates (e.g., U+2066 to U+2069) and first-strong isolates to limit the scope of directional changes, preventing unintended effects on surrounding text and improving compatibility with legacy documents. These updates address common challenges in bidirectional text handling, such as mirroring paired punctuation (e.g., parentheses flipping in RTL contexts) and resolving ambiguities in mixed numeral systems, where digits may embed LTR even in RTL paragraphs. In modern computing, bidirectional support is integral to internationalization (i18n) standards, implemented in operating systems like Windows via Unicode properties and control characters (e.g., U+200F Right-to-Left Mark for forcing RTL flow), as well as in web browsers through CSS properties like unicode-bidi and direction. For web content, the base direction can be set using HTML's dir attribute (e.g., dir="rtl"), ensuring consistent rendering across diverse linguistic environments, though developers must account for issues like caret positioning during editing or neutral character resolution at direction boundaries. Overall, the UBA enables seamless global text interchange, supporting users of RTL scripts while maintaining logical order for storage and processing efficiency.

Fundamentals

Definition and Overview

Bidirectional text refers to sequences of characters that mix left-to-right (LTR) and right-to-left (RTL) directional scripts, requiring algorithmic reordering to achieve correct visual presentation. This phenomenon arises from the fundamental differences in writing systems: LTR for Latin-based languages like English, and RTL for Semitic languages such as Arabic and Hebrew, which often appear together in multilingual documents or interfaces. A simple example is the string "Hello שלום", where the Hebrew word "שלום" (meaning "peace") is rendered in RTL order—reversing its characters visually—while embedded within the LTR "Hello", resulting in the RTL portion appearing to the right but reading from right to left. In computing, accurate bidirectional text rendering is crucial for maintaining readability in digital applications, web pages, and software interfaces supporting global users, preventing confusion in mixed-language content. The Unicode standard addresses this through its Bidirectional Algorithm, ensuring consistent display across diverse platforms.

Historical Context

Bidirectional writing practices trace their roots to ancient civilizations, where script directions varied to accommodate inscription surfaces or aesthetic needs. One of the earliest forms was boustrophedon, a method alternating line directions from left-to-right and right-to-left, akin to an ox plowing a field. This style appeared in ancient Greek inscriptions as early as the 8th century BCE, with examples like the Dipylon Oinochoe vase from Athens demonstrating careful reversal of characters to maintain legibility across lines. Etruscan inscriptions from the 7th century BCE, influenced by Greek contact, also frequently employed boustrophedon, as seen in early monumental texts on stone and metal, reflecting a transitional phase before standardization to consistent directions. Semitic scripts established a more uniform right-to-left orientation that profoundly shaped later writing systems. The Phoenician alphabet, developed around 1200 BCE in the Levant, was consistently inscribed from right to left, omitting vowels for efficiency in trade and administration. This directionality directly influenced the Hebrew script by the 10th century BCE, where shared consonantal forms and phonetic principles preserved the RTL flow in biblical and epigraphic texts. Similarly, through intermediate Aramaic adaptations around the 9th century BCE, Phoenician RTL conventions evolved into the Arabic script, standardizing right-to-left writing across the Islamic world by the 7th century CE. Other historical systems exhibited bidirectional flexibility tied to visual or contextual cues rather than strict linearity. In ancient Egyptian hieroglyphs, dating from circa 3200 BCE, the reading direction was determined by the orientation of figures—human and animal glyphs faced toward the start of the text, allowing seamless shifts between left-to-right and right-to-left flows within the same composition, as evidenced in tomb inscriptions and stelae. The advent of printing in the 15th–16th centuries amplified challenges for bidirectional and mixed scripts, predating digital solutions. Early European presses struggled with RTL languages like Hebrew and Arabic, requiring custom type molds for cursive joins and multiple letter variants—up to four forms per Arabic character—resulting in laborious, error-prone composition by undertrained compositors. Mixed-language texts, such as early Arabic works printed in Italy from 1514, demanded reversed setting and frequent plate realignments, limiting production and fidelity to manuscript traditions until specialized foundries emerged in the 19th century.

Technical Standards

Unicode Bidirectional Support

The Unicode Standard establishes a foundational framework for handling bidirectional text via its Bidirectional Algorithm, initially specified in version 2.0, released in July 1996. This algorithm outlines rules for resolving the visual ordering of text that mixes left-to-right (LTR) and right-to-left (RTL) scripts, enabling consistent rendering across diverse writing systems without requiring script-specific adjustments in applications. At its core, the algorithm operates on a paragraph-by-paragraph basis, embedding each unit of text independently to isolate directional behavior. It determines the implicit base direction from the first strong directional character in the paragraph, such as an LTR letter or RTL character, and supports explicit overrides through dedicated formatting controls to force specific directions when needed. In resolving implicit direction, the algorithm relies on character classifications, such as left-to-right or right-to-left types, which are detailed elsewhere. The Bidirectional Algorithm has evolved through successive Unicode versions, with a key enhancement in Unicode 6.3 (released September 2013) introducing bidirectional isolates to manage nested directional runs more precisely, limiting their scope and reducing interference with adjacent text. The Unicode Consortium maintains ongoing refinements to the algorithm, ensuring compatibility with new scripts and internationalization requirements. This support extends to broader standards, where the algorithm informs implementations in HTML and CSS—such as the dir="rtl" attribute for declaring base direction—and is realized in open-source libraries like the International Components for Unicode (ICU), which provides robust bidirectional text processing for software applications.

Character Classification

In bidirectional text processing, characters are classified into categories based on their inherent directional behavior, which determines how they influence the ordering of text in mixed-direction scripts. These classifications are defined by the Unicode Bidirectional Algorithm and form the foundation for resolving text directionality. The primary property governing this is the Bidi_Class, a normative Unicode character property that assigns one of 23 possible bidirectional types to each code point, including unassigned and private-use characters. Characters are grouped into three main categories: strong, weak, and neutral, with additional explicit formatting types for directional control. Strong characters have a fixed direction that strongly influences surrounding text: L for left-to-right (e.g., Latin letters like A), R for right-to-left (e.g., Hebrew letters like א), and AL for right-to-left Arabic letters (e.g., Arabic ا). Weak characters adopt direction based on context, such as EN for European numbers (e.g., digits 0-9), AN for Arabic numbers (e.g., Eastern Arabic-Indic digits ٠-٩), ES for European number separators (e.g., + or -), ET for European number terminators (e.g., $ or °), CS for common separators (e.g., , or ;), and NSM for nonspacing marks (e.g., diacritics like acute accent ´). Neutral characters have no inherent direction and resolve based on adjacent strong types, including B for paragraph separators (e.g., ¶), S for segment separators (e.g., tab), WS for whitespace (e.g., space), and ON for other neutrals (e.g., most punctuation like !). Explicit embedding and isolate types provide mechanisms for overriding or isolating directional runs: LRE, RLE, LRO, and RLO for legacy embedding and overrides; LRI, RLI, FSI, and PDI for isolates; and PDF and BN for format and boundary neutrals. These types are assigned via the Unicode Character Database, where unassigned code points default to strong types (L or R based on script) and private-use characters may vary by implementation. Additional properties refine classification: Bidi_Mirrored indicates characters that mirror their glyphs in right-to-left contexts (e.g., parentheses () become )( ), while Bidi_Paired_Bracket identifies paired brackets (with types Open or Close) to ensure proper matching within directional runs, typically treating them as ON unless specified otherwise. The following table summarizes the bidirectional character types, their abbreviations, descriptions, and representative examples (based on Unicode 15.1 data, with approximate character counts for scale):
CategoryTypeDescriptionExamples (Unicode Code Points)Approx. Count
StrongLLeft-to-RightA (U+0041), α (U+03B1)112,000
RRight-to-Leftא (U+05D0), ܐ (U+0710)3,700
ALRight-to-Left Arabicا (U+0627), ދ (U+078B)1,100
WeakENEuropean Number1 (U+0031), १ (U+0967)10
ESEuropean Number Separator+ (U+002B), − (U+2212)2
ETEuropean Number Terminator¢ (U+00A2), ₹ (U+20B9)5
ANArabic Number٠ (U+0660), ১ (U+09E7)30
CSCommon Number Separator: (U+003A), ، (U+060C)3
NSMNonspacing Mark̀ (U+0300), ◌̥ (U+0325)1,900
BNBoundary Neutral­ (U+00AD), ‌ (U+200C)27
NeutralBParagraph Separator¶ (U+00B6), ‡ (U+2029)5
SSegment Separator (U+000C), � (U+001D)2
WSWhitespace(U+0020),   (U+2009)17
ONOther Neutral! (U+0021), © (U+00A9)460
ExplicitLRELeft-to-Right Embedding‎ (U+202A)1
LROLeft-to-Right Override‭ (U+202D)1
RLERight-to-Left Embedding‫ (U+202B)1
RLORight-to-Left Override‮ (U+202E)1
PDFPop Directional Format‬ (U+202C)1
LRILeft-to-Right Isolate‎ (U+2066)1
RLIRight-to-Left Isolate‎ (U+2067)1
FSIFirst Strong Isolate‎ (U+2068)1
PDIPop Directional Isolate‎ (U+2069)1
These classifications are essential prerequisites for the bidirectional algorithm, which uses them to resolve the final display order of text runs.

Formatting Controls

Unicode provides a set of explicit directional formatting characters to control the rendering of bidirectional text without changing its underlying semantics. These controls allow authors to override or adjust the automatic bidirectional algorithm, ensuring proper visual ordering in mixed-direction content. They are defined in the Unicode Standard and detailed in Unicode Standard Annex #9 (UAX #9). The formatting controls fall into several categories: implicit marks, embeddings, overrides, and isolates. Implicit marks, such as the Left-to-Right Mark (LRM, U+200E) and Right-to-Left Mark (RLM, U+200F), function as zero-width characters that influence the directionality of adjacent neutral or weak characters without visible effect. The Arabic Letter Mark (ALM, U+061C) serves a similar role specifically for Arabic script contexts. These marks are particularly useful for resolving ambiguities in short sequences, such as ensuring correct punctuation attachment in mixed text. Embeddings and overrides provide stronger directional control by altering the embedding levels for subsequent text. The Left-to-Right Embedding (LRE, U+202A) and Right-to-Left Embedding (RLE, U+202B) initiate an embedded sequence in the specified direction, which can nest up to eight levels deep, while the Pop Directional Formatting (PDF, U+202C) terminates the most recent embedding or override, restoring the prior level. Overrides, including the Left-to-Right Override (LRO, U+202D) and Right-to-Left Override (RLO, U+202E), force all following characters—regardless of their inherent directionality—to adopt the specified strong direction until terminated by PDF. For example, in an RTL-dominant context like Hebrew text, inserting LRE before an English phrase such as "Hello World" followed by PDF ensures the phrase renders left-to-right: Hebrew LRE Hello World PDF. These older embedding and override mechanisms can propagate effects to surrounding text, potentially causing unintended reordering. To address limitations of embeddings, Unicode 6.3 introduced directional isolates, which limit directional changes to a scoped segment without influencing adjacent content. The Left-to-Right Isolate (LRI, U+2066) and Right-to-Left Isolate (RLI, U+2067) embed text in the respective directions, while the First Strong Isolate (FSI, U+2068) determines the direction based on the first strong directional character within the isolate. All isolates are terminated by the Pop Directional Isolate (PDI, U+2069), which also closes any nested embeddings or overrides inside the isolate. Isolates are preferred in modern implementations for their cleaner scoping, as they prevent "stacking" issues where mismatched controls disrupt the entire paragraph. For instance, in RTL Arabic text containing an embedded LTR URL, using FSI before the URL and PDI after it allows the URL to render correctly without affecting the surrounding Arabic: Arabic FSI https://example.com PDI. While embeddings and overrides remain supported for backward compatibility, isolates are recommended for new content to minimize compatibility risks and improve robustness in applications like web browsers and text editors. These controls interact with inherent character classifications, such as strong RTL types, to fine-tune rendering outcomes.

Bidirectional Algorithm

The Unicode Bidirectional Algorithm (UBA), defined in Unicode Standard Annex #9, is a standardized process for determining the correct visual ordering of bidirectional text, ensuring that left-to-right (LTR) and right-to-left (RTL) scripts display properly when mixed. It operates on a sequence of characters, each classified by bidirectional type from the Unicode Character Database, such as strong (L for LTR, R or AL for RTL), weak (e.g., numbers like EN), neutral (e.g., punctuation like ON), or explicit formatting controls. The algorithm proceeds in sequential rules to resolve embedding levels and reorder the text for rendering, without altering the logical storage order. The process begins with identifying the base direction of each paragraph (rules P1–P3). The text is first split into paragraphs at paragraph separators. The base embedding level is then set to 0 (LTR) if the first strong directional character is of type L, or to 1 (RTL) if it is R or AL; if no strong character is found, the base direction defaults to LTR or follows higher-level protocols. Next, explicit embeddings and overrides are resolved (rules X1–X10), processing directional formatting characters like LRE (start LTR embedding), RLE (start RTL embedding), LRO/RLO (overrides), and their terminators PDF, PDI, along with isolates (LRI, RLI, FSI) that limit the scope of embeddings to prevent deep nesting issues. These are managed via a directional status stack with a maximum depth of 125 to avoid overflow. Weak and neutral types are then resolved relative to their neighbors and the embedding direction (rules W1–W7 and N1–N2). Weak characters, such as European numbers (EN) or Arabic numbers (AN), adopt the direction of adjacent strong types or the embedding level, with rules adjusting for contexts like numbers following RTL text (e.g., EN becomes AN). Neutral characters, including most punctuation and whitespace, take the direction of the nearest strong type or the paragraph embedding level, with special handling for paired brackets. Following this, implicit levels are assigned (rules I1–I2): characters of type L, EN, or AN receive even levels matching the embedding parity, while R types receive odd levels. The resolved text is segmented into bidirectional runs, defined as contiguous sequences of characters with the same resolved embedding level and direction. These runs are then reordered for visual display (rules L1–L4). Separators and whitespace are reset to the paragraph level (L1), and runs are reversed within each higher embedding level, starting from the highest odd level down to the base, to achieve the correct visual order. Finally, mirroring is applied (L4): characters with resolved RTL direction and the Bidi_Mirrored property (e.g., < becomes >) are replaced with their mirrored glyphs. A high-level pseudocode summary of the resolution phases is as follows:
For each paragraph in the text:
    Split at paragraph separators (P1)
    Determine base embedding level from first strong character (P2–P3)
    Process explicit directional formatting to set embedding levels (X1–X10)
    For each isolating run sequence:
        Resolve weak types based on neighbors (W1–W7)
        Resolve neutral types to strong or embedding direction (N1–N2)
        Assign implicit levels (I1–I2)
    For each rendering line:
        Identify and form level runs (L1)
        Reorder runs by embedding level (L2)
        Adjust marks and numerics (L3)
        Apply mirroring for RTL mirrored characters (L4)
The UBA has limitations, as it focuses solely on horizontal bidirectional reordering and does not support vertical text orientations or complex layouts, which are handled by higher-level systems such as CSS Writing Modes or layout engines.

Script Applications

Right-to-Left Scripts

Right-to-left (RTL) scripts are writing systems where text is primarily arranged from right to left, a directionality that necessitates specific bidirectional handling when mixed with left-to-right (LTR) elements. Among the primary RTL scripts are Hebrew, Arabic, Syriac, Persian (Farsi), Urdu, Pashto, and Kurdish (Sorani dialect), with Arabic-script variants like Persian, Urdu, and others adapting the cursive forms for their phonologies while maintaining RTL flow. Each of Hebrew, Arabic, and Syriac belongs to the Semitic language family and functions as abjads where consonants form the core of the writing system, with vowels often indicated optionally via diacritics. Hebrew employs a square script, characterized by block-like letter forms that do not connect cursively; it primarily represents consonants, with niqqud diacritics for vowels used mainly in educational or religious contexts. Arabic, in contrast, is a cursive script where letters adopt contextual joining forms—initial, medial, final, or isolated—depending on their position within a word, and it frequently incorporates harakat diacritics for short vowels and pronunciation nuances. Persian uses a modified Arabic script with additional letters for sounds like /p/, /ch/, /zh/, and /g/, retaining cursive RTL directionality. Urdu similarly adapts Arabic script with extra characters for Indic sounds, often including more diacritics (zer, zabar) for vowels. Syriac and related Semitic variants, such as those used for Neo-Aramaic languages, also feature cursive connections similar to Arabic, with combining diacritics like qushshaya and rukkakha for vocalization, though they include unique elements like the ligature for the letter Waw with a vertical stroke. In bidirectional contexts, these scripts establish an RTL base direction, embedding LTR segments for elements like European numerals and dates, which maintain their natural left-to-right order within the flow; for instance, a date such as "2025-11-10" appears with the year reading LTR amid surrounding RTL text. Punctuation in these scripts often undergoes mirroring as per the Unicode Bidirectional Algorithm—for example, opening parentheses visually flip to closing forms in RTL runs to preserve logical pairing. The Unicode Bidirectional Algorithm ensures proper reordering and mirroring for these scripts when integrated with LTR content. Modern adaptations for RTL scripts include specialized keyboard layouts and font technologies to facilitate digital input and display. The Hebrew QWERTY layout, a phonetic mapping on standard QWERTY keyboards, assigns Hebrew letters to keys based on English sound approximations (e.g., 'k' for Kaf), enabling bilingual users to switch seamlessly between Hebrew and Latin input. For Arabic, Persian, Urdu, and Syriac, fonts must support OpenType shaping tables to render correct joining forms and ligatures, ensuring cursive connectivity across digital platforms. These scripts are prevalent in the Middle East, North Africa, and South Asia, where they serve over 500 million speakers as of 2025, predominantly Arabic (around 400 million), Persian (around 110 million), and Urdu (around 70 million) users, underscoring their cultural and communicative significance in regions spanning from Morocco to Pakistan.

Mixed-Direction Examples

Bidirectional text frequently arises in multilingual settings where left-to-right (LTR) elements, such as email addresses, are embedded within right-to-left (RTL) scripts like Arabic. For instance, an Arabic sentence describing a contact might include an LTR email like "[email protected]", which the bidirectional algorithm isolates to prevent reversal while the surrounding Arabic flows from right to left. Similarly, Hebrew documents often incorporate LTR URLs, such as "https://www.example.com", maintaining their sequential order amid RTL text to ensure hyperlinks remain functional and readable. Product labels in bilingual markets, like those combining English brand names with Arabic descriptions, rely on bidi handling to display prices or instructions without visual disruption, as seen in consumer goods sold across the Middle East. The visual rendering of mixed-direction text involves reordering characters according to the Unicode Bidirectional Algorithm, grouping LTR segments appropriately within RTL contexts. A representative example is the English phrase "Price: $100" inserted into an Arabic paragraph, which displays as "$100 :Price" on screen, with the numeric value and colon mirroring to align with RTL flow while preserving the internal LTR logic of the price. This reordering extends to other neutrals like punctuation, ensuring commas or parentheses pair correctly with adjacent strong directional characters, as demonstrated in mixed sentences like Arabic text quoting English prices or dates. Software applications handle these scenarios through built-in bidi support. Microsoft Word introduced comprehensive RTL and bidirectional features in Office 2000, enabling users to toggle paragraph directions, embed LTR isolates for emails or URLs, and process mixed-script documents like bilingual reports. Web browsers such as Firefox and Chrome implement the algorithm natively via CSS properties like unicode-bidi and direction, rendering inline mixed content—such as Hebrew pages with embedded English URLs—correctly across platforms. Culturally, bidirectional text appears in public signage across diverse regions. In Israel, road signs use trilingual layouts with Hebrew (RTL) on top, followed by Arabic (RTL) and English (LTR) on separate lines, avoiding inline mixing to simplify reading for tourists and locals alike. In the UAE, directional signs and product labels pair Arabic (RTL) with Latin (LTR) scripts, often isolating English terms like brand names or prices to maintain clarity in high-traffic areas like Dubai.

Non-Alphabetic Systems

In ancient Egyptian hieroglyphic writing, the direction of reading is determined by the orientation of human and animal figures, which typically face toward the beginning of the text; a rightward-facing preference is dominant, allowing text to be arranged left-to-right or right-to-left accordingly. This flexibility introduces bidirectional elements, as inscriptions on monuments or papyri could reverse direction within a single composition to suit artistic layouts. In the modern Gardiner's sign list, a standardized catalog of over 700 hieroglyphs compiled by Egyptologist Alan H. Gardiner, signs are conventionally oriented to face right, facilitating consistent scholarly transcription while preserving the script's inherent directional variability. CJK (Chinese, Japanese, and Korean) scripts, which employ logographic Han characters, are primarily rendered left-to-right in horizontal lines or top-to-bottom in vertical columns in contemporary usage, though ancient inscriptions and seals often exhibit bidirectional or rotational arrangements. For instance, Chinese seal scripts on chops or imprints frequently arrange characters in anti-clockwise or clockwise sequences to fit circular or square forms, requiring readers to interpret direction based on context rather than linear flow. In the Unicode Standard, CJK characters are classified with the bidirectional class "L" (left-to-right), treating them as strong directional elements that do not inherently support right-to-left overrides, though neutral punctuation may interact in mixed-text scenarios. Boustrophedon writing, meaning "as the ox turns" in Greek, features alternating line directions—left-to-right followed by right-to-left—creating a bidirectional pattern akin to plowing a field; this method appears in various ancient non-alphabetic scripts but lacks native support in Unicode, relying on manual formatting or specialized tools for reproduction. The Mayan script of Mesoamerica, used from approximately 300 BCE to 900 CE, often employed boustrophedon in double-column blocks, where glyphs reversed direction at line ends to fill codex pages efficiently. Similarly, the undeciphered Rongorongo script of Easter Island, dating to the 19th century or earlier, follows a reverse boustrophedon style, with lines read right-to-left then flipped 180 degrees for the next, as evidenced in surviving wooden tablets. Moon type, a tactile writing system developed in 1845 by British inventor William Moon for blind readers, adapts simplified Latin-derived symbols embossed on paper and employs a boustrophedon layout, alternating left-to-right and right-to-left across lines to optimize page space and finger navigation. This mirroring approach made it accessible for illiterate or elderly blind individuals familiar with print shapes, contrasting with Braille's fixed left-to-right progression. Historically promoted in 19th-century Britain by the British and Foreign Blind Association, Moon type saw widespread use until the early 20th century, with approximately 300 books and other works produced.

Challenges and Considerations

Rendering Issues

Rendering bidirectional text presents several challenges across different platforms and software implementations, often deviating from the ideal standards outlined in the Unicode Bidirectional Algorithm. One common issue arises in legacy software, where incorrect reordering of characters occurs due to incomplete support for complex scripts. For instance, prior to the introduction of Uniscribe in Windows 2000, earlier versions like Windows 95 and 98 lacked robust bidirectional handling, leading to garbled display of mixed-direction text such as Arabic embedded in English. Nested embeddings exacerbate these problems, as improper nesting of directional controls can cause punctuation and neutral characters to associate with the wrong embedding level, resulting in visually incorrect layouts. The Unicode standard addresses this through explicit directional isolates (e.g., RLI, LRI, PDI) to prevent interference between embedded segments, but many implementations fail to handle deep nesting correctly, leading to reversed or misaligned text blocks. Platform variations further complicate rendering, with notable differences between mobile operating systems. iOS introduced stronger RTL support with Auto Layout in iOS 6 (2012), enabling automatic mirroring of layouts for languages like Arabic and Hebrew, though full UI overhauls came in iOS 9 (2015). In contrast, Android's bidirectional support evolved unevenly; versions prior to 4.2 (2012) offered minimal RTL handling, while later releases like Android 5.0+ integrated better bidi via HarfBuzz, but inconsistencies persist in text shaping for mixed scripts across devices. On the web, inconsistencies arise without explicit CSS controls like unicode-bidi: bidi-override, as browsers may apply the bidirectional algorithm differently, leading to erratic reordering of inline elements. For example, Firefox and Chrome have historically diverged in handling SVG RTL text with overrides, requiring developers to use the bdo element or directional formatting codes for consistent isolation. Accessibility challenges include screen readers mishandling bidirectional directions and multilingual content, which can disrupt the logical reading order for users. Developers are recommended to test with tools such as the Unicode text-rendering test suite, which includes bidi conformance tests, to ensure proper isolation and directionality in applications.

Security Implications

Bidirectional text introduces significant security risks, particularly through homograph attacks where right-to-left (RTL) characters are used to visually mimic left-to-right (LTR) domains, deceiving users into interacting with malicious sites. For instance, attackers can embed RTL override characters to reverse the apparent order of characters in a URL, making a phishing domain like a spoofed "apple.com" appear legitimate while directing to a harmful endpoint. This technique, known as BiDi Swap, exploits the Unicode Bidirectional Algorithm's handling of mixed-direction scripts to create deceptive links that bypass casual inspection. BiDi Swap attacks continue to be reported in phishing campaigns as of 2025. Phishing campaigns have leveraged these vulnerabilities, with the Unicode Technical Standard #39 (UTS #39) highlighting bidi spoofing as a persistent threat since at least Unicode 10.0 in 2017, where reordering of characters like "A1<שׂ" to resemble "Αשֺ>1" enables visual confusion in identifiers such as email addresses or domains. In response, browsers implemented mitigations; Google Chrome addressed incorrect handling of RTL domains in its Omnibox via CVE-2018-18348, introducing stricter RTL detection and display rules starting with version 71.0.3578.80 in late 2018 to prevent spoofing. Algorithmic exploits further compound these risks, as overly long bidirectional embeddings or control sequences can trigger buffer overflows in text parsers. A notable example is CVE-2014-8146 in the International Components for Unicode (ICU) library, where the resolveImplicitLevels function in the bidirectional algorithm suffers a heap-based buffer overflow due to improper bounds checking on input levels, potentially leading to remote code execution in affected applications. Similarly, bidirectional override characters have been abused in supply-chain attacks, as detailed in the Trojan Source vulnerability (CVE-2021-42574), where hidden RTL/LTR controls reorder source code or text to conceal malicious logic from reviewers while preserving functional execution. Mitigations emphasize robust input validation and user warnings for mixed-script content, with standards like UTS #39 recommending restriction levels to limit dangerous character mixing and confusable detection via normalized skeletons. Unicode 15.0 (2022) enhanced security by reclassifying joiner control characters (e.g., U+200D Zero Width Joiner) as restricted in identifier contexts, reducing opportunities for isolation bypasses in bidirectional processing, while updates to UAX #9 improved algorithm isolation to counter reordering exploits. Browsers and libraries continue to evolve, with Chrome's IDN spoof checker enforcing script-specific policies and Punycode fallbacks for suspicious domains.

References

  1. [1]
  2. [2]
    Unicode Bidirectional Algorithm basics - W3C
    Aug 9, 2016 · The Unicode Bidirectional (bidi) Algorithm describes the rules browsers use to display text correctly, creating separate directional runs for ...
  3. [3]
    Text directionality - Globalization - Microsoft Learn
    Nov 20, 2023 · Text that uses both directionalities is termed bidirectional (or bidi) text. While text might be displayed vertically, left-to-right and right- ...
  4. [4]
    Layout (bidirectional text and character shaping) overview - IBM
    Bidirectional (BIDI) text results when texts of different direction orientation appear together. For example, English text is read from left to right.Missing: importance computing
  5. [5]
    [PDF] the rise of the greek alphabet - University of Michigan Library
    Jeffery is right that the earliest Greek inscriptions were written boustrophedon, but with the first line of each paragraph always running from right to left.
  6. [6]
  7. [7]
    The Phoenician Alphabet & Language - World History Encyclopedia
    Jan 18, 2012 · Phoenician was written from right to left, & vowels were omitted. Similarities to Hebrew. By 1000 BCE the Phoenician and Hebrew languages had ...
  8. [8]
    ODYSSEY/Egypt/Writing - the Carlos Museum
    Scholars discovered that hieroglyphs can be written from right to left or left to right! The direction that animal hieroglyphs face gives us a clue! If the ...
  9. [9]
    Neither Good, Fast, Nor Cheap: Challenges of Early Arabic ...
    Oct 1, 2017 · Early Arabic printing was not good, fast, or cheap due to few presses, limited type, difficult setting, and unskilled workers.
  10. [10]
    The Puzzling Provenance of Historic Hebrew Type - Atlas Obscura
    Apr 9, 2021 · The typefaces used for printing the Hebrew script were limited both in scope and style when compared with their Latin counterparts.** “In Europe ...
  11. [11]
    History of Unicode Release and Publication Dates
    This page collects together information about the dates for various releases of the Unicode Standard, as well as details regarding publication dates.Unicode Release Dates · Publication Dates for Unicode...
  12. [12]
    ICU - International Components for Unicode
    No readable text found in the HTML.<|separator|>
  13. [13]
  14. [14]
    Special Areas and Format Characters - Unicode
    As with other format control characters, bidirectional ordering controls affect the layout of the text in which they are contained but should be ignored for ...23.2 Layout Controls · 23.2. 2 Cursive Connection... · 23.8 Specials
  15. [15]
  16. [16]
  17. [17]
  18. [18]
  19. [19]
  20. [20]
  21. [21]
    Chapter 9 – Unicode 16.0.0
    Summary of each segment:
  22. [22]
    Hebrew Keyboard - Globalization - Microsoft Learn
    Oct 24, 2024 · An interactive representation of the Windows Hebrew keyboard. To see different keyboard states, click or move the mouse over the state keys.
  23. [23]
    45 Arabic language statistics you must know - IstiZada
    Nov 4, 2024 · The Arabic language is spoken by approximately 400 million people all around the world. · There are three main forms of the Arabic language: ...<|control11|><|separator|>
  24. [24]
    Inline markup and bidirectional text in HTML - W3C
    Jun 25, 2021 · Provides background on the Unicode bidirectional algorithm and inline markup to help you implement Arabic, Hebrew and other right-to-left ...
  25. [25]
    Right-to-left language support and bidirectional text - Microsoft Learn
    Dec 31, 2024 · Bidirectional text occurs when the control hosts both RTL text (such as Arabic or Hebrew) and LTR text within the same string of characters.
  26. [26]
    [DOC] MULTINAT.doc - Microsoft Download Center
    Each Office 2000 application was designed with a single worldwide executable that supports most European, East Asian, and Bi-Directional (e.g., Arabic and ...Missing: RTL | Show results with:RTL
  27. [27]
    BiDi support in WebKit - The Chromium Projects
    The main issue involved in bidirectional editing is that there is no one-to-one relationship between a logical position in the text, and a visual position on ...
  28. [28]
    Israel Road & Traffic Signs - Anglo-List
    Oct 28, 2021 · There are 255 road signs and 87 road markings in Israel. Here is a selection of road-signs, traffic-signs and road markings you need to know.Missing: bidirectional | Show results with:bidirectional
  29. [29]
    How to Read Road Signs in the UAE | Trinity Rental
    Nov 25, 2024 · The UAE street signs use Arabic, which is the country's main language. However, English-language signs are posted in significant public ...Missing: bidirectional Latin
  30. [30]
    The Orientation of Hieroglyphs. Part 1, Reversals
    The Orientation of Hieroglyphs. Part 1, Reversals. Fischer, Henry George. 1977. 160 pages. 127 illustrations. 9.5 x 12.5 in.
  31. [31]
    Egyptian Hieroglyphic Alphabet - Discovering Egypt
    You can distinguish the direction in which the text is to be read because the human or animal figures always face towards the beginning of the line. Also, the ...Missing: preferred | Show results with:preferred
  32. [32]
    Arrangement and Direction of Writing - Bibliotheca Alexandrina
    Hieroglyphic inscriptions were organized into registers of vertical columns or horizontal lines. Signs were written from right to left, and from left to right.
  33. [33]
    The Cultural Heritage of China :: The Arts :: Painting :: Seals
    Chinese seals are typically made of stone, sometimes of wood, and are typically used with red ink or cinnabar paste (Chinese: 朱砂; Pinyin: zhūshā). The word 印 ...
  34. [34]
  35. [35]
    Unicode Mail List Archive: Re: Boustrophedon
    Nov 9, 2008 · > Maya script (http://en.wikipedia.org/wiki/Maya_script ) was written > boustrophredonically, usually written in blocks arranged in columns two
  36. [36]
    (PDF) SYMBIOTIC RONGORONGO PROJECT - Academia.edu
    Aug 7, 2025 · Structural Visualization • • Rongorongo text is traditionally written in reverse boustrophedon. The visualized excerpt will be presented in ...
  37. [37]
    How a Blind Doctor's 'Moon Code' Helped Thousands Read Again
    Jan 6, 2017 · Moon Code, a reading language for the blind which, although overshadowed by Braille, has given a specific population the chance to read again for hundreds of ...Missing: RTL mirroring Latin 19th
  38. [38]
    [PDF] BOOK ART AT THE LIBRARY COMPANY The Moon Reader and ...
    Aug 13, 2014 · The text was meant to be read from left to right,. Page 2. then back from right to left ... Although braille has since replaced Moon type in ...
  39. [39]
    Uniscribe - Wikipedia
    Uniscribe was released with Windows 2000 and Internet Explorer 5.0. In addition, the Windows CE platform has supported Uniscribe since version 5.0. "USP" is ...Missing: history | Show results with:history
  40. [40]
    Using Uniscribe - Win32 apps | Microsoft Learn
    Jan 7, 2021 · Uniscribe provides low-level routines for handling fully formatted text, and a simple ScriptString API set for unformatted text. Using Uniscribe ...
  41. [41]
    UAX #9: Unicode Bidirectional Algorithm
    The Unicode Bidirectional Algorithm (UBA) takes a stream of text as input and proceeds in four main phases: Separation into paragraphs. The rest of the ...
  42. [42]
    How to use Unicode controls for bidi text - W3C
    Feb 23, 2023 · The RLI/LRI control codes solve this problem by isolating the embedded text from the number that follows it. You would simply use RLI...PDI ...
  43. [43]
    Understanding Bidirectional (BIDI) Text in Unicode - iamcal.com
    Mar 1, 2009 · This document attempts to explain how bidirectional text in Unicode works and what this means for the web. In the Unicode standard ...The Basics · The Weak Characters · Filtering User InputMissing: definition | Show results with:definition<|control11|><|separator|>
  44. [44]
    Internationalization - Right to Left Support for Mobile Apps
    Apr 5, 2017 · On iOS, the support for RTL got very easy with the introduction of AutoLayout in iOS 6 and has improved a lot since. Only for special UI ...
  45. [45]
    Apple improves support for right-to-left languages with iOS 9
    Jun 9, 2015 · The iPhone and iPad user interface has been overhauled for readers of languages like Arabic and Hebrew.
  46. [46]
    What are the differences in for RTL bidi (Arabic/Hebrew) support in ...
    Apr 8, 2015 · If you have had experience with this, what are the differences in different versions of Android in regards to RTL/bidi support? What becomes ...Why does bidi text render differently in Chrome/FF compared to IE ...How to render RTL text correct on older Android versionsMore results from stackoverflow.com
  47. [47]
    unicode-bidi - CSS - MDN Web Docs
    Oct 30, 2025 · The unicode-bidi CSS property, together with the direction property, determines how bidirectional text in a document is handled.Try It · Syntax · Values<|control11|><|separator|>
  48. [48]
    Intro to the Bidirectional Algorithm - RTL:WTF
    May 18, 2021 · The bidirectional algorithm handles mixed LTR/RTL text by using strong, weak, and neutral character types, and some characters that flip based ...
  49. [49]
    Browser difference in displaying SVG RTL text with bidi-override and ...
    May 22, 2013 · It's a bug in Firefox. Fortunately it will be fixed reasonably soon (not sure exactly when) as we're revamping/rewriting text support in SVG ...<|control11|><|separator|>
  50. [50]
    The troubled state of screen readers in multilingual situations
    Jun 7, 2020 · Screen readers have language bugs, mismatches between voices and languages, and the state of multilingual support is appalling, with no working ...
  51. [51]
    What's New in NVDA
    Highlights of this release include major improvements concerning punctuation and symbols, including configurable levels, custom labelling and character ...
  52. [52]
    Unicode's test suite for text rendering engines - GitHub
    This is a test suite for text rendering engines. It is not easy to correctly display text, so we founded this project to help implementations to get this right.
  53. [53]
    Summarized test results: Bidi algorithm - W3C
    Jan 20, 2019 · These tests check whether user agents apply the basic Unicode Bidirectional Algorithm (UBA) to text. In particular they test isolation when using RLI, LRI, or ...
  54. [54]
    BIDI Swap: Unmasking the Art of URL Misleading with Bidirectional ...
    Varonis reveals a decade-old Unicode flaw that enables BiDi URL spoofing and poses phishing risks. Learn how attackers exploit RTL/LTR scripts and browser ...
  55. [55]
    Revised Homograph Attacks - Part 2 - Aleph Research
    Jul 23, 2020 · In this post we'll talk about IDN Homograph attacks and how we bypass various browsers policies. As a side note, the name homograph is commonly used for this ...Preface · Unicode Security Mitigations · Chrome's Idn Policy
  56. [56]
    UTS #39: Unicode Security Mechanisms
    It must not contain any stateful bidirectional format characters. That is, no [:bidicontrol:] except for the LRM, RLM, and ALM, since the bidirectional controls ...<|control11|><|separator|>
  57. [57]
    Vulnerability Details : CVE-2018-18348 - CVE Details
    Dec 11, 2018 · Incorrect handling of bidirectional domain names with RTL characters in Omnibox in Google Chrome prior to 71.0.3578.80 allowed a remote attacker ...
  58. [58]
    CVE-2014-8146 : The resolveImplicitLevels function in common ...
    May 25, 2015 · Vulnerability Details : CVE-2014-8146. Potential exploit. ICU4C resolveImplicitLevels Heap Buffer Overflow in Unicode Bidirectional Algorithm ...Missing: HarfBuzz | Show results with:HarfBuzz
  59. [59]
    [PDF] Trojan Source: Invisible Vulnerabilities - USENIX
    By injecting Unicode Bidi control char- acters into comments and strings, an adversary can produce syntactically-valid source code in most modern languages for ...
  60. [60]
    Announcing The Unicode® Standard, Version 15.0
    Sep 13, 2022 · The following six Unicode Standard Annexes and Technical Standards have noteworthy updates for Version 15.0: UAX #9, Unicode Bidirectional ...