Fact-checked by Grok 2 weeks ago

Character encoding

Character encoding is the process of assigning numerical values to symbols in a writing system, such as letters, digits, and punctuation, to enable computers to store, transmit, and display textual data in binary form.^[1] This mapping, often called a coded character set, transforms human-readable characters into sequences of bytes that hardware and software can process efficiently.^[2] Essential for digital communication, character encoding ensures compatibility across systems but has evolved due to the need to support diverse languages and scripts beyond early limitations.^[3] The history of character encoding traces back to early telegraphy and computing needs, with initial systems like the Baudot code in the 1870s using 5-bit representations for limited symbols.^[4] In the 1960s, as computers proliferated, standardized encodings emerged to address interoperability; the American Standard Code for Information Interchange (ASCII), developed by the American National Standards Institute in 1963, became the foundational 7-bit scheme supporting 128 characters primarily for English text, assigning values like 65 for uppercase 'A'.^[5] Concurrently, IBM introduced EBCDIC in 1963 for its mainframes, an 8-bit encoding that prioritized punched-card compatibility but differed from ASCII, leading to fragmentation.^[4] By the 1980s, the rise of global computing exposed ASCII's limitations in handling non-Latin scripts, prompting extensions like ISO/IEC 8859 series (starting 1987), which provided 8-bit encodings for specific languages, such as ISO-8859-1 for Western European characters.^[5] These national and regional standards, while useful, created a "Tower of Babel" with over 800 variants, complicating multilingual data exchange.^[6] The solution arrived with Unicode, a universal standard initiated in 1987 by a consortium including Apple and Xerox, and first published in 1991, which assigns unique code points (non-negative integers) to 159,801 characters from 172 scripts as of version 17.0 in 2025.^[7] Unicode's architecture separates abstract characters from their serialized forms, allowing flexible encodings like UTF-8 (variable-width, backward-compatible with ASCII), UTF-16, and UTF-32.^[2] Today, Unicode dominates modern computing, underpinning the web, operating systems, and international standards like ISO/IEC 10646, which it harmonizes with to support bidirectional text, emojis, and historic scripts.^[8] Its adoption has resolved many legacy issues, but challenges persist in legacy systems, conversion errors (mojibake), and ensuring universal accessibility for all languages.^[3]

History

Early Developments

The origins of character encoding trace back to pre-digital communication systems designed for efficient transmission over limited channels. In the 1830s, Samuel F.B. Morse and Alfred Vail developed Morse code for use with the electric telegraph, representing letters, numbers, and punctuation as unique sequences of short (dots) and long (dashes) signals, which functioned as an early binary-like encoding scheme to convert textual information into transmittable pulses.^[9] This system marked a foundational step in abstracting characters into discrete, machine-interpretable forms, though it was variable-length and optimized for human operators rather than direct machine processing. By the 1870s, Émile Baudot advanced this concept with the Baudot code, a fixed-length 5-bit binary encoding for telegraphy that assigned 32 distinct combinations to letters, figures, and control signals, enabling faster and more automated multiplexing of messages across wires.^[10]^[11] The late 19th century saw the emergence of punched-card systems for data processing, bridging telegraphy and computing. In 1890, Herman Hollerith invented punched cards for the U.S. Census, where rectangular holes in specific positions on 80-column cards encoded demographic data as machine-readable patterns, processed electrically by tabulating machines to tally and sort information at scale.^[12]^[13] This innovation shifted text and numeric data into a physical, binary-inspired format—presence or absence of a punch—facilitating the first large-scale automated handling of character-based records and laying groundwork for stored-program computers. Early electronic computers in the 1940s and 1950s built on these ideas with custom encodings tailored to hardware constraints. The ENIAC, completed in 1945, primarily handled numeric computations using a decimal architecture where each digit was represented by a 10-position ring counter of flip-flops, effectively employing 10 bits to encode values from 0 to 9, with provisions for sign and later extensions to alphanumeric input via plugboards and switches.^[14]^[15] By the early 1950s, machines like the Ferranti Mark 1 adopted 5-bit codes derived from teleprinter standards for input and output, supporting 32 basic symbols including uppercase letters and numerals, augmented by shift mechanisms to access additional characters and control functions without exceeding the bit limit.^[16]^[17] These machine-specific schemes prioritized efficiency for limited memory and I/O, often favoring uppercase-only text and numeric data to fit within word sizes like 20 or 40 bits.

Standardization Efforts

The European Computer Manufacturers Association (ECMA), now known as Ecma International, was established in 1961 to promote standards in information and communication technology, with significant work on character encoding beginning in the early 1960s through technical committees like TC4, which addressed optical character recognition and related sets by 1964.^[18] This effort contributed to early international harmonization, including the development of ECMA-6 in 1965, a 7-bit code aligned with emerging global needs. In parallel, the American Standards Association (ASA) published the first edition of the ASCII standard (X3.4-1963) on June 17, 1963, defining a 7-bit code with 128 positions primarily for English alphanumeric characters, control functions, and basic symbols to facilitate data interchange in telecommunications and computing.^[19] A major revision followed in 1967 as USAS X3.4-1967, refining assignments such as lowercase letters and punctuation for broader adoption.^[20] This national standard influenced international efforts, leading to its adoption as ISO Recommendation R 646 in 1967, which specified a compatible 7-bit coded character set for information processing interchange.^[21] The formation of ISO/IEC JTC 1/SC 2 in 1987 marked a key institutional milestone, building on earlier ISO/TC 97/SC 2 work from the 1960s to standardize coded character sets, including graphic characters, control functions, and string ordering for global compatibility.^[22] Under this subcommittee, the ISO/IEC 8859 series emerged in the late 1980s and 1990s, extending to 8-bit encodings for regional scripts; for instance, ISO 8859-1 (Latin-1) was first published in 1987, supporting 191 graphic characters for Western European languages while maintaining ASCII compatibility in the lower 128 positions.^[23] Subsequent parts, such as ISO 8859-2 for Latin/Cyrillic in 1987 and ISO 8859-9 for Turkish in 1999, addressed diverse linguistic needs without overlapping repertoires. A pivotal event occurred in 1986 when ISO/IEC JTC 1/SC 2 established Working Group 2 (WG 2) to develop a universal coded character set, resulting in the initial draft of ISO 10646, which aimed to unify disparate encodings into a comprehensive repertoire exceeding 65,000 characters across scripts.^[24] This effort laid the groundwork for collaboration with emerging initiatives like Unicode, fostering a single global framework by the early 1990s.

Evolution to Unicode

The Unicode Consortium was established on January 3, 1991, in California, as a nonprofit organization dedicated to creating, maintaining, and promoting a universal character encoding standard to address the fragmentation of existing systems. This initiative arose from collaborative efforts involving engineers from Xerox and Apple, who had developed proprietary encodings like Xerox's Character Code Standard (XCCS) and Apple's Macintosh Roman, alongside the emerging ISO/IEC 10646 project for a 31-bit universal coded character set.^[25] The consortium's formation facilitated the merger of these parallel developments, harmonizing Unicode with ISO 10646 by 1993 to ensure synchronized evolution and global interoperability.^[26] Unicode 1.0, released in October 1991, marked the standard's debut with support for approximately 7,100 characters, primarily covering Western European languages through extensions of ASCII, as well as initial inclusions for scripts like Greek and Cyrillic.^[27] Subsequent versions rapidly expanded the repertoire to encompass a broader array of writing systems; for instance, the Arabic script was fully integrated in Unicode 1.1 (1993), enabling proper representation of right-to-left text and diacritics essential for languages across the Middle East and North Africa.^[28] Further growth in the 2010s incorporated modern digital symbols, with emoji characters first standardized in Unicode 6.0 (2010), drawing from Japanese mobile phone sets to support expressive, cross-platform communication.^[29] The explosive expansion of the Internet during the 1990s, with user bases growing from millions to hundreds of millions globally, underscored the need for a single encoding capable of handling multilingual content without the constraints of regional standards like ISO 8859, which were limited to 256 characters per set and struggled with mixed-script documents. Unicode's adoption was accelerated by its design flexibility, particularly the introduction of UTF-8 in 1992 by Ken Thompson and Rob Pike at Bell Labs, which uses variable-length encoding (1 to 4 bytes per character) while ensuring the first 128 code points match ASCII exactly for seamless backward compatibility.^[30] This allowed legacy ASCII-based systems, prevalent in early web infrastructure, to process Unicode text without modification, facilitating the web's transition to global, multilingual applications. As of November 2025, Unicode 17.0—released on September 9, 2025—encodes 159,801 characters across 172 scripts, reflecting ongoing efforts to include underrepresented languages and cultural symbols.^[31] Notable among these expansions is the addition of scripts like Nyiakeng Puachue Hmong in Unicode 12.0 (2019), a syllabary developed in the 1980s for Hmong communities in Southeast Asia and the diaspora, demonstrating Unicode's commitment to linguistic diversity and minority language preservation.^[28] Unicode 17.0 further added four new scripts, including Beria Erfe and Sidetic, along with eight new emoji and other symbols.^[32]

Core Terminology

Character and Glyph

In character encoding, a character is defined as the smallest component of written language that has semantic value, referring to the abstract meaning and/or shape of a unit of written text, independent of its specific visual representation, font, or script variation.^[33] For instance, the character "A" represents a consistent informational unit, regardless of whether it appears in uppercase, lowercase, or stylized forms across different writing systems. This abstraction allows characters to serve as units of data organization, control, or representation in computing, as established in international standards.^[34] A glyph, in contrast, is the specific visual form or graphic symbol used to render a character on a display or in print, varying based on factors such as typeface, size, language context, or stylistic choices.^[33] Examples include the serif-style "A" in Times New Roman versus the sans-serif "A" in Arial, or contextual glyph substitutions in scripts like Arabic where the same character adopts different shapes depending on its position in a word (initial, medial, final, or isolated). The term "glyph" originates from typography, where it denotes the carved or engraved shape of a letter, derived from the Ancient Greek gluphḗ meaning "carving," and was adapted for digital encoding in standards like ISO/IEC 10646 to describe these rendered forms.^[35] A fundamental distinction is that one character can correspond to multiple glyphs, enabling flexibility in presentation while preserving the underlying semantic content; conversely, a single glyph may represent multiple characters in certain cases, such as the Latin ligature "fi" (a combined glyph for the two distinct characters "f" and "i") to improve readability and aesthetics in typesetting.^[33] This separation ensures that character encoding focuses on abstract information interchange, leaving glyph rendering to higher-level processes like font systems.

Character Repertoire and Set

In character encoding standards, the character repertoire refers to the complete, abstract collection of distinct characters that an encoding scheme is designed to represent, independent of their visual forms or numeric assignments. This repertoire encompasses a finite or potentially open set of abstract characters, such as letters, digits, symbols, and ideographs from various writing systems, ensuring comprehensive coverage for text interchange. For instance, the Unicode Standard's repertoire, as defined in ISO/IEC 10646, includes over 159,000 assigned abstract characters as of version 17.0 released in September 2025.^[31]^[37] A character set, in contrast, typically denotes a named, practical subset of the full repertoire or the entire repertoire itself when bounded for specific applications, often implying an organized grouping for encoding purposes. Examples include the Basic Latin character set, which covers the 128 characters of the ASCII standard, serving as a foundational subset within Unicode's broader repertoire. Character sets are thus more application-oriented, allowing systems to handle delimited portions of the repertoire efficiently without processing the exhaustive total.^[33]^[37] The repertoire fundamentally excludes glyphs, which are the visual representations of characters, focusing instead solely on abstract units as established in prior definitions of character and glyph distinctions. A key example of repertoire organization is Han unification in Unicode, where ideographic characters shared across Chinese, Japanese, and Korean scripts are consolidated into a single abstract form to optimize space and promote interoperability, resulting in a unified subset within the overall repertoire. This approach highlights the repertoire's exhaustive, abstract nature versus the bounded, implementable scope of character sets, enabling scalable support for global scripts.^[38]^[37]

Code Points and Code Space

In character encoding, a code point is a unique integer value assigned to represent an abstract character within a coded character set.^[39] This numeric identifier serves as an address in the encoding's abstract space, allowing unambiguous reference to characters regardless of how they are stored or displayed. For instance, in the Unicode standard, the code point U+0041 denotes the Latin capital letter "A".^[40] The code space encompasses the entire range of possible code points defined by an encoding standard, forming a contiguous or structured set of nonnegative integers available for character assignment.^[39] This range determines the encoding's capacity to represent characters, with the size of the code space calculated as the maximum code point value minus the minimum value plus one. In Unicode, the code space spans from 0 to 10FFFF in hexadecimal (equivalent to 0 to 1,114,111 in decimal), providing 1,114,112 possible code points across 17 planes of 65,536 positions each.^[40] By contrast, the American Standard Code for Information Interchange (ASCII) uses a 7-bit code space from 0 to 127, accommodating 128 positions for basic Latin characters and control codes.^[41] Within the code space, not all positions are immediately assigned to characters from a given repertoire; unassigned code points are explicitly reserved for future extensions or additions to ensure long-term stability and expandability of the encoding.^[42] These reservations prevent conflicts as new characters, such as those from emerging scripts or symbols, are incorporated over time.

Code Units and Encoding Forms

In character encoding, a code unit is the minimal bit combination that can represent a unit of encoded text for processing or interchange.^[33] The Unicode Standard defines three primary encoding forms—UTF-8, UTF-16, and UTF-32—each utilizing code units of fixed sizes: 8 bits for UTF-8, 16 bits for UTF-16, and 32 bits for UTF-32.^[33] An encoding form specifies the mapping from code points (abstract numerical identifiers for characters) to sequences of one or more code units, enabling the representation of the full Unicode code space within the constraints of the chosen unit size.^[43] These forms handle the internal binary packaging of code points without specifying byte serialization or order, which is addressed by encoding schemes.^[33] For instance, code points serve as the input to these forms, transforming abstract values into storable or transmittable sequences.^[43] In UTF-8, code points are encoded variably using 1 to 4 bytes depending on their value: code points U+0000 to U+007F require 1 byte (identical to ASCII), U+0080 to U+07FF use 2 bytes, U+0800 to U+FFFF use 3 bytes, and U+10000 to U+10FFFF use 4 bytes.^[44] This variable-length approach distributes the bits of the code point across bytes, with leading bytes using bit patterns (e.g., 0xxxxxxx for 1-byte, 110xxxxx for 2-byte starters) to indicate sequence length and continuation bytes marked by 10xxxxxx.^[44] UTF-16 employs fixed 16-bit code units for code points in the Basic Multilingual Plane (BMP, U+0000 to U+FFFF), representing each as a single unit.^[44] For supplementary code points beyond U+FFFF (up to U+10FFFF), UTF-16 uses surrogate pairs: two consecutive 16-bit code units, where the first (high surrogate) ranges from U+D800 to U+DBFF and the second (low surrogate) from U+DC00 to U+DFFF, together encoding the full value. This mechanism allows UTF-16 to cover the entire Unicode repertoire with at most two code units per character.^[44] UTF-32, in contrast, uses a single 32-bit code unit for every code point, providing a straightforward fixed-width encoding but at the cost of greater storage for most text.^[44] Encoding forms like these focus solely on the code unit sequences derived from code points, distinguishing them from encoding schemes that incorporate byte-order marks or endianness for serialized output.^[43]

Encoding Principles

Coded Character Sets

A coded character set (CCS) is a mapping from a set of abstract characters in a character repertoire to nonnegative integers known as code points, within a defined code space.^[45] This mapping must be injective, ensuring each character in the repertoire corresponds to a unique code point, to avoid ambiguity in representation.^[45] For fixed-width CCS, the mapping is often bijective, fully utilizing the available code space, such as assigning characters surjectively to all positions in a 7-bit or 8-bit range. A classic example is the American Standard Code for Information Interchange (ASCII), standardized internationally as ISO/IEC 646, which defines a 7-bit CCS mapping 128 characters—including 95 printable graphics and 33 controls—to code points from 0 to 127.^[46] Similarly, ISO/IEC 8859-1 extends this to an 8-bit CCS for Western European languages, defining a repertoire of 191 graphic characters mapped to code points 32 to 126 and 160 to 255, while positions 127 to 159 remain undefined or reserved for controls.^[47] Coded character sets originated as the foundational mechanism for digital text representation in early computing, providing a simple, fixed-width assignment of numbers to characters before the demands of global scripts necessitated multi-byte approaches. In modern standards like the Universal Coded Character Set (UCS) of ISO/IEC 10646—which aligns with Unicode—the CCS expands to a 21-bit code space organized into 17 planes of 65,536 code points each, enabling over 1 million possible assignments while maintaining the core principle of unique character-to-code-point injections.^[48] One key limitation of traditional CCS designs is their fixed-size structure, which assumes a one-to-one correspondence of one code point per base character, restricting support for scripts requiring diacritics or ligatures without separate combining mechanisms.^[45] This approach sufficed for limited repertoires like Latin alphabets but proved inadequate for multilingual needs, prompting evolutions in encoding standards.

Transformation Formats

Transformation formats, also known as Character Encoding Schemes (CES), provide methods to serialize sequences of code units from an encoding form into byte sequences suitable for storage and transmission, ensuring reversibility and compatibility with byte-oriented systems.^[49] A CES combines the rules of an encoding form—such as the fixed-width 16-bit code units of UTF-16—with serialization conventions like byte order and variable-length mapping, allowing adaptation of coded character sets to practical binary representations.^[49] A prominent example is UTF-8, a variable-length CES that encodes Unicode code points using 1 to 4 bytes per character, where the first 128 code points (U+0000 to U+007F) are represented as single bytes identical to ASCII, preserving compatibility with legacy 7-bit systems.^[50] This design enables efficient handling of multilingual text by allocating fewer bytes to common Latin characters while supporting the full Unicode repertoire through multi-byte sequences for higher code points. For fixed-width formats like UTF-16 and UTF-32, which use 16-bit or 32-bit code units respectively, byte serialization must account for endianness—the order of byte transmission in multi-byte units. The Byte Order Mark (BOM), represented by the Unicode character U+FEFF, serves as a signature at the start of a data stream to indicate the endianness: in big-endian UTF-16, it appears as the byte sequence FE FF, while in little-endian it is FF FE. UTF-16 further employs surrogate pairs to extend its 16-bit code units to cover the full 21-bit Unicode code space; a pair consists of a high surrogate from the range U+D800 to U+DBFF followed by a low surrogate from U+DC00 to U+DFFF, together encoding code points from U+10000 to U+10FFFF.^[49] These transformation formats offer significant advantages in space efficiency, particularly for ASCII-heavy text, as UTF-8 requires only 1 byte per character for the basic Latin alphabet compared to 2 bytes in UTF-16 or 4 in UTF-32. As a result, UTF-8 has become the dominant encoding for web content, used by 98.8% of websites as of November 2025.^[51]

Higher-Level Protocols

Higher-level protocols build upon character encoding schemes to ensure reliable transmission and interpretation of text across networks and applications. In the Multipurpose Internet Mail Extensions (MIME) standard, character sets are declared using parameters in headers, such as Content-Type: text/plain; charset=US-ASCII, allowing email clients and other systems to decode messages correctly.^[52] Similarly, the Hypertext Transfer Protocol (HTTP) uses the Content-Type header to specify the media type and charset, for instance Content-Type: text/html; charset=[UTF-8](/page/UTF-8), which informs web browsers how to render the content without misinterpretation of bytes as characters.^[53] These declarations integrate encoding forms like UTF-8 or UTF-16 into layered communication stacks, preventing issues such as mojibake during data exchange.^[54] Unicode normalization forms address variations in character representation to maintain consistency in protocols. Normalization Form C (NFC) composes characters into precomposed forms where possible, such as combining a base letter and accent into a single glyph (e.g., "é" as U+00E9), while Normalization Form D (NFD) decomposes them into base and combining marks (e.g., "e" + combining acute accent).^[55] These forms ensure canonical equivalence, meaning NFC and NFD representations are semantically identical but may differ in storage, which is crucial for protocols handling user-generated content to avoid mismatches in searching, sorting, or collation.^[55] Applications often normalize to NFC for compatibility with legacy systems, as it minimizes the length of encoded strings compared to decomposed forms.^[55] Escaping sequences in higher-level protocols protect special characters during transmission, preventing them from being interpreted as control codes. In HTML, entities like & for ampersand (&), < for less-than (<), and > for greater-than (>) escape markup-significant symbols, ensuring safe inclusion in documents without altering structure.^[56] This mechanism, defined in the HTML specification, allows protocols to transport raw text while preserving its integrity across diverse systems.^[57] The Extensible Markup Language (XML) exemplifies protocol-level encoding integration by requiring support for UTF-8 and UTF-16, with UTF-32 permitted as an optional encoding, and an optional encoding declaration like <?xml version="1.0" encoding="UTF-8"?> at the document's start.^[58] This declaration signals the processor to use the specified transformation format for parsing, ensuring interoperability in data exchange standards such as SOAP or RSS feeds.^[58] Without it, XML defaults to UTF-8 or UTF-16 based on the byte order mark, but explicit declaration enhances robustness in multi-encoding environments.^[58]

Unicode Standard

Abstract Character Repertoire

The abstract character repertoire of the Unicode Standard represents a vast, curated collection of characters drawn from the world's writing systems, symbols, and notations, serving as the foundational set of abstract characters available for encoding. This repertoire comprises 159,801 characters across 172 scripts, encompassing both contemporary languages and historical notations to support global text interchange.^[31] For instance, it includes modern scripts such as Devanagari for Hindi and Latin for English, alongside historic ones like Linear B, an ancient Mycenaean Greek syllabary, and recent additions like Emoji from version 15.0 released in 2023, which introduced new symbolic representations for digital communication. The repertoire is designed to be open-ended, allowing for ongoing expansion to accommodate evolving linguistic needs without disrupting existing encodings.^[49] A key principle in constructing this repertoire is Han unification, which merges ideographic characters shared across Chinese, Japanese, Korean, and Vietnamese writing systems to minimize redundancy while preserving cultural distinctions through glyph variation. Under Han unification, a single code point is assigned to visually similar ideographs that represent the same abstract character, such as the shared form for "mountain" (山) used in multiple East Asian contexts, reducing the total number of unique code points required.^[33] This approach, developed through collaboration among ideograph experts, ensures efficient storage and processing while relying on font rendering and normalization processes to display appropriate variants.^[59] The Unicode repertoire is organized into 17 planes, each containing 65,536 potential code points, to systematically allocate characters by category and rarity. Plane 0, known as the Basic Multilingual Plane (BMP), holds the most commonly used characters, including scripts for major world languages and basic symbols, covering the initial 65,536 code points from U+0000 to U+FFFF. Plane 1, the Supplementary Multilingual Plane (SMP), extends support for less common and historic scripts, such as Egyptian Hieroglyphs and Anatolian Hieroglyphs. As of 2025, Plane 3, the Tertiary Ideographic Plane (TIP), has seen partial allocation for ancient scripts, including provisional spaces for oracle bone and seal scripts to accommodate expanded Han-related historic material.^[60] The core repertoire deliberately excludes private use areas, which are reserved code point ranges (such as Planes 15 and 16) for implementation-specific characters defined by private agreements rather than the standard itself, ensuring the official set remains universally interoperable. Character variants, including compatibility decompositions for legacy encodings, are managed through Unicode normalization forms rather than duplicating code points in the repertoire, promoting consistency across systems.^[49]

Encoding Forms and Schemes

The Unicode Standard defines three primary encoding forms—UTF-8, UTF-16, and UTF-32—that transform sequences of Unicode code points into streams of code units suitable for storage, processing, or transmission in binary format. These forms operate on the abstract repertoire of Unicode characters, ensuring a consistent mapping from code points to bytes while optimizing for different use cases such as space efficiency, processing speed, or simplicity. Encoding schemes further specify how these code units are serialized into bytes, particularly addressing byte order variations.^[61]^[62] UTF-8 is a variable-length encoding form that represents each Unicode code point using one to four 8-bit code units (octets). It achieves backward compatibility with ASCII by encoding the 128 ASCII characters (code points U+0000 to U+007F) as single bytes identical to their ASCII values, with the binary pattern 0xxxxxxx. For code points beyond this range, UTF-8 employs multi-byte sequences distinguished by leading bit patterns that indicate the sequence length and allow self-synchronization for error detection and parsing. The specific bit patterns are as follows:

Bytes	1st Byte Binary	2nd Byte Binary	3rd Byte Binary	4th Byte Binary
1	0xxxxxxx
2	110xxxxx	10xxxxxx
3	1110xxxx	10xxxxxx	10xxxxxx
4	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

Here, x represents bits from the code point value, with continuation bytes always starting with 10. This design minimizes overhead for Latin scripts while supporting the full Unicode range up to U+10FFFF, excluding surrogates and noncharacters. UTF-8's efficiency for English and European text, combined with its ASCII transparency, has led to its widespread adoption as the preferred encoding for web content, HTML, XML, and plain text files.^[62]^[63] UTF-16 is a variable-width encoding form that uses 16-bit code units to represent characters, encoding code points in the Basic Multilingual Plane (BMP, U+0000 to U+FFFF) with a single code unit and supplementary code points (U+10000 to U+10FFFF) with two code units known as a surrogate pair. A surrogate pair consists of a high surrogate (U+D800 to U+DBFF) followed by a low surrogate (U+DC00 to U+DFFF); the pair's numerical value reconstructs the original code point via the formula (high - 0xD800) × 0x400 + (low - 0xDC00) + 0x10000. This allows UTF-16 to cover the entire Unicode space while maintaining compatibility with 16-bit systems. UTF-16 is the native internal encoding for Java strings and is extensively used in Windows operating system APIs and components for text processing.^[62]^[44]^[64]^[65] UTF-32 is a fixed-width encoding form that maps each Unicode code point directly to a single 32-bit code unit, equivalent to the code point's numerical value padded to 32 bits. This one-to-one correspondence simplifies random access, indexing, and arithmetic operations on text but results in higher storage and bandwidth requirements, as every character occupies four bytes regardless of its value. UTF-32 is particularly suited for applications where processing efficiency outweighs space concerns, such as in-memory representations during computation.^[62]^[6] The distinction between encoding forms and schemes arises in byte serialization, primarily for multi-byte code units in UTF-16 and UTF-32, where endianness (the order of byte storage) must be specified. Big-endian (BE) places the most significant byte first, while little-endian (LE) places the least significant byte first, leading to schemes such as UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE. UTF-8, being octet-based, has no endianness issues and uses a single scheme. To resolve ambiguity in endianness, the Byte Order Mark (BOM)—the character U+FEFF—is commonly prefixed to streams: in UTF-16BE, it appears as the bytes EF BB (hex FE FF); in UTF-16LE, as BB EF (FF FE). For UTF-8, the BOM is EF BB BF, serving as an optional encoding signature rather than a strict byte-order indicator, though its use is discouraged in protocols like HTTP to avoid issues with signature interpretation. The BOM enables automatic detection of the encoding scheme but should not be interpreted as a regular character if present.^[61]^[66]^[67]

Code Point Allocation and Documentation

The allocation of code points in the Unicode Standard is managed by the Unicode Consortium through a rigorous proposal and review process to ensure global interoperability and cultural representation. Proposals for new characters or scripts are submitted to the Consortium, where they undergo evaluation by technical committees, including the Unicode Technical Committee (UTC) and script-specific ad hoc groups, based on criteria such as evidence of usage, distinctiveness from existing characters, and compatibility with encoding principles.^[68] Once approved, code points are assigned from reserved areas in the code space, with strict stability policies prohibiting reallocation to maintain backward compatibility—assigned code points remain fixed across versions, and no code point can be repurposed for a different character.^[69]^[70] Code points are categorized to define their roles and behaviors, facilitating consistent implementation across systems. Key categories include Control (Cc) for legacy control codes like line feeds, Format (Cf) for characters that affect text layout without visible rendering such as zero-width joiners, and Private Use (Co) for vendor-specific or custom assignments that do not conflict with standard characters.^[71] These categories, along with others like Letter (L) and Symbol (S), are part of the General Category property, which informs rendering, searching, and processing rules. Documentation of code points is centralized in the Unicode Character Database (UCD), a collection of machine-readable files that provide comprehensive metadata for each assigned code point. The primary file, UnicodeData.txt, lists details for every character, including its formal name (e.g., "LATIN CAPITAL LETTER A" for U+0041), General Category, decomposition mappings for normalization, and numeric values where applicable.^[72] Additional UCD files, such as PropList.txt for binary properties (e.g., Emoji=Yes) and Blocks.txt for grouping into named ranges, ensure developers and researchers can access authoritative properties like bidirectional class or script identification.^[72] Unicode versioning supports extensibility by incrementally adding code points in new blocks while preserving existing assignments, with each major release synchronizing with ISO/IEC 10646. For instance, Unicode 17.0, released on September 9, 2025, introduced blocks for four new scripts, including the Sidetic script (U+10940–U+1095F) from ancient Anatolia, expanding support for underrepresented languages.^[31] Planes 15 (U+F0000–U+FFFFF) and 16 (U+100000–U+10FFFF) are fully reserved as Supplementary Private Use Areas-A and -B, respectively, allowing organizations to define up to 131,068 private code points without standardization.^[73] A representative example of code point allocation is emoji characters, which are documented in supplemental planes with specific properties for variation. Emoji base characters, such as the grinning face (U+1F600), reside in blocks like Emoticons and Supplemental Symbols and Pictographs, while variation selectors like U+1F3FB–U+1F3FF enable skin tone modifications (e.g., medium skin tone) to promote inclusivity through modifier sequences.^[29] These are flagged in the UCD with Emoji=Yes and detailed in emoji-specific data files for consistent rendering across platforms.^[72]

Legacy and Regional Encodings

ASCII and Western Encodings

The American Standard Code for Information Interchange (ASCII), standardized in 1963 by the American Standards Association (now ANSI), defines a 7-bit character encoding scheme with 128 code points, ranging from 0 to 127.^[74] This standard was developed to facilitate data interchange, particularly for telegraphic and early computing applications, where the 7-bit format aligned with teletype equipment limitations.^[75] Within ASCII, code points 0–31 and 127 are designated as control characters for functions like carriage return and line feed, while 32–126 represent printable characters, including uppercase and lowercase letters, digits, and basic punctuation.^[76] ASCII's 7-bit design allowed compatibility with transmission systems but restricted representation to just 128 characters, primarily suited for English text.^[77] As computing hardware evolved to use 8-bit bytes for storage and processing, extended 8-bit encodings emerged to utilize the full 256 possible code points, enabling support for additional Western European characters.^[78] One prominent extension is ISO/IEC 8859-1, known as Latin-1, first published in 1987 by the International Organization for Standardization.^[23] This 8-bit standard retains the ASCII subset in positions 0–127 and adds 128 characters in the 128–255 range, including accented letters (e.g., á, ç, ñ) and symbols for languages like French, German, Spanish, and Italian, totaling 191 graphic characters.^[23] A widely used variant is Windows-1252, developed by Microsoft as a superset of ISO 8859-1 for Western European languages in its operating systems.^[79] Introduced in the 1990s and finalized with updates in Windows 98 (1998), it differs from Latin-1 mainly in the 128–159 range by assigning printable characters instead of controls, including the euro symbol (€) at code point 128 to support the European currency introduced in 1999.^[80]^[81] Despite the rise of Unicode, ASCII and its Western extensions remain prevalent in legacy systems, such as older databases and protocols, due to their simplicity and backward compatibility.^[82] Notably, ASCII forms a direct subset of UTF-8, where the first 128 code points are encoded identically as single bytes, ensuring seamless integration in modern applications without data loss. This compatibility has sustained their influence, though their 256-character limit highlights the need for broader encodings in multilingual contexts.^[78]

EBCDIC and Mainframe Systems

EBCDIC, or Extended Binary Coded Decimal Interchange Code, is an 8-bit character encoding standard developed by IBM in 1963, providing 256 possible code points for representing characters primarily in mainframe computing environments.^[83] It evolved from the earlier 6-bit BCDIC (Binary Coded Decimal Interchange Code) used for punched card data processing, retaining a zone-digit structure that groups characters into zones rather than a strictly sequential order.^[84] This design results in a non-contiguous layout, where, for example, the decimal digits 0 through 9 are assigned to hexadecimal codes F0 through F9 (decimal 240–249), while uppercase letters A through I occupy C1 through C9, J through R are D1 through D9, and S through Z are E2 through E9, creating gaps between these ranges.^[85] EBCDIC remains the default encoding for data in IBM's z/OS operating system, which powers many enterprise mainframes, but it is fundamentally incompatible with ASCII due to differing bit assignments; for instance, the character "A" is encoded as 0xC1 in EBCDIC versus 0x41 in ASCII.^[86] To support multilingual needs, IBM introduced the Coded Character Set Identifier (CCSID) system, which specifies variants of EBCDIC code pages; notable examples include CCSID 37 for U.S. English, CCSID 500 for Western European languages, and CCSID 1047, a Latin-1 compatible variant used for open systems integration.^[87] Despite the dominance of Unicode in contemporary computing, EBCDIC persists in legacy mainframe applications, particularly in banking and finance sectors where z/OS systems handle high-volume transaction processing.^[88] Interoperability is achieved through conversion gateways and tools that transcode EBCDIC to Unicode, enabling data exchange with modern distributed systems while preserving the reliability of established mainframe infrastructures.^[89]

CJK and Multilingual Encodings

Character encodings for Chinese, Japanese, and Korean (CJK) languages emerged in the 1980s and 1990s to handle the vast number of ideographic characters, which far exceeded the 256-code-point limit of single-byte systems like ASCII. These multi-byte encodings, primarily variable-width, were developed regionally to support local scripts on personal computers and Unix systems, incorporating thousands of hanzi (Chinese), kanji (Japanese), and hanja (Korean) characters alongside Latin and other scripts. Unlike fixed-width encodings, they used one or two bytes (or more in later extensions) per character, enabling efficient representation of complex writing systems while maintaining partial compatibility with ASCII.^[90] Shift-JIS, introduced in the 1980s by ASCII Corporation and adopted by Microsoft, encodes the JIS X 0208 standard, which defines 6,879 graphic characters including 6,355 kanji. It employs a variable-width scheme where half-width (single-byte) codes from JIS X 0201 handle Roman letters, half-width katakana, and controls (0x00–0x7F and 0xA1–0xDF), while full-width (double-byte) sequences (0x81–0x9F, 0xE0–0xEF for the first byte, followed by 0x40–0x7E or 0x80–0xFC for the second) represent JIS X 0208 characters, including double-byte katakana and kanji. This design allows seamless mixing with ASCII but leads to ambiguities in byte streams, as lead bytes overlap with some single-byte ranges. The Windows variant, Code Page 932 (CP932 or Windows-31J), extends Shift-JIS by adding 83 NEC Row 14 and IBM extensions for special symbols and rare kanji, registered with IANA as a distinct charset supporting JIS X 0201 and JIS X 0208.^[91]^[38] In China, GBK (1995) extended the GB 2312-1980 standard, which covered 6,763 simplified Chinese characters, by adding 21,003 total hanzi plus compatibility with GB 13000.1 (aligned to Unicode 1.0), using a double-byte structure similar to Shift-JIS: single-byte for ASCII (0x00–0x7F) and double-byte (0x81–0xFE for both bytes) for Chinese characters, excluding 0x80 and 0xC0–0xFF ranges to avoid conflicts. GBK was registered with IANA in 2002 but later superseded by the mandatory GB 18030-2000 standard, which ensures full Unicode 3.0 compliance by incorporating all 20,902 Basic Multilingual Plane CJK characters and adding four-byte sequences (e.g., 0x90 followed by three bytes from 0x30–0x39, 0x81–0xFE) for rarely used extensions, while preserving backward compatibility with GBK and GB 2312. GB 18030-2022 further expanded to 87,887 Chinese characters across multiple planes, aligning with Unicode 11.0, supporting ethnic minority scripts like Uighur.^[92]^[93]^[94] Big5, developed in 1984 by Taiwan's Institute for Information Industry and widely used in Taiwan and Hong Kong for traditional Chinese, encodes 13,053 characters from the first two planes of CNS 11643-1992, prioritizing common hanzi. It uses a variable-width format with single-byte ASCII (0x00–0x7F) and double-byte hanzi (first byte 0xA1–0xF9, second byte 0x40–0x7E or 0xA1–0xFE), mapping to 11,625 unique codes plus punctuation. While not an official standard, Big5 became the de facto encoding for traditional Chinese on PCs; extensions like Big5-2003 incorporate CNS 11643 planes 3–6 for 48,000+ characters. In Unix environments, the EUC-TW encoding represents CNS 11643 using Extended Unix Code (EUC), with up to four bytes per character from multiple planes, as defined by Taiwan's Chinese National Standard.^[90]^[24] Han unification in the Unicode Standard merged visually similar ideographs from CJK sources, reducing the required code points from over 120,000 in disparate legacy sets to approximately 93,000 unified ideographs across extensions (e.g., 20,992 in the core block plus 42,720 in Extension B). This approach, detailed in Unicode 1.0 documentation, minimized redundancy but introduced challenges in rendering, as fonts must select region-specific glyphs for the same code point. Legacy issues persist in mixed-use scenarios, where misinterpreting bytes from Shift-JIS, GBK, or Big5 as another encoding causes mojibake—garbled text like reversed or substituted characters—particularly when ASCII and multi-byte sequences intermix without proper declaration. For instance, a Shift-JIS double-byte lead byte (0x81–0x9F) read as ISO-8859-1 yields Western symbols instead of kanji. Unicode addresses CJK through its unified model but requires careful transcoding to mitigate such artifacts from pre-Unicode systems.^[38]^[95]^[96]

Transcoding and Interoperability

Conversion Processes

Transcoding involves converting sequences of bytes from one character encoding to another by mapping the underlying code points through an intermediate abstract character repertoire, most commonly Unicode. This process typically proceeds in three steps: first, the source encoding is decoded into a sequence of Unicode code points representing abstract characters; second, these code points are processed if necessary to handle equivalences or other transformations; and third, the code points are encoded into the target encoding's byte sequence. Using Unicode as the pivot ensures a standardized intermediate form that facilitates interoperability across diverse encodings, as implemented in libraries like ICU, where conversions always route through Unicode (UTF-16 internally).^[97] The fidelity of transcoding depends on the overlap between the source and target encoding repertoires. If all source characters have exact counterparts in the target, the conversion is lossless, preserving the original meaning and appearance. For instance, transcoding from ASCII to UTF-8 is lossless, as the 128 ASCII code points map directly to the first 128 Unicode code points, and UTF-8 encodes them using the same single-byte values. Conversely, conversions can be lossy when source characters fall outside the target's repertoire, leading to substitutions (e.g., replacement characters like �), omissions, or approximations that alter the text. An example is transcoding from ISO-8859-1 (Latin-1), which includes accented Western European characters, to GBK, a Chinese encoding primarily focused on CJK ideographs; while basic Latin letters map well, certain accents may lack precise equivalents in GBK's limited non-CJK extensions, resulting in data loss.^[65] To mitigate issues arising from Unicode's canonical equivalences—where multiple code point sequences represent the same abstract character—normalization is often applied during transcoding. Specifically, converting the intermediate Unicode sequence to Normalization Form C (NFC) before encoding into the target ensures that precomposed characters (e.g., é as a single code point U+00E9) are used where possible, improving compatibility with legacy encodings that may not handle decomposed forms (e.g., e + combining acute accent). This step canonicalizes the representation, reducing round-trip discrepancies in bidirectional conversions. The Unicode Standard recommends NFC for general text processing due to its compatibility with legacy data.^[55] Conversion processes rely on mapping tables that define correspondences between code points in the source, Unicode pivot, and target encodings. These tables are typically bidirectional, allowing round-trip conversions where mappings are reversible (e.g., a source byte maps to a unique Unicode code point, which maps back uniquely). The International Components for Unicode (ICU) library exemplifies this with its comprehensive set of conversion data files, derived from standards like IBM's CDRA tables, which specify both forward and fallback mappings to handle partial overlaps gracefully. Such tables enable efficient, deterministic transcoding while flagging potential losses.^[98]

Tools and Algorithms

Software libraries play a central role in implementing character encoding transcoding, providing robust APIs for converting between various formats in applications. The International Components for Unicode (ICU), originally developed by Taligent and open-sourced by IBM in 1999, is a mature set of C/C++ and Java libraries that supports Unicode and globalization features, including conversion between Unicode and over 220 legacy character sets through its converter API.^[99]^[98] ICU's transcoding capabilities handle complex mappings, such as those involving multi-byte encodings like GB18030, ensuring accurate round-trip conversions where possible.^[100] Python's standard library includes the codecs module, which offers a registry of built-in encoders and decoders for standard encodings such as UTF-8, UTF-16, ASCII, and Latin-1, along with support for additional formats like base64 and compression codecs.^[101] This module facilitates stream and file interfaces for transcoding, allowing developers to register custom codecs and perform operations like encode() and decode() on strings or bytes, making it essential for handling diverse text data in Python applications.^[101] Transcoding algorithms typically rely on table-driven lookups for efficiency in simple cases, where precomputed mapping tables translate code points or bytes directly between encodings like ASCII to UTF-8.^[102] For more complex scenarios, such as converting UTF-16 to UTF-8, dynamic algorithms are employed to handle surrogate pairs: a high surrogate (U+D800 to U+DBFF) and low surrogate (U+DC00 to U+DFFF) are combined to form a single code point beyond the Basic Multilingual Plane, which is then encoded into 4 bytes in UTF-8 following the Unicode Transformation Format rules. These methods ensure validity checks, such as rejecting unpaired surrogates, to prevent data corruption during conversion.^[103] Encoding detection often precedes transcoding when the source format is unknown, using heuristic approaches like byte frequency analysis to infer the likely encoding. The chardet library, a popular Python tool, implements this by analyzing byte distributions against statistical models for various encodings; for instance, it probes for multi-byte patterns in UTF-8 or Shift-JIS by tracking character transitions and confidence scores based on observed frequencies.^[104] This probabilistic method achieves high accuracy for common legacy encodings but may require fallback strategies for ambiguous cases. In web development, the WHATWG Encoding Standard, as of 2025, specifies that UTF-8 is the dominant encoding for interchange, mandating its use in new protocols while defining fallback transcoding for legacy labels like windows-1252 in browsers, ensuring seamless handling of mixed content through APIs such as TextEncoder and TextDecoder.^[105] Building on foundational conversion processes, these tools enable practical interoperability across diverse systems.^[105]

Common Challenges

One of the primary challenges in transcoding between character encodings is the production of mojibake, which occurs when data encoded in one scheme, such as UTF-8, is incorrectly decoded using another, like ISO-8859-1 (Latin-1). This results in garbled or nonsensical text, as the byte sequences are misinterpreted according to a different mapping. For example, the UTF-8 representation of the character "é" uses the bytes C3 A9; when decoded as ISO-8859-1, these become the characters "Ã" followed by "©", rendering as "Ã©" instead of the intended accented letter.^[106] Such errors are common in web content, email, and file transfers where encoding metadata is absent or ignored, leading to persistent display issues across systems.^[107] Another frequent issue is round-trip loss, where converting text from a modern encoding like Unicode back to a legacy one and then reversing the process fails to preserve the original data. This irreversibility stems from repertoire mismatches: legacy encodings, such as ASCII or ISO-8859 series, support only limited character sets and cannot represent supplementary Unicode characters, such as emoji (e.g., 😀 at U+1F600). During transcoding to these older schemes, such characters are typically replaced with placeholders, question marks, or omitted entirely, making full recovery impossible upon reconversion.^[108] This problem is particularly acute in systems interfacing with mainframes or older databases that rely on encodings like EBCDIC, where certain Unicode glyphs have no direct equivalents.^[109] Encoding detection presents additional pitfalls due to overlapping byte patterns across schemes, complicating automated identification of the correct format. For instance, certain byte ranges in UTF-8 (a variable-length encoding using 1-4 bytes) coincide with those in Shift-JIS (a Japanese encoding using 1-2 bytes), allowing the same sequence to be validly interpreted in multiple ways without explicit declarations.^[110] This ambiguity can trigger incorrect decoding, especially in multilingual content or when HTTP headers or meta tags are missing or inconsistent. Although UTF-8 dominates web usage at 98.8% of sites as of late 2025, the remaining non-UTF-8 pages—often legacy or regionally specific—combined with misdeclarations, contribute to ongoing detection errors in real-world applications.^[111] Overlong encodings in UTF-8 exacerbate these challenges by introducing invalid sequences that represent code points with unnecessary extra bytes, violating the standard's canonical form. For example, the ASCII character "/" (U+002F) can be illicitly encoded as a two-byte sequence like C0 AF instead of the single byte 2F, potentially evading input filters or normalization checks during transcoding. These non-standard forms are explicitly prohibited in UTF-8 for security reasons, as they enable attacks like bypassing validation logic in web applications or parsers.^[112] Proper handling requires strict decoders that reject such sequences, but inconsistent implementations across libraries can propagate errors. Tools such as the Universal Charset Detector (from Mozilla) offer heuristics for mitigation, though they cannot eliminate all ambiguities.^[113]

Modern Applications and Challenges

Internationalization and Localization

Internationalization (i18n) involves designing software to support multiple languages and regions without requiring code modifications, where character encodings like UTF-8 play a central role by enabling the representation of diverse scripts within a single application. UTF-8, as a variable-length encoding of the Unicode standard, allows seamless mixing of characters from different writing systems, such as Latin, Cyrillic, and Devanagari, facilitating global software development. For instance, applications can handle multilingual user interfaces by storing all text in UTF-8, which supports 159,801 characters across 165 scripts as of Unicode 17.0 (September 2025), including recent additions like the Sidetic and Tolong Siki scripts for better representation of lesser-known languages.^[114]^[31]^[115] In POSIX-compliant systems, locales incorporate encoding tags to specify character sets, such as en_US.[UTF-8](/page/UTF-8), which combines English (United States) language conventions with UTF-8 encoding for broad script support. This configuration, set via environment variables like [LANG](/page/Lang) or LC_CTYPE, ensures that applications interpret and display text correctly for the designated region, including proper handling of collating sequences and character classifications. The vast majority of mobile applications rely on Unicode-based encodings like UTF-8 to meet global user demands, driven by platform standards in iOS and Android that mandate Unicode compliance for internationalization.^[116]^[117] Localization (l10n) builds on i18n by adapting content and interfaces for specific locales, where character encodings ensure accurate rendering of culturally appropriate text. For right-to-left (RTL) languages like Arabic, Unicode's bidirectional algorithm (defined in Unicode Standard Annex #9, or UAX #9) automatically determines text directionality, reordering mixed RTL and left-to-right (LTR) scripts to maintain readability. This algorithm assigns embedding levels to characters—odd levels for RTL (e.g., Hebrew or Arabic) and even for LTR—enabling proper interleaving, such as displaying an English URL within an Arabic sentence from right to left while preserving LTR flow for the URL itself. Implementing UAX #9 in software libraries allows localized applications to support bidirectional text without manual adjustments, enhancing user experience in regions using scripts like Arabic, which requires mirroring layouts and icons for intuitive navigation.^[118]^[119]

Security and Compatibility Issues

Character encoding systems, particularly those supporting Unicode, introduce security vulnerabilities through visual similarities between characters from different scripts, enabling homograph attacks where malicious actors create deceptive domain names or text that appear legitimate to users. For instance, the Cyrillic lowercase "a" (U+0430) is visually indistinguishable from the Latin lowercase "a" (U+0061) in most fonts, allowing phishing sites like "xn--pple-43d.com" (apple.com spoof) to mimic trusted brands.^[120]^[121] These attacks exploit the Internationalized Domain Names (IDN) framework, which permits non-ASCII characters in domains but relies on Punycode to transcode them into ASCII-compatible strings for the Domain Name System (DNS).^[122] Variable-length encodings like UTF-8 exacerbate risks of buffer overflows, classified under CWE-120 (Buffer Copy without Checking Size of Input) and CWE-176 (Improper Handling of Unicode Encoding), where multi-byte sequences can exceed allocated memory if not properly validated, leading to crashes or code execution. Overlong UTF-8 representations, which encode characters using more bytes than necessary, further amplify this by bypassing length checks in legacy parsers.^[123]^[124] Mitigation involves using secure libraries, such as those in modern C++ (e.g., std::string with encoding-aware functions) or Java's String class, which enforce strict validation and prevent overflows during decoding.^[125] Compatibility issues arise when integrating modern Unicode encodings with legacy systems designed for fixed-width ASCII, necessitating fallbacks where UTF-8 bytes in the 0x00-0x7F range are interpreted as ASCII to maintain interoperability without data loss. This backward compatibility ensures seamless handling of English text in older applications, but mismatches during transcoding can introduce errors, such as mojibake, if encodings are not explicitly declared.^[126]^[127] For IDNs, Punycode (defined in RFC 3492, 2003) addresses this by reversibly mapping Unicode domain labels to ASCII, prefixed with "xn--", allowing global domain registration while preserving DNS's ASCII roots.^[128] Attacks exploiting zero-width joiners (U+200D) to concatenate characters invisibly and obfuscate malicious scripts or domains remain a concern, with ongoing development of defenses in browsers and security tools.^[129]

Emerging Trends

In recent years, there has been a pronounced shift toward universal adoption of UTF-8 as the dominant character encoding standard, particularly on the web, where it now accounts for over 98% of all websites, rendering legacy encodings such as ISO-8859-1 and Windows-1252 to less than 2% of traffic.^[130] This trend reflects broader efforts to ensure seamless interoperability in global digital communication, with UTF-8's variable-length efficiency and backward compatibility with ASCII driving its prevalence in modern applications, from web browsers to mobile operating systems. The ongoing evolution of the Unicode Standard continues to expand support for diverse scripts and symbols, as evidenced by Unicode 17.0, released in September 2025, which introduces 4,803 new characters, including four entirely new scripts—Sidetic, Tolong Siki, Beria Erfe, and Sharada Supplement—along with eight new emoji such as a trombone, treasure chest, and hairy creature.^[31] These additions underscore Unicode's commitment to encompassing emerging linguistic needs, with the total character repertoire now exceeding 159,800, facilitating better representation of lesser-known languages and cultural artifacts.^[131] Advancements in text processing are increasingly emphasizing grapheme clusters over individual code points to align more closely with user-perceived characters, as defined in Unicode Standard Annex #29.^[132] For instance, complex constructs like skin-toned emoji (e.g., 👨‍👩‍👧‍👦) or accented letters with diacritics are treated as single units, preventing fragmentation in editing, rendering, and input methods; this approach is gaining traction in 2025 software ecosystems, including terminal emulators and web browsers that implement proper segmentation for accurate cursor movement and deletion.^[133]^[134] Integration of machine learning techniques is emerging to enhance character encoding tasks, particularly in automatic detection of encoding schemes and script identification for multilingual content.^[135] Google's Noto font family, supporting over 150 writing systems across more than 1,000 languages, exemplifies this by enabling robust rendering in AI-assisted design tools, though direct ML applications remain focused on detection rather than core encoding transformations.^[136] Additionally, AI-driven defenses are being developed to counter Unicode-based security exploits, such as homograph attacks, by analyzing anomalous character sequences in real-time.^[137]

References

[1]
Character encodings for beginners - W3C
Apr 16, 2015 · A character encoding provides a key to unlock (ie. crack) the code. It is a set of mappings between the bytes in the computer and the characters ...Missing: science | Show results with:science
[2]
UTR#17: Character Encoding Model - Unicode
A coded character set is defined to be a mapping from a set of abstract characters to the set of non-negative integers. This range of integers need not be ...
[3]
Character and data encoding - Globalization - Microsoft Learn
Feb 2, 2024 · The process of assigning characters to numerical values is called character encoding. While there are many character encoding standards, one of ...Missing: science | Show results with:science
[4]
ASCII: The History Behind Encoding - DigiKey
May 17, 2024 · ASCII (sometimes pronounced aahs-kee by the youngins today) is a character encoding standard designed in the 1960s that assigns a unique number ...
[5]
The Evolution of Character Encoding - ASCII table
ASCII is a 7-bit encoding system, first published as a standard in 1963 by the American National Standards Institute (ANSI). It can represent 128 different ...
[6]
None
Nothing is retrieved...<|separator|>
[7]
Chapter 1 – Unicode 16.0.0
The Unicode Standard is the universal character encoding standard for written characters and text. It defines a consistent way of encoding multilingual text.
[8]
Unicode Standard
The Unicode Standard is the universal character encoding designed to support the worldwide interchange, processing, and display of the written texts of the ...
[9]
History and technology of Morse Code
It was developed by Samuel Morse and Alfred Vail in 1835. Morse code is an early form of digital communication. It uses two states (on and off) composed into ...History Of Morse Code · American Morse Code · Representation And Timing<|separator|>
[10]
The Baudot Code - Electrical and Computer Engineering
The Baudot Code is an early example of a binary character code based on 5-bit values defining 32 different codewords. The invention of this code in 1870 by ...
[11]
The Roots of Computer Code Lie in Telegraph Code
Sep 11, 2017 · Baudot Code's biggest advantage over Morse Code, which was first used in the 1840s, and other earlier codes, was its speed. Earlier systems sent ...
[12]
The IBM punched card
To load the program or read punched card data, each card would be inserted into a punched card reader to input data into a tabulating machine. As demand for ...
[13]
Punch Cards for Data Processing
Punch cards became the preferred method of entering data and programs onto them. They also were used in later minicomputers and some early desktop calculators.
[14]
Binary Numbers - College of Computing and Software Engineering
Each decimal digit required ten binary devices arranged so that one was on and the other nine were off. The circuit that was on indicated the digit represented.
[15]
History of Computers
ENIAC (Electrical Numerical Integrator and Calculator) used a word of 10 decimal digits instead of binary ones like previous automated calculators/computers.
[16]
Programming on the Ferranti Mark 1 - The University of Manchester
As 5-hole tape only provided 32 different values, and around 50 are necessary to provide for letters (upper case only), digits 0 - 9, and a reasonable selection ...
[17]
(DOC) Manchester Mark 1 - Academia.edu
Because the Mark 1 had a 40-bit word length, eight 5-bit teleprinter characters were required to encode each word. Thus for example the binary word: 10001 ...
[18]
History - Ecma International
This meeting was held on 27 April 1960 in Brussels; it was decided that an association of manufacturers should be formed which would be called European Computer ...
[19]
First edition of the ASCII standard is published - Event
The first edition of the ASCII standard was published on June 17, 1963, after work began in 1960 and a proposal in 1961.
[20]
[PDF] code for information interchange - NIST Technical Series Publications
This standard is a revision of X3.4-1968, which was developed in parallel with its international counterpart, ISO 646-1973. This current revision retains the ...
[21]
Guide to the use of Character Sets in Europe - Open Standards
This led to the publication in 1967 of ISO Recommendation 646 (ISO had Recommendations rather than Standards at that time). ... The exception dates from an early ...
[22]
ISO/IEC JTC 1/SC 2 - Coded character sets
Creation date: 1987. Scope. Standardization of graphic character sets and their characteristics, including string ordering, associated control functions, ...
[23]
ISO 8859-1:1987 Information processing — 8-bit single-byte coded ...
Publication date. : 1987-02. Stage. : Withdrawal of International Standard [95.99]. Edition. : 1. Number of pages. : 7. Technical Committee : ISO/IEC JTC 1/SC 2.Missing: first | Show results with:first
[24]
About CNS \ Chinese Code Status - 全字庫 CNS11643 (2024)
From the time when the ISO/IEC JTC1/SC2/WG2 working group was first established in 1986 to the time when the first part of the ISO 10646 standard "Architecture ...
[25]
Summary Narrative - Unicode
Aug 31, 2006 · Unicode began as a project in late 1987 after discussions between engineers from Apple and Xerox: Joe Becker, Lee Collins and Mark Davis.
[26]
ISO/Unicode Merger: Interview with Ed Hart
Oct 16, 1995 · In 1991 you had ISO 10646 and Unicode 1.0. Both codes wanted to be ... These talks led to an agreement to merge ISO/IEC 10646 and Unicode.
[27]
Unicode 1.0
Jul 15, 2015 · Unicode 1.0 was the first published version, consisting of Volumes 1 and 2, code charts, and descriptions for many characters. It was published ...
[28]
Supported Scripts - Unicode
Details about most of these scripts can be looked up at the ScriptSource website. Version (Year), Scripts Added, Totals. 1.1 (1993), 23. Arabic, Gujarati ...
[29]
UTS #51: Unicode Emoji
Emoji became available in 1999 on Japanese mobile phones. There was an early proposal in 2000 to encode DoCoMo emoji in the Unicode standard. At that time, it ...
[30]
Rob Pike's UTF-8 history
UTF-8 was designed, in front of my eyes, on a placemat in a New Jersey diner one night in September or so 1992. What happened was this. We had used the original ...
[31]
How many Unicode characters are there - BabelStone
Sep 12, 2023 · There are 154,998 encoded Unicode characters in version 16.0, but these do not always correspond to user-perceived characters.
[32]
Glossary of Unicode Terms
Acronym for character encoding scheme. Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or ...
[33]
Annex B The Universal Character Set (UCS) - Open Standards
There is effective cooperation between the Unicode Consortium and ISO/IEC JTC 1/SC 2 which should ensure that this compatibility is maintained in future ...
[34]
What is glyph? | Definition from TechTarget
Apr 25, 2023 · A glyph -- from a Greek word meaning carving -- is a graphic symbol that provides the appearance or form for a character.
[35]
glyph - Wiktionary, the free dictionary
First attested in 1727. Borrowed from French glyphe, from Ancient Greek γλυφή (gluphḗ, “carving”), from γλύφω (glúphō, “I carve, engrave”).English · Pronunciation · Noun
[36]
Unicode 17.0.0
This page summarizes the important changes for the Unicode Standard, Version 17.0.0. This version supersedes all previous versions of the Unicode Standard.Missing: November | Show results with:November
[37]
https://www.unicode.org/reports/tr17/tr17-2.html
[38]
[PDF] Unification of the Han Characters - Unicode
As a result of the agreement to merge the Unicode standard and ISO 10646, the Unicode consor- tium agreed to adopt the unified Han character repertoire that was ...
[39]
UTR#17: Character Encoding Model - Unicode
An abstract character is defined to be in a coded character set if the coded character set maps from it to an integer. That integer is said to be the code point ...
[40]
Chapter 1 – Unicode 17.0.0
An encoded character is represented by a number from 0 to 10FFFF16, called a code point.1.1 Coverage · 1.1. 1 Standards Coverage · 1.3 Text Handling<|control11|><|separator|>
[41]
The Standard ASCII Character Set and Codes
The Standard ASCII Character Set and Codes table given below contains 128 characters with corresponding numerical codes in the range 0..127 (decimal).
[42]
About the Code Charts - Unicode
Reserved Characters. Character codes that are marked “<reserved>” are unassigned and reserved for future encoding. Reserved codes are indicated by a  glyph. ...
[43]
UTR#17: Unicode Character Encoding Model
A coded character set is defined to be a mapping from a set of abstract characters to the set of nonnegative integers. This range of integers need not be ...
[44]
Chapter 3 – Unicode 16.0.0
D75 Surrogate pair: A representation for a single abstract character that consists of a sequence of two 16-bit code units, where the first value of the pair is ...
[45]
UTR#17: Unicode Character Encoding Model
When used unqualified without qualification, the terms UTF-8, UTF-16, and UTF-32 are ambiguous between their sense as Unicode encoding forms and as Unicode ...Unicode Character Encoding... · 2 Abstract Character... · 4 Character Encoding Form...
[46]
[PDF] ISO/IEC 646:1991 (E) - iTeh Standards
Dec 15, 1991 · International Standard ISO/IEC 646 was prepared by Joint Technical. Committee ISO/IEC JTC 1, Information technology. This third edition ...
[47]
[PDF] ISO 8859-1:1987 - iTeh Standards
Equipment claimed to implement this part of IS0 8859 shall implement all 191 characters. ... f13231d485de/iso-8859-1-1987.
[48]
[PDF] INTERNATIONAL STANDARD ISO/IEC 10646
ISO/IEC 10646 is an international standard for information technology, specifically a universal coded character set (UCS).<|separator|>
[49]
https://www.unicode.org/reports/tr17/
[50]
RFC 3629 - UTF-8, a transformation format of ISO 10646
UTF-8, the object of this memo, has a one-octet encoding unit. It uses all bits of an octet, but has the quality of preserving the full US-ASCII [US-ASCII] ...
[51]
Usage statistics of UTF-8 for websites - W3Techs
UTF-8 is used by 98.8% of all the websites whose character encoding we know. Historical trend. This diagram shows the historical trend in the percentage of ...
[52]
RFC 2045: Multipurpose Internet Mail Extensions (MIME) Part One
For example, some communities use the term "character encoding" for what MIME calls a "character set", while using the phrase "coded character set" to ...
[53]
RFC 7231 - Hypertext Transfer Protocol (HTTP/1.1) - IETF Datatracker
This document defines the semantics of HTTP/1.1 messages, as expressed by request methods, request header fields, response status codes, and response header ...
[54]
Content-Type header - HTTP - MDN Web Docs - Mozilla
Jul 4, 2025 · The HTTP Content-Type representation header is used to indicate the original media type of a resource before any content encoding is applied.
[55]
UAX #15: Unicode Normalization Forms
Jul 30, 2025 · ... Encoding Form. Incorrect buffer handling can introduce subtle errors in the results. Any buffered implementation should be carefully checked ...
[56]
Character entity references in HTML 4 - W3C
The character entity references in this section are for escaping markup-significant characters (these are the same as those in HTML 2.0 and 3.2), for denoting ...
[57]
https://html.spec.whatwg.org/multipage/syntax.html
[58]
Extensible Markup Language (XML) 1.0 (Fifth Edition) - W3C
Nov 26, 2008 · All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode [Unicode]; the mechanisms for signaling which of the two is in use, or ...
[59]
Han Unification History - Unicode
The Unicode Han character set began with a project to create a Han character cross-reference database at Xerox in 1986. In 1988, a parallel effort began at ...
[60]
Roadmap to the TIP - Unicode
The following table shows a map of the actual and proposed allocation on Plane 3, the TIP (Tertiary Ideographic Plane).
[61]
https://www.unicode.org/reports/tr17/tr17-6.html
[62]
[PDF] The Unicode Standard, Version 16.0 – Core Specification
Sep 10, 2024 · This is the Unicode Standard, Version 16.0 Core Specification, with an authoritative HTML version and an archival PDF version.<|control11|><|separator|>
[63]
https://www.unicode.org/reports/tr27/tr27-3.html
[64]
Charset (Java Platform SE 8 ) - Oracle Help Center
The native character encoding of the Java programming language is UTF-16. A charset in the Java platform therefore defines a mapping between sequences of ...
[65]
The Unicode standard - Globalization | Microsoft Learn
Feb 2, 2024 · The Unicode Standard is a character encoding supporting all writing systems, simplifying localization and enabling universal data exchange.
[66]
[PDF] Clarify guidance for use of a BOM as a UTF-8 encoding signature
Jan 2, 2021 · In these two cases, the character U+FEFF is used as a signature to indicate the byte order and the character set by using the byte ...
[67]
https://www.unicode.org/faq/utf_bom.html
[68]
Submitting Character Proposals - Unicode
Before preparing a proposal, sponsors should note in particular the distinction between the terms character and glyph as therein defined.
[69]
Unicode® Character Encoding Stability Policies
Jan 9, 2024 · This page lists the policies of the Unicode Consortium regarding character encoding stability. These policies are intended to ensure that text encoded in one ...
[70]
Registered Code Stability Policy - Unicode
This document lists the policies of the Unicode Consortium regarding registered code stability. Assignment. Once a code is assigned, it will not be reallocated, ...
[71]
Character Properties - Unicode
Control, a C0 or C1 control code. Cf, Format, a format control character. Cs, Surrogate, a surrogate code point. Co, Private_Use, a private-use character. Cn ...
[72]
UAX #44: Unicode Character Database
Aug 27, 2025 · This annex provides the core documentation for the Unicode Character Database (UCD). It describes the layout and organization of the Unicode Character Database.
[73]
Unicode 16.0.0
Sep 10, 2024 · Unicode 16.0 adds 5185 characters, for a total of 154,998 characters. The new additions include seven new scripts: Garay is a modern-use script ...Latest Code Charts · Unicode Character Database · Unicode Collation Algorithm
[74]
Chapter 2 – Unicode 16.0.0
In contrast, a character encoding standard provides a single set of fundamental units of encoding, to which it uniquely assigns numerical code points. These ...
[75]
Milestones:American Standard Code for Information Interchange ...
May 23, 2025 · The American Standards Association X3.2 subcommittee published the first edition of the ASCII standard in 1963. Its first widespread ...
[76]
Eight Bit Codes | pclt.sites.yale.edu
Mar 2, 2010 · The ASCII character set was based on a 7 bit code that AT&T used to transmit teletype messages. ... ) to use every one of the 256 possible byte ...
[77]
The US ASCII Character Set - Columbia University
Codes 0 through 31 and 127 (decimal) are unprintable control characters. Code 32 (decimal) is a nonprinting spacing character. Codes 33 through 126 (decimal) ...Missing: points source
[78]
ASCII | Definition, History, Trivia, & Facts | Britannica
Sep 12, 2025 · On June 17, 1963, ASCII was approved as the American standard. However, it did not gain wide acceptance, mainly because IBM chose to use ...
[79]
What is ASCII (American Standard Code for Information Interchange)?
Jan 24, 2025 · ASCII (American Standard Code for Information Interchange) is the most common character encoding format for text data in computers and on the internet.
[80]
Code Page Identifiers - Win32 apps - Microsoft Learn
Jan 7, 2021 · For the most consistent results, applications should use Unicode, such as UTF-8 or UTF-16, instead of a specific code page. Expand table ...Missing: Java | Show results with:Java
[81]
Microsoft Announces Plans to Support the Euro Currency Symbol
Apr 29, 1998 · The company also said Windows 98 and Windows NT 5.0 will support the euro from their market availability dates. In addition, Microsoft's ...Missing: 1252 | Show results with:1252
[82]
The euro currency symbol (OpenType 1.8.3) - Typography
Feb 11, 2022 · The Unicode assignment of the euro symbol is 20AC. The symbol will be added to the following codepages at position 0x80: 1250 Eastern European, 1252 Western, ...
[83]
UTF-8 (UCS transformation format) - IBM
UTF-8 (UCS transformation format) · It is a superset of ASCII, in which the ASCII characters are encoded as single-byte characters with the same numeric value.
[84]
What is Extended Binary Coded Decimal Interchange Code (EBCDIC)
Aug 23, 2023 · EBCDIC was developed by IBM in 1963 to complement the punched cards used for storage and data processing in the early days of computing.
[85]
Doug Jones's punched card codes - University of Iowa
EBCDIC is a direct descendant of the 6-bit BCD codes used by IBM's early computers. One consequence of this is that the EBCDIC codes for the 029 character set ...
[86]
ASCII and EBCDIC conversion tables - IBM
The following table is an ASCII-to-EBCDIC conversion table that translates 7-bit ASCII characters to 8-bit EBCDIC characters. Conversion table irregularities
[87]
The EBCDIC character set - IBM
EBCDIC is a character set developed before ASCII, used to encode z/OS data. It is an 8-bit set, but has different bit assignments than ASCII.Missing: history | Show results with:history
[88]
Db2 12 - Internationalization - EBCDIC - IBM
EBCDIC was developed by IBM in 1963. Certain characters are the same on every EBCDIC code page. Those characters are called invariant characters . Other ...
[89]
[PDF] Mainframe Banking Application Case Study | GenRocket
EBCDIC is derived from the original data format used by IBM mainframe peripherals (e.g., disk drives, tape devices and card readers) and uses 8-bit character ...
[90]
Unicode on the mainframe - IBM
Mainframes (using EBCDIC for single-byte characters), PCs, and various RISC systems use the same Unicode assignments. Unicode is maintained by the Unicode ...
[91]
Character Sets - Internet Assigned Numbers Authority
Jun 6, 2024 · The second region (1000-1999) is for the Unicode and ISO/IEC 10646 coded character sets together with a specification of a (set of) sub- ...<|control11|><|separator|>
[92]
shift_jis
This charset is an extension of csHalfWidthKatakana by adding graphic characters in JIS X 0208. The CCS's are JIS X0201:1997 and JIS X0208:1997. Several vendor ...
[93]
GBK
Application of IANA Charset Registration for GBK ... To remedy this shortcoming, the GBK specification has since been "replaced" by the mandatory GB 18030-2000 ...
[94]
IANA Charset Registration for GB18030
Since GB18030 is fully ISO 10646 compatible, it readily supports CJK Extension B and other languages. GB18030 is a "mandatory" standard: starting September 1, ...
[95]
622 New CJK Ideographs to be Available in Unicode Version 15.1
Aug 2, 2023 · The Unicode Standard will include 622 new CJK characters in Version 15.1, which will be released on September 12, 2023.
[96]
What issues lead people to use Japanese-specific encodings rather ...
Jun 8, 2011 · Reason 2: Inefficient character conversions. Converting characters from Unicode to legacy Japanese encodings and back requires tables, i.e. ...
[97]
Converter | ICU Documentation
ICU provides comprehensive character set conversion services, mapping tables, and implementations for many encodings. Since ICU uses Unicode (UTF-16) internally ...
[98]
International Components for Unicode - Character Set Mapping Tables
ICU uses mapping tables for character set conversion, available in .ucm and .xml formats, and are a subset of IBM's CDRA tables.Missing: bidirectional | Show results with:bidirectional
[99]
ICU Documentation
Background and History of ICU. ICU was originally developed by the Taligent company. The Taligent team later became the Unicode group at the IBM® Globalization ...General Transforms · How To Use ICU · ICU Data · Unicode Basics
[100]
International Components for Unicode (ICU) - IBM
ICU supports the most current version of the Unicode Standard, including supplementary Unicode characters that are needed for support of GB 18030, HKSCS, and ...
[101]
codecs — Codec registry and base classes — Python 3.14.0 ...
This module defines base classes for standard Python codecs (encoders and decoders) and provides access to the internal Python codec registry.Codecs -- Codec Registry And... · Codec Base Classes · Standard Encodings
[102]
Transcoding unicode characters with AVX‐512 instructions
Sep 12, 2023 · ... lookup table. The accelerated UTF-8 to UTF-16 transcoding algorithm processes up to 12 input UTF-8 bytes at a time. Given the input bytes ...Missing: driven | Show results with:driven
[103]
RFC 2781 - UTF-16, an encoding of ISO 10646 - IETF Datatracker
This document describes the UTF-16 encoding of Unicode/ISO-10646, addresses the issues of serializing UTF-16 as an octet stream for transmission over the ...<|separator|>
[104]
How it works — chardet 5.0.0 documentation - Read the Docs
UniversalDetector checks whether the text contains any high-bit characters. If so, it creates a series of “probers” for detecting multi-byte encodings.Missing: library | Show results with:library
[105]
Encoding Standard - whatwg
Aug 12, 2025 · The UTF-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore, for new protocols ...Iso-8859-8 · BMP coverage of iso-8859-8 · BMP coverage of windows-1256 · Koi8-rMissing: mandate | Show results with:mandate
[106]
UTF-8 mojibake – a practical guide to understanding decoding errors
Apr 20, 2022 · I am going to talk about the errors that one is likely to encounter in the wild. How to recognise them and understand what happened.
[107]
Converting UTF-8 to ISO-8859-1 in Java | Baeldung
Jun 20, 2024 · Explore two approaches for converting UTF-8 encoded strings to ISO-8859-1.<|control11|><|separator|>
[108]
Character Model for the World Wide Web: String Matching - W3C
Aug 11, 2021 · These presentation forms are intended only for the support of round-trip encoding conversions with the legacy character encodings that include ...
[109]
Round Trip Safety Configuration of Shift-JIS Characters - IBM
Although the characters from most native character encoding systems are round trip safe, the Shift-JIS encoding system is an exception. Approximately 400 ...<|separator|>
[110]
Unicode Basics | ICU Documentation
ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. The ICU User Guide ...
[111]
Usage statistics of character encodings for websites - W3Techs
UTF-8 is used by 98.8% of all the websites whose character encoding we know ... 0.1%. Shift JIS. 0.1%. W3Techs.com, 6 November 2025. Percentages of websites using ...
[112]
CAPEC-80 - Using UTF-8 Encoding to Bypass Validation Logic
For security reasons, a UTF-8 decoder must not accept UTF-8 sequences that are longer than necessary to encode a character. If you use a parser to decode the ...
[113]
A composite approach to language/encoding detection - Mozilla
Nov 26, 2002 · This paper presents three types of auto-detection methods to determine encodings of documents without explicit charset declaration.
[114]
How UTF-8 links to the i18n process? - Lingoport
UTF-8 encoding is crucial for internationalization and localization (i18n) of websites and software. Learn how it supports multi-language content and ...
[115]
Unicode – The World Standard for Text and Emoji
Jun 14, 2024 · Learn more about Unicode. Adopt Character, Technical Work, Support Unicode, About Us, News and Events, Legal & Licensing, Contact The Unicode Consortium.Code Charts · Emoji · Members · Submitting Emoji Proposals
[116]
How to Change or Set System Locales in Linux - Tecmint
Jul 13, 2023 · To display a list of all available locales use the following command. $ locale -a C C.UTF-8 en_US.utf8 POSIX. How to Set System Locale in Linux.
[117]
ICU - International Components for Unicode
ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and ...C++ · Why Use ICU4J? · Downloading ICU · Source Code Access
[118]
Internationalization vs. localization (i18n vs l10n): The differences
Sep 23, 2024 · Arabic is a right-to-left (RTL) language, which means that not only do you need to translate the content, you also need to re-design the ...
[119]
https://www.unicode.org/reports/tr9/
[120]
Homograph attacks: How hackers exploit look-alike domains
Apr 16, 2025 · For example, in Unicode the Cyrillic “а” (U+0430) looks identical to the Latin “a” (U+0061). They might use characters from different ...
[121]
An Automated Framework for Detecting IDN Homographs - arXiv
Sep 17, 2019 · ... Latin character 'a' (U+0061) and Cyrillic character 'a' (U+0430). Visually identical characters such as these are generally known as ...
[122]
Phishing with Unicode Domains - Bram.us
Apr 18, 2017 · ... homograph attacks. For example: the Cyrillic а (codepoint U+0430) totally looks like the Latin a (codepoint U+0061). When visting brаm.us ...
[123]
CWE-176: Improper Handling of Unicode Encoding (4.18) - Mitre
Mistakenly specifying the wrong units in a size argument can lead to a buffer overflow. The following function takes a username specified as a multibyte string ...
[124]
The Security Risks of Overlong UTF-8 Encodings - usd HeroLab
Sep 6, 2024 · While overlong UTF-8 encodings may seem like a niche topic, their security implications can be profound and are worth considering.
[125]
CWE-120: Buffer Copy without Checking Size of Input ('Classic ...
Buffer overflows often can be used to execute arbitrary code, which is usually outside the scope of the product's implicit security policy. This can often be ...Missing: UTF- | Show results with:UTF-
[126]
Character encoding: Types, UTF-8, Unicode, and more explained
Apr 7, 2025 · The idea is simple: assign a number to each character. For example, the letter "A" gets the number 65, "B" gets 66, and so on. Since computers ...ASCII character encoding · UTF-16 character encoding · UTF-8 character encodingMissing: science | Show results with:science
[127]
What Is UTF-8 Encoding? - JumpCloud
A significant feature of UTF-8 is its backward compatibility with ASCII, the former encoding standard for English characters. This means ASCII characters are ...
[128]
RFC 3492 - Punycode: A Bootstring encoding of Unicode for ...
Punycode is a simple and efficient transfer encoding syntax designed for use with Internationalized Domain Names in Applications (IDNA).
[129]
Zero-Width Characters in Cybersecurity: Safe or Risky? - Invisible Text
May 8, 2025 · Learn how zero-width characters are used in phishing attacks, how to detect them, and protect your site with smart, ethical security ...<|separator|>
[130]
Advanced Phishing Attack Conceals JavaScript with Invisible ...
Feb 25, 2025 · Attackers insert zero-width Unicode characters inside JavaScript keywords and function names to disguise malicious code. This makes it appear ...
[131]
Historical trends in the usage statistics of character encodings for ...
This report shows the historical trends in the usage of the top character encodings since October 2024. 2024 1 Oct, 2024 1 Nov, 2024 1 Dec, 2025 1 Jan, 2025 1 ...
[132]
What's New In Unicode 17.0 - Emojipedia Blog
Sep 9, 2025 · Unicode 17.0 includes a total of 4,802 new characters, of which 7 are the brand-new emoji codepoints discussed above. This brings the total ...
[133]
UAX #29: Unicode Text Segmentation
A single Unicode code point is often, but not always the same as a basic unit of a writing system for a language, or what a typical user might think of as a “ ...
[134]
Grapheme Clusters and Terminal Emulators - Mitchell Hashimoto
Oct 1, 2023 · This blog post describes why this happens and how terminal emulator and program authors can achieve consistent spacing for all characters.Missing: emerging | Show results with:emerging
[135]
https://cs229.stanford.edu/proj2007/KimPark-AutomaticDetectionOfCharacterEncodingAndLanguages.pdf
[136]
[PDF] Automatic Detection of Character Encoding and Language - CS229
Therefore, automatically detecting the correct character encoding from the given text can serve many people using various character encodings, including their.Missing: rate | Show results with:rate
[137]
Noto Home - Google Fonts
Noto is a collection of high-quality fonts in more than 1000 languages and over 150 writing systems.
[138]
https://drj.com/journal_main/ai-driven-defense-unicode-exploits/