Fact-checked by Grok 2 weeks ago

Character encoding

Character encoding is the process of assigning numerical values to symbols in a , such as letters, digits, and , to enable computers to store, transmit, and display textual data in . This mapping, often called a coded character set, transforms human-readable characters into sequences of bytes that hardware and software can process efficiently. Essential for digital communication, character encoding ensures compatibility across systems but has evolved due to the need to support diverse languages and scripts beyond early limitations. The history of character encoding traces back to early telegraphy and computing needs, with initial systems like the in the 1870s using 5-bit representations for limited symbols. In the , as computers proliferated, standardized encodings emerged to address ; the American Standard Code for Information Interchange (ASCII), developed by the in 1963, became the foundational 7-bit scheme supporting 128 characters primarily for English text, assigning values like 65 for uppercase 'A'. Concurrently, introduced EBCDIC in 1963 for its mainframes, an 8-bit encoding that prioritized punched-card compatibility but differed from ASCII, leading to fragmentation. By the 1980s, the rise of global computing exposed ASCII's limitations in handling non-Latin scripts, prompting extensions like ISO/IEC 8859 series (starting 1987), which provided 8-bit encodings for specific languages, such as ISO-8859-1 for Western European characters. These national and regional standards, while useful, created a "" with over 800 variants, complicating multilingual data exchange. The solution arrived with , a universal standard initiated in 1987 by a including Apple and , and first published in , which assigns unique code points (non-negative integers) to 159,801 characters from 172 scripts as of version 17.0 in 2025. Unicode's architecture separates abstract characters from their serialized forms, allowing flexible encodings like (variable-width, backward-compatible with ASCII), UTF-16, and UTF-32. Today, Unicode dominates modern computing, underpinning the web, operating systems, and international standards like ISO/IEC 10646, which it harmonizes with to support , emojis, and historic scripts. Its adoption has resolved many legacy issues, but challenges persist in legacy systems, conversion errors (), and ensuring universal accessibility for all languages.

History

Early Developments

The origins of character encoding trace back to pre-digital communication systems designed for efficient transmission over limited channels. In the 1830s, Samuel F.B. Morse and developed for use with the electric telegraph, representing letters, numbers, and punctuation as unique sequences of short (dots) and long (dashes) signals, which functioned as an early -like encoding scheme to convert textual information into transmittable pulses. This system marked a foundational step in abstracting characters into discrete, machine-interpretable forms, though it was variable-length and optimized for human operators rather than direct machine processing. By the 1870s, advanced this concept with the , a fixed-length 5-bit encoding for that assigned 32 distinct combinations to letters, figures, and control signals, enabling faster and more automated of messages across wires. The late 19th century saw the emergence of punched-card systems for , bridging and . In 1890, invented punched cards for the U.S. Census, where rectangular holes in specific positions on 80-column cards encoded demographic data as machine-readable patterns, processed electrically by tabulating machines to tally and sort information at scale. This innovation shifted text and numeric data into a physical, binary-inspired format—presence or absence of a punch—facilitating the first large-scale automated handling of character-based records and laying groundwork for stored-program computers. Early electronic computers in the and built on these ideas with custom encodings tailored to hardware constraints. The , completed in 1945, primarily handled numeric computations using a architecture where each was represented by a 10-position of flip-flops, effectively employing 10 bits to encode values from 0 to 9, with provisions for and later extensions to alphanumeric input via plugboards and switches. By the early , machines like the adopted 5-bit codes derived from standards for input and output, supporting 32 basic symbols including uppercase letters and numerals, augmented by shift mechanisms to access additional characters and control functions without exceeding the bit limit. These machine-specific schemes prioritized efficiency for limited memory and I/O, often favoring uppercase-only text and numeric data to fit within word sizes like 20 or 40 bits.

Standardization Efforts

The European Computer Manufacturers Association (ECMA), now known as , was established in 1961 to promote standards in information and communication technology, with significant work on character encoding beginning in the early through technical committees like TC4, which addressed and related sets by 1964. This effort contributed to early international harmonization, including the development of ECMA-6 in 1965, a 7-bit code aligned with emerging global needs. In parallel, the American Standards Association (ASA) published the first edition of the ASCII (X3.4-1963) on June 17, 1963, defining a 7-bit code with 128 positions primarily for English alphanumeric characters, control functions, and basic symbols to facilitate data interchange in telecommunications and . A major revision followed in 1967 as USAS X3.4-1967, refining assignments such as lowercase letters and punctuation for broader adoption. This national influenced international efforts, leading to its adoption as ISO Recommendation R 646 in 1967, which specified a compatible 7-bit coded character set for information processing interchange. The formation of ISO/IEC JTC 1/SC 2 in 1987 marked a key institutional milestone, building on earlier ISO/TC 97/SC 2 work from the 1960s to standardize coded character sets, including graphic characters, control functions, and string ordering for global compatibility. Under this subcommittee, the ISO/IEC 8859 series emerged in the late 1980s and 1990s, extending to 8-bit encodings for regional scripts; for instance, ISO 8859-1 (Latin-1) was first published in 1987, supporting 191 graphic characters for Western European languages while maintaining ASCII compatibility in the lower 128 positions. Subsequent parts, such as ISO 8859-2 for Latin/Cyrillic in 1987 and ISO 8859-9 for Turkish in 1999, addressed diverse linguistic needs without overlapping repertoires. A pivotal event occurred in 1986 when ISO/IEC JTC 1/SC 2 established Working Group 2 (WG 2) to develop a , resulting in the initial draft of ISO 10646, which aimed to unify disparate encodings into a comprehensive repertoire exceeding 65,000 characters across scripts. This effort laid the groundwork for collaboration with emerging initiatives like , fostering a single global framework by the early 1990s.

Evolution to Unicode

The was established on January 3, 1991, in , as a dedicated to creating, maintaining, and promoting a to address the fragmentation of existing systems. This initiative arose from collaborative efforts involving engineers from and Apple, who had developed proprietary encodings like Xerox's Character Code Standard (XCCS) and Apple's Macintosh Roman, alongside the emerging ISO/IEC 10646 project for a 31-bit . The consortium's formation facilitated the merger of these parallel developments, harmonizing with ISO 10646 by 1993 to ensure synchronized evolution and global interoperability. Unicode 1.0, released in October 1991, marked the standard's debut with support for approximately 7,100 characters, primarily covering Western European languages through extensions of ASCII, as well as initial inclusions for scripts like Greek and Cyrillic. Subsequent versions rapidly expanded the repertoire to encompass a broader array of writing systems; for instance, the Arabic script was fully integrated in Unicode 1.1 (1993), enabling proper representation of right-to-left text and diacritics essential for languages across the Middle East and North Africa. Further growth in the 2010s incorporated modern digital symbols, with emoji characters first standardized in Unicode 6.0 (2010), drawing from Japanese mobile phone sets to support expressive, cross-platform communication. The explosive expansion of the during the , with user bases growing from millions to hundreds of millions globally, underscored the need for a single encoding capable of handling multilingual content without the constraints of regional standards like ISO 8859, which were limited to 256 characters per set and struggled with mixed-script documents. 's adoption was accelerated by its design flexibility, particularly the introduction of in 1992 by and at , which uses variable-length encoding (1 to 4 bytes per character) while ensuring the first 128 code points match ASCII exactly for seamless . This allowed legacy ASCII-based systems, prevalent in early web infrastructure, to process Unicode text without modification, facilitating the web's transition to global, multilingual applications. As of November 2025, Unicode 17.0—released on September 9, 2025—encodes 159,801 characters across 172 scripts, reflecting ongoing efforts to include underrepresented languages and cultural symbols. Notable among these expansions is the addition of scripts like in Unicode 12.0 (2019), a developed in the 1980s for communities in and the , demonstrating Unicode's commitment to linguistic diversity and minority language preservation. Unicode 17.0 further added four new scripts, including Beria Erfe and Sidetic, along with eight new and other symbols.

Core Terminology

Character and Glyph

In character encoding, a character is defined as the smallest component of that has semantic value, referring to the meaning and/or of a unit of written text, independent of its specific visual representation, font, or script variation. For instance, the character "A" represents a consistent informational unit, regardless of whether it appears in uppercase, lowercase, or stylized forms across different writing systems. This allows characters to serve as units of data organization, control, or representation in , as established in international standards. A , in contrast, is the specific visual form or graphic symbol used to render a on a display or in print, varying based on factors such as , size, language context, or stylistic choices. Examples include the serif-style "A" in Times New Roman versus the sans-serif "A" in , or contextual glyph substitutions in scripts like where the same adopts different shapes depending on its position in a word (, medial, final, or isolated). The term "" originates from , where it denotes the carved or engraved shape of a letter, derived from the gluphḗ meaning "carving," and was adapted for digital encoding in standards like ISO/IEC 10646 to describe these rendered forms. A fundamental distinction is that one character can correspond to multiple glyphs, enabling flexibility in presentation while preserving the underlying semantic content; conversely, a single glyph may represent multiple characters in certain cases, such as the Latin ligature "fi" (a combined glyph for the two distinct characters "f" and "i") to improve readability and aesthetics in typesetting. This separation ensures that character encoding focuses on abstract information interchange, leaving glyph rendering to higher-level processes like font systems.

Character Repertoire and Set

In character encoding standards, the character repertoire refers to the complete, abstract collection of distinct characters that an encoding scheme is designed to represent, independent of their visual forms or numeric assignments. This repertoire encompasses a finite or potentially of abstract characters, such as letters, digits, symbols, and ideographs from various writing systems, ensuring comprehensive coverage for text interchange. For instance, the Standard's repertoire, as defined in ISO/IEC 10646, includes over 159,000 assigned abstract characters as of version 17.0 released in September 2025. A character set, in contrast, typically denotes a named, practical of the full or the entire itself when bounded for specific applications, often implying an organized grouping for encoding purposes. Examples include the Basic Latin character set, which covers the 128 characters of the ASCII standard, serving as a foundational within Unicode's broader . Character sets are thus more application-oriented, allowing systems to handle delimited portions of the efficiently without processing the exhaustive total. The fundamentally excludes , which are the visual representations of , focusing instead solely on abstract units as established in prior definitions of and distinctions. A key example of repertoire organization is in , where ideographic shared across , , and scripts are consolidated into a single abstract form to optimize space and promote , resulting in a unified within the overall . This approach highlights the 's exhaustive, abstract nature versus the bounded, implementable scope of character sets, enabling scalable support for global scripts.

Code Points and Code Space

In character encoding, a code point is a unique integer value assigned to represent an abstract character within a coded character set. This numeric identifier serves as an address in the encoding's abstract space, allowing unambiguous reference to characters regardless of how they are stored or displayed. For instance, in the Unicode standard, the code point U+0041 denotes the Latin capital letter "A". The code space encompasses the entire range of possible code points defined by an encoding standard, forming a contiguous or structured set of nonnegative integers available for character assignment. This range determines the encoding's capacity to represent characters, with the size of the code space calculated as the maximum code point value minus the minimum value plus one. In Unicode, the code space spans from 0 to 10FFFF in hexadecimal (equivalent to 0 to 1,114,111 in decimal), providing 1,114,112 possible code points across 17 planes of 65,536 positions each. By contrast, the American Standard Code for Information Interchange (ASCII) uses a 7-bit code space from 0 to 127, accommodating 128 positions for basic Latin characters and control codes. Within the code space, not all positions are immediately assigned to characters from a given repertoire; unassigned code points are explicitly reserved for future extensions or additions to ensure long-term stability and expandability of the encoding. These reservations prevent conflicts as new characters, such as those from emerging scripts or symbols, are incorporated over time.

Code Units and Encoding Forms

In character encoding, a code unit is the minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard defines three primary encoding forms—UTF-8, UTF-16, and UTF-32—each utilizing code units of fixed sizes: 8 bits for UTF-8, 16 bits for UTF-16, and 32 bits for UTF-32. An encoding form specifies the mapping from code points (abstract numerical identifiers for characters) to sequences of one or more code units, enabling the representation of the full code space within the constraints of the chosen unit size. These forms handle the internal binary packaging of code points without specifying byte serialization or order, which is addressed by encoding schemes. For instance, code points serve as the input to these forms, transforming values into storable or transmittable sequences. In , code points are encoded variably using 1 to 4 bytes depending on their value: code points U+0000 to U+007F require 1 byte (identical to ASCII), U+0080 to U+07FF use 2 bytes, U+0800 to U+FFFF use 3 bytes, and U+10000 to U+10FFFF use 4 bytes. This variable-length approach distributes the bits of the code point across bytes, with leading bytes using bit patterns (e.g., 0xxxxxxx for 1-byte, 110xxxxx for 2-byte starters) to indicate sequence length and continuation bytes marked by 10xxxxxx. UTF-16 employs fixed 16-bit code units for code points in the Basic Multilingual Plane (, U+0000 to U+FFFF), representing each as a single unit. For supplementary code points beyond U+FFFF (up to U+10FFFF), UTF-16 uses surrogate pairs: two consecutive 16-bit code units, where the first (high surrogate) ranges from U+D800 to U+DBFF and the second (low surrogate) from U+DC00 to U+DFFF, together encoding the full value. This mechanism allows UTF-16 to cover the entire repertoire with at most two code units per character. UTF-32, in contrast, uses a single 32-bit code unit for every , providing a straightforward fixed-width encoding but at the cost of greater storage for most text. Encoding forms like these focus solely on the code unit sequences derived from code points, distinguishing them from encoding schemes that incorporate byte-order marks or for serialized output.

Encoding Principles

Coded Character Sets

A coded set (CCS) is a from a set of abstract in a to nonnegative integers known as , within a defined code space. This must be injective, ensuring each in the corresponds to a unique , to avoid ambiguity in representation. For fixed-width CCS, the is often bijective, fully utilizing the available code space, such as assigning surjectively to all positions in a 7-bit or 8-bit range. A classic example is the American Standard Code for Information Interchange (ASCII), standardized internationally as ISO/IEC 646, which defines a 7-bit mapping 128 characters—including 95 printable graphics and 33 controls—to code points from 0 to 127. Similarly, ISO/IEC 8859-1 extends this to an 8-bit for Western European languages, defining a repertoire of 191 graphic characters mapped to code points 32 to 126 and 160 to 255, while positions 127 to 159 remain undefined or reserved for controls. Coded character sets originated as the foundational mechanism for digital text representation in early , providing a , fixed-width of numbers to characters before the demands of global scripts necessitated multi-byte approaches. In modern standards like the Universal Coded Character Set (UCS) of ISO/IEC 10646—which aligns with —the expands to a 21-bit code space organized into 17 planes of code points each, enabling over 1 million possible assignments while maintaining the core principle of unique character-to-code-point injections. One key limitation of traditional designs is their fixed-size structure, which assumes a one-to-one correspondence of one per base character, restricting support for scripts requiring diacritics or ligatures without separate combining mechanisms. This approach sufficed for limited repertoires like Latin alphabets but proved inadequate for multilingual needs, prompting evolutions in encoding standards.

Formats

formats, also known as Character Encoding Schemes (CES), provide methods to serialize sequences of code units from an encoding form into byte sequences suitable for storage and transmission, ensuring reversibility and compatibility with byte-oriented systems. A CES combines the rules of an encoding form—such as the fixed-width 16-bit code units of —with serialization conventions like byte order and variable-length mapping, allowing adaptation of coded character sets to practical binary representations. A prominent example is , a variable-length CES that encodes code points using 1 to 4 bytes per character, where the first 128 code points (U+0000 to U+007F) are represented as single bytes identical to ASCII, preserving compatibility with legacy 7-bit systems. This design enables efficient handling of multilingual text by allocating fewer bytes to common Latin characters while supporting the full repertoire through multi-byte sequences for higher code points. For fixed-width formats like UTF-16 and UTF-32, which use 16-bit or 32-bit code units respectively, byte serialization must account for —the order of byte transmission in multi-byte units. The (BOM), represented by the character U+FEFF, serves as a signature at the start of a to indicate the endianness: in big-endian UTF-16, it appears as the byte sequence FE FF, while in little-endian it is FF FE. UTF-16 further employs surrogate pairs to extend its 16-bit code units to cover the full 21-bit code space; a pair consists of a high surrogate from the range U+D800 to U+DBFF followed by a low surrogate from U+DC00 to U+DFFF, together encoding code points from U+10000 to U+10FFFF. These transformation formats offer significant advantages in space efficiency, particularly for ASCII-heavy text, as UTF-8 requires only 1 byte per for the basic compared to 2 bytes in UTF-16 or 4 in UTF-32. As a result, has become the dominant encoding for , used by 98.8% of websites as of November 2025.

Higher-Level Protocols

Higher-level protocols build upon character encoding schemes to ensure reliable transmission and interpretation of text across networks and applications. In the Multipurpose Internet Mail Extensions (MIME) standard, character sets are declared using parameters in headers, such as Content-Type: text/plain; charset=US-ASCII, allowing clients and other systems to decode messages correctly. Similarly, the Hypertext Transfer Protocol (HTTP) uses the Content-Type header to specify the and charset, for instance Content-Type: text/html; charset=[UTF-8](/page/UTF-8), which informs web browsers how to render the content without misinterpretation of bytes as . These declarations integrate encoding forms like or UTF-16 into layered communication stacks, preventing issues such as during data exchange. Unicode normalization forms address variations in character representation to maintain consistency in protocols. Normalization Form C (NFC) composes characters into precomposed forms where possible, such as combining a base letter and accent into a single (e.g., "é" as U+00E9), while Normalization Form D (NFD) decomposes them into base and combining marks (e.g., "e" + combining ). These forms ensure canonical equivalence, meaning NFC and NFD representations are semantically identical but may differ in storage, which is crucial for protocols handling user-generated content to avoid mismatches in searching, sorting, or . Applications often normalize to NFC for compatibility with legacy systems, as it minimizes the length of encoded strings compared to decomposed forms. Escaping sequences in higher-level protocols protect special characters during transmission, preventing them from being interpreted as control codes. In HTML, entities like &amp; for ampersand (&), &lt; for less-than (<), and &gt; for greater-than (>) escape markup-significant symbols, ensuring safe inclusion in documents without altering structure. This mechanism, defined in the HTML specification, allows protocols to transport raw text while preserving its integrity across diverse systems. The Extensible Markup Language (XML) exemplifies protocol-level encoding integration by requiring support for UTF-8 and UTF-16, with UTF-32 permitted as an optional encoding, and an optional encoding declaration like <?xml version="1.0" encoding="UTF-8"?> at the document's start. This declaration signals the processor to use the specified transformation format for parsing, ensuring interoperability in data exchange standards such as or feeds. Without it, XML defaults to UTF-8 or UTF-16 based on the byte order mark, but explicit declaration enhances robustness in multi-encoding environments.

Unicode Standard

Abstract Character Repertoire

The abstract character repertoire of the Unicode Standard represents a vast, curated collection of characters drawn from the world's writing systems, symbols, and notations, serving as the foundational set of abstract characters available for encoding. This repertoire comprises 159,801 characters across 172 scripts, encompassing both contemporary languages and historical notations to support global text interchange. For instance, it includes modern scripts such as for and Latin for English, alongside historic ones like , an ancient , and recent additions like from version 15.0 released in 2023, which introduced new symbolic representations for digital communication. The repertoire is designed to be open-ended, allowing for ongoing expansion to accommodate evolving linguistic needs without disrupting existing encodings. A key principle in constructing this repertoire is , which merges ideographic characters shared across Chinese, Japanese, Korean, and Vietnamese writing systems to minimize redundancy while preserving cultural distinctions through variation. Under , a single is assigned to visually similar ideographs that represent the same abstract character, such as the shared form for "" (山) used in multiple East Asian contexts, reducing the total number of unique code points required. This approach, developed through collaboration among ideograph experts, ensures efficient storage and processing while relying on font rendering and normalization processes to display appropriate variants. The Unicode repertoire is organized into 17 planes, each containing 65,536 potential code points, to systematically allocate characters by category and rarity. Plane 0, known as the Basic Multilingual Plane (BMP), holds the most commonly used characters, including scripts for major world languages and basic symbols, covering the initial 65,536 code points from U+0000 to U+FFFF. Plane 1, the Supplementary Multilingual Plane (SMP), extends support for less common and historic scripts, such as and . As of 2025, Plane 3, the Tertiary Ideographic Plane (TIP), has seen partial allocation for ancient scripts, including provisional spaces for and seal scripts to accommodate expanded Han-related historic material. The core repertoire deliberately excludes private use areas, which are reserved code point ranges (such as Planes 15 and 16) for implementation-specific characters defined by private agreements rather than the standard itself, ensuring the official set remains universally interoperable. Character variants, including compatibility decompositions for legacy encodings, are managed through Unicode normalization forms rather than duplicating code points in the repertoire, promoting consistency across systems.

Encoding Forms and Schemes

The Unicode Standard defines three primary encoding forms—UTF-8, UTF-16, and UTF-32—that transform sequences of Unicode code points into streams of code units suitable for storage, processing, or transmission in binary format. These forms operate on the abstract repertoire of Unicode characters, ensuring a consistent mapping from code points to bytes while optimizing for different use cases such as space efficiency, processing speed, or simplicity. Encoding schemes further specify how these code units are serialized into bytes, particularly addressing byte order variations. is a variable-length encoding form that represents each using one to four 8-bit code units (octets). It achieves with ASCII by encoding the 128 ASCII characters (code points U+0000 to U+007F) as single bytes identical to their ASCII values, with the binary pattern 0xxxxxxx. For code points beyond this range, UTF-8 employs multi-byte sequences distinguished by leading bit patterns that indicate the sequence length and allow self-synchronization for error detection and parsing. The specific bit patterns are as follows:
Bytes1st Byte Binary2nd Byte Binary3rd Byte Binary4th Byte Binary
10xxxxxxx
2110xxxxx10xxxxxx
31110xxxx10xxxxxx10xxxxxx
411110xxx10xxxxxx10xxxxxx10xxxxxx
Here, x represents bits from the code point value, with continuation bytes always starting with 10. This design minimizes overhead for Latin scripts while supporting the full Unicode range up to U+10FFFF, excluding surrogates and noncharacters. UTF-8's efficiency for English and European text, combined with its ASCII transparency, has led to its widespread adoption as the preferred encoding for web content, HTML, XML, and plain text files. UTF-16 is a variable-width encoding form that uses 16-bit code units to represent characters, encoding code points in the Basic Multilingual Plane (BMP, U+0000 to U+FFFF) with a single code unit and supplementary code points (U+10000 to U+10FFFF) with two code units known as a surrogate pair. A surrogate pair consists of a high surrogate (U+D800 to U+DBFF) followed by a low surrogate (U+DC00 to U+DFFF); the pair's numerical value reconstructs the original code point via the formula (high - 0xD800) × 0x400 + (low - 0xDC00) + 0x10000. This allows UTF-16 to cover the entire Unicode space while maintaining compatibility with 16-bit systems. UTF-16 is the native internal encoding for Java strings and is extensively used in Windows operating system APIs and components for text processing. UTF-32 is a fixed-width encoding form that maps each directly to a single 32-bit code unit, equivalent to the code point's numerical value padded to 32 bits. This one-to-one correspondence simplifies , indexing, and operations on text but results in higher storage and bandwidth requirements, as every character occupies four bytes regardless of its value. UTF-32 is particularly suited for applications where processing efficiency outweighs space concerns, such as in-memory representations during computation. The distinction between encoding forms and schemes arises in byte serialization, primarily for multi-byte code units in UTF-16 and UTF-32, where (the order of byte storage) must be specified. Big-endian (BE) places the most significant byte first, while little-endian (LE) places the least significant byte first, leading to schemes such as UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE. , being octet-based, has no endianness issues and uses a single scheme. To resolve ambiguity in endianness, the (BOM)—the character U+FEFF—is commonly prefixed to streams: in UTF-16BE, it appears as the bytes EF BB (hex FE FF); in UTF-16LE, as BB EF (FF FE). For , the BOM is EF BB BF, serving as an optional encoding signature rather than a strict byte-order indicator, though its use is discouraged in protocols like HTTP to avoid issues with signature interpretation. The BOM enables automatic detection of the encoding scheme but should not be interpreted as a regular character if present.

Code Point Allocation and Documentation

The allocation of code points in the Unicode Standard is managed by the through a rigorous proposal and review process to ensure global and cultural representation. Proposals for new characters or scripts are submitted to the Consortium, where they undergo evaluation by technical committees, including the Unicode Technical Committee (UTC) and script-specific groups, based on criteria such as evidence of usage, distinctiveness from existing characters, and compatibility with encoding principles. Once approved, are assigned from reserved areas in the code space, with strict stability policies prohibiting reallocation to maintain —assigned code points remain fixed across versions, and no code point can be repurposed for a different . Code points are categorized to define their roles and behaviors, facilitating consistent implementation across systems. Key categories include for legacy control codes like line feeds, for characters that affect text layout without visible rendering such as zero-width joiners, and for vendor-specific or custom assignments that do not conflict with standard characters. These categories, along with others like Letter (L) and , are part of the General Category property, which informs rendering, searching, and processing rules. Documentation of code points is centralized in the Unicode Character Database (UCD), a collection of machine-readable files that provide comprehensive for each assigned . The primary file, UnicodeData.txt, lists details for every character, including its formal name (e.g., "LATIN CAPITAL LETTER A" for U+0041), General Category, decomposition mappings for , and numeric values where applicable. Additional UCD files, such as PropList.txt for binary properties (e.g., Emoji=Yes) and Blocks.txt for grouping into named ranges, ensure developers and researchers can access authoritative properties like bidirectional class or script identification. Unicode versioning supports extensibility by incrementally adding code points in new blocks while preserving existing assignments, with each major release synchronizing with ISO/IEC 10646. For instance, Unicode 17.0, released on September 9, 2025, introduced blocks for four new scripts, including the Sidetic script (U+10940–U+1095F) from ancient , expanding support for underrepresented languages. Planes 15 (U+F0000–U+FFFFF) and 16 (U+100000–U+10FFFF) are fully reserved as Supplementary Private Use Areas-A and -B, respectively, allowing organizations to define up to 131,068 private code points without . A representative example of code point allocation is emoji characters, which are documented in supplemental planes with specific properties for variation. Emoji base characters, such as the grinning face (U+1F600), reside in blocks like Emoticons and , while variation selectors like U+1F3FB–U+1F3FF enable skin tone modifications (e.g., medium skin tone) to promote inclusivity through modifier sequences. These are flagged in the UCD with Emoji=Yes and detailed in emoji-specific data files for consistent rendering across platforms.

Legacy and Regional Encodings

ASCII and Western Encodings

The American Standard Code for Information Interchange (ASCII), standardized in 1963 by the American Standards Association (now ANSI), defines a 7-bit character encoding scheme with 128 code points, ranging from 0 to 127. This standard was developed to facilitate data interchange, particularly for telegraphic and early computing applications, where the 7-bit format aligned with teletype equipment limitations. Within ASCII, code points 0–31 and 127 are designated as control characters for functions like and line feed, while 32–126 represent printable characters, including uppercase and lowercase letters, digits, and basic punctuation. ASCII's 7-bit design allowed compatibility with transmission systems but restricted representation to just 128 characters, primarily suited for English text. As computing hardware evolved to use 8-bit bytes for and , extended 8-bit encodings emerged to utilize the full 256 possible code points, enabling support for additional Western European characters. One prominent extension is ISO/IEC 8859-1, known as Latin-1, first published in 1987 by the . This 8-bit standard retains the ASCII subset in positions 0–127 and adds 128 characters in the 128–255 range, including accented letters (e.g., á, ç, ñ) and symbols for languages like , , , and , totaling 191 graphic characters. A widely used variant is , developed by as a superset of ISO 8859-1 for Western European languages in its operating systems. Introduced in the 1990s and finalized with updates in (1998), it differs from Latin-1 mainly in the 128–159 range by assigning printable characters instead of controls, including the euro symbol (€) at code point 128 to support the European currency introduced in 1999. Despite the rise of , ASCII and its Western extensions remain prevalent in systems, such as older databases and protocols, due to their simplicity and . Notably, ASCII forms a direct subset of , where the first 128 code points are encoded identically as single bytes, ensuring seamless integration in modern applications without . This compatibility has sustained their influence, though their 256-character limit highlights the need for broader encodings in multilingual contexts.

EBCDIC and Mainframe Systems

EBCDIC, or Extended Binary Coded Decimal Interchange Code, is an 8-bit character encoding standard developed by IBM in 1963, providing 256 possible code points for representing characters primarily in mainframe computing environments. It evolved from the earlier 6-bit BCDIC (Binary Coded Decimal Interchange Code) used for punched card data processing, retaining a zone-digit structure that groups characters into zones rather than a strictly sequential order. This design results in a non-contiguous layout, where, for example, the decimal digits 0 through 9 are assigned to hexadecimal codes F0 through F9 (decimal 240–249), while uppercase letters A through I occupy C1 through C9, J through R are D1 through D9, and S through Z are E2 through E9, creating gaps between these ranges. EBCDIC remains the default encoding for data in IBM's operating system, which powers many enterprise mainframes, but it is fundamentally incompatible with due to differing bit assignments; for instance, the character "A" is encoded as 0xC1 in versus 0x41 in ASCII. To support multilingual needs, IBM introduced the Coded Character Set Identifier (CCSID) system, which specifies variants of EBCDIC code pages; notable examples include CCSID 37 for U.S. English, CCSID 500 for Western European languages, and CCSID 1047, a Latin-1 compatible variant used for open systems integration. Despite the dominance of Unicode in contemporary computing, EBCDIC persists in legacy mainframe applications, particularly in banking and finance sectors where systems handle high-volume . Interoperability is achieved through conversion gateways and tools that transcode EBCDIC to Unicode, enabling data exchange with modern distributed systems while preserving the reliability of established mainframe infrastructures.

CJK and Multilingual Encodings

Character encodings for , , and (CJK) languages emerged in the 1980s and 1990s to handle the vast number of ideographic characters, which far exceeded the 256-code-point limit of single-byte systems like ASCII. These multi-byte encodings, primarily variable-width, were developed regionally to support local scripts on personal computers and Unix systems, incorporating thousands of hanzi (Chinese), (Japanese), and (Korean) characters alongside Latin and other scripts. Unlike fixed-width encodings, they used one or two bytes (or more in later extensions) per character, enabling efficient representation of complex writing systems while maintaining partial compatibility with ASCII. Shift-JIS, introduced in the 1980s by and adopted by , encodes the standard, which defines 6,879 graphic characters including 6,355 . It employs a variable-width scheme where half-width (single-byte) codes from handle Roman letters, half-width , and controls (0x00–0x7F and 0xA1–0xDF), while full-width (double-byte) sequences (0x81–0x9F, 0xE0–0xEF for the first byte, followed by 0x40–0x7E or 0x80–0xFC for the second) represent characters, including double-byte and . This design allows seamless mixing with ASCII but leads to ambiguities in byte streams, as lead bytes overlap with some single-byte ranges. The Windows variant, Code Page 932 (CP932 or Windows-31J), extends Shift-JIS by adding 83 NEC Row 14 and extensions for special symbols and rare , registered with IANA as a distinct charset supporting and . In , GBK (1995) extended the GB 2312-1980 standard, which covered 6,763 , by adding 21,003 total hanzi plus compatibility with GB 13000.1 (aligned to 1.0), using a double-byte structure similar to Shift-JIS: single-byte for ASCII (0x00–0x7F) and double-byte (0x81–0xFE for both bytes) for , excluding 0x80 and 0xC0–0xFF ranges to avoid conflicts. GBK was registered with IANA in 2002 but later superseded by the mandatory GB 18030-2000 standard, which ensures full 3.0 compliance by incorporating all 20,902 Basic Multilingual Plane and adding four-byte sequences (e.g., 0x90 followed by three bytes from 0x30–0x39, 0x81–0xFE) for rarely used extensions, while preserving backward compatibility with GBK and GB 2312. GB 18030-2022 further expanded to 87,887 across multiple planes, aligning with 11.0, supporting ethnic minority scripts like Uighur. Big5, developed in 1984 by 's Institute for Information Industry and widely used in and for traditional Chinese, encodes 13,053 characters from the first two planes of CNS 11643-1992, prioritizing common hanzi. It uses a variable-width format with single-byte ASCII (0x00–0x7F) and double-byte hanzi (first byte 0xA1–0xF9, second byte 0x40–0x7E or 0xA1–0xFE), mapping to 11,625 unique codes plus punctuation. While not an official standard, Big5 became the encoding for traditional Chinese on PCs; extensions like Big5-2003 incorporate CNS 11643 planes 3–6 for 48,000+ characters. In Unix environments, the EUC-TW encoding represents CNS 11643 using (EUC), with up to four bytes per character from multiple planes, as defined by 's Chinese National Standard. Han unification in the Unicode Standard merged visually similar ideographs from CJK sources, reducing the required code points from over 120,000 in disparate legacy sets to approximately 93,000 unified ideographs across extensions (e.g., 20,992 in the core block plus 42,720 in Extension B). This approach, detailed in Unicode 1.0 documentation, minimized redundancy but introduced challenges in rendering, as fonts must select region-specific glyphs for the same code point. Legacy issues persist in mixed-use scenarios, where misinterpreting bytes from Shift-JIS, GBK, or as another encoding causes —garbled text like reversed or substituted characters—particularly when ASCII and multi-byte sequences intermix without proper declaration. For instance, a Shift-JIS double-byte lead byte (0x81–0x9F) read as ISO-8859-1 yields Western symbols instead of . Unicode addresses CJK through its unified model but requires careful to mitigate such artifacts from pre-Unicode systems.

Transcoding and Interoperability

Conversion Processes

involves converting sequences of bytes from one character encoding to another by mapping the underlying code points through an intermediate abstract character repertoire, most commonly . This process typically proceeds in three steps: first, the source encoding is decoded into a sequence of Unicode code points representing abstract characters; second, these code points are processed if necessary to handle equivalences or other transformations; and third, the code points are encoded into the target encoding's byte sequence. Using as the pivot ensures a standardized intermediate form that facilitates across diverse encodings, as implemented in libraries like ICU, where conversions always route through (UTF-16 internally). The fidelity of transcoding depends on the overlap between the source and target encoding repertoires. If all source characters have exact counterparts in the target, the conversion is lossless, preserving the original meaning and appearance. For instance, transcoding from ASCII to is lossless, as the 128 ASCII code points map directly to the first 128 Unicode code points, and encodes them using the same single-byte values. Conversely, conversions can be lossy when source characters fall outside the target's repertoire, leading to substitutions (e.g., replacement characters like �), omissions, or approximations that alter the text. An example is transcoding from ISO-8859-1 (Latin-1), which includes accented Western European characters, to GBK, a Chinese encoding primarily focused on CJK ideographs; while basic Latin letters map well, certain accents may lack precise equivalents in GBK's limited non-CJK extensions, resulting in . To mitigate issues arising from 's canonical equivalences—where multiple sequences represent the same abstract character— is often applied during . Specifically, converting the intermediate Unicode sequence to Normalization Form C () before encoding into the target ensures that precomposed characters (e.g., é as a single U+00E9) are used where possible, improving compatibility with legacy encodings that may not handle decomposed forms (e.g., e + combining ). This step canonicalizes the representation, reducing round-trip discrepancies in bidirectional conversions. The Unicode Standard recommends for general text processing due to its compatibility with legacy data. Conversion processes rely on mapping tables that define correspondences between code points in the source, Unicode pivot, and target encodings. These tables are typically bidirectional, allowing round-trip conversions where mappings are reversible (e.g., a source byte maps to a unique Unicode code point, which maps back uniquely). The International Components for Unicode (ICU) library exemplifies this with its comprehensive set of conversion data files, derived from standards like IBM's CDRA tables, which specify both forward and fallback mappings to handle partial overlaps gracefully. Such tables enable efficient, deterministic transcoding while flagging potential losses.

Tools and Algorithms

Software libraries play a central role in implementing character encoding , providing robust s for converting between various formats in applications. The (ICU), originally developed by Taligent and open-sourced by in 1999, is a mature set of C/C++ and libraries that supports and features, including conversion between and over 220 legacy character sets through its converter . ICU's capabilities handle complex mappings, such as those involving multi-byte encodings like GB18030, ensuring accurate round-trip conversions where possible. Python's standard library includes the codecs module, which offers a registry of built-in encoders and decoders for standard encodings such as , UTF-16, ASCII, and Latin-1, along with support for additional formats like and compression codecs. This module facilitates stream and file interfaces for , allowing developers to register custom codecs and perform operations like encode() and decode() on strings or bytes, making it essential for handling diverse text data in Python applications. Transcoding algorithms typically rely on table-driven lookups for efficiency in simple cases, where precomputed mapping tables translate s or bytes directly between encodings like ASCII to . For more complex scenarios, such as converting UTF-16 to , dynamic algorithms are employed to handle pairs: a high surrogate (U+D800 to U+DBFF) and low surrogate (U+DC00 to U+DFFF) are combined to form a single code point beyond the Basic Multilingual Plane, which is then encoded into 4 bytes in following the Unicode Transformation Format rules. These methods ensure validity checks, such as rejecting unpaired surrogates, to prevent during conversion. Encoding detection often precedes transcoding when the source format is unknown, using heuristic approaches like byte frequency analysis to infer the likely encoding. The chardet library, a popular Python tool, implements this by analyzing byte distributions against statistical models for various encodings; for instance, it probes for multi-byte patterns in UTF-8 or Shift-JIS by tracking character transitions and confidence scores based on observed frequencies. This probabilistic method achieves high accuracy for common legacy encodings but may require fallback strategies for ambiguous cases. In , the Encoding Standard, as of 2025, specifies that is the dominant encoding for interchange, mandating its use in new protocols while defining fallback for legacy labels like in browsers, ensuring seamless handling of mixed content through APIs such as TextEncoder and TextDecoder. Building on foundational conversion processes, these tools enable practical interoperability across diverse systems.

Common Challenges

One of the primary challenges in between character encodings is the production of , which occurs when data encoded in one scheme, such as , is incorrectly decoded using another, like ISO-8859-1 (Latin-1). This results in garbled or nonsensical text, as the byte sequences are misinterpreted according to a different mapping. For example, the representation of the "é" uses the bytes C3 A9; when decoded as ISO-8859-1, these become the characters "Ã" followed by "©", rendering as "é" instead of the intended accented letter. Such errors are common in , , and file transfers where encoding is absent or ignored, leading to persistent display issues across systems. Another frequent issue is round-trip loss, where converting text from a modern encoding like Unicode back to a legacy one and then reversing the process fails to preserve the original data. This irreversibility stems from repertoire mismatches: legacy encodings, such as ASCII or ISO-8859 series, support only limited character sets and cannot represent supplementary Unicode characters, such as emoji (e.g., 😀 at U+1F600). During transcoding to these older schemes, such characters are typically replaced with placeholders, question marks, or omitted entirely, making full recovery impossible upon reconversion. This problem is particularly acute in systems interfacing with mainframes or older databases that rely on encodings like EBCDIC, where certain Unicode glyphs have no direct equivalents. Encoding detection presents additional pitfalls due to overlapping byte patterns across schemes, complicating automated identification of the correct format. For instance, certain byte ranges in (a variable-length encoding using 1-4 bytes) coincide with those in Shift-JIS (a Japanese encoding using 1-2 bytes), allowing the same sequence to be validly interpreted in multiple ways without explicit declarations. This ambiguity can trigger incorrect decoding, especially in multilingual content or when HTTP headers or meta tags are missing or inconsistent. Although dominates web usage at 98.8% of sites as of late 2025, the remaining non-UTF-8 pages—often legacy or regionally specific—combined with misdeclarations, contribute to ongoing detection errors in real-world applications. Overlong encodings in UTF-8 exacerbate these challenges by introducing invalid sequences that represent code points with unnecessary extra bytes, violating the standard's . For example, the ASCII character "/" (U+002F) can be illicitly encoded as a two-byte sequence like C0 AF instead of the single byte 2F, potentially evading input filters or checks during . These non-standard forms are explicitly prohibited in UTF-8 for security reasons, as they enable attacks like bypassing validation logic in web applications or parsers. Proper handling requires strict decoders that reject such sequences, but inconsistent implementations across libraries can propagate errors. Tools such as the Universal Charset Detector (from ) offer heuristics for mitigation, though they cannot eliminate all ambiguities.

Modern Applications and Challenges

Internationalization and Localization

Internationalization (i18n) involves designing software to support multiple languages and regions without requiring code modifications, where character encodings like UTF-8 play a central role by enabling the representation of diverse scripts within a single application. UTF-8, as a variable-length encoding of the Unicode standard, allows seamless mixing of characters from different writing systems, such as Latin, Cyrillic, and Devanagari, facilitating global software development. For instance, applications can handle multilingual user interfaces by storing all text in UTF-8, which supports 159,801 characters across 165 scripts as of Unicode 17.0 (September 2025), including recent additions like the Sidetic and Tolong Siki scripts for better representation of lesser-known languages. In POSIX-compliant systems, locales incorporate encoding tags to specify character sets, such as en_US.[UTF-8](/page/UTF-8), which combines English () language conventions with encoding for broad script support. This configuration, set via environment variables like [LANG](/page/Lang) or LC_CTYPE, ensures that applications interpret and display text correctly for the designated region, including proper handling of collating sequences and character classifications. The vast majority of mobile applications rely on -based encodings like to meet global user demands, driven by platform standards in and that mandate compliance for . Localization (l10n) builds on i18n by adapting content and interfaces for specific locales, where character encodings ensure accurate rendering of culturally appropriate text. For languages like , 's bidirectional algorithm (defined in Unicode Standard Annex #9, or UAX #9) automatically determines text directionality, reordering mixed and left-to-right (LTR) scripts to maintain readability. This algorithm assigns embedding levels to characters—odd levels for (e.g., Hebrew or ) and even for LTR—enabling proper interleaving, such as displaying an English within an sentence from right to left while preserving LTR flow for the URL itself. Implementing UAX #9 in software libraries allows localized applications to support without manual adjustments, enhancing user experience in regions using scripts like , which requires mirroring layouts and icons for intuitive navigation.

Security and Compatibility Issues

Character encoding systems, particularly those supporting Unicode, introduce security vulnerabilities through visual similarities between characters from different scripts, enabling homograph attacks where malicious actors create deceptive domain names or text that appear legitimate to users. For instance, the Cyrillic lowercase "a" (U+0430) is visually indistinguishable from the Latin lowercase "a" (U+0061) in most fonts, allowing phishing sites like "xn--pple-43d.com" (apple.com spoof) to mimic trusted brands. These attacks exploit the Internationalized Domain Names (IDN) framework, which permits non-ASCII characters in domains but relies on Punycode to transcode them into ASCII-compatible strings for the Domain Name System (DNS). Variable-length encodings like exacerbate risks of buffer overflows, classified under CWE-120 (Buffer Copy without Checking Size of Input) and CWE-176 (Improper Handling of Encoding), where multi-byte sequences can exceed allocated memory if not properly validated, leading to crashes or code execution. Overlong representations, which encode characters using more bytes than necessary, further amplify this by bypassing length checks in legacy parsers. Mitigation involves using secure libraries, such as those in modern C++ (e.g., std::string with encoding-aware functions) or Java's class, which enforce strict validation and prevent overflows during decoding. Compatibility issues arise when integrating modern Unicode encodings with legacy systems designed for fixed-width ASCII, necessitating fallbacks where UTF-8 bytes in the 0x00-0x7F range are interpreted as ASCII to maintain without data loss. This ensures seamless handling of English text in older applications, but mismatches during can introduce errors, such as , if encodings are not explicitly declared. For IDNs, (defined in RFC 3492, 2003) addresses this by reversibly mapping Unicode domain labels to ASCII, prefixed with "xn--", allowing global domain registration while preserving DNS's ASCII roots. Attacks exploiting zero-width joiners (U+200D) to concatenate characters invisibly and obfuscate malicious scripts or domains remain a concern, with ongoing development of defenses in browsers and tools. In recent years, there has been a pronounced shift toward universal adoption of as the dominant character encoding standard, particularly on the web, where it now accounts for over 98% of all websites, rendering legacy encodings such as ISO-8859-1 and to less than 2% of traffic. This trend reflects broader efforts to ensure seamless in global digital communication, with 's variable-length efficiency and with ASCII driving its prevalence in modern applications, from web browsers to mobile operating systems. The ongoing evolution of the Unicode Standard continues to expand support for diverse scripts and symbols, as evidenced by Unicode 17.0, released in September 2025, which introduces 4,803 new characters, including four entirely new scripts—Sidetic, Tolong Siki, Beria Erfe, and Sharada Supplement—along with eight new emoji such as a , treasure chest, and hairy creature. These additions underscore 's commitment to encompassing emerging linguistic needs, with the total character repertoire now exceeding 159,800, facilitating better representation of lesser-known languages and cultural artifacts. Advancements in text processing are increasingly emphasizing clusters over individual code points to align more closely with user-perceived characters, as defined in Unicode Standard Annex #29. For instance, complex constructs like skin-toned (e.g., 👨‍👩‍👧‍👦) or accented letters with diacritics are treated as single units, preventing fragmentation in editing, rendering, and input methods; this approach is gaining traction in 2025 software ecosystems, including terminal emulators and web browsers that implement proper segmentation for accurate cursor movement and deletion. Integration of techniques is emerging to enhance character encoding tasks, particularly in automatic detection of encoding schemes and script identification for multilingual content. font family, supporting over 150 writing systems across more than 1,000 languages, exemplifies this by enabling robust rendering in AI-assisted design tools, though direct ML applications remain focused on detection rather than core encoding transformations. Additionally, AI-driven defenses are being developed to counter Unicode-based security exploits, such as attacks, by analyzing anomalous character sequences in real-time.

References

  1. [1]
    Character encodings for beginners - W3C
    Apr 16, 2015 · A character encoding provides a key to unlock (ie. crack) the code. It is a set of mappings between the bytes in the computer and the characters ...Missing: science | Show results with:science
  2. [2]
    UTR#17: Character Encoding Model - Unicode
    A coded character set is defined to be a mapping from a set of abstract characters to the set of non-negative integers. This range of integers need not be ...
  3. [3]
    Character and data encoding - Globalization - Microsoft Learn
    Feb 2, 2024 · The process of assigning characters to numerical values is called character encoding. While there are many character encoding standards, one of ...Missing: science | Show results with:science
  4. [4]
    ASCII: The History Behind Encoding - DigiKey
    May 17, 2024 · ASCII (sometimes pronounced aahs-kee by the youngins today) is a character encoding standard designed in the 1960s that assigns a unique number ...
  5. [5]
    The Evolution of Character Encoding - ASCII table
    ASCII is a 7-bit encoding system, first published as a standard in 1963 by the American National Standards Institute (ANSI). It can represent 128 different ...
  6. [6]
    None
    Nothing is retrieved...<|separator|>
  7. [7]
    Chapter 1 – Unicode 16.0.0
    The Unicode Standard is the universal character encoding standard for written characters and text. It defines a consistent way of encoding multilingual text.
  8. [8]
    Unicode Standard
    The Unicode Standard is the universal character encoding designed to support the worldwide interchange, processing, and display of the written texts of the ...
  9. [9]
    History and technology of Morse Code
    It was developed by Samuel Morse and Alfred Vail in 1835. Morse code is an early form of digital communication. It uses two states (on and off) composed into ...History Of Morse Code · American Morse Code · Representation And Timing<|separator|>
  10. [10]
    The Baudot Code - Electrical and Computer Engineering
    The Baudot Code is an early example of a binary character code based on 5-bit values defining 32 different codewords. The invention of this code in 1870 by ...
  11. [11]
    The Roots of Computer Code Lie in Telegraph Code
    Sep 11, 2017 · Baudot Code's biggest advantage over Morse Code, which was first used in the 1840s, and other earlier codes, was its speed. Earlier systems sent ...
  12. [12]
    The IBM punched card
    To load the program or read punched card data, each card would be inserted into a punched card reader to input data into a tabulating machine. As demand for ...
  13. [13]
    Punch Cards for Data Processing
    Punch cards became the preferred method of entering data and programs onto them. They also were used in later minicomputers and some early desktop calculators.
  14. [14]
    Binary Numbers - College of Computing and Software Engineering
    Each decimal digit required ten binary devices arranged so that one was on and the other nine were off. The circuit that was on indicated the digit represented.
  15. [15]
    History of Computers
    ENIAC (Electrical Numerical Integrator and Calculator) used a word of 10 decimal digits instead of binary ones like previous automated calculators/computers.
  16. [16]
    Programming on the Ferranti Mark 1 - The University of Manchester
    As 5-hole tape only provided 32 different values, and around 50 are necessary to provide for letters (upper case only), digits 0 - 9, and a reasonable selection ...
  17. [17]
    (DOC) Manchester Mark 1 - Academia.edu
    Because the Mark 1 had a 40-bit word length, eight 5-bit teleprinter characters were required to encode each word. Thus for example the binary word: 10001 ...
  18. [18]
    History - Ecma International
    This meeting was held on 27 April 1960 in Brussels; it was decided that an association of manufacturers should be formed which would be called European Computer ...
  19. [19]
    First edition of the ASCII standard is published - Event
    The first edition of the ASCII standard was published on June 17, 1963, after work began in 1960 and a proposal in 1961.
  20. [20]
    [PDF] code for information interchange - NIST Technical Series Publications
    This standard is a revision of X3.4-1968, which was developed in parallel with its international counterpart, ISO 646-1973. This current revision retains the ...
  21. [21]
    Guide to the use of Character Sets in Europe - Open Standards
    This led to the publication in 1967 of ISO Recommendation 646 (ISO had Recommendations rather than Standards at that time). ... The exception dates from an early ...
  22. [22]
    ISO/IEC JTC 1/SC 2 - Coded character sets
    Creation date: 1987. Scope. Standardization of graphic character sets and their characteristics, including string ordering, associated control functions, ...
  23. [23]
    ISO 8859-1:1987 Information processing — 8-bit single-byte coded ...
    Publication date. : 1987-02. Stage. : Withdrawal of International Standard [95.99]. Edition. : 1. Number of pages. : 7. Technical Committee : ISO/IEC JTC 1/SC 2.Missing: first | Show results with:first
  24. [24]
    About CNS \ Chinese Code Status - 全字庫 CNS11643 (2024)
    From the time when the ISO/IEC JTC1/SC2/WG2 working group was first established in 1986 to the time when the first part of the ISO 10646 standard "Architecture ...
  25. [25]
    Summary Narrative - Unicode
    Aug 31, 2006 · Unicode began as a project in late 1987 after discussions between engineers from Apple and Xerox: Joe Becker, Lee Collins and Mark Davis.
  26. [26]
    ISO/Unicode Merger: Interview with Ed Hart
    Oct 16, 1995 · In 1991 you had ISO 10646 and Unicode 1.0. Both codes wanted to be ... These talks led to an agreement to merge ISO/IEC 10646 and Unicode.
  27. [27]
    Unicode 1.0
    Jul 15, 2015 · Unicode 1.0 was the first published version, consisting of Volumes 1 and 2, code charts, and descriptions for many characters. It was published ...
  28. [28]
    Supported Scripts - Unicode
    Details about most of these scripts can be looked up at the ScriptSource website. Version (Year), Scripts Added, Totals. 1.1 (1993), 23. Arabic, Gujarati ...
  29. [29]
    UTS #51: Unicode Emoji
    Emoji became available in 1999 on Japanese mobile phones. There was an early proposal in 2000 to encode DoCoMo emoji in the Unicode standard. At that time, it ...
  30. [30]
    Rob Pike's UTF-8 history
    UTF-8 was designed, in front of my eyes, on a placemat in a New Jersey diner one night in September or so 1992. What happened was this. We had used the original ...
  31. [31]
    How many Unicode characters are there - BabelStone
    Sep 12, 2023 · There are 154,998 encoded Unicode characters in version 16.0, but these do not always correspond to user-perceived characters.
  32. [32]
    Glossary of Unicode Terms
    Acronym for character encoding scheme. Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or ...
  33. [33]
    Annex B The Universal Character Set (UCS) - Open Standards
    There is effective cooperation between the Unicode Consortium and ISO/IEC JTC 1/SC 2 which should ensure that this compatibility is maintained in future ...
  34. [34]
    What is glyph? | Definition from TechTarget
    Apr 25, 2023 · A glyph -- from a Greek word meaning carving -- is a graphic symbol that provides the appearance or form for a character.
  35. [35]
    glyph - Wiktionary, the free dictionary
    First attested in 1727. Borrowed from French glyphe, from Ancient Greek γλυφή (gluphḗ, “carving”), from γλύφω (glúphō, “I carve, engrave”).English · Pronunciation · Noun
  36. [36]
    Unicode 17.0.0
    This page summarizes the important changes for the Unicode Standard, Version 17.0.0. This version supersedes all previous versions of the Unicode Standard.Missing: November | Show results with:November
  37. [37]
  38. [38]
    [PDF] Unification of the Han Characters - Unicode
    As a result of the agreement to merge the Unicode standard and ISO 10646, the Unicode consor- tium agreed to adopt the unified Han character repertoire that was ...
  39. [39]
    UTR#17: Character Encoding Model - Unicode
    An abstract character is defined to be in a coded character set if the coded character set maps from it to an integer. That integer is said to be the code point ...
  40. [40]
    Chapter 1 – Unicode 17.0.0
    An encoded character is represented by a number from 0 to 10FFFF16, called a code point.1.1 Coverage · 1.1. 1 Standards Coverage · 1.3 Text Handling<|control11|><|separator|>
  41. [41]
    The Standard ASCII Character Set and Codes
    The Standard ASCII Character Set and Codes table given below contains 128 characters with corresponding numerical codes in the range 0..127 (decimal).
  42. [42]
    About the Code Charts - Unicode
    Reserved Characters. Character codes that are marked “<reserved>” are unassigned and reserved for future encoding. Reserved codes are indicated by a  glyph. ...
  43. [43]
    UTR#17: Unicode Character Encoding Model
    A coded character set is defined to be a mapping from a set of abstract characters to the set of nonnegative integers. This range of integers need not be ...
  44. [44]
    Chapter 3 – Unicode 16.0.0
    D75 Surrogate pair: A representation for a single abstract character that consists of a sequence of two 16-bit code units, where the first value of the pair is ...
  45. [45]
    UTR#17: Unicode Character Encoding Model
    When used unqualified without qualification, the terms UTF-8, UTF-16, and UTF-32 are ambiguous between their sense as Unicode encoding forms and as Unicode ...Unicode Character Encoding... · 2 Abstract Character... · 4 Character Encoding Form...
  46. [46]
    [PDF] ISO/IEC 646:1991 (E) - iTeh Standards
    Dec 15, 1991 · International Standard ISO/IEC 646 was prepared by Joint Technical. Committee ISO/IEC JTC 1, Information technology. This third edition ...
  47. [47]
    [PDF] ISO 8859-1:1987 - iTeh Standards
    Equipment claimed to implement this part of IS0 8859 shall implement all 191 characters. ... f13231d485de/iso-8859-1-1987.
  48. [48]
    [PDF] INTERNATIONAL STANDARD ISO/IEC 10646
    ISO/IEC 10646 is an international standard for information technology, specifically a universal coded character set (UCS).<|separator|>
  49. [49]
  50. [50]
    RFC 3629 - UTF-8, a transformation format of ISO 10646
    UTF-8, the object of this memo, has a one-octet encoding unit. It uses all bits of an octet, but has the quality of preserving the full US-ASCII [US-ASCII] ...
  51. [51]
    Usage statistics of UTF-8 for websites - W3Techs
    UTF-8 is used by 98.8% of all the websites whose character encoding we know. Historical trend. This diagram shows the historical trend in the percentage of ...
  52. [52]
    RFC 2045: Multipurpose Internet Mail Extensions (MIME) Part One
    For example, some communities use the term "character encoding" for what MIME calls a "character set", while using the phrase "coded character set" to ...
  53. [53]
    RFC 7231 - Hypertext Transfer Protocol (HTTP/1.1) - IETF Datatracker
    This document defines the semantics of HTTP/1.1 messages, as expressed by request methods, request header fields, response status codes, and response header ...
  54. [54]
    Content-Type header - HTTP - MDN Web Docs - Mozilla
    Jul 4, 2025 · The HTTP Content-Type representation header is used to indicate the original media type of a resource before any content encoding is applied.
  55. [55]
    UAX #15: Unicode Normalization Forms
    Jul 30, 2025 · ... Encoding Form. Incorrect buffer handling can introduce subtle errors in the results. Any buffered implementation should be carefully checked ...
  56. [56]
    Character entity references in HTML 4 - W3C
    The character entity references in this section are for escaping markup-significant characters (these are the same as those in HTML 2.0 and 3.2), for denoting ...
  57. [57]
  58. [58]
    Extensible Markup Language (XML) 1.0 (Fifth Edition) - W3C
    Nov 26, 2008 · All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode [Unicode]; the mechanisms for signaling which of the two is in use, or ...
  59. [59]
    Han Unification History - Unicode
    The Unicode Han character set began with a project to create a Han character cross-reference database at Xerox in 1986. In 1988, a parallel effort began at ...
  60. [60]
    Roadmap to the TIP - Unicode
    The following table shows a map of the actual and proposed allocation on Plane 3, the TIP (Tertiary Ideographic Plane).
  61. [61]
  62. [62]
    [PDF] The Unicode Standard, Version 16.0 – Core Specification
    Sep 10, 2024 · This is the Unicode Standard, Version 16.0 Core Specification, with an authoritative HTML version and an archival PDF version.<|control11|><|separator|>
  63. [63]
  64. [64]
    Charset (Java Platform SE 8 ) - Oracle Help Center
    The native character encoding of the Java programming language is UTF-16. A charset in the Java platform therefore defines a mapping between sequences of ...
  65. [65]
    The Unicode standard - Globalization | Microsoft Learn
    Feb 2, 2024 · The Unicode Standard is a character encoding supporting all writing systems, simplifying localization and enabling universal data exchange.
  66. [66]
    [PDF] Clarify guidance for use of a BOM as a UTF-8 encoding signature
    Jan 2, 2021 · In these two cases, the character U+FEFF is used as a signature to indicate the byte order and the character set by using the byte ...
  67. [67]
  68. [68]
    Submitting Character Proposals - Unicode
    Before preparing a proposal, sponsors should note in particular the distinction between the terms character and glyph as therein defined.
  69. [69]
    Unicode® Character Encoding Stability Policies
    Jan 9, 2024 · This page lists the policies of the Unicode Consortium regarding character encoding stability. These policies are intended to ensure that text encoded in one ...
  70. [70]
    Registered Code Stability Policy - Unicode
    This document lists the policies of the Unicode Consortium regarding registered code stability. Assignment. Once a code is assigned, it will not be reallocated, ...
  71. [71]
    Character Properties - Unicode
    Control, a C0 or C1 control code. Cf, Format, a format control character. Cs, Surrogate, a surrogate code point. Co, Private_Use, a private-use character. Cn ...
  72. [72]
    UAX #44: Unicode Character Database
    Aug 27, 2025 · This annex provides the core documentation for the Unicode Character Database (UCD). It describes the layout and organization of the Unicode Character Database.
  73. [73]
    Unicode 16.0.0
    Sep 10, 2024 · Unicode 16.0 adds 5185 characters, for a total of 154,998 characters. The new additions include seven new scripts: Garay is a modern-use script ...Latest Code Charts · Unicode Character Database · Unicode Collation Algorithm
  74. [74]
    Chapter 2 – Unicode 16.0.0
    In contrast, a character encoding standard provides a single set of fundamental units of encoding, to which it uniquely assigns numerical code points. These ...
  75. [75]
    Milestones:American Standard Code for Information Interchange ...
    May 23, 2025 · The American Standards Association X3.2 subcommittee published the first edition of the ASCII standard in 1963. Its first widespread ...
  76. [76]
    Eight Bit Codes | pclt.sites.yale.edu
    Mar 2, 2010 · The ASCII character set was based on a 7 bit code that AT&T used to transmit teletype messages. ... ) to use every one of the 256 possible byte ...
  77. [77]
    The US ASCII Character Set - Columbia University
    Codes 0 through 31 and 127 (decimal) are unprintable control characters. Code 32 (decimal) is a nonprinting spacing character. Codes 33 through 126 (decimal) ...Missing: points source
  78. [78]
    ASCII | Definition, History, Trivia, & Facts | Britannica
    Sep 12, 2025 · On June 17, 1963, ASCII was approved as the American standard. However, it did not gain wide acceptance, mainly because IBM chose to use ...
  79. [79]
    What is ASCII (American Standard Code for Information Interchange)?
    Jan 24, 2025 · ASCII (American Standard Code for Information Interchange) is the most common character encoding format for text data in computers and on the internet.
  80. [80]
    Code Page Identifiers - Win32 apps - Microsoft Learn
    Jan 7, 2021 · For the most consistent results, applications should use Unicode, such as UTF-8 or UTF-16, instead of a specific code page. Expand table ...Missing: Java | Show results with:Java
  81. [81]
    Microsoft Announces Plans to Support the Euro Currency Symbol
    Apr 29, 1998 · The company also said Windows 98 and Windows NT 5.0 will support the euro from their market availability dates. In addition, Microsoft's ...Missing: 1252 | Show results with:1252
  82. [82]
    The euro currency symbol (OpenType 1.8.3) - Typography
    Feb 11, 2022 · The Unicode assignment of the euro symbol is 20AC. The symbol will be added to the following codepages at position 0x80: 1250 Eastern European, 1252 Western, ...
  83. [83]
    UTF-8 (UCS transformation format) - IBM
    UTF-8 (UCS transformation format) · It is a superset of ASCII, in which the ASCII characters are encoded as single-byte characters with the same numeric value.
  84. [84]
    What is Extended Binary Coded Decimal Interchange Code (EBCDIC)
    Aug 23, 2023 · EBCDIC was developed by IBM in 1963 to complement the punched cards used for storage and data processing in the early days of computing.
  85. [85]
    Doug Jones's punched card codes - University of Iowa
    EBCDIC is a direct descendant of the 6-bit BCD codes used by IBM's early computers. One consequence of this is that the EBCDIC codes for the 029 character set ...
  86. [86]
    ASCII and EBCDIC conversion tables - IBM
    The following table is an ASCII-to-EBCDIC conversion table that translates 7-bit ASCII characters to 8-bit EBCDIC characters. Conversion table irregularities
  87. [87]
    The EBCDIC character set - IBM
    EBCDIC is a character set developed before ASCII, used to encode z/OS data. It is an 8-bit set, but has different bit assignments than ASCII.Missing: history | Show results with:history
  88. [88]
    Db2 12 - Internationalization - EBCDIC - IBM
    EBCDIC was developed by IBM in 1963. Certain characters are the same on every EBCDIC code page. Those characters are called invariant characters . Other ...
  89. [89]
    [PDF] Mainframe Banking Application Case Study | GenRocket
    EBCDIC is derived from the original data format used by IBM mainframe peripherals (e.g., disk drives, tape devices and card readers) and uses 8-bit character ...
  90. [90]
    Unicode on the mainframe - IBM
    Mainframes (using EBCDIC for single-byte characters), PCs, and various RISC systems use the same Unicode assignments. Unicode is maintained by the Unicode ...
  91. [91]
    Character Sets - Internet Assigned Numbers Authority
    Jun 6, 2024 · The second region (1000-1999) is for the Unicode and ISO/IEC 10646 coded character sets together with a specification of a (set of) sub- ...<|control11|><|separator|>
  92. [92]
    shift_jis
    This charset is an extension of csHalfWidthKatakana by adding graphic characters in JIS X 0208. The CCS's are JIS X0201:1997 and JIS X0208:1997. Several vendor ...
  93. [93]
    GBK
    Application of IANA Charset Registration for GBK ... To remedy this shortcoming, the GBK specification has since been "replaced" by the mandatory GB 18030-2000 ...
  94. [94]
    IANA Charset Registration for GB18030
    Since GB18030 is fully ISO 10646 compatible, it readily supports CJK Extension B and other languages. GB18030 is a "mandatory" standard: starting September 1, ...
  95. [95]
    622 New CJK Ideographs to be Available in Unicode Version 15.1
    Aug 2, 2023 · The Unicode Standard will include 622 new CJK characters in Version 15.1, which will be released on September 12, 2023.
  96. [96]
    What issues lead people to use Japanese-specific encodings rather ...
    Jun 8, 2011 · Reason 2: Inefficient character conversions. Converting characters from Unicode to legacy Japanese encodings and back requires tables, i.e. ...
  97. [97]
    Converter | ICU Documentation
    ICU provides comprehensive character set conversion services, mapping tables, and implementations for many encodings. Since ICU uses Unicode (UTF-16) internally ...
  98. [98]
    International Components for Unicode - Character Set Mapping Tables
    ICU uses mapping tables for character set conversion, available in .ucm and .xml formats, and are a subset of IBM's CDRA tables.Missing: bidirectional | Show results with:bidirectional
  99. [99]
    ICU Documentation
    Background and History of ICU. ICU was originally developed by the Taligent company. The Taligent team later became the Unicode group at the IBM® Globalization ...General Transforms · How To Use ICU · ICU Data · Unicode Basics
  100. [100]
    International Components for Unicode (ICU) - IBM
    ICU supports the most current version of the Unicode Standard, including supplementary Unicode characters that are needed for support of GB 18030, HKSCS, and ...
  101. [101]
    codecs — Codec registry and base classes — Python 3.14.0 ...
    This module defines base classes for standard Python codecs (encoders and decoders) and provides access to the internal Python codec registry.Codecs -- Codec Registry And... · Codec Base Classes · Standard Encodings
  102. [102]
    Transcoding unicode characters with AVX‐512 instructions
    Sep 12, 2023 · ... lookup table. The accelerated UTF-8 to UTF-16 transcoding algorithm processes up to 12 input UTF-8 bytes at a time. Given the input bytes ...Missing: driven | Show results with:driven
  103. [103]
    RFC 2781 - UTF-16, an encoding of ISO 10646 - IETF Datatracker
    This document describes the UTF-16 encoding of Unicode/ISO-10646, addresses the issues of serializing UTF-16 as an octet stream for transmission over the ...<|separator|>
  104. [104]
    How it works — chardet 5.0.0 documentation - Read the Docs
    UniversalDetector checks whether the text contains any high-bit characters. If so, it creates a series of “probers” for detecting multi-byte encodings.Missing: library | Show results with:library
  105. [105]
    Encoding Standard - whatwg
    Aug 12, 2025 · The UTF-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore, for new protocols ...Iso-8859-8 · BMP coverage of iso-8859-8 · BMP coverage of windows-1256 · Koi8-rMissing: mandate | Show results with:mandate
  106. [106]
    UTF-8 mojibake – a practical guide to understanding decoding errors
    Apr 20, 2022 · I am going to talk about the errors that one is likely to encounter in the wild. How to recognise them and understand what happened.
  107. [107]
    Converting UTF-8 to ISO-8859-1 in Java | Baeldung
    Jun 20, 2024 · Explore two approaches for converting UTF-8 encoded strings to ISO-8859-1.<|control11|><|separator|>
  108. [108]
    Character Model for the World Wide Web: String Matching - W3C
    Aug 11, 2021 · These presentation forms are intended only for the support of round-trip encoding conversions with the legacy character encodings that include ...
  109. [109]
    Round Trip Safety Configuration of Shift-JIS Characters - IBM
    Although the characters from most native character encoding systems are round trip safe, the Shift-JIS encoding system is an exception. Approximately 400 ...<|separator|>
  110. [110]
    Unicode Basics | ICU Documentation
    ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. The ICU User Guide ...
  111. [111]
    Usage statistics of character encodings for websites - W3Techs
    UTF-8 is used by 98.8% of all the websites whose character encoding we know ... 0.1%. Shift JIS. 0.1%. W3Techs.com, 6 November 2025. Percentages of websites using ...
  112. [112]
    CAPEC-80 - Using UTF-8 Encoding to Bypass Validation Logic
    For security reasons, a UTF-8 decoder must not accept UTF-8 sequences that are longer than necessary to encode a character. If you use a parser to decode the ...
  113. [113]
    A composite approach to language/encoding detection - Mozilla
    Nov 26, 2002 · This paper presents three types of auto-detection methods to determine encodings of documents without explicit charset declaration.
  114. [114]
    How UTF-8 links to the i18n process? - Lingoport
    UTF-8 encoding is crucial for internationalization and localization (i18n) of websites and software. Learn how it supports multi-language content and ...
  115. [115]
    Unicode – The World Standard for Text and Emoji
    Jun 14, 2024 · Learn more about Unicode. Adopt Character, Technical Work, Support Unicode, About Us, News and Events, Legal & Licensing, Contact The Unicode Consortium.Code Charts · Emoji · Members · Submitting Emoji Proposals
  116. [116]
    How to Change or Set System Locales in Linux - Tecmint
    Jul 13, 2023 · To display a list of all available locales use the following command. $ locale -a C C.UTF-8 en_US.utf8 POSIX. How to Set System Locale in Linux.
  117. [117]
    ICU - International Components for Unicode
    ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and ...C++ · Why Use ICU4J? · Downloading ICU · Source Code Access
  118. [118]
    Internationalization vs. localization (i18n vs l10n): The differences
    Sep 23, 2024 · Arabic is a right-to-left (RTL) language, which means that not only do you need to translate the content, you also need to re-design the ...
  119. [119]
  120. [120]
    Homograph attacks: How hackers exploit look-alike domains
    Apr 16, 2025 · For example, in Unicode the Cyrillic “а” (U+0430) looks identical to the Latin “a” (U+0061). They might use characters from different ...
  121. [121]
    An Automated Framework for Detecting IDN Homographs - arXiv
    Sep 17, 2019 · ... Latin character 'a' (U+0061) and Cyrillic character 'a' (U+0430). Visually identical characters such as these are generally known as ...
  122. [122]
    Phishing with Unicode Domains - Bram.us
    Apr 18, 2017 · ... homograph attacks. For example: the Cyrillic а (codepoint U+0430) totally looks like the Latin a (codepoint U+0061). When visting brаm.us ...
  123. [123]
    CWE-176: Improper Handling of Unicode Encoding (4.18) - Mitre
    Mistakenly specifying the wrong units in a size argument can lead to a buffer overflow. The following function takes a username specified as a multibyte string ...
  124. [124]
    The Security Risks of Overlong UTF-8 Encodings - usd HeroLab
    Sep 6, 2024 · While overlong UTF-8 encodings may seem like a niche topic, their security implications can be profound and are worth considering.
  125. [125]
    CWE-120: Buffer Copy without Checking Size of Input ('Classic ...
    Buffer overflows often can be used to execute arbitrary code, which is usually outside the scope of the product's implicit security policy. This can often be ...Missing: UTF- | Show results with:UTF-
  126. [126]
    Character encoding: Types, UTF-8, Unicode, and more explained
    Apr 7, 2025 · The idea is simple: assign a number to each character. For example, the letter "A" gets the number 65, "B" gets 66, and so on. Since computers ...ASCII character encoding · UTF-16 character encoding · UTF-8 character encodingMissing: science | Show results with:science
  127. [127]
    What Is UTF-8 Encoding? - JumpCloud
    A significant feature of UTF-8 is its backward compatibility with ASCII, the former encoding standard for English characters. This means ASCII characters are ...
  128. [128]
    RFC 3492 - Punycode: A Bootstring encoding of Unicode for ...
    Punycode is a simple and efficient transfer encoding syntax designed for use with Internationalized Domain Names in Applications (IDNA).
  129. [129]
    Zero-Width Characters in Cybersecurity: Safe or Risky? - Invisible Text
    May 8, 2025 · Learn how zero-width characters are used in phishing attacks, how to detect them, and protect your site with smart, ethical security ...<|separator|>
  130. [130]
    Advanced Phishing Attack Conceals JavaScript with Invisible ...
    Feb 25, 2025 · Attackers insert zero-width Unicode characters inside JavaScript keywords and function names to disguise malicious code. This makes it appear ...
  131. [131]
    Historical trends in the usage statistics of character encodings for ...
    This report shows the historical trends in the usage of the top character encodings since October 2024. 2024 1 Oct, 2024 1 Nov, 2024 1 Dec, 2025 1 Jan, 2025 1 ...
  132. [132]
    What's New In Unicode 17.0 - Emojipedia Blog
    Sep 9, 2025 · Unicode 17.0 includes a total of 4,802 new characters, of which 7 are the brand-new emoji codepoints discussed above. This brings the total ...
  133. [133]
    UAX #29: Unicode Text Segmentation
    A single Unicode code point is often, but not always the same as a basic unit of a writing system for a language, or what a typical user might think of as a “ ...
  134. [134]
    Grapheme Clusters and Terminal Emulators - Mitchell Hashimoto
    Oct 1, 2023 · This blog post describes why this happens and how terminal emulator and program authors can achieve consistent spacing for all characters.Missing: emerging | Show results with:emerging
  135. [135]
  136. [136]
    [PDF] Automatic Detection of Character Encoding and Language - CS229
    Therefore, automatically detecting the correct character encoding from the given text can serve many people using various character encodings, including their.Missing: rate | Show results with:rate
  137. [137]
    Noto Home - Google Fonts
    Noto is a collection of high-quality fonts in more than 1000 languages and over 150 writing systems.
  138. [138]