Fact-checked by Grok 2 weeks ago

Windows code page

Windows code pages are character encoding schemes employed by the Microsoft Windows operating system to represent text using numeric identifiers, primarily supporting legacy applications, older mail servers, and console interfaces that predate widespread Unicode adoption.^[1] These code pages extend the 7-bit ASCII set (code points 0x00–0x7F) with additional characters in the 8-bit range (0x80–0xFF) to accommodate international glyphs, punctuation, and symbols specific to various languages and locales.^[1] Commonly referred to as "ANSI code pages," they are based on drafts from the American National Standards Institute but differ slightly from official standards like ISO 8859-1, with code page 1252 serving as the default for Western European languages in many English-language Windows installations.^[1]^[2] Historically, code pages emerged in the 1980s and 1990s to address the limitations of single-byte character sets in supporting diverse writing systems, starting with 8-bit single-byte character sets (SBCS) for Latin-based scripts and evolving to double-byte (DBCS) and multi-byte (MBCS) variants for East Asian languages such as Japanese (code page 932, based on Shift JIS) and Simplified Chinese (code page 936, based on GBK).^[3] Distinct from ANSI code pages, OEM code pages—like 437 for United States English—were designed for MS-DOS environments, emphasizing line-drawing characters and compatibility with FAT file systems, and they remain relevant for console and legacy hardware interactions.^[1]^[2] In Windows, the active code page is system-wide and influences ANSI API functions (prefixed with "A"), while Unicode-based APIs (prefixed with "W") bypass code pages entirely for broader compatibility.^[1] Although Windows has relied on Unicode internally since Windows NT for comprehensive global text handling, code pages persist for backward compatibility, enabling functions like MultiByteToWideChar and WideCharToMultiByte to convert between legacy encodings and Unicode.^[3]^[1] Microsoft recommends transitioning to Unicode encodings, such as UTF-8 (code page 65001), for new applications to avoid data corruption risks associated with varying system-default code pages across locales.^[2]^[1] Over 50 code pages are supported in modern Windows versions, covering languages from Arabic (code page 1256) to Vietnamese (code page 1258), but their use is increasingly limited to specific scenarios like regional file naming or command-line tools.^[2]^[3]

Fundamentals

Definition and Purpose

A Windows code page is a character encoding system that defines a mapping between byte values, typically in an 8-bit range, and specific characters, enabling the representation of text in legacy Windows applications and files. These mappings associate sequences of bytes—most commonly single bytes for 256 possible characters—with Unicode code points, allowing for the encoding and decoding of text data. Developed by Microsoft, code pages extend the limitations of 7-bit ASCII by incorporating an additional 128 characters in the upper byte range (0x80–0xFF), which vary according to language or regional requirements.^[3]^[1] The primary purpose of Windows code pages is to facilitate internationalization in software and systems prior to the widespread adoption of Unicode, supporting non-ASCII characters for various scripts such as Latin extensions, Cyrillic, Arabic, and others. By providing deterministic translations—often one-to-one for single-byte sets or many-to-one for multi-byte variants—code pages ensure compatibility for legacy applications, older mail and news servers, command-line tools, and document formats that rely on regional character sets. For instance, code page identifiers like CP1252 are used for Western European languages, assigning unique byte values to accented letters and symbols not covered by basic ASCII. This approach addressed the need for localized text handling in global markets without requiring a universal encoding standard at the time.^[3]^[1] Key characteristics of Windows code pages include their identification by numeric codes (e.g., 1252 for ANSI Western European), support for both single-byte character sets (SBCS) and double- or multi-byte character sets (DBCS/MBCS) for denser scripts, and a fixed mapping that remains consistent within a given locale. Commonly referred to as "ANSI code pages" in Windows contexts, they are not identical to formal ANSI or ISO standards but are based on drafts like ISO 8859-1, with Microsoft-specific extensions for broader compatibility. While modern Windows primarily uses Unicode internally for universal character support, code pages persist for backward compatibility in transitional environments.^[1]^[3]

Types of Windows Code Pages

Windows code pages are categorized into several main types based on their intended use and character encoding mechanisms within the operating system. These include ANSI code pages, OEM code pages, multi-byte character set (MBCS) code pages, and Unicode-based code pages, each serving distinct roles in handling text data across different contexts such as graphical user interfaces, consoles, and internationalized applications.^[2] ANSI code pages, also known as active code pages (ACP), are primarily used for text rendering in Windows graphical user interfaces (GUI), file I/O operations, and legacy text files, with the default varying by system locale—for instance, code page 1252 (Windows-1252) for English-language systems.^[2]^[4] These single-byte encodings map the 256 possible byte values to characters, supporting Western European languages in the default case, and are retrieved programmatically via the GetACP API function.^[4] OEM code pages, in contrast, are designed for console applications, command-line interfaces, and compatibility with MS-DOS-era systems, often differing from ANSI pages to accommodate hardware-specific character sets like box-drawing symbols.^[2]^[5] For example, code page 437 serves as the OEM default for United States English locales, and it can be queried using the GetOEMCP API.^[2]^[6] These pages ensure proper display in text-mode environments but are locale-dependent and not suitable for cross-system data exchange without verification.^[5] Multi-byte code pages (MBCS) extend single-byte capabilities to support languages with extensive character sets, such as East Asian scripts, by employing a variable-length scheme where most characters use a single byte but others require a lead byte followed by a trail byte to encode extended glyphs.^[2]^[7] Examples include code page 932 for Japanese (Shift JIS) and 936 for Simplified Chinese (GBK), which allow Windows applications to process double-byte characters seamlessly in MBCS-aware strings.^[2]^[3] This approach enables denser representation of non-Latin scripts but requires careful byte parsing to distinguish single- from multi-byte sequences.^[8] Unicode code pages represent a modern bridge between legacy systems and universal text encoding, incorporating UTF formats directly as code pages for interoperability; notable examples are code page 1200 for UTF-16 little-endian and 65001 for UTF-8, which support all Unicode characters without locale-specific limitations.^[2] These are increasingly recommended over traditional pages to avoid data corruption from varying system defaults, and they integrate with APIs like MultiByteToWideChar for conversions.^[9]^[1] All Windows code pages are identified by unique numeric identifiers (e.g., 1252 for ANSI Western European), which applications use to specify encoding in functions like CreateFile or registry queries.^[2] Character mappings for these code pages are stored in National Language Support (NLS) files, such as C_1252.NLS, located in the %SystemRoot%\System32 directory, while active code page settings are configurable via registry keys under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage (e.g., ACP for ANSI).^[10]^[11] These mappings are loaded into memory by system DLLs like kernel32.dll during runtime for efficient text processing.^[1]

Historical Development

Origins in MS-DOS and Early Windows

The origins of Windows code pages trace back to the MS-DOS era in the 1980s, where they emerged as essential extensions to handle international characters and graphics on the IBM PC platform. In 1981, with the release of the IBM PC, Code Page 437 (CP437) was introduced as the original OEM code page, extending the 7-bit US-ASCII standard to an 8-bit, 256-character set that included box-drawing graphics, mathematical symbols, and a selection of European accented characters to support text-based user interfaces and early applications.^[12] This code page, also known as OEM-US or PC-8, was designed for compatibility with the IBM PC's hardware, particularly its display adapters, and became the foundational character encoding for MS-DOS systems.^[1] Key milestones in the 1980s built upon CP437 as the baseline for subsequent code pages. IBM established CP437 as the standard for the US English market, while Microsoft extended this framework for international MS-DOS versions to accommodate diverse linguistic needs. A notable example is CP850, introduced in 1987 with MS-DOS 3.3, which served as a multilingual extension for Western European languages, Latin America, and Canada, incorporating a broader set of Latin-1 characters while retaining compatibility with CP437's structure.^[12] These developments allowed MS-DOS to support country-specific variants, loaded dynamically to adapt to regional requirements without altering the core operating system. Technically, these early code pages operated as 8-bit supersets of 7-bit US-ASCII, where the first 128 characters (hex 00-7F) matched ASCII exactly, and the upper 128 (hex 80-FF) provided extensions for localized content such as diacritics and line art. In MS-DOS, country-specific code pages were configured via the CONFIG.SYS file during boot, using commands like COUNTRY=XXX to load appropriate national language support files (e.g., COUNTRY.SYS) and DISPLAY.SYS to set console code pages, enabling seamless switching between encodings like CP437 for the US or equivalents for other regions.^[13] This modular approach ensured hardware and software compatibility across global markets. Early Windows versions from 1.0 (1985) to 3.x (up to 1992) inherited these MS-DOS OEM code pages primarily for console and backward compatibility, maintaining support for CP437 and its variants in command-line interfaces and file systems. For graphical user interfaces (GUI), Windows introduced "ANSI" code pages—distinct from true ANSI standards but based on ECMA-94—to handle text rendering, with the initial set in Windows 1.0 evolving to include additional characters by Windows 3.1 and tying selections to regional settings for localized installations.^[14] This dual system of OEM for legacy DOS integration and ANSI for native Windows applications laid the groundwork for the platform's character encoding architecture.^[1]

Evolution and Standardization

With the release of Windows 95 in 1995 and Windows NT 4.0 in 1996, Microsoft formalized the Windows-125x series as the primary ANSI code pages for single-byte character encodings in the Windows environment, establishing them as the default for text handling in graphical applications. These code pages, such as CP1252 for Western European languages, were designed to extend the ASCII range (0x00-0x7F) while aligning as closely as possible with the ISO/IEC 8859 standards, for instance mapping CP1252 to ISO/IEC 8859-1 for Latin-1 characters. However, alignments were incomplete due to proprietary Microsoft extensions, including the addition of 27 printable characters in the 0x80-0x9F range of CP1252, which ISO/IEC 8859-1 left undefined for control codes.^[15]^[16] Simultaneously, Windows 95 introduced enhanced support for multi-byte character sets through Double-Byte Character Sets (DBCS), enabling efficient handling of languages requiring more than 256 characters, such as those in East Asia, by using lead and trail bytes for extended glyphs while preserving ASCII compatibility. This formalization was documented in the Windows 95 SDK, where conversion tables for code pages like CP1252 were provided in files such as UNICODE.BIN, supporting up to 18 code pages in international editions for bidirectional mappings between code pages and internal Unicode representations. Standardization efforts involved Microsoft's submission of the Windows-125x mappings to the Internet Assigned Numbers Authority (IANA) for official MIME charset registration, with CP1250 (Central European) registered on May 3, 1996, following collaborative development with ISO/IEC to incorporate ISO 8859-2 mappings while adding vendor-specific characters.^[16]^[17]^[18] In the mid-1990s, Microsoft expanded the Windows-125x series to address global market needs, introducing CP1251 for Cyrillic scripts in 1995 to support languages like Russian and Bulgarian, building on earlier non-English Windows 3.1 implementations but integrating it fully into the Windows 95/NT ecosystem. This was followed by CP1256 for Arabic in 1996, which incorporated right-to-left text rendering and visual ordering adjustments, registered with IANA on May 3, 1996, to facilitate telecom and document exchange in Middle Eastern regions. These expansions reflected influences from ITU-T recommendations for international telecom encodings, such as those in Recommendation T.61 for teletex services, where Microsoft adapted mappings to ensure compatibility with global data transmission standards despite proprietary deviations. Key documentation of these code pages, including detailed glyph tables and conversion matrices, appeared in Microsoft SDK releases from 1995 onward, serving as authoritative references for developers despite the incomplete harmonization with ISO standards due to extensions for Windows-specific typography.^[14]^[19]^[20]

Transition to Unicode

The Windows NT family has used Unicode internally since Windows NT 3.1 in 1993, initially with UCS-2 encoding, providing Unicode support for the enterprise line while the consumer Windows 9x series continued relying on code pages. With the release of Windows 2000 in 2000, Microsoft upgraded the operating system's internal text encoding to native UTF-16, including surrogate pairs for full Unicode support beyond the Basic Multilingual Plane across applications and system components.^[21] This change marked a significant pivot from reliance on single-byte and multi-byte code pages, which were limited to specific language sets, toward a unified encoding capable of handling global scripts. However, to maintain compatibility with existing software, Windows retained support for legacy code pages through conversion APIs such as MultiByteToWideChar and WideCharToMultiByte, allowing applications to translate between code page-based strings and UTF-16 Unicode.^[1] Windows XP, released in 2001, introduced UTF-8 as code page 65001 (CP_UTF8), enabling limited support for this variable-length encoding in APIs and file handling, though initial implementations suffered from bugs, particularly in console output and certain localization scenarios.^[2] These issues persisted for years, prompting developers to favor UTF-16 for reliability, but Microsoft addressed many through cumulative updates, with notable improvements in console handling by Windows 10's version 1607 (Anniversary Update) in 2016, enhancing UTF-8 stability for non-Unicode applications.^[22] By Windows 10 version 1903 (May 2019 Update), further refinements included the activeCodePage manifest property, allowing apps to declare UTF-8 as their default code page, and a beta system-wide option to set UTF-8 as the active ANSI code page via registry or settings (e.g., enabling "Beta: Use Unicode UTF-8 for worldwide language support" under Administrative language settings).^[23] As Windows evolved, code pages' role diminished, with Microsoft explicitly marking them as legacy components by Windows 11 in 2021 to discourage new development reliance on them in favor of Unicode encodings like UTF-8 and UTF-16.^[3] This deprecation trend emphasized UTF-8 adoption for legacy non-Unicode apps through configuration tweaks, such as registry keys under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage, while ensuring backward compatibility for older systems and tools.^[23] Overall, the transition underscored Unicode's superiority for internationalization, reducing the fragmentation caused by region-specific code pages while preserving interoperability via robust conversion mechanisms.^[24]

Single-Byte Code Page Families

Windows-125x Series

The Windows-125x series comprises a family of 8-bit single-byte code pages, designated as code pages 1250 through 1258, developed by Microsoft to support Western European and other non-East Asian scripts in Windows operating systems. These code pages serve as the primary ANSI code pages for graphical user interfaces, extending the ISO/IEC 8859 family of standards by incorporating additional glyphs not defined in the ISO specifications, such as curly quotation marks, em dashes, and other typographic symbols in the range 0x80–0x9F. Unlike the ISO 8859 standards, which reserve this C1 control range for non-printing characters, the Windows-125x implementations assign printable characters to these bytes to better accommodate common usage in word processing and display applications.^[25]^[1] Each code page in the series targets specific linguistic regions, mapping the first 128 bytes (0x00–0x7F) identically to the ASCII standard while using the extended range (0x80–0xFF) for language-specific accented letters, symbols, and punctuation. For instance, code page 1252 (Western European, also known as Windows Latin 1) was adopted in the 1980s as the default for English and other Western European languages, based on an early American National Standards Institute (ANSI) draft that preceded the finalization of ISO 8859-1; it includes characters like the en dash (–) at 0x96 and non-breaking space ( ) at 0xA0, which differ from ISO 8859-1's undefined or control assignments in the 0x80–0x9F block.^[1]^[25] Code page 1250 supports Central European languages (e.g., Polish, Czech) by extending ISO 8859-2 with additional diacritics; 1251 handles Cyrillic scripts (e.g., Russian, Bulgarian) based on ISO 8859-5; 1253 covers Greek, extending ISO 8859-7; 1254 addresses Turkish needs, modifying ISO 8859-9; 1255 encodes Hebrew from right-to-left, drawing from ISO 8859-8; 1256 supports Arabic, also based on ISO 8859-6; 1257 serves Baltic languages (e.g., Latvian, Lithuanian), extending ISO 8859-4 and 13; and 1258 accommodates Vietnamese, combining Latin characters with tone marks in a manner similar to but distinct from VISCII.^[2]^[25] The following table summarizes the key code pages in the series, their primary language coverage, and .NET encoding names:

Code Page	Description	Primary Languages/Region	.NET Name
1250	ANSI Central European	Central/Eastern European (Latin script)	windows-1250
1251	ANSI Cyrillic	Cyrillic (Russian, Ukrainian, etc.)	windows-1251
1252	ANSI Latin 1 (Western)	Western European (English, French, etc.)	windows-1252
1253	ANSI Greek	Greek	windows-1253
1254	ANSI Turkish	Turkish	windows-1254
1255	ANSI Hebrew	Hebrew	windows-1255
1256	ANSI Arabic	Arabic	windows-1256
1257	ANSI Baltic	Baltic (Latvian, Lithuanian, etc.)	windows-1257
1258	ANSI/OEM Vietnamese	Vietnamese	windows-1258

These mappings ensure compatibility with legacy single-byte systems while prioritizing printable output over strict adherence to international standards. A notable update to code page 1252 occurred post-1998, incorporating the Euro symbol (€) at byte 0x80 to support the introduction of the European currency, alongside adjustments for characters like Ž and ž; similar practical enhancements appear across the series to address regional requirements.^[2]^[25] In Windows environments, the active ANSI code page from the 125x series is determined by the system's locale and retrieved programmatically via the GetACP function, which returns the identifier for the current operating system's default ANSI code page (e.g., 1252 on English systems). These code pages remain prevalent in legacy text files (e.g., .txt documents saved without Unicode specification), email protocols, and older applications that rely on single-byte processing for efficiency in non-Unicode contexts, though Microsoft recommends transitioning to Unicode encodings like UTF-8 for broader compatibility and to avoid locale-dependent variations.^[4]^[1]

OEM and DOS Code Pages

OEM and DOS code pages refer to a family of 8-bit character encodings designed for use in MS-DOS systems and Windows console applications, where they handle text display and input in legacy environments. Unlike ANSI code pages, which prioritize international characters in the upper byte range, OEM code pages allocate values from 0x80 to 0xFF primarily to graphics symbols, including line-drawing elements, block characters, and punctuation for text-based user interfaces.^[1] These encodings originated with the IBM PC in the early 1980s, evolving from the initial MS-DOS support for regional variations.^[12] In MS-DOS, OEM code pages were configured and loaded through the COUNTRY= directive in the CONFIG.SYS file, which specified a country code and optional code page identifier to enable appropriate character sets, keyboard layouts, and formatting conventions from a supporting file like COUNTRY.SYS.^[26] This mechanism allowed DOS to adapt to different locales without altering the core 7-bit ASCII base (0x00–0x7F), which remained consistent across code pages. The first such code page, CP437, was introduced in 1981 for the United States and included dedicated slots for box-drawing characters to support applications like early text editors and games.^[12]^[2] Subsequent OEM code pages extended this model to other regions, replacing graphics symbols with language-specific characters while retaining many visual elements for compatibility. Key examples include:

Code Page	Name	Region/Language
437	IBM437	United States
737	IBM737	Greek
775	IBM775	Baltic States
850	IBM850	Western Europe (Multilingual)
852	IBM852	Central Europe
855	IBM855	Cyrillic (Russian)
857	IBM857	Turkish
860	IBM860	Portuguese
861	IBM861	Icelandic
863	IBM863	Canadian French
865	IBM865	Nordic

These code pages, first expanded in MS-DOS 3.3 in 1987, provided essential support for non-English scripts in DOS applications.^[2]^[12] A core characteristic of OEM code pages is their emphasis on legacy graphics in the 128–255 range, enabling block-mode interfaces in text-mode programs, though this limits their capacity for full international text compared to later encodings. In Windows, the active OEM code page can be retrieved programmatically using the GetOEMCP function, which returns the identifier for the system's default console encoding.^[6]^[1] These code pages maintain a legacy role in modern Windows, particularly in the Command Prompt, where they govern console output and input for compatibility with older executables. The chcp command allows switching between supported OEM code pages, but only the system's installed OEM page renders correctly in the console window, underscoring their persistence for DOS-era software and file systems like FAT.^[27] Despite the shift to Unicode, OEM code pages remain integral for emulating historical behaviors in virtual DOS environments and non-Unicode-aware tools.^[1]

Other Single-Byte Encodings

Windows supports several single-byte encodings derived from international standards beyond its proprietary Windows-125x and OEM families, including adaptations of the ISO/IEC 8859 series, ITU-T recommendations, and KOI8 variants. These encodings facilitate compatibility with global standards for text handling in legacy applications and data exchange, particularly in regions where specific scripts require precise mapping to 8-bit code points.^[2] The ISO/IEC 8859 series provides single-byte encodings for various Latin-based scripts, with Windows assigning dedicated code page identifiers for direct support. For instance, code page 28591 corresponds to ISO/IEC 8859-1 (Latin-1), covering Western European languages with characters such as accented letters and symbols for English, French, German, and Spanish. Similarly, code page 28599 maps to ISO/IEC 8859-9 (Latin-5), tailored for Turkish, incorporating letters like Ğ (U+011E) and I without dot (U+0131) to support the Turkish alphabet. These Windows mappings adhere closely to the ISO standards but include platform-specific implementations for accent handling and control codes, differing from extensions in Windows' own code pages.^[2]^[2] ITU-T code pages in Windows address telecommunications and multimedia needs, drawing from standards like ISO/IEC 6937. Code page 20269 implements ISO/IEC 6937, a non-spacing accent encoding for Latin scripts used in early digital telephony and videotex systems, allowing combined diacritics for efficient transmission of accented characters in bandwidth-limited environments. Additionally, code page 20866 supports KOI8-R, a Cyrillic encoding standardized in the 1990s for Russian text, originating from Soviet-era computing but adapted for post-Cold War interoperability in Unix-like systems and early web content.^[2]^[2]^[2] KOI8 variants extend this support to other Cyrillic scripts, enhancing legacy Unix-Windows data interchange. Code page 21866 corresponds to KOI8-U, an extension of KOI8-R for Ukrainian, incorporating characters like Є (U+0404) and І (U+0406) while maintaining compatibility with the Russian base for shared Cyrillic layouts. These encodings remain relevant for processing older files from Eastern European systems, where full Unicode adoption has been gradual.^[2]^[2] All these single-byte encodings are integrated into Windows through code page APIs, such as MultiByteToWideChar and WideCharToMultiByte, enabling conversion to and from Unicode (UTF-16) for applications requiring backward compatibility. Support persists in modern Windows versions, including mappings for rare ITU-T standards that have seen limited updates since the early 2000s, ensuring reliable handling of international legacy data without requiring custom implementations.^[2]^[25]

Multi-Byte and Specialized Code Page Families

East Asian Multi-Byte Code Pages

East Asian multi-byte code pages in Windows support CJK (Chinese, Japanese, Korean) languages by combining single-byte character sets (SBCS) for ASCII compatibility with double-byte character sets (DBCS) for ideographic characters. These encodings use variable-width representations, where the first 128 code points (0x00–0x7F) encode ASCII characters in a single byte, while extended characters require two bytes: a lead byte typically in the range 0x81–0xFE followed by a trail byte that together represent a hanzi, kanji, or hangul syllable.^[3]^[28] The lead byte signals the start of a multi-byte sequence, allowing parsers to distinguish between single-byte and double-byte characters during text processing. For example, in code page 932, lead bytes occupy ranges such as 0x81–0x9F, enabling encoding of thousands of Japanese characters beyond the 7-bit ASCII limit. Trail bytes vary by code page but generally fall in non-overlapping ranges to avoid ambiguity with ASCII or single-byte extensions. This structure ensures backward compatibility with 8-bit systems while accommodating the vast character sets needed for East Asian scripts.^[28] Prominent East Asian multi-byte code pages in Windows include:

CP932: A Microsoft variant of the Shift JIS encoding for Japanese, developed in the 1990s to handle JIS X 0208 characters plus extensions; it supports over 6,000 kanji and kana.^[3]^[2]
CP936: The encoding for Simplified Chinese, initially based on GB 2312 but extended to GBK in Windows to include additional characters from GB 13000.1 for better coverage of modern usage in mainland China and Singapore.^[3]^[2]
CP949: An extension of EUC-KR based on the KS C 5601 standard for Korean, incorporating unified Hangul syllables and Hanja characters for compatibility with Windows Korean locales.^[3]^[2]
CP950: An extension of the Big5 encoding for Traditional Chinese, used in Taiwan and Hong Kong, with added characters for regional variants and compatibility.^[3]^[2]

These code pages feature Windows-specific extensions that go beyond international standards to include vendor-proprietary characters. For instance, CP932 adds NEC special characters (such as row 13 extensions) and selections from IBM extended sets, enabling support for legacy Japanese applications and fonts not covered in base Shift JIS. Similar extensions in CP936, CP949, and CP950 incorporate additional glyphs for practical Windows usage, such as user-defined or regional variants. In Windows environments, these multi-byte code pages serve as legacy encodings for East Asian regional settings, particularly in older applications, file systems, and command-line interfaces where Unicode adoption is incomplete. The operating system includes APIs like IsDBCSLeadByteEx to validate lead bytes specifically in code pages 932, 936, 949, and 950, facilitating safe string manipulation and preventing misinterpretation of byte sequences in multi-lingual text.^[29]^[3]

EBCDIC Code Pages

EBCDIC, or Extended Binary Coded Decimal Interchange Code, is an 8-bit character encoding developed by IBM in the early 1960s as part of its System/360 mainframe architecture to facilitate data interchange on punched cards and tapes.^[30]^[31] This encoding extends earlier BCD-based codes used in IBM systems, assigning 256 possible values to characters while maintaining compatibility with decimal arithmetic through a structured bit layout. Unlike ASCII, which follows a sequential ordering where digits precede letters, EBCDIC places alphabetic characters before numeric digits in its collating sequence, a design choice rooted in legacy punched-card sorting practices.^[32]^[33] Windows provides support for EBCDIC code pages mainly to enable interoperability with IBM mainframe environments, such as z/OS, where EBCDIC remains the native encoding for legacy applications and data stores. These code pages are classified as non-native encodings in Windows and are handled through system APIs rather than as default system locales. Key variants include those tailored for specific languages and regions, using the EBCDIC framework's flexibility for national adaptations. For instance, the high-order "zone" bits (bits 7-4) categorize characters into groups like punctuation, letters, and digits, while the low-order "numeric" bits (bits 3-0) define specific symbols within those groups, allowing variants to remap accented or national characters without altering the core structure.^[1]^[25]^[34] The following table summarizes prominent Windows EBCDIC code pages, their associated names, and primary uses, drawn from Microsoft's supported code page specifications:

Code Page	Name	Description	CCSID (IBM Equivalent)
037	IBM EBCDIC US-Canada	Standard for English text in North America	37
500	IBM EBCDIC International	Supports Western European languages	500
870	IBM EBCDIC Multilingual Latin 2	For Central and Eastern European Latin scripts	870
1047	IBM EBCDIC Open Systems Latin 1	POSIX-compliant variant for Western Latin	1047

These mappings facilitate direct data exchange, such as file transfers or terminal emulation, between Windows applications and z/OS systems, where zone bit variations accommodate country-specific characters like accented letters in French or German. Euro symbol extensions, such as in code page 1140 (a variant of 037), were added in later updates to support modern currency needs.^[2]^[35]^[25] In contemporary Windows environments, EBCDIC code pages are infrequently used for new development due to the prevalence of Unicode, but they persist for legacy integration via conversion functions in the Win32 API, such as MultiByteToWideChar and WideCharToMultiByte, which handle EBCDIC-to-Unicode translations. This support is particularly relevant in enterprise scenarios involving Host Integration Server, where SNA protocols enable seamless data flow between Windows clients and mainframes. Microsoft maintains over 40 EBCDIC variants in its code page library, ensuring compatibility without native rendering in the Windows UI.^[1]^[36]^[35]

Macintosh Compatibility Code Pages

Windows includes a series of code pages specifically designed for compatibility with classic Mac OS character encodings, facilitating cross-platform text handling for applications and file transfers. The foundational encoding is MacRoman (CP10000), a single-byte character set introduced by Apple in 1984 to support Western European languages on Macintosh computers. MacRoman uses bytes 0-127 for ASCII compatibility but assigns distinct characters to the 128-255 range, incorporating Apple-specific symbols like the dagger (†), apple logo (), and various fractions, which differ significantly from equivalent ranges in Windows encodings such as CP1252.^[37]^[2] These compatibility code pages extend beyond Roman scripts to cover other Macintosh language systems, mapping them to Windows representations for bidirectional conversion. Key examples include support for East Asian and right-to-left scripts, ensuring that text from Mac OS applications could be processed in Windows without loss of meaning.

Code Page ID	IANA Name	Description
10000	macintosh	MAC Roman; Western European (Mac)
10001	x-mac-japanese	Japanese (Mac)
10002	x-mac-chinesetrad	Traditional Chinese (Big5; Mac)
10003	x-mac-korean	Korean (Mac)
10004	x-mac-arabic	Arabic (Mac)
10005	x-mac-hebrew	Hebrew (Mac)
10006	x-mac-greek	Greek (Mac)
10007	x-mac-cyrillic	Cyrillic (Mac)
10008	x-mac-chinesesimp	Simplified Chinese (GB2312; Mac)

This table lists the primary Macintosh compatibility code pages supported in Windows, each tailored to a specific Mac OS script system.^[2] Bidirectional conversions between these code pages and other formats, including Unicode, are enabled through Windows National Language Support (NLS) APIs such as MultiByteToWideChar and WideCharToMultiByte, which handle mapping for accurate round-trip preservation where possible. These mechanisms were essential for pre-Unicode file exchanges, such as sharing documents between Macintosh and Windows systems in mixed environments like publishing or office workflows.^[9]^[38] Developed during the 1990s amid growing interoperability needs between Microsoft Windows and Apple Macintosh platforms, these code pages addressed the challenges of diverse legacy encodings but have seen limited adoption in contemporary applications, overshadowed by Unicode standards.^[14]

UTF-8 and UTF-16 Integration

Windows has utilized UTF-16 as its primary internal encoding for text processing since the introduction of Windows NT in 1993, initially based on UCS-2 and later extended to full UTF-16 support with surrogate pairs to handle Unicode code points beyond the Basic Multilingual Plane (BMP), exceeding 65,536 characters.^[39] However, code page 1200 designates a UCS-2-compatible version of UTF-16 in little-endian byte order, limited to the BMP, which is the default for Windows systems in code page conversion APIs, while code page 1201 specifies big-endian UTF-16 with the same BMP limitation.^[2] This encoding serves as the native format for Windows API calls involving wide characters (WCHAR), ensuring efficient handling of Unicode data within the operating system kernel and user-mode applications, though code page conversions via 1200/1201 do not fully support surrogates. In contrast, UTF-8 support in Windows, designated as code page 65001, was introduced with Windows XP in 2001 as a variable-width encoding using 1 to 4 bytes per character to represent the full Unicode repertoire.^[2] UTF-8 enables compatibility with web standards and cross-platform text files, but its adoption was initially limited due to incomplete system integration. Significant improvements occurred starting with Windows 10 version 1903 (May 2019 Update), which enhanced UTF-8 handling in the console, file I/O, and API functions, making it viable for broader system-wide use without relying solely on UTF-16 conversions.^[23] Integration of these encodings into Windows code page mechanisms occurs primarily through string conversion APIs, such as WideCharToMultiByte, which accepts code page 65001 to convert UTF-16 wide strings to UTF-8 byte sequences, and the reciprocal MultiByteToWideChar for the reverse.^[38] Developers can specify these code pages explicitly for portability, bypassing locale-dependent ANSI code pages. Additionally, since Windows 10 version 1903, UTF-8 can be set as the active ANSI code page (ACP) either via a beta system locale configuration in Settings > Time & Language > Language > Administrative language settings (enabling "Beta: Use Unicode UTF-8 for worldwide language support," which updates the registry under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage to set ACP to 65001) or through the stable activeCodePage manifest element set to "UTF-8" for individual applications; this was expanded in Windows 11 to also allow "Legacy" or specific code pages via manifests.^[23]^[40] This enables UTF-8 as the default for non-Unicode (A) API variants like CreateFileA. Early implementations of code page 65001 in Windows XP and Server 2003 exhibited limitations and bugs, including inconsistent handling of invalid sequences and partial support for Unicode operations like normalization, which could lead to data corruption in multi-byte contexts.^[41] These issues, such as errors in surrogate pair processing and flag support in conversion functions, were progressively addressed through updates, with key fixes for normalization and error detection by Windows Vista in 2006 and further refinements in subsequent service packs.^[38] Despite these advancements, UTF-8 remains non-default in most Windows configurations to maintain backward compatibility with legacy single-byte code pages, though Microsoft recommends transitioning to UTF-8 or UTF-16 for new applications to avoid encoding pitfalls.^[23]

Other Unicode-Derived Code Pages

In addition to the primary UTF-8 and UTF-16 encodings, Windows supports several other code pages derived from or closely related to Unicode standards, primarily for legacy compatibility and specialized applications. Code page 1200 corresponds to UTF-16 in little-endian byte order, encoding the Basic Multilingual Plane (BMP) of ISO/IEC 10646 and available only to managed applications (e.g., .NET).^[2] This format evolved from UCS-2, an earlier fixed-width 16-bit encoding limited to the BMP and incapable of representing characters beyond U+FFFF without surrogates; UCS-2 has been deprecated in favor of full UTF-16 since Unicode 2.0, though Windows maintains backward compatibility for applications assuming the legacy behavior.^[42] Code page 65000 implements UTF-7, a variable-width encoding designed to be safe for transport over 7-bit channels like email, where ASCII remains unmodified and non-ASCII Unicode characters are encoded using a modified Base64 scheme with escape sequences.^[2] Developed as part of early Unicode efforts, UTF-7 prioritizes compatibility with legacy protocols but is less efficient than UTF-8 and rarely used outside specific contexts like IMAP folders. Code page 1201 provides UTF-16 in big-endian byte order, mirroring the structure of CP1200 but with reversed byte serialization for network or cross-platform interchange where big-endian is conventional, and also available only to managed code environments in Windows, such as .NET applications.^[2] Windows also supports UTF-32 encodings via code page 12000 (little-endian) and 12001 (big-endian), fixed-width 32-bit formats that directly map Unicode code points without surrogates, available only to managed applications for scenarios requiring explicit 32-bit Unicode handling.^[2] The Standard Compression Scheme for Unicode (SCSU), a draft Unicode Technical Standard for reducing storage needs through dynamic windowing of frequent character ranges, sees limited implementation in Windows products like SQL Server for internal Unicode compression, but it lacks a dedicated code page identifier and is not exposed for general file or text handling.^[43]^[44] Experimental encodings like UTF-EBCDIC, proposed to map Unicode onto EBCDIC-compatible structures for mainframe interoperability, remain unimplemented as Windows code pages and are confined to niche research or vendor-specific extensions post-2010.

Usage in Windows

API and System Implementation

Windows code pages are integrated into the operating system through the National Language Support (NLS) subsystem, which provides APIs in kernel32.dll for managing and converting between code pages and Unicode representations.^[45] The NLS framework loads locale-specific data, including code page translation tables stored in binary NLS files (e.g., c_1252.nls for Windows-1252), from the %SystemRoot%\System32 directory, enabling dynamic access to character mappings without embedding them in applications.^[2] Core APIs for querying active code pages include GetACP, which retrieves the current ANSI code page identifier used for non-Unicode text in the system locale, and GetOEMCP, which returns the OEM code page identifier typically used for console and DOS compatibility operations.^[4]^[6] These functions allow applications to determine the system's default mappings for ANSI and OEM contexts, respectively, ensuring compatibility with legacy single-byte encodings. For validation, the IsValidCodePage API checks whether a specified code page identifier (e.g., 1252 for Western European) is supported by the system, returning a nonzero value if valid.^[46] Character conversion between multi-byte code pages and Unicode is handled primarily by MultiByteToWideChar and its counterpart WideCharToMultiByte, both exported from kernel32.dll. MultiByteToWideChar translates a string from a specified code page to UTF-16, with flags such as MB_PRECOMPOSED (the default) directing the function to produce precomposed Unicode characters where possible, avoiding separate base and combining marks.^[9] Conversely, WideCharToMultiByte performs the reverse, mapping UTF-16 to a target code page, and supports similar flags to control decomposition behavior for compatibility with legacy applications. These APIs rely on NLS-loaded tables for accurate mappings and are essential for bridging code page-based data with internal Unicode processing. Error handling in these conversion functions addresses invalid byte sequences, which occur when input bytes do not map to valid characters in the source code page. Without the MB_ERR_INVALID_CHARS flag, MultiByteToWideChar substitutes such sequences with the system's default character (typically '?'), allowing partial conversions to proceed rather than failing entirely; setting the flag causes the function to return 0 and set GetLastError to ERROR_INVALID_PARAMETER upon encountering invalid input.^[9] This substitution mechanism prevents crashes in legacy code but can lead to data loss, underscoring the importance of validating inputs via IsValidCodePage beforehand. In modern Windows versions starting from Windows 10, the system prefers UTF-16 as the internal encoding for strings and UI elements, reducing reliance on code pages for core operations while maintaining API support for backward compatibility.^[1] Starting with Windows 10, beta system-wide UTF-8 support is available as an alternative to traditional ANSI code pages, configurable via administrative settings, and this feature is supported in Windows 11.^[23]

Regional and Language Settings

In Windows, users configure code pages through the Regional and Language Settings in the Settings app (or legacy Control Panel), specifically under Time & Language > Region > Administrative language settings, where the system locale can be changed to select a non-Unicode code page for legacy applications.^[23]^[47] For example, selecting the Russian (Russia) system locale sets the default ANSI code page to Windows-1251 (CP1251), which determines how non-Unicode programs interpret text characters.^[47] This configuration affects the active code page used by the system for ANSI operations, with changes requiring a restart to take effect.^[48] Locale IDs (LCIDs) in Windows associate specific code pages with languages and regions, enabling the system to retrieve relevant encoding information for internationalization.^[49] Each LCID is a 32-bit value combining a primary language identifier (lower 10 bits), sublanguage (next 6 bits), sort ID (next 4 bits), and reserved bits, with the default ANSI code page tied to the locale—for instance, LCID 1049 corresponds to Russian (Russia) and links to CP1251.^[50] Applications can query these associations using the GetLocaleInfo function with the LOCALE_IDEFAULTANSICODEPAGE constant to obtain the ANSI code page for a given LCID, supporting runtime locale-specific text handling.^[51]^[52] Multilingual User Interface (MUI) packs and language feature updates install additional locales and their associated code pages during Windows setup or post-installation via the Settings app under Time & Language > Language > Add a language.^[53] These packs extend system support for non-default languages, automatically incorporating the corresponding code pages without altering the primary system locale.^[54] In recent versions, such as Windows 11, Microsoft has promoted UTF-8 as a configurable system locale option (labeled as a beta feature in Administrative settings), allowing users to set it as the default active code page (CP65001) for broader Unicode compatibility across legacy and modern applications.^[23]^[55] These settings directly impact compatibility for legacy applications that rely on the system locale's ANSI code page rather than Unicode, such as older versions of Notepad or console tools, where mismatched locales can result in garbled text display.^[47]^[56] Enabling UTF-8 as the system locale enhances cross-language support in such apps by standardizing on a Unicode-based encoding, though it may require application-specific adjustments for optimal rendering.^[23]

Limitations and Challenges

Common Encoding Problems

One prevalent issue in handling Windows code pages arises from mojibake, where text becomes garbled due to the assumption of an incorrect code page during decoding. This occurs because different code pages assign distinct byte values to characters, leading to systematic misinterpretation; for instance, ANSI code pages can vary across systems or be altered, resulting in data corruption when a file encoded in one code page is read using another. A common example involves a file saved in Windows-1252 (CP1252), which maps the euro symbol (€) to byte 0x80, but when interpreted as CP850 (OEM Multilingual Latin I), this byte maps to Ç (C with cedilla), producing mojibake such as â in unrelated contexts or a replacement character in some displays if unsupported. Such mismatches are particularly frequent in legacy applications or cross-system file transfers without explicit encoding metadata. Round-trip conversion problems further complicate text handling, where converting from a code page to Unicode and back fails to preserve the original data due to non-reversible mappings. In multi-byte code pages like CP932 (Shift-JIS for Japanese), multiple byte sequences may map to a single Unicode character, but the reverse conversion cannot unambiguously reconstruct the original bytes, leading to loss of information or altered content. This issue is exacerbated when system and thread code pages differ, as the conversion functions like MultiByteToWideChar and WideCharToMultiByte rely on the active code page context, potentially causing corruption during round-trip operations in multithreaded environments. Locale mismatches amplify these risks, especially when files from regions using incompatible code pages are processed on systems configured for different locales. For example, a document encoded in CP1251 (Windows Cyrillic) containing Russian text, such as "Привет" (byte sequence like 0xCF 0xF0 0xE8 0xE2 0xE5 0xF2), will appear corrupted—often as Latin gibberish like Ïðèвåт—when viewed using a Latin-based code page like CP1252 on a Western European-configured Windows system. This corruption stems from the fundamental incompatibility between script-specific code pages, where bytes intended for Cyrillic glyphs overlap with Latin control or printable characters in other pages, rendering the text unusable without proper locale-aware handling. Detecting the correct code page poses significant challenges, particularly for legacy files lacking a Byte Order Mark (BOM), which is absent in traditional Windows code page encodings unlike UTF-8 or UTF-16 variants. Without metadata, applications must rely on heuristics or user intervention, often leading to trial-and-error decoding attempts. In console environments, tools like the chcp command allow switching the active code page (e.g., chcp 1252 to set CP1252), but this only affects output display and does not retroactively detect or correct embedded file encodings, complicating troubleshooting in mixed-locale scenarios.

Deprecation and Migration Strategies

Microsoft recommends that developers prioritize Unicode encodings, such as UTF-8 or UTF-16, for new Windows applications to avoid the limitations of legacy code pages and ensure broad international support.^[2] This guidance emphasizes using UTF-8 for its efficiency in handling variable-width characters and compatibility with web standards, while UTF-16 remains suitable for internal string processing in Windows APIs.^[57] Since Windows 10 version 1903, Microsoft has provided a system configuration option to set UTF-8 as the default ANSI code page for legacy non-Unicode applications, accessible via the "Use Unicode UTF-8 for worldwide language support" setting in Region > Administrative settings, which modifies the registry key HKLM\SYSTEM\CurrentControlSet\Control\Nls\CodePage\ACP to value 65001.^[23] This feature, initially introduced as beta, enables smoother transitions by interpreting legacy API calls through UTF-8, though it requires a system reboot and may impact performance in some scenarios.^[58] For bulk migration of existing data from code pages to Unicode, several tools facilitate conversion. PowerShell cmdlets, such as Get-Content -Encoding <codepage> to read files in legacy encodings (e.g., Default for ANSI) followed by Set-Content -Encoding utf8, allow scripted batch processing of text files to UTF-8.^[59] Windows native APIs like MultiByteToWideChar and WideCharToMultiByte provide programmatic conversion capabilities for developers integrating migration into applications.^[1] Additionally, the iconv utility, installable via Git for Windows or similar environments, supports command-line bulk conversions, such as iconv -f [WINDOWS-1252](/page/Windows-1252) -t [UTF-8](/page/UTF-8) input.txt > output.txt, for handling multiple files from specific code pages.^[60] Best practices for migration include embedding a Byte Order Mark (BOM) in UTF-8 files to assist legacy Windows applications in detecting the encoding, particularly for text files opened in Notepad or Excel.^[61] After conversion, validation using APIs like IsTextUnicode ensures data integrity by checking for valid Unicode sequences and flagging potential corruption from mismatched code pages. Enterprises adopting Windows 10 and later have increasingly implemented these strategies during upgrades, often combining automated scripts with testing phases to handle large-scale data shifts in multilingual environments.^[62] As of 2025, Microsoft continues to promote Unicode adoption without announcing full deprecation of code pages, maintaining backward compatibility for legacy software while encouraging UTF-8 as the standard for future development.^[23]

References

[1]
Code Pages - Win32 apps - Microsoft Learn
Oct 26, 2021 · Windows code pages, commonly called "ANSI code pages", are code pages for which non-ASCII values (values greater than 127) represent international characters.
[2]
Code Page Identifiers - Win32 apps - Microsoft Learn
Jan 7, 2021 · The following table defines the available code page identifiers. Note: ANSI code pages can be different on different computers, or can be changed for a single ...
[3]
Code pages - Globalization - Microsoft Learn
Feb 2, 2024 · A code page is a list of selected character codes (characters represented as code points). Code pages were originally defined to support a specific language.
[4]
GetACP function (winnls.h) - Win32 apps - Microsoft Learn
Feb 22, 2024 · Returns the current Windows ANSI code page (ACP) identifier for the operating system. See Code Page Identifiers for a list of identifiers for Windows ANSI code ...Missing: identification | Show results with:identification
[5]
Console Code Pages - Windows Console - Microsoft Learn
Feb 12, 2021 · A code page is a mapping of 256 character codes to individual characters. Different code pages include different special characters, typically customized for a ...
[6]
GetOEMCP function (winnls.h) - Win32 apps | Microsoft Learn
Feb 22, 2024 · The GetOEMCP function returns the current OEM code page identifier for the operating system. Applications should use Unicode for consistent ...
[7]
Locales and Code Pages | Microsoft Learn
Aug 3, 2021 · For example, the ANSI code page 1252 is used for English and most European languages, and the ANSI code page 932 is used for Japanese Kanji.Missing: OEM | Show results with:OEM
[8]
Unicode and Multibyte Character Set (MBCS) Support
Mar 28, 2024 · Multibyte Character Sets (MBCS), char based single or double-byte characters and strings encoded in a locale-specific character set. Note.
[9]
MultiByteToWideChar function (stringapiset.h) - Win32 apps
Feb 5, 2024 · The Windows ANSI code page for the current thread. Note: This value can be different on different computers, even on the same network. It can be ...Missing: MBCS | Show results with:MBCS
[10]
NLS Terminology - Win32 apps - Microsoft Learn
Jan 7, 2021 · NLS terminology includes language groups, language for non-Unicode programs, Standards and Formats, thread locale, input language, and system ...Locale and Language Terms · Code Page<|control11|><|separator|>
[11]
setlocale, _wsetlocale - Microsoft Learn
Jan 8, 2024 · Use the setlocale function to set, change, or query some or all of the current program locale information specified by locale and category.Missing: identification DLLs<|control11|><|separator|>
[12]
DOS codepages (and their history) - Aivosto
Codepage 437 is the original IBM "PC-ASCII" codepage. It's the basis for all other codepages. Differences exist in the 80-FF (hex) range. In the charts that ...
[13]
Using Code Pages in DOS
DOS uses the AUTOEXEC.BAT and CONFIG.SYS files to set up system code pages to support a national language. Examples of CONFIG.SYS commands are shown later in ...
[14]
Windows codepages (and their history) - Aivosto
The Windows ANSI character set first appeared in Windows 1.0 in 1985. Despite its name, Windows ANSI was not actually based on any published ANSI standard.Missing: inheritance | Show results with:inheritance
[15]
[MS-WMF]: Glossary - Microsoft Learn
Jun 24, 2021 · The term "ANSI" as used to signify Windows code pages is a historical reference and a misnomer that persists in the Windows community. The ...
[16]
Code Pages and the Windows 95 IFSMGR
BIN supplied with the US edition of Windows 95 provides conversion tables for just the three code pages 437, 850 and 1252. Conversion From Unicode. A Unicode ...
[17]
http://bitsavers.org/pdf/microsoft/windows_95/Programmers_Guide_to_Microsoft_Windows_95_1995.pdf
[18]
Windows-1250
... Registration of new MIME charset: Windows-1250 Date: Fri, 3 May 1996 09:49:45 -0700 MIME character set name: Windows-1250 Security Considerations Security ...Missing: CP1250 | Show results with:CP1250
[19]
Windows-1251
... Windows-1251 Date: Fri, 3 May 1996 09:49:41 -0700 MIME character set name: Windows-1251 Security Considerations: Security issues are not discussed in this ...
[20]
Windows-1256
... Windows-1256 Date: Fri, 3 May 1996 09:50:06 -0700 MIME character set name: Windows-1256 Security Considerations Security issues are not discussed in this ...
[21]
[DOC] biccencodingandunicodeconside... - Microsoft Download Center
Starting from Windows 2000, Windows has full Unicode support implemented by using UTF-16 Encoding. Windows APIs take WCHAR* as input which represent a Unicode ...
[22]
Windows Command-Line: Unicode and UTF-8 Output Text Buffer
Nov 15, 2018 · In this post, we'll discuss the improvements we've been making to the Windows Console's internal text buffer, enabling it to better store ...
[23]
Use UTF-8 code pages in Windows apps - Microsoft Learn
Jul 18, 2025 · UTF-8 is the universal code page for internationalization and is able to encode the entire Unicode character set. It is used extensively on the ...
[24]
Unicode - Win32 apps | Microsoft Learn
Jan 7, 2021 · An application can use the MultiByteToWideChar and WideCharToMultiByte functions to convert between strings based on code pages and Unicode ...
[25]
[MS-UCODEREF]: Supported Codepage in Windows - Microsoft Learn
Jun 24, 2021 · The following table shows all the supported codepages by Windows. The Codepage ID lists the integer number assigned to a codepage.
[26]
List of DOS CONFIG.SYS commands
Specifies where to load DOS. DOS=HIGH|LOW[,UMB|,NOUMB] HIGH Load DOS into the high memory area (HMA) if available. LOW Load DOS ...
[27]
chcp | Microsoft Learn
Feb 3, 2023 · Only the original equipment manufacturer (OEM) code page that is installed with Windows appears correctly in a Command Prompt window that uses ...
[28]
Support for Multibyte Character Sets (MBCSs) | Microsoft Learn
Oct 16, 2024 · For example, Japanese code page 932 uses the range 0x81 through 0x9F as lead bytes, but Korean code page 949 uses a different range. Consider ...<|control11|><|separator|>
[29]
IsDBCSLeadByteEx function (winnls.h) - Win32 apps - Microsoft Learn
Oct 12, 2021 · This function validates lead byte values only in code pages 932, 936, 949, 950, and 1361. Use system default Windows ANSI code page. Use the ...
[30]
The IBM System/360
The IBM System/360, introduced in 1964, ushered in a new era of compatibility in which computers were no longer thought of as collections of individual ...
[31]
EBCDIC Codes and Characters - Lookup Tables
IBM released their IBM system/360 line around the same time ASCII was being standardized in the early 1960s. IBM therefore developed their own EBCDIC (Extended ...Missing: introduction date<|separator|>
[32]
Differences between Unicode and EBCDIC sorting sequences - IBM
In EBCDIC, alphabetic characters are sorted before numeric characters. Because the Db2 catalog is stored in Unicode, any queries that you issue against Unicode ...Missing: letters digits
[33]
Sorting or Collating Sequence, EBCDIC and ASCII Environments
The major difference is the numbers (or digits) are sorted after the alphabet (with lower case being sorted before upper case) for the EBCDIC environment. The ...
[34]
How it was: ASCII, EBCDIC, ISO, and Unicode - EE Times
Due to the fact that IBM sold its computer systems around the world, it had to create multiple versions of EBCDIC. In fact, 57 different national variants were ...Missing: zone | Show results with:zone
[35]
EBCDIC Code Page Support (SNANLS)1 - Host Integration Server
Apr 19, 2022 · The following table shows the EBCDIC code pages and character code set identifiers (CCSIDs) supported by SNA National Language Support (SNANLS) ...
[36]
National Language Support in Windows2 - Host Integration Server
Apr 19, 2022 · Windows operating systems include EBCDIC-to-Unicode and Unicode-to-EBCDIC translation tables for all of the popular host code pages.<|control11|><|separator|>
[37]
Apple's MacRoman character set and equivalent Unicode and ...
Jun 16, 2000 · The following table lists all of the 223 characters in Apple's proprietary MacRoman character set, and gives the Unicode name and numeric character reference.<|control11|><|separator|>
[38]
WideCharToMultiByte function (stringapiset.h) - Win32 apps
Feb 5, 2024 · Note The ANSI code pages can be different on different computers, or can be changed for a single computer, leading to data corruption. For the ...Missing: MBCS | Show results with:MBCS
[39]
Why does Windows use UTF-16LE? - Stack Overflow
Feb 5, 2021 · Windows NT adopted UCS-2 as its internal encoding when everybody believed the number of Unicode code points would never go beyond 65536. When ...What's the point of UTF-16? - Stack Overflow"Windows uses UTF-16 as its internal encoding", what exactly does ...More results from stackoverflow.com
[40]
Unicode in Microsoft Windows - Wikipedia
Windows [is] moving forward to support UTF-8 ... Microsoft Windows (Windows XP and later) has a code page designated for UTF-8, code page 65001 or CP_UTF8 .Missing: CP65001 | Show results with:CP65001
[41]
https://archives.miloush.net/michkap/archive/2005/04/19/409566.html
[42]
[PDF] ISO/IEC International Standard ISO/IEC 10646 - Unicode
... UCS-2 which would be a subset of the UTF-16 encoding form restricted to the BMP UCS scalar values. The UCS-2 form is deprecated. 9.3 UTF-32 (UCS-4). UTF-32 (or ...
[43]
Unicode compression implementation - SQL Server | Microsoft Learn
Aug 21, 2023 · SQL Server uses an implementation of the Standard Compression Scheme for Unicode (SCSU) algorithm to compress Unicode values that are stored in row or page ...
[44]
https://www.unicode.org/reports/tr6/tr6-4.html
[45]
National Language Support Functions - Win32 apps | Microsoft Learn
Jun 13, 2022 · NLS supports functions like string comparison, locale conversion, getting locale scripts, and enumerating calendar information.
[46]
IsValidCodePage function (winnls.h) - Win32 apps - Microsoft Learn
Oct 12, 2021 · The IsValidCodePage function determines if a specified code page is valid, returning a nonzero value if valid, and 0 if invalid. A valid code ...Missing: identification | Show results with:identification
[47]
Non-Unicode apps show question marks instead of Russian text ...
Sep 8, 2025 · This issue occurs because the System Locale determines the ANSI code page used by non-Unicode apps (e.g., Windows-1251 for Russian) which ...
[48]
Set-WinSystemLocale (International) - Microsoft Learn
The Set-WinSystemLocale cmdlet sets the system locale, which determines code pages and font fallback for legacy applications, and is primarily used for non- ...
[49]
[MS-LCID]: Windows Language Code Identifier (LCID) Reference
Jun 24, 2021 · Describes localizable information in Windows. It lists all language code identifiers (LCIDs) available in all versions of Windows.
[50]
Locale Identifiers - Win32 apps - Microsoft Learn
Jan 7, 2021 · Each locale has a unique identifier, a 32-bit value that consists of a language identifier and a sort order identifier.
[51]
GetLocaleInfoA function (winnls.h) - Win32 apps - Microsoft Learn
Feb 9, 2023 · The ANSI string retrieved by the ANSI version of this function is translated from Unicode to ANSI based on the default ANSI code page for the ...
[52]
How can I get the default code page for a locale? - The Old New Thing
Oct 7, 2016 · You can ask GetLocaleInfo for the LOCALE_IDEFAULTANSICODEPAGE to get the ANSI code page for a locale.
[53]
Overview of MUI - Win32 apps | Microsoft Learn
Aug 20, 2021 · MUI technology is targeted at developers and ISVs aiming to build and support multilingual applications for the Windows platform.
[54]
Available Language Packs for Windows - Microsoft Learn
Jan 7, 2022 · The following tables show the supported language packs for Windows desktop editions and Windows Server, and supported language interface ...Languages overview · Add languages · The speech recognition FODMissing: pages | Show results with:pages
[55]
UCRT Locale names, Languages, and Country/Region strings
Jun 18, 2025 · The locale can be set by using the locale names, languages, country/region codes, and code pages that are supported by the Windows NLS API.
[56]
about_Character_Encoding - PowerShell | Microsoft Learn
Jan 19, 2024 · Default is the encoding specified by the active system locale's ANSI legacy code page. Export-Csv creates Ascii files but uses different ...
[57]
Use Unicode! - Dr. International - Microsoft Developer Blogs
Sep 12, 2023 · They usually internally convert to Unicode, then back to the system encoding, making them slower than the native “W” Unicode APIs.
[58]
What does "Beta: Use Unicode UTF-8 for worldwide language ...
Jun 2, 2019 · Win+R -> intl.cpl · Administrative tab · Click the Change system locale button. · Enable Beta: Use Unicode UTF-8 for worldwide language support ...How to uncheck "Beta: Use Unicode UTF-8 for worldwide language ...Why is the /utf-8 flag in MSVC not allowing my program to display ...More results from stackoverflow.com
[59]
Converting text file to UTF-8 on Windows command prompt
Jan 5, 2017 · You can easily do this with PowerShell: Get-Content .\test.txt | Set-Content -Encoding utf8 test-utf8.txt This method will convert to UTF-8-BOM.Change default code page of Windows console to UTF-8 - Super UserChange OutputEncoding and code page - Super UserMore results from superuser.com
[60]
iconv: Convert from CP1252 to UTF-8 - Stack Overflow
Mar 15, 2013 · When you convert CP1252 encoded string Çàïèñêè ýêñïåäèòîðà to UTF-8 with command iconv.exe -f CP1252 -t UTF-8 test.txt >testout.txt then the ...iconv: Converting from Windows ANSI to UTF-8 with BOMHow to detect the equivalent Windows code pages for UTF8 textMore results from stackoverflow.comMissing: bulk | Show results with:bulk
[61]
Using Byte Order Marks - Win32 apps | Microsoft Learn
Sep 26, 2024 · For new applications, UTF-8 is recommended. Ideally, all Unicode text follows only one set of byte ordering rules.
[62]
Migrating to Unicode - W3C
Apr 11, 2008 · This article provides guidelines for the migration of software and data to Unicode. It covers planning the migration, and design and implementation of Unicode- ...