Code page 850
Code page 850 is an 8-bit single-byte character set (SBCS) encoding, also known as IBM850, cp850, or OEM Multilingual Latin 1, designed for use in MS-DOS and compatible operating systems to support Western European languages.[1][2] Introduced in 1987 with MS-DOS 3.3, it encompasses 256 characters, including the 128 ASCII characters extended with accented Latin letters, currency symbols, and mathematical operators tailored for multilingual text in DOS environments.[3][4] This code page serves as a variant of ISO/IEC 8859-1, but it expands upon the earlier code page 437 by replacing many box-drawing characters, Greek letters, and icons with additional Western European alphabetic characters to better accommodate accented text.[5] It primarily supports languages such as Danish, Dutch, English, French, German, Icelandic, Italian, Norwegian, Portuguese, Spanish, Swedish, as well as Latin American Spanish and Canadian French.[6][3] Historically, code page 850 became the default OEM code page for Western European DOS systems due to its broad compatibility with PC hardware and software from the late 1980s onward, remaining influential in legacy Windows console applications.[7] A Euro variant, known as code page 858, emerged in 1998 with PC DOS 2000, modifying position 0xD5 to include the euro symbol (€) instead of the dotless i (ı).[3] Today, it persists in certain IBM database collations and legacy system configurations, such as DB2 with SYSTEM_850 territory settings.[8]Overview
Definition and Purpose
Code page 850, also known as CCSID 850 or IBM 00850, is an 8-bit character encoding standard developed by IBM for DOS operating systems and early PC environments. It extends the 7-bit US-ASCII set, which covers codes 0x00–0x7F, to a full 256-character repertoire by adding symbols and diacritics in the upper range (0x80–0xFF), specifically tailored for Latin-based scripts prevalent in Western European languages.[3][1][4] The primary purpose of Code page 850 is to facilitate the display, input, and processing of accented characters—such as Ç, Ñ, and ü—in text-based applications and command-line interfaces, addressing the limitations of ASCII for non-English Western European text. This encoding replaced the graphics-heavy symbols of predecessor code pages, like Code page 437, with a focus on practical multilingual support for everyday computing tasks in international settings.[3][4] A key feature of Code page 850 is its inclusion of all printable characters from ISO/IEC 8859-1 (Latin-1) in the upper half, albeit in a rearranged order and augmented with supplementary symbols like box-drawing elements, rendering it highly suitable for multilingual Western European text handling. It was designated as the default OEM code page in several Western European and English-speaking locales outside the US, such as the United Kingdom and Ireland, differing from the US default of Code page 437.[3][1][4]Historical Development
Code page 850 was developed by IBM in the mid-1980s as part of the original equipment manufacturer (OEM) code page family designed for use with MS-DOS and PC-DOS operating systems. It first appeared in the IBM registry in 1986, marking its initial formal documentation within IBM's technical framework. This encoding emerged as an extension of the earlier code page 437, which had been optimized for the U.S. market with a focus on box-drawing graphics and symbols at the expense of support for accented Latin characters needed in international contexts.[3] The primary motivation for code page 850 was to meet the demands of the expanding European market, where the limitations of code page 437 hindered proper representation of Western European languages. By reassigning characters in the upper range (0x80–0xFF) to prioritize Latin-1 accented letters over some graphics and symbols, IBM aimed to provide broader multilingual support without requiring a complete overhaul of existing PC hardware and software ecosystems. This shift reflected IBM's efforts to standardize character encodings for global adoption in personal computing.[3] By 1987, code page 850 was fully integrated into IBM PC standards and released alongside PC-DOS 3.3 and MS-DOS 3.3, receiving its official Coded Character Set Identifier (CCSID) assignment as 850. Its design was influenced by the contemporaneous development of the ISO/IEC 8859-1 standard, finalized in 1987, serving as a superset that incorporated most of its Latin-1 characters while retaining some DOS-specific elements. Intended to support 11 Western European languages, including Danish, Dutch, English, French, German, Icelandic, Italian, Norwegian, Portuguese, Spanish, and Swedish—these 11 languages cover the primary Western European locales, with additional support for variants like Latin American Spanish—it became the default for many locales in Western Europe and Latin America.[9][3][10]Technical Specifications
Character Encoding Details
Code page 850 is an 8-bit character encoding scheme that supports 256 distinct characters, with byte values ranging from 0x00 to 0xFF. The lower half, from 0x00 to 0x7F, is identical to the US-ASCII standard, encompassing basic control characters, printable ASCII symbols, and the delete character at 0x7F.[4] The upper half, from 0x80 to 0xFF, extends this base with 128 additional characters primarily drawn from Latin-1 extensions, focusing on accented letters and symbols to accommodate Western European languages, while rearranging some positions for compatibility with DOS text displays.[4] Unlike code page 437, which allocates many upper-half positions to box-drawing graphics, Greek letters, and mathematical symbols, code page 850 prioritizes additional Latin characters such as diacritics over such graphics to better support multilingual text in Latin-based scripts.[4] This upper range forms a near-superset of ISO/IEC 8859-1's Latin-1 Supplement but includes specific rearrangements and substitutions, such as box-drawing elements retained only in select positions (e.g., 0xB0 to 0xB3 for fill and line characters) to maintain legacy DOS interface functionality.[4] All positions in 0x80 to 0xFF are assigned printable characters in standard implementations, with no control characters or undefined slots in this range.[4] Representative assignments in the upper half illustrate its emphasis on Western European orthography. For instance, 0x80 maps to Ç (C with cedilla), 0x81 to ü (u with diaeresis), 0x82 to é (e with acute), and 0x83 to â (a with circumflex), providing essential diacritics like umlauts (e.g., 0x84 = ä), cedillas, and tildes (e.g., 0xA4 = ñ).[4] Further examples include 0x90 = É (E with acute), 0xA0 = á (a with acute), 0xA1 = í (i with acute), 0xD0 = ð (eth), and 0xFF = a non-breaking space, culminating in support for characters such as ÿ (y with diaeresis) at 0x98.[4] These assignments enable representation of accented vowels, special monetary symbols (e.g., 0x9C = £ for pound sterling), and limited typographic marks, but offer no encoding for non-Latin scripts like Cyrillic or Greek alphabets.[4]Mapping to Unicode
Code page 850 provides a direct mapping to Unicode code points for all 256 positions, enabling straightforward conversion of legacy data to modern encodings like UTF-8 or UTF-16. The standard mapping is defined in the Unicode Consortium's character mapping tables, where bytes 0x00 through 0x7F correspond to the ASCII subset of Unicode (U+0000 to U+007F), including control characters such as 0x00 mapping to U+0000 (NULL) and 0x7F to U+007F (DELETE). For the extended range (0x80 to 0xFF), characters primarily map to the Latin-1 Supplement block (U+0080 to U+00FF), supporting Western European accented letters and symbols, while some positions include box-drawing elements from the Box Drawing block (U+2500 to U+257F) and shading characters from the Block Elements block (U+2580 to U+259F).[11] This mapping ensures 1:1 correspondence for all positions without undefined slots, though conversions may involve considerations for display contexts where control or non-printable characters like the delete code at 0x7F are handled as U+007F rather than rendered visually. For instance, the byte 0x80 maps to U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA, Ç), 0x82 to U+00E9 (LATIN SMALL LETTER E WITH ACUTE, é), and 0x99 to U+00D6 (LATIN CAPITAL LETTER O WITH DIAERESIS, Ö), all using precomposed forms that align directly with Unicode's preferred normalization. In strict conversions, any potential mismatches in application-specific implementations could result in the replacement character U+FFFD for unrenderable glyphs, but the official table avoids this by assigning defined code points to every byte. IBM's code page documentation confirms the glyph assignments for printable characters in this scheme, supporting reliable translation to Unicode for data preservation.[11][12] The following table illustrates representative mappings from the extended range, highlighting accented letters, symbols, and graphics:| CP850 Byte (Hex) | Unicode Code Point | Description |
|---|---|---|
| 0x80 | U+00C7 | LATIN CAPITAL LETTER C WITH CEDILLA (Ç) |
| 0x84 | U+00E4 | LATIN SMALL LETTER A WITH DIAERESIS (ä) |
| 0x9C | U+00A3 | POUND SIGN (£) |
| 0xA4 | U+00F1 | LATIN SMALL LETTER N WITH TILDE (ñ) |
| 0xB0 | U+2591 | LIGHT SHADE |
| 0xB3 | U+2502 | BOX DRAWINGS LIGHT VERTICAL (│) |
| 0xC7 | U+00C3 | LATIN CAPITAL LETTER A WITH TILDE (Ã) |
| 0xD0 | U+00F0 | LATIN SMALL LETTER ETH (ð) |
| 0xFF | U+00A0 | NO-BREAK SPACE |
Variants and Comparisons
Code Page 858
Code page 858, also known as CCSID 858, is a single-byte character encoding introduced by IBM in 1998 as a euro-enabled variant of code page 850.[13] It is designed for use in DOS and OS/2 environments supporting Western European languages, with its structure mirroring code page 850 except for a single modification to accommodate the euro currency symbol.[14] This encoding was registered to facilitate the transition to the euro, which was officially introduced in 1999, by reassigning the byte value 0xD5 from the dotless i (U+0131) to the euro sign (U+20AC).[13] Code page 858 was introduced for euro support in PC DOS 2000 and OS/2, though PC DOS 2000 (released the same year) implemented the modification within code page 850 using its existing identifier to preserve compatibility with existing Latin-1 based text.[3] IBM officially termed it "Multilingual Latin-1" or a "modified code page 850," reflecting its role as an extension for multilingual support in PC environments.[15] The encoding's MIME name is IBM00858, and it is also aliased as PC-Multilingual-850+euro, emphasizing its focus on Western European character sets with the added currency symbol.[15] The character table for code page 858 demonstrates near-identical mapping to code page 850, with the sole difference at position 0xD5 ensuring that most legacy text remains readable without alteration.[13] This minimal change allowed for backward compatibility in applications and systems already using code page 850, minimizing disruptions during the euro rollout. IBM's documentation, such as the CCSID mappings in CP00858 specifications, provides the detailed byte-to-Unicode correspondences, confirming the encoding's standardization following the 1998 euro preparations.[14]Comparison with Code Page 437
Code page 437, introduced in 1981 as the original character set for the IBM PC, allocates the extended range from 0x80 to 0xFF primarily to box-drawing characters, block elements, mathematical symbols, and icons to facilitate text-based user interfaces and graphics in CP/M and early DOS environments.[16] In contrast, Code page 850 reassigns a significant portion—approximately 50 of the 128 positions in this range—to Latin characters with diacritics, prioritizing support for accented letters used in Western European languages to improve text readability and compatibility with multilingual applications.[11][17] These reassignments highlight key divergences in character priorities. For instance, the code point 0xB5 in Code page 437 maps to a box-drawing character (┡, U+2561), suitable for constructing graphical borders in console applications, whereas in Code page 850 it maps to the capital letter A with acute accent (Á, U+00C1), essential for languages like Spanish and Portuguese.[17][11] Similarly, 0xC6 in Code page 437 is another box-drawing element (╞, U+255E), but in Code page 850 it becomes the lowercase a with tilde (ã, U+00E3), commonly needed for Portuguese and other Latin-based scripts.[17][11] These changes replace graphical and symbolic elements with precomposed accented characters, reducing the availability of visual aids but expanding textual expressiveness. The design philosophy of Code page 437 emphasized optimization for the US-English market and graphical rendering in software such as WordPerfect, where block elements and symbols enabled simple user interface construction on limited hardware.[3] Code page 850, however, shifted focus toward international text handling, aligning more closely with the ISO 8859-1 standard by incorporating additional diacritics for Western European localization while retaining core ASCII compatibility and some essential graphics.[3][8] Both code pages maintain identical assignments for the standard ASCII range (0x00–0x7F), ensuring basic compatibility, but Code page 850's targeted modifications in the extended range enable more effective support for European languages without necessitating hardware font replacements or full system overhauls.[11][17] This evolution reflects the growing demand for localized computing in the mid-1980s, as DOS expanded beyond North America.[3]| Code Point | Code Page 437 (Character, Unicode) | Code Page 850 (Character, Unicode) |
|---|---|---|
| 0xB5 | ┡ (U+2561) | Á (U+00C1) |
| 0xC6 | ╞ (U+255E) | ã (U+00E3) |
| 0xD2 | ╥ (U+2565) | Ê (U+00CA) |
Usage and Compatibility
Adoption in DOS and Early PCs
Code page 850 was introduced as a standard character encoding in MS-DOS 3.3 and IBM PC-DOS 3.3, both released in 1987, marking a significant expansion of internationalization support in these operating systems. It became the default code page for Western European locales, enabling proper handling of accented characters in file names, console output, and text processing. For instance, in the United Kingdom version of MS-DOS, code page 850 served as the primary encoding to accommodate Latin-1 characters beyond the limitations of the earlier code page 437. This adoption facilitated broader software compatibility across Europe, where previous encodings struggled with diacritics common in languages like French and German.[3][18] Hardware integration of code page 850 was tightly coupled with the display capabilities of early PCs, particularly through the Enhanced Graphics Adapter (EGA) and Video Graphics Array (VGA) standards. Characters were rendered using 9×14 pixel raster fonts stored in the video BIOS, allowing for clear depiction of extended Latin glyphs on 80-column text modes. This encoding was embedded in core system components, including the BIOS setup screens, the command prompt interface, and popular business applications tailored for European markets, such as dBase database software and Lotus 1-2-3 spreadsheet program. These integrations ensured that users could input and display multinational text seamlessly on hardware like the IBM PC/AT and compatible systems from the late 1980s.[3][19] Country-specific versions of MS-DOS and PC-DOS supported code page 850 for a range of Western European languages, including Danish, Dutch, English, French, German, Icelandic, Italian, Norwegian, Portuguese, Spanish, and Swedish, through localized distributions that activated the appropriate keyboard layouts and character mappings. Users could switch to code page 850 during runtime using the MODE CON CP PREPARE=850 command, which prepared the console for the encoding by loading the corresponding code page information (CPI) file, often followed by CHCP 850 to select it actively. By 1990, with the release of Windows 3.0 and its international editions, code page 850 had become the standard OEM code page for font fallback in command-line environments, bridging DOS legacy with early graphical interfaces.[3][6]Compatibility Issues and Solutions
One primary compatibility challenge with Code page 850 arises from its use of the extended ASCII range (0x80–0xFF) for Western European characters with diacritics, which can lead to mojibake—garbled text—when files are transferred to systems configured for 7-bit ASCII-only environments, where high-bit characters are often stripped, replaced with question marks, or rendered as undefined symbols.[2] For instance, the byte 0x80, representing Ç (U+00C7) in Code page 850, may appear as garbage or a control character on strict ASCII systems lacking 8-bit support.[20] Similarly, interoperability issues occur when Code page 850 files are viewed on machines defaulting to Code page 437, the original IBM PC encoding, due to differing mappings in the extended range; for example, the byte 0xB0 encodes º (U+00BA, masculine ordinal indicator) in Code page 850 but a light shading block character (U+2591) in Code page 437, resulting in visual distortion of intended text.[21] To address these display and interpretation mismatches, DOS provided the CHCP command, introduced in MS-DOS 3.3 and PC DOS equivalents, allowing users to dynamically switch the active console code page during a session—for example, enteringCHCP 850 sets Code page 850 without rebooting, ensuring correct rendering of multilingual content on compatible hardware.[22] Hardware-level solutions involved loading appropriate fonts through EGA or VGA drivers using code page information (CPI) files, such as EGA.CPI, which contained glyph bitmaps for the extended characters; this was configured via the DISPLAY.SYS device driver in CONFIG.SYS to preload the correct font set at boot, supporting resolutions like 80x25 text mode.[23] For file transfers across systems, utilities like GNU recode enabled batch conversion between code pages, transliterating characters where direct mappings were unavailable to prevent data loss during migrations or network shares.
Early network environments, such as Novell NetWare, exhibited inconsistent Code page 850 support, where client-server mismatches could corrupt filenames or shared documents if the server's code page (often defaulting to 437) differed from the client's locale-specific settings, leading to failed authentications or unreadable volumes. This was mitigated starting with DOS 5.0 in 1991 through enhanced locale-specific boot configurations via the COUNTRY command in CONFIG.SYS, which specified a country code (e.g., 049 for Germany) and associated code page (e.g., 850) to automatically load international drivers and set the default encoding at startup, improving cross-system consistency.[24]
Backward compatibility with legacy applications was preserved by allowing fallback to Code page 437 as the universal base, where shared ASCII characters (0x00–0x7F) remained identical, but full support for Code page 850's diacritics required application-level awareness, such as explicit code page queries via DOS interrupts (e.g., INT 21h/AH=66h) to detect and adapt to the active encoding.