Windows codepages (and their history)

Windows supports a number of character sets besides Unicode. The character sets are also known as ANSI codepages, even though they are not based on any ANSI standard. This article presents the current and historical versions of Windows codepages starting from 1985.

This article is intended for computing experts who already know what character sets and codepages are. We compare codepages to other codepages. We check out different versions that have appeared of the same codepage. We first check out the Windows ANSI character set, which actually was no ANSI at all, and see how it developed from 1985 onwards. We find differences in documented and actual behavior and point out codepage related errors in MSDN.

Contents

Windows codepages

The first version of Microsoft Windows, released in 1985, came with a single character set. It was known as the Windows ANSI character set. This character set was quite different from the character set of DOS, the 437. The most notable difference was with line drawing characters, which are missing in Windows.

Since then, more character sets have appeared. Today Windows comes with a number of code pages. Most of the codepages are different those of DOS, but they serve a similar purpose. The following codepages have been available since the 1990s:

The Far-Eastern codepages 932, 936, 949, 950 and 1361 are double byte character sets (DBCS) while the rest are single byte character sets (SBCS). The rest of this article focuses on the single byte sets.

Windows ANSI (Latin 1)

The Windows ANSI character set first appeared in Windows 1.0 in 1985. Despite its name, Windows ANSI was not actually based on any published ANSI standard. The first version of Windows ANSI was identical to ECMA-94 8-Bit Single-Byte Coded Graphic Character Set, which was also published in 1985. The characters of ECMA-94 found themselves into ISO 8859-1 and eventually to Unicode.

Windows ANSI went its own way. More characters were soon added by newer versions of Windows. In addition to the original "ANSI" character set, Windows started supporting other character sets too. Windows ANSI became known as Windows Latin I, which covered letters used in USA and Western Europe. Other codepages were defined to support other regions and languages.

History of Windows ANSI

Windows 1.0

The same characters as in ISO 8859-1, except × and ÷ missing

Windows 2.0

Added the missing × and ÷ and also single quotes ‘ ’

Windows 3.0

No changes

Windows 3.1

Added 22 new characters to range 82-9F

Windows 95, NT4

Originally the same character set as in Windows 3.1. Support for €, Ž and ž was added in 1998.

Windows 98

Added €, Ž and ž.

Windows ANSI codepage charts

In the following codepage charts, a gray cell means a reserved (unused) character position. Green cells indicate added characters. The ASCII control character area (00–1F hex) has been left out. The pinkish gray cell 7F is reserved for the DEL control character. Hover the mouse pointer over a cell to display the respective Unicode value.

An updated version of the Windows ANSI character set appeared in Windows 2.0. It added the missing × and ÷ and also single quotes ‘ ’. Some sources refer to this character set as codepage 1004. IBM codepage 1004 is a superset of this set.

Windows 3.1 added 22 new characters to the Windows ANSI character set. According Windows 3.1 SDK, this set was "sometimes referred to as codepage 1007".

The first versions of Windows 95 and NT4 used this same codepage as well. The page was no longer known as 1007, but 1252. A probable reason for the renumbering is that alternative codepages became available, which were numbered 125x. This codepage was part of the series.

In 1998, codepage 1252 was updated to include 3 new characters: €, Ž and ž. This is the current version of codepage 1252. The first operating system to use this character set was Windows 98.

Windows 95 and NT4 originally used the same codepage as Windows 3.1. In 1998, an update became available that added the euro (€) to 1252. Apparently, Žž were added at the same time (definite source not found). Prior to applying the update, even when a font included these characters, the characters didn't display properly on codepage 1252.

This is the ISO standard Latin-1 character set (ISO 8859-1). The gray area, positions 7F to 9F, is reserved for control characters. The actual control characters, which are rarely used, are not part of ISO 8859-1.

National codepages (SBCS)

Originally, Windows ANSI was the only available codepage on Windows. By the mid-1990s, a range of national codepages had appeared. Both single byte character sets (SBCS) and double byte character sets (DBCS) for Far-Eastern languages appeared. The following discussion is about the SBCS sets, which cover the Latin, Greek, Cyrillic, Hebrew and Arabic scripts.

The codepages were unstable at first. Several versions have existed as missing characters were added.
All codepages were updated in 1998, when the euro symbol (€) was added, along with some additional characters. The last update was to 1256 Arabic in Windows 2000. After that the codepages have been completely stable.

Appearance of national codepages in non-English Windows

Non-English language versions of Windows have supported some codepages already before they appeared in the English versions. According to IANA charset registrations made by Microsoft in May 1996, Windows character sets then appeared as follows:

Note that the IANA information predates the release of Windows NT4. For some reason, codepage 874 was not registered with IANA, even though it was in use.

Appearance of national codepages in English Windows

The following development appeared in the English versions of Windows:

1991: Windows 3.1 did not support codepages. One installation supported one character set only. Windows 3.1 came in several language versions, and the different language versions supported different character sets. These sets are the predecessors of codepages (apparently 1250–1256). The Windows ANSI character set, the predecessor of 1252, was among these sets.

1995: Windows 95 came with codepage support. The English version supported 1252 only. Other language versions supported others.

1998: A euro update was released for Windows NT4. It updated pages 1250, 1251, 1252, 1253, 1254, 1255, 1256 and 1257 by adding one or more missing characters. This update was later added to NT4 Service Packs.

1999: Windows 98 (SE, English version) supported the following codepages, exactly similar to the 1998 euro updated versions: 1250, 1251, 1252, 1253, 1254, 1257. The English version did not support codepage 1255, 1256, 1258 or 874.

National codepage charts (SBCS)

The following codepage charts list the development of the Windows single byte character sets (SBCS). Far-Eastern double byte sets have been left out. The charts are based on the actual operation of the English versions of Windows NT4, 98, 2000, XP and 7. Comparison has been made with documented behavior, primarily with Unicode vendor mapping tables and Nadine Kano's Developing International Software for Windows 95 and Windows NT (Microsoft Press, 1995).

The charts focus on visible (graphic) characters. The ASCII control character area (00–1F hex) has been left out on purpose.

Legend. Blueish cells indicate characters different from those of codepage 1252 (current version). A gray cell means a reserved character position. Pinkish gray cells are invisible control characters. Green cells are characters that were added or modified since the previous version of the same codepage. "Original" codepages are as they appeared in 1991.

Hover the mouse pointer over a cell to display the respective Unicode value.

The 1998 update is a combination of the 1992 and 1995 documented versions, with the euro symbol added.

Position CA (hex) differs in documentation and implementation. According to Microsoft documentation (MSDN 2012 and cp1255 to Unicode table v2.0, 04/15/98), position CA (hex) is reserved. Windows, however, has U+05BA HEBREW POINT HOLAM HASER FOR VAV at this position. Windows 98 SE (Hebrew version) also had it, even though its fonts did not have a glyph for it. This is a rare character that was added Unicode 5.0 only in 2006. The character made its way to Windows implementations, but not to their documentation.