Text, Encodings, and .NET

Almost every .NET method that deals with strings, text files, XML files, et cetera has an overload that allows you to specify a Text Encoding. For a while I purposefully ignored those, but recently I got bitten once again by an encoding problem, so I decided it was time to get to the bottom of this.

Unicode vs. ANSI

One of the biggest changes from 16-bit to 32-bit Windows was the introduction of Unicode as a way to represent characters and strings. In 16-bit Windows, a character was simply a byte (e.g. it consisted of 8 bits). This was called the ANSI character set. It had its problems, though: the lower half (the lower 7 bits) was well standardized, but accented characters in the upper half (128 and up) were not always supported in all applications in the same way, causing corruption of text when transferred between applications. In short, Unicode was designed to solve all these problems by simply creating a very, very large alphabet by using 32 bits per character. Thus all languages on Earth have a section of the Unicode alphabet, so it contains all kinds of Asian characters, but also Hebrew, Arabic, et cetera.

Unicode support was added to languages (the 'wide' character in C++, for instance) and the Windows system DLLs started to show functions in two varieties: those ending in 'A' (for 'ANSI') and those ending in 'W' (for 'wide'). Visual Basic 5.0 made a complete switch: all strings were represented in Unicode in memory, but when writing to disc they were converted to ANSI. (A first indication of a possible text encoding problem!)

The .NET Framework goes a step further: strings are always represented in Unicode internally, but when writing to file you have the option to specify which text encoding to use.

Unicode Is Not Unicode

Life is never that simple, though. 32 bits per character sounds great, but it requires a whopping four times the storage. That's why the .NET Framework (like most Unicode systems) uses 16 bits per character internally. Even that's pretty inefficient when writing to a text file - every file doubles in size when converted to Unicode! Moreover, most characters in a Western language text are actually in the ANSI character set, so almost half of the storage of a 16-bit Unicode file is left unused anyway. This problem was very neatly solved by UTF-8, which represents all characters below 128 as a single byte, and those above as a sequence of 2 to 5 bytes. This allows the full 32-bit range of Unicode characters to appear in a text file, while only increasing the length of the file marginally.

So now we already have three flavors of Unicode: 32-bit, 16-bit and 8-bit! It gets worse: because of the different way processors store integers in memory ('Little Endian' vs. 'Big Endian') there are two subflavors of the 16-bit and 32-bit types. This brings the number of flavors to 5. There is another one, called UTF-7, but that's rarely used.

UTF-8 vs. ANSI

The UTF-8 encoding is actuallly quite brilliant. If a file contains no characters over 127 (i.e. the 8th bit is never used), UTF-8 is completely compatible with ANSI - the two can be used interchangeably. (Except when a Byte Order Mark is used - read on!) This makes UTF-8 the encoding of choice in virtually all cases where text needs to be stored.

Byte Order Marks (BOMs)

So Unicode was designed to standardize text representations. But with all these flavors, confusion is all too easy. I know I'm supposed to read a Unicode text file - but is it 16-bit or UTF-8? And if 16-bit, is it little endian or big endian? Fortunately Byte Order Marks (or Preambles) were introduced to distinguish the various Unicode styles. These consist of a fixed sequence of bytes at the start of the file, marking it a, say, Unicode 32-bit Big Endian. Byte order marks may not be legal text, otherwise some completely ordinary text may be misinterpreted because it happens to start with a special sequence. That's why there is no BOM for ANSI.

The following table lists the byte order marks supported by .NET:

Codepage

System.Text...

Name

BOM (hex)

Description

1252

Encoding.Default

Western European (Windows)

(None)

'ANSI'

65001

Encoding.UTF8

Unicode (UTF-8)

EF BB BF

Normal UTF-8

1200

Encoding.Unicode

Unicode

FF FE

UTF-16

12000

Encoding.UTF32

Unicode (UTF-32)

FF FE 00 00

UTF-32

1201

(none)

Unicode (Big-Endian)

FE FF

UTF-16 Big-Endian

12001

(none)

Unicode (UTF-32 Big-Endian)

00 00 FE FF

UTF32 Big-Endian

Questions, questions

However, while Byte Order Marks are useful, they're also optional. That leaves two questions:

1. What happens when there is no byte order mark in the file? In other words: if a file does not contain a byte order mark, how should it be interpreted? More specifically, how does .NET treat it?

2. What happens if the byte order mark is not the one we expect? Suppose we tell .NET to read the contents of a file in Unicode 16-bit mode, and the file actually contains a BOM for Unicode 32-bit?

These questions are not hard to answer - time for an experiment!

The test

I wrote a short program to write out a string containing some accented characters to a series of text files, using the ANSI-encoding and the flavors of Unicode that have a BOM. The text was also written once without any encoding. Using the same series of encodings, the file was read back and the text compared.

An inspection of the text file written revealed that the files written wihout encoding or with the ANSI encoding did not contain a byte order mark; all the others do. For the ANSI-file this was to be expected, since ANSI does not have a BOM. But the file written without a specific encoding turned out to use UTF-8, only without the BOM!

This is the result of reading back the file. 'OK' means the text was read back unchanged; 'ERROR' incdicates it wasn't.

Read encoding:

None

Default

Unicode

Unicode (Big-Endian)

Unicode (UTF-32)

Unicode (UTF-32 Big-Endian)

Unicode (UTF-8)

Write encoding:

None

OK

ERROR

ERROR

ERROR

ERROR

ERROR

OK

Default

ERROR

OK

ERROR

ERROR

ERROR

ERROR

ERROR

Unicode

OK

OK

OK

OK

OK

OK

OK

Unicode (Big-Endian)

OK

OK

OK

OK

OK

OK

OK

Unicode (UTF-32)

OK

OK

ERROR

OK

OK

OK

OK

Unicode (UTF-32 Big-Endian)

ERROR

ERROR

ERROR

ERROR

ERROR

OK

ERROR

Unicode (UTF-8)

OK

OK

OK

OK

OK

OK

OK

It should not surprise you that the diagonal top-left to bottom-right says 'OK', since that represents the cases where the same encoding was used to write and then read the file. [Phew!]

The top row of results (the read results of the file written without encoding) is interesting. If no encoding is specified while writing the file, you must read it using the UTF-8 encoding, or without specifying an encoding. This is consistent with our observation that writing without an encoding actually uses UTF-8, but leaves off the BOM.

The second row tells us how ANSI files are treated. They don't contain a BOM, but do sometimes contain characters above 127 which are always 'wrong' when interpreted as UTF-8. And indeed, you must specify the ANSI encoding Encoding.Default when reading them back!

The next row, Unicode, stands for 16-bit Little Endian Unicode with a BOM. No matter which encoding you specify (or even none) when reading, the file is handled correctly. I assume the BOM in the file overrides the specified encoding. This goes for the next row, too: if the encoding is 16-bit Big Endian, the file is read correctly no matter what encoding is used to read it.

The 32-bit Unicode row is a little more surprising. Here, the BOM seems to be misinterpreted by the normal 16-bit Unicode encoding - all others work as expected. (The BOM for 16-bit Unicode starts with the same two bytes as the 32-bit Unicode BOM - could that be it?)

Even stranger is the Big Endian 32-bit row. A file containing a UTF-32 Big Endian BOM can only be read using the same encoding - all others choke on it!

The last row, however, saves the day. If your file is encoded in UTF-8 with a BOM, you cannot misread it - all other encodings will honor the BOM and read it correctly.

(By the way: I tested under the .NET Framework 2.0 and 3.5 and the results were identical, event the apparent misinterpretations in 32-bit Unicode...)

The answers

So question 2 is answered in some detail. Specifying UTF-8 when writing text files will make sure they always get read back in good order, independent of the encoding the reading party uses. This goes for both flavors of 16-bit Unicode, too. Stay away from 32-bit Unicode unless you know what you're doing: those files can only be read using some of the available Unicode encodings.

The answer to question 1 is also available from the results. If your file does not contain a BOM, it's either interpreted as ANSI or UTF-8. The problem is, that none of the encodings can tell the difference! If you read an ANSI file using the UTF-8 encoding, the result is corrupted. Vice versa idem: reading a UTF-8 file without a BOM using the ANSI encoding does not work, either.

Conclusion

There are two important conclusions to draw from this experiment. The first is actually a recommendation: use the UTF-8encoding Encoding.Utf8explicitly when writing text files. This will ensure they'll always read back as expected. UTF-16 Unicode (both normal and Big-Endian) have that same quality, but these are wasteful in terms of disc space - for Western language text files, anyway.

The other conclusion is that there really is no way to tell if a file that does not contain a BOM contains ANSI or UTF-8 text. Theoretically, there are ANSI text files that cannot be interpreted as UTF-8, but the vast majority cannot reliably be distinguished. This means that, when reading a file, you have to choose. If you read the file wihout an encoding, you'll fail on ANSI-files, but only if they contain 'high' characters (over 127). If you specify Encoding.Default, ANSI files will be read correctly, but UTF-8 files without a BOM will not - again, only if they contain 'high' characters. In other words: if a file does not contain a BOM, you must specify the correct encoding to read it. That is Encoding.Default for ANSI files or Encoding.Utf8 (or no encoding) for UTF-8. Your choice will depend on the type of file you're most likely to encounter, or you'll have to let the user choose somehow.

An attempt at auto-detecting encodings

Still, while it is impossible to auto-detect the difference between UTF-8 and ANSI encoding - or ASCII or UTF-7 for that matter - we can of course attempt to read a BOM from the start of a text file to get a clue as to its text encoding. Read more about Detecting Text Encoding in .NET.