Hello Everyone.
I really do not seem to understand character encoding. I've read many web articles which made me more confused.

Q1: Please point me to a good, clear and sufficient reference to understand encoding as a beginner.

Today, I used a StreamReader to read a .txt file that had been sent to me via email. The file contained Arabic characters, along with numbers and English characters. In the TextBox, numbers and English characters appeared correctly, whereas Arabic characters did not.
I was shocked at first, because it was not the first time I managed to read Arabic characters from .txt files. I searched the web for a solution and found that encoding was the reason; so I opened the file with the NotePad and it said "ANSI". I fixed the problem by specifying "Default" for encoding in the StreamReader's instantiation. Default is "Arabic (Windows)" in my machine.

So, I made the attached VS2008 project to test my understanding of what I read. I also attached a screenshot of the UI.

Q2: What encoding is used for hard-coded strings in VS? Can I force VS to use a specific encoding? (In my attached project, the content written to the .txt file is a hard-coded string consisting of both English and Arabic characters. The project also has a PictureBox with an image of the correct appearance.)

Q3: Is ANSI a character encoding? Why is not it present in System.Text.Encoding? Is it "Default"?

Q4: Does this mean that all my Arabic applications will not appear correctly in the US, for instance? What about databases? word documents in my Flash memory? What about this forum if I post some Arabic characters?!

Q5: Does the OS have something to do with this? Does Windows have all character encoding standards? Can I have a Windows machine with no Unicode support, for instance?

Q6: What to do to make my Arabic website/desktop application appear correctly in all countries? Is it my responsibility or it's the user's?

Ooooh! These are fun questions. Some of them aren't going to have answers you'll like, but such is life This might take more than one post.

Q1:
I don't know of one. I like the topic; I should write one! I'll add it to my list of projects to do "someday", but here's a summary. There's not much to write, and I'm going to use unprofessional tone.

In the beginning, there was ASCII. You could only use 7 bits for a character, and this allowed for 128 characters. This fit the upper and lowercase English alphabet, numbers, many punctuation marks, and left room for a few characters that apparently were used to control hardware with simple ASCII strings. If you wanted Arabic letters, too bad!

Then someone decided to let you use all 8 bits. This allowed for 256 total characters. This is still not enough for each language, so the concept of code pages was created. A code page is simply a mapping of the upper 128 bits available to particular characters. This had some problems. Obviously, English didn't do much with its new space (I believe it added some characters common to other European languages that use accented English characters) but languages like Russian benefited greatly from the ability to finally use their alphabet. However, this introduced a problem. In the Russian code page, 129 might be a particular letter, but in the English code page it's something different entirely. So if you wanted to send someone text that included Russian characters, you had to make sure to tell them which code page to use; if they used the wrong one the text wouldn't display properly. Another problem: both Arabic and Russian characters can't fit in one code page, so you can't mix both in one text file.

Another, bigger problem: Chinese and Japanese alphabets are huge and too big to fit in even the full 256 characters allowed by ASCII. The solution was "multibyte" character sets. These code pages would be allowed to use more than one byte for their characters; this allowed enough room for all of their characters. However, there are still multiple incompatible code pages and if you sent a string of Japanese text to a machine that was using the English code page it'd just look like a bunch of garbage. 2/3 of the bugs I encounter in one of my features are the result of ASCII code page mismatches.

I don't know precisely why ANSI is the label that is used, but when I say ASCII or ANSI I am referring to this scheme where you have to choose a particular code page in order to display text that uses characters outside the first 128 values defined by the ASCII standard. This is a point that I would research in a more formal guide.

Some time in the 90s, some smart people realized that the ANSI code page solution was a hacked-together mess. They set out to decide on a text encoding that would support all languages from the start and have room to include more languages. The result was Unicode. But of course it's not as simple as that.

Unicode itself is not a text encoding; it's merely a list of all characters that a font or text encoding must support in order to be considered compliant with Unicode.

The first major Unicode encoding was UTF-16. This encoding always uses at least 16 bits to represent each character of text. 16 bits isn't enough for all of the world's languages, so for some characters it uses more than 16 bits. The string "Hello" would be 5 bytes in ASCII/ANSI but 10 bytes in UTF-16. Programmers didn't move quickly to UTF-16 for several reasons, but this size increase was likely the 2nd biggest reason. This was the early 90s, and doubling all text strings would have been disastrous for the internet.

The next major Unicode encoding is UTF-8. UTF-8 can use multiple bytes for characters like UTF-16, but it uses only 1 byte for the standard ASCII range 0-127. This made it a very promising encoding for the English-speaking world because it wouldn't change the size of a text document unless it used non-English characters. I like this encoding a lot.

There are a couple of other encodings I will mention but not talk about in detail: UTF-32 and UTF-64. I think you can figure out the details from the names and why they aren't particularly popular. I'll reserve the discussion of why they're useful.

Since Windows 2000 (and I think it was actually NT 4), the default encoding for strings in Windows is UTF-16. However, that doesn't mean everything supports it. There's still hundreds of API calls related to ANSI strings, and people write new applications that rely on them every day out of ignorance, fear of change, unavoidable requirements, or sloth. This makes working with arbitrary text pretty difficult. You can't always tell a file's encoding. Many Unicode encodings can optionally add a special string of characters called a Byte Order Mark (BOM) to the beginning of the file; if one of these is present you can detect it and use the indicated encoding. However, use of a BOM is optional. This is why most text editors either support only one encoding or if they support many they let you choose the one in use: there's no heuristic for guessing the right encoding with acceptable accuracy.

It can be more confusing from an application standpoint. For example, Notepad can save files in either ANSI or some Unicode encodings, but the default is ANSI and the way to configure the encoding used is a non-standard extension to the save file dialog. The situation is similar in many text editors: they still default to ANSI and the way to change the default is not something most users will bother with. You can counter this by ensuring any applications *you* write default to a Unicode encoding and make changing that encoding a first-class option rather than burying it in the "for smart people" options. 99% of people won't change the default, so using one of the Unicode encodings will promote its use. (That is, until they send it to a friend who only has an ANSI text editor and they want to know why your application turns their file into garbage )

In .NET, all in-memory strings are UTF-16 and UTF-16 is the default encoding. Encoding.Unicode is the confusing name for the UTF-16 encoding, and currently Encoding.Default will return UTF-16 as well (that could change if the Windows default changes.) Almost all text-to-binary or vice versa methods in the .NET Framework allow you to specify an encoding to be used. If you don't specify an encoding, odds are you'll get UTF-16.

Another roadblock along the way: not all fonts support all of the Unicode code points. In particular, the Chinese/Japanese characters are not always available. Most English-speaking users never have a need to see these characters, so fonts save memory/disk space by not including those glyphs. You have to install certain language packs and/or use special versions of fonts to make sure you can display the characters appropriately. So even if you use a Unicode encoding and your recipient uses a text viewer that can handle that encoding, if their font doesn't have the right glyphs the text won't display properly!

I could go on, but I think it'd be better to answer the rest and let you ask followup questions.

Q2:
VS is yet another one of those programs that makes this unclear.

As I stated, .NET uses UTF-16 for in-memory strings. You can't configure this. However, VS might be using a different encoding for your source code file. With a source file open, there should be an "Advanced Save Options..." item under the File menu. This will tell you the encoding you're using. If you're saving the file as an ANSI file, the compiler will convert the ANSI string literal to UTF-16 as it does its work. However, if you send that file to someone on a machine without your code page, the string will look incorrect to them and the conversion to UTF-16 won't work, as the compiler will interpret the bytes differently. I use "UTF-8 (with signature)", which means it's UTF-8 with a BOM so other text editors have a fighting chance of identifying the encoding.

Q3:
ANSI is the name typically used for the horrible "ASCII + code pages + sometimes multibyte" set of standards. It is represented in .NET by Encoding.ASCII, but you can also create encoding objects that use ANSI encoding and a specific code page that is different than your system's code page.

Q4:
This is where things get ugly.

IF you write your data in a Unicode encoding, AND the application that is viewing the text is using the right encoding, the user should be able to view/edit your text just fine. However, if their font doesn't have the glyphs for Arabic characters, they won't display properly. This is still better than the ANSI behavior. Many API calls in Windows perform "substitution" when a string has characters that don't exist in the current code page; this will literally change the unreadable bytes to represent the '?' character. This means with ANSI, it is possible for you to send bytes to someone, have them convert it to a string, then have them convert it back to bytes and send it to you and you will get a *different* string. With Unicode, they may not see the appropriate characters if their font lacks support, but when the string is sent back to you it will be intact.

On the web, it's different. The web server can send any encoding it wants. The HTML page can send a hint to the browser via meta tags. This page's source tells me it's serving text using the ANSI codepage specified by ISO-8859-1: you can read about it here. It doesn't look like Arabic languages are supported well, so there's not too much you can do. In theory you might be able to use HTML entity references to put the Arabic characters in but the forums software might strip it out, a browser might get confused by it, and you can't guarantee the viewer has fonts that can display it.

Windows 2000 (and probably NT4) supported at least UTF-16; I know it used Unicode by default. The 95/98/ME line of OS used ANSI by default and I don't think access to Unicode was available (if at all) until late in 98's lifespan. This is part of why many applications had separate NT and 9x executables; the Unicode API calls weren't available in the 9x line.

So yes, if you are running Windows 95 or 98 it is possible the machine will not have Unicode support at all. However, you can't run the .NET Framework on either so it's moot in the context of VB .NET.

"Does Windows have all character encoding standards?" is a tough question to ask. The technically correct answer is, "No, there's too many standards." The more correct answer is, "Windows supports Unicode, UTF-8, UTF-16, UTF-32, UTF-64, and any ANSI code pages the user has installed." That still has pitfalls.

Unicode is not immutable. Some time in the past couple of years, some new letters were added to Unicode to address some shortcomings in several languages. Microsoft had to release an update to Windows to address this; until the update is applied the Unicode implementation on Windows won't support the new characters. This is why I say the technically correct answer is "No"; it is impossible for Windows to have perfect support.

Another way the OS can matter: If you're using the default encoding, that could change. For example, Mono is an implementation of the CLR that can run on Linux. It's possible some Linux installations use UTF-8 or another encoding for the default. Microsoft might change the default encoding in the future. When you ask for Encoding.Default, you're essentially saying, "I don't care what encoding is used, just let me write the text!" Obviously, that carries a small amount of risk.

Q6:
For a website:

Make sure you save your HTML in a Unicode encoding.

If you're using a language like PHP or ASP .NET to generate the pages, make sure it's using Unicode *and* saving the files in Unicode.

Make sure your pages use meta tags to help the browser understand what encoding to use.

(I think there is an HTTP header that can indicate encoding... you might want to send that too.)

If it's absolutely vital that some text be displayed in Arabic at all times, consider using images for that text.

Even if you do all of those things, not all people may have a font with full support for Arabic characters. My guess is it is safe to assume if they can actually read Arabic and want to visit Arabic pages, they'll have already installed the appropriate fonts.

For an application:

If you send text as bytes to anything else (this includes writing files), make sure you specify one of the Unicode encodings.

If you receive text as bytes (this includes reading files), make sure you know which encoding you should be using to read the text. If it's not a Unicode encoding you're going to have problems on some people's machines.

If it's absolutely vital that some text always be displayed in Arabic, consider using an image to display that text.

Again, fonts can bite you, but I feel like users who plan to use an application in Arabic would have already handled this.

So a large amount of responsibility falls on you, but some responsibility is the user's. If you take a stream of bytes that represents UTF-8 text, decode it using Encoding.UTF8, and put that string in a TextBox control, it is either the user's fault or the control's fault if the text does not display properly.

Dear Atma, if it's possible to describe your posts, you wrote a complete, extremely detailed article, which is very clear and enjoyable. I will not waste your efforts and time. Thank you very much. I will try to benefit the most from what you wrote.

Quote:

In the beginning, there was ASCII. You could only use 7 bits for a character, and this allowed for 128 characters.

OK then, what I understood is that: because a bit can represent two different characters, with 7 bits we can represent 2^7 = 128 characters. This is ASCII, mainly supporting English alphabet, numbers and some marks.
In ASCII, a character is 7 bits in size.

Quote:

Then someone decided to let you use all 8 bits. This allowed for 256 total characters.

With 8 bits, we can represent 2^8 = 256 different characters. Each character is 1 byte in size.

Quote:

... A code page is simply a mapping of the upper 128 bits available to particular characters... languages like Russian benefited greatly from the ability to finally use their alphabet... Arabic and Russian characters can't fit in one code page.

What I understood: 8-bit code page is: leave the first 128 characters exactly as they are in ASCII, and every non-English culture come on and fit your needs with the new 128 characters.Q7: Are these 8-bit code pages considered ASCII? Or maybe variations of
ASCII? I thought ASCII was the specific encoding which used 7 bits/character.

Quote:

... but when I say ASCII or ANSI I am referring to this scheme where you have to choose a particular code page in order to display text that uses characters outside the first 128 values defined by the ASCII standard.

So, ANSI refers to some code pages, and it's not one specific encoding standard. It also uses 8 bits/character. (I opened the Notepad; and saved a file in ANSI; I found that each character was 1 byte in size, even Arabic ones.)

Q8: Because 8-bit code pages follow the ASCII standard for the first 128 characters, does this mean that it's completely save to read an ASCII-saved file using one of these 8-bit code pages?

After that, multi-byte character sets were there to support so many characters. If a character is 2 bytes in size, this means we can represent 2^16 = 65536 different characters.

Quote:

Unicode itself is not a text encoding; it's merely a list of all characters that a font or text encoding must support in order to be considered compliant with Unicode.

Understood.

Quote:

The first major Unicode encoding was UTF-16. This encoding always uses at least 16 bits to represent each character of text... The string "Hello" would be 5 bytes in ASCII/ANSI but 10 bytes in UTF-16... UTF-8 can use multiple bytes for characters like UTF-16, but it uses only 1 byte for the standard ASCII range 0-127.

I tried all these encodings using the Notepad and found that:

UTF-16 test: 2 bytes/character for English and Arabic characters; 2 bytes if no characters are there (Q9: Are they the BOM?)

UTF-8 test: 1 byte/character for English characters and 2 bytes/character for Arabic characters; 3 bytes if no characters are there (Q9 repeated: Are they the BOM?)

If I use a StreamWriter to write a file using ASCII encoding, and then open the file in Notepad, it shows that the file is encoded in "ANSI".

ANSI test: 0 byte if no characters are there (Q9 repeated: Does this mean ANSI has no BOM?).

TO answer your Q9 - The BOM characters (if present) will be at the start of the Unicode stream; it serves to tell whatever Unicode aware system that is reading the stream what format / endian type the data is in.

The BOM is a Unicode thing and ANSI doesn't have a similar mechanism in place so there is no such thing as BOM in an ANSI file.

... the default encoding for strings in Windows is UTF-16... This makes working with arbitrary text pretty difficult... Many Unicode encodings can optionally add a special string of characters called a Byte Order Mark... use of a BOM is optional.

Understood.

Quote:

It can be more confusing from an application standpoint...

So the best of all is to use a Unicode encoding in today's Windows applications.

Quote:

In .NET, all in-memory strings are UTF-16 and UTF-16 is the default encoding...

Very valuable info. From MSDN:

Quote:

New Windows applications should use UTF-16 as their internal data representation.

Your notes about fonts are extremely helpful. They clarified the role of fonts for me. So, encoding is something, font is something at the top of that.

Regarding VS:

Quote:

With a source file open, there should be an "Advanced Save Options..." item under the File menu. This will tell you the encoding you're using.

I'm using "Unicode (UTF-8 with Signature) Code page 65001".

I will give an example situation to understand your coming quotes.Assumption: fonts issues are ignored, no problems because of fonts.

I design a Windows application to be used worldwide.

In Form.Load event, I show a welcome MessageBox of a hard-coded string.

My application asks the user to input some text in a TextBox control.

There is a Button control that is when clicked the application performs the following:

Compares what is in the TextBox with a hard-code string.

If they are not equal, the TextBox.Text is passed to a StreamWriter in order to be written using some ANSI code page, named myANSICodePage.

Now to quotes:

Quote:

.NET uses UTF-16 for in-memory strings...

Q10: Does this mean: what the user types in the TextBox (TextBox.Text) is a UTF-16 encoded string regardless of the user's country?

Quote:

If you're saving the file as an ANSI file, the compiler will convert the ANSI string literal to UTF-16 as it does its work...

Q11: Does this mean: the welcome message will be converted to UTF-16 before it displayed to the user?

Q12: Does this compiler's action of conversion happen even if I send the MSIL assembly only? Or it only happens if the user has the source file and compiles it?

Q13: What is meant by Encoding Conversion? How is it done? If the welcome message is "Hello" but in Arabic, source file encoded in some ANSI code page supporting Arabic, the user has that ANSI, does this mean the compiler will always convert the message correctly to UTF-16? If I go back to 8-bit code pages, you said:

Quote:

So if you wanted to send someone text that included Russian characters, you had to make sure to tell them which code page to use; if they used the wrong one the text wouldn't display properly... This means with ANSI, it is possible for you to send bytes to someone, have them convert it to a string, then have them convert it back to bytes and send it to you and you will get a *different* string.

So I understood that: conversion is something, corruption of text is something else. Right?

Q14: If TextBox.Text is UTF-16 and contains some Arabic characters, sending it to the StreamWriter to be written using myANSICodePage should produce rubbish if myANSICodePage is for Russian characters. Correct?

Quote:

The web server can send any encoding it wants. The HTML page can send a hint to the browser via meta tags. This page's source tells me it's serving text using the ANSI codepage specified by ISO-8859-1.

Good News for the web. View Source command also tells us Google is using UTF-8.

Quote:

In theory you might be able to use HTML entity references to put the Arabic characters in but the forums software might strip it out, a browser might get confused by it, and you can't guarantee the viewer has fonts that can display it

Q7:
That's something I'm not entirely clear on. Usually I say "ASCII" or "ANSI" when I mean "not Unicode" and that's usually clear enough. I think the technical answer is ASCII should only refer to the original ASCII standard, and possibly the original English code page; when I think if ASCII I think of these characters.

I'm not even sure if ANSI is the right way to refer to the concept of "ASCII + a code page". I do know that it's now many text editors refer to it, and if enough people use the wrong term it becomes the right one, so it's probably good enough.

Q8:
As far as I know, yes. It's *possible* there are code pages that remap the standard ASCII range, but I've never encountered one. I would assume that the first 128 characters will look the same on everyone's computer no matter what.

Technical note: There is an ancient alternative to ASCII called EBCDIC. Really, really old computers might still use that. You'll be looking at 20+ year old hardware before you encounter this though so it's safe to assume ASCII.

Q9:
PlausiblyDamp explained well. From the sound of it, those bytes are the BOM in each file. The BOM isn't *just* for determining the file's Unicode encoding; it also specifies the endianness of the file. For example, in UTF-16, you could represent "A" as 00 41 or 41 00 in hex; the difference is if you put the "big" end of the number first or last. That's why it's called "Byte Order Mark" instead of "Encoding Identification Mark". It just so happens it's convenient for identifying the encoding as well.

Wikipedia has more information. You can see from the chart on the page that UTF-8 has a 3-byte BOM, UTF-16 has a 2-byte BOM and actually two different BOMs so it can indicate endianness, and other encodings have various BOM sequences. In general, the .NET classes that convert binary streams into text understand that they should remove the BOM from the stream when they convert.

As PD pointed out, ANSI has no BOM. The original ASCII standard had no need for specifying byte order because each character had only one byte. Code pages came later, and there was already too much hardware/software reliant on the original ASCII standard to add any kind of metadata to indicate the appropriate code page. Thus, the only way to pick the right code page for an ANSI file is to know in advance which is correct.

Q10:
As far as I know, yes. .NET controls should support Unicode. I cannot say with 100% certainty that this will be the case with third-party controls or ActiveX controls, but I would consider it a bug if a Microsoft-provided .NET control does not use UTF-16 from start to finish.

Q11:
There is no conversion. ANSI is only used by an application if it uses ANSI Win32 API calls. Since I assume .NET exclusively uses Unicode API calls, the text will always exist as UTF-16.

Rampant Speculation
If you paste some text from an ANSI file, the conversion likely happens when you make the copy. Windows sees the string coming in as ANSI, converts to UTF-16 using the appropriate code page as a guide, then stores it as UTF-16 on the clipboard. If you paste it into a control that expects ANSI text, Windows will use the code page specified by that control's thread to convert the text from UTF-16 to ANSI. I believe each thread can associate a specific code page with itself; that's why there's a different CurrentCulture and CurrentUICulture property in .NET. The former fetches the current thread's culture; the latter fetches the culture for the thread that started the application (don't quote me on that, use the documentation.)

Q12:
I don't understand the question. Could you describe it in more detail?

Q13:
The compiler is going to use the system's default code page as far as I know; there may be some compiler option to force it to use another. Here's four scenarios. I'm not super-certain about all of this; I'll put a confidence rating after each one.

You save a code file in UTF-16
The compiler isn't going to do squat to the strings as it compiles the code; they're already in the right format. Confidence: 100%

You save a code file in UTF-8
The compiler will convert the string literals in code to UTF-16 when it's creating data in the assembly. You won't lose any characters because both encodings are Unicode encodings. Confidence: 100%

You save a code file in ANSI and your code page supports Arabic characters.
The compiler will use your code page to convert the string literals to UTF-16. You shouldn't lose any data because your code page can handle the characters. Confidence: 99%

You save a code file in ANSI and your code page doesn't support Arabic characters.
The text editor might display the characters, but when you save your code page won't be able to convert them to bytes. I'm almost certain VS gives you a warning about this. Since the bytes in the file aren't the bytes that the Arabic code page would interpret as the characters, whatever strings the compiler embeds in the executable are incorrect. If we assume X is an Arabic character, I would expect the string "XYZ" to be encoded as something like "?YZ". Confidence: 80%

Q14:
It's possible.

Some code pages have overlap with other code pages. If the Russian code page has no support for Arabic characters, I would expect the text file to be incorrect. Again, if "X" were an Arabic character, "XYZ" would be encoded as "?YZ". If the Russian code page happens to have Arabic character support *and* uses the same bytes, the result should be usable. My guess is that's very unlikely. (The most overlap I've seen is in the Asian code pages.)

Windows 2000 (and probably NT4) supported at least UTF-16; I know it used Unicode by default. The 95/98/ME line of OS used ANSI by default and I don't think access to Unicode was available... if you are running Windows 95 or 98 it is possible the machine will not have Unicode support at all. However, you can't run the .NET Framework on either...

So, as a .Net 3.5 developer targeting Windows operating systems that can run the .Net framework, I do not need to care about UTF-16 being not supported.

Q15: People have made many Encodings and code pages, but I can find only 7 of them in the System.Text namespace and the Encoding class. Where are the rest?

I copied your guidelines on how to make website/windows applications display properly worldwide, and emailed them to a friend of mine (we usually develop applications together.) He told me that he never did one of them except putting images that display text!

Atma, you helped me a lot in understanding character encoding with such exhaustive posts. I will read more about the subject and let you know of something is too hard for me to understand. THANK YOU.

Ok Atma, this is the last question.
I have an ANSI codepage that supports Arabic, you do not have one.
I write some Arabic characters in Module1.vb and saved it using my ANSI code page.
If I send you this Module1.vb, and you open it using VS and want to compile it. You will not get the characters correctly.
Now, what if I compile the Module1.vb file and send you the .exe file only. Will it show correctly, assuming you have proper fonts?

I believe the situation you describe is correct. If you send me source code encoded in a code page I'm not using, the executable I create will be different from the one you create. If you compile an executable, the data should be in UTF-16 and I would expect to see the appropriate text.