The ease of implementation depends on programming language/platform. Sadly there are still some widely used programming languages without native support for Unicode.
– R. Martinho FernandesApr 20 '11 at 15:44

If you want all (all) browsers to understand and represent your encoding well, it is safest to stick to ASCII / rendered images.
– Johan KotlinskiApr 20 '11 at 15:47

@kotlinski: what about my IBM mainframe text browser that only supports EBCDIC? On a serious note: if you reduce the set of browsers to "all sane browsers produced in the last few years" (which might even include IE 5.5 in this specific case), then UTF-8 and UTF-16 are equally valid.
– Joachim SauerApr 20 '11 at 15:51

1

Joachim: The browser will understand it, but one problem is that many operating systems out there will not have representations for all characters.
– Johan KotlinskiApr 20 '11 at 16:23

1

@kotlinski: that's true, but you'll find that this is much less of a problem if the language you're using is the native language of your users: Users in countries that need special fonts usually do have those fonts (and the OS to support them).
– Joachim SauerApr 20 '11 at 16:32

3 Answers
3

If you want to support a variety of languages for web content, you should use an encoding that covers the entire Unicode range. The best choice for this purpose is UTF-8. UTF-8 is the preferred encoding for the web; from the HTML5 draft standard:

Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]

Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]

UTF-8 and Windows-1252 are the only encodings required to be supported by browsers, and UTF-8 and UTF-16 are the only encodings required to be supported by XML parsers. UTF-8 is thus the only common encoding that everything is required to support.

The following is more of an expanded response to Liv's answer than an answer on its own; it's a description of why UTF-8 is preferable to UTF-16 even for CJK content.

For characters in the ASCII range, UTF-8 is more compact (1 byte vs 2) than UTF-16. For characters between the ASCII range and U+07FF (which includes Latin Extended, Cyrillic, Greek, Arabic, and Hebrew), UTF-8 also uses two bytes per character, so it's a wash. For characters outside the Basic Multilingual Plane, both UTF-8 and UTF-16 use 4 bytes per character, so it's a wash there.

The only range in which UTF-16 is more efficient than UTF-8 is for characters from U+07FF to U+FFFF, which includes Indic alphabets and CJK. Even for a lot of text in that range, UTF-8 winds up being comparable, because the markup of that text (HTML, XML, RTF, or what have you) is all in the ASCII range, for which UTF-8 is half the size of UTF-16.

For example, if I pick a random web page in Japanese, the home page of nhk.or.jp, it is encoded in UTF-8. If I transcode it to UTF-16, it grows to almost twice its original size:

UTF-8 is better in almost every way than UTF-16. Both of them are variable width encodings, and so have the complexity that entails. In UTF-16, however, 4 byte characters are fairly uncommon, so it's a lot easier to make fixed width assumptions and have everything work until you run into a corner case that you didn't catch. An example of this confusion can be seen in the encoding CESU-8, which is what you get if you convert UTF-16 text into UTF-8 by just encoding each half of a surrogate pair as a separate character (using 6 bytes per character; three bytes to encode each half of the surrogate pair in UTF-8), instead of decoding the pair to its codepoint and encoding that into UTF-8. This confusion is common enough that the mistaken encoding has actually been standardized so that at least broken programs can be made to interoperate.

UTF-8 is much smaller than UTF-16 for the vast majority of content, and if you're concerned about size, compressing your text will always do better than just picking a different encoding. UTF-8 is compatible with APIs and data structures that use a null-terminated sequence of bytes to represent strings, so as long as your APIs and data structures either don't care about encoding or can already handle different encodings in their strings (such as most C and POSIX string handling APIs), UTF-8 can work just fine without having to have a whole new set of APIs and data structures for wide characters. UTF-16 doesn't specify endianness, so it makes you deal with endianness issues; actually there are three different related encodings, UTF-16, UTF-16BE, and UTF-16LE. UTF-16 could be either big endian or little endian, and so requires a BOM to specify. UTF-16BE and LE are big and little endian versions, with no BOM, so you need to use an out-of-band method (such as a Content-Type HTTP header) to signal which one you're using, but out-of-band headers are notorious for being wrong or missing.

UTF-16 is basically an accident, that happened because people thought that 16 bits would be enough to encode all of Unicode at first, and so started changing their representation and APIs to use wide (16 bit) characters. When they realized they would need more characters, they came up with a scheme for using some reserved characters for encoding 32 bit values using two code units, so they could still use the same data structures for the new encoding. This brought all of the disadvantages of a variable-width encoding like UTF-8, without most of the advantages.

+100: VERY WELL SAID! I despise UTF‑16, although UCS‑2 makes me even madder. Dan Kogai says in the manpage for his Encode::Unicode Perl module: “To say the least, surrogate pairs were the biggest mistake of the Unicode Consortium. But according to the late Douglas Adams in The Hitchhiker’s Guide to the Galaxy Trilogy, In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. Their mistake was not of this magnitude so let’s forgive them.”
– tchristApr 21 '11 at 16:52

UTF-8 is like UTF-16 and UTF-32, because it can represent every character in the Unicode character set. But unlike UTF-16 and UTF-32, it possesses the advantages of being backward-compatible with ASCII. And it has the advantage of avoiding the complications of endianness and the resulting need to use byte order marks (BOM). For these and other reasons, UTF-8 has become the dominant character encoding for the World-Wide Web, accounting for more than half of all Web pages.

You need to take more into consideration when dealing with this.
For instance you can represent chinese, japanese and pretty much everything in UTF-8 -- but it will use a set of escape characters for each such "foreign" character -- and as such your data representation might take a lot of storage due to these extra markers. You could look at UTF-16 as well which doesn't need escape/markers for the likes of chinese, japanese and so on -- however, each character takes now 2 bytes to represent; so if you're dealing mainly with Latin charsets you've just doubled the size of your data storage with no benefit. There's also shift-jis dedicated for Japanese which represents these charset better than UTF-8 or UTF-16 but then you don't have support for Latin chars.
I would say, if you know upfront you will have a lot of foreign characters, consider UTF-16; if you're mainly dealing with accents and Latin chars, use UTF-8; if you won't be using any Latin characters then consider shift-jis and the likes.

How much does storage matter these days? If your text content grows by 50% what would be the greatest effect? Chances are the Chinese and Japanese text will still be on par with the English in size.
– Mark RansomApr 20 '11 at 15:55

@Martinho Fernandes, when bandwidth matters use compression. I've never tested it, but I'm guessing UTF-8 and UTF-16 compress to nearly the same size.
– Mark RansomApr 20 '11 at 16:12

2

This answer is incorrect and misleading. UTF-8 takes 1 byte for characters in the ASCII range, 2 bytes for characters from U+0080 through U+07FF, 3 bytes for characters from U+0800 through U+FFFF, and 4 bytes for characters from U+010000 through U+1FFFFF. UTF-16 takes 2 bytes for characters from U+0000 through U+FFFF and 4 bytes for characters U+010000 through U+1FFFFF. So, for the ASCII range UTF-8 is smaller, and for anything in the 2 byte range of Unicode (like Hebrew, Arabic, Greek, Russian), and for everything outside the BMP, UTF-8 takes the same amount of storage as UTF-16.
– Brian CampbellApr 21 '11 at 15:36

3

Oh, and finally, for web content, being smaller for the ASCII range is generally more important than being smaller on CJK characters, even for CJK content. Most web content is stored with HTML or XML markup, all of which is in the ASCII range. Since UTF-16 is double the size of UTF-8 for characters in the ASCII range, while UTF-8 is only half again the size of UTF-16 for CJK characters, and since so much of the content actually is markup, transcoding web pages from UTF-8 to UTF-16 will almost always increase their size, even if they are all written in Japanese or Chinese.
– Brian CampbellApr 21 '11 at 15:53