This converter is assymetric. In ToUnicode direction, it is generous and acts as Windows-949. It also supports 8-byte sequences for 8,822 Hangul syllables not encoded as precomposed forms in KS X 1001. In FromUnicode direction, it is strict and generate 8-byte sequences for those 8,822 Hangul syllables instead of 2-byte sequences in windows-949.

8-bit encodings (excluding UTFs, CJK encodings and T.61) tested using <http://coq.no/X/charset5/tests8bit.html> (fail/pass should not be taken too seriously yet, especially not for more obscure encodings), Firefox version 3.5.1, OS X.

Bugs: Filed <https://bugzilla.mozilla.org/show_bug.cgi?id=512060> for the labels marked 'not recognised' in the table above since the lack of support for these is clearly accidental rather than deliberate (though it seems to suggest that these particular labels are not particularly widely used). In most other cases, research and deliberation will be needed to distinguish between bugs and features.

Internet Explorer

Matching

Strips leading and trailing whitespace and then does ASCII(?) case-insensitive matching. (Matches HTML5.)

Safari

Matching

Encodings

Data

Safari uses the system version of ICU on Mac (4.0 for Snow Leopard, 3.6 for Leopard and 3.2 for Tiger) and in addition supports TEC on Mac for encodings that are not in ICU. (Unclear how much of TEC is enabled.)

On Windows Safari ships with ICU 4.0

According to webkit/WebCore/platform/text/TextCodecICU.cpp, WebKit now uses ICU <http://site.icu-project.org/> with additional aliases (webkit/WebCore/platform/text/TextCodecICU.cpp), additional encodings (webkit/WebCore/platform/text/mac/mac-encodings.txt) possibly implemented using TECM at least on the Mac, a list of official IANA labels (webkit/WebCore/platform/text/mac/character-sets.txt) and probably a few more which I have not noticed.

ICU 4.2’s icu/source/data/mappings/convrtrs.txt or <http://demo.icu-project.org/icu-bin/convexp> lists encodings and labels not supported in Safari 4.0 on Leopard, and webkit/WebCore/platform/text/TextCodecICU.cpp mentions that Tiger included ICU 3.2.

Chrome

Similar to Safari with some customizations in ICU alias tables. Chrome 3.0 has ICU 3.8 plus customizations for EUC-JP (to match IE/Firefox). For EUC-KR and GBK, we use different mapping tables than used by Safari (which just uses ICU's default tables for them). ISO-8859-16 is also added.

Chrome trunk uses ICU 4.2.

Thoughts

Anne

If it can be agreed upon that all non-UTF-8 and non-UTF-16 encodings are legacy encodings I personally would not mind advocating that we should drop support for US-ASCII and ISO-8859-1 completely in favor of Windows-1252 (and do the same for similar situations). I.e. that US-ASCII and ISO-8859-1 labels simply map to Windows-1252. This should simplify code a little bit as well.

I also think that we should ban UTF-7, UTF-32 and all EBCDIC encodings. This is already mostly done by HTML5.

I wonder if we can standardize (document to start with) the encoding detection algorithm. The list of encodings is fixed. The list of legacy pages is also fairly fixed. The detection algorithms in browsers should be fairly stable. Certainly looks possible.

E-mails

WHATWG got these e-mails that we should make sure to cover as part of this:

Spec notes

This is what the spec used to say about encodings:

<p>In addition, user agents must support the aliases given in the
following table for every character encoding they support, so that
labels from the first column are treated as equivalent to the labels
given in the corresponding cell from the second column on the same
row.</p>
<table>
<caption>Additional character encoding aliases</caption>
<thead>
<tr> <th> Alias <th> Corresponding encoding <th> References
<tbody>
<tr> <td> x-sjis <td> windows-31J <td>
<a href="#refsSHIFTJIS">[SHIFTJIS]</a>
<a href="#refsWIN31J">[WIN31J]</a>
<tr> <td> windows-932 <td> windows-31J <td>
<a href="#refsWIN31J">[WIN31J]</a>
<tr> <td> x-x-big5 <td> Big5 <td>
<a href="#refsBIG5">[BIG5]</a>
</tbody>
</table>

ICU in Chrome and Safari

Giving a link to or actually including the info on what Safari and Chrome support would be nice. It seems like this would be at least a subset, but it sounds like more may have been added from the text above.