Run echo.cgi?name=Philip and it works fine. Since we use
$cgi->escapeHTML there's no danger of cross-site scripting (XSS)
attacks: try echo.cgi?name=<script>alert('oops')</script> and
the script will be harmlessly displayed as plain text.

That's okay for ASCII, but we live in a Unicode world. Try
echo.cgi?name=金, and you'll probably get "Hello, é‡‘!" – the
web browser encodes the character "金" as UTF-8 when sending to the server, but
defaults to decoding the response as Windows-1252 (a superset of ISO-8859-1).
The script needs to output Content-Type: text/html; charset=UTF-8
to tell the web browser exactly what the character encoding is.

But UTF-8 isn't the only encoding in the world, so let's let the user choose
whatever output encoding they prefer:

Now you can run echo.cgi?name=金;enc=EUC-KR and the output will
be encoded into the byte sequence 0xD1 0xD1 (the EUC-KR
representation of "金"). The web browser decodes it as EUC-KR, and everything
is fine, and you've encoded the character with one byte fewer than in UTF-8.
Perfect!

The problem

If you now run
echo.cgi?name=%14%C3%8B%C3%84%C3%8A%C3%91%C3%B8%C3%88%C2%9E%2F%25%C3%81%C3%8A%C3%88%C2%88%1B%3F%3F%C3%B8%C3%8B%1B%C2%89%14%07%C3%8B%C3%84%C3%8A%C3%91%C3%B8%C3%88%C2%9E;enc=CP1047
in a browser like Firefox 3.5, it will execute a script from the URL. We've
evidently introduced an XSS vulnerability, even though our CGI script is
correctly decoding and escaping and encoding all its text.

The name bytes 0x14 0xC3 0x8B … are decoded as UTF-8
into the characters U+0014 U+00CB …. Any dangerous characters are
escaped in escapeHTML, e.g. "<"
(U+003C) becomes "&lt;", but we don't have any of
those characters here so nothing gets escaped. Now the text is encoded as
CP1047 – an EBCDIC encoding, very different to ASCII. U+0014 U+00CB
… encodes into the bytes 0x3C 0x73 ….

Those bytes are sent to the browser. But Firefox doesn't support the CP1047
encoding, so it falls back on its default of Windows-1252. 0x3C
0x73 then decodes into the characters "<s" – the
start of an unescaped script tag.

(Internet Explorer does support CP1047, so it will decode 0x3C
0x73 into the harmless characters U+0014 U+00CB instead.
Web browsers are not required to support (or to ignore) these encodings, and
are not doing anything wrong here – the only bug is on the server, using
encodings that are dangerous when not supported by clients.)

The problem again

Run
echo.cgi?name=숍訊昱穿刷奄剔㏆穽侘㈊섞昌侄從쒜;enc=ISO-2022-KR in
Google Chrome 2.0, and the same story applies. ISO-2022-KR encodes the
first character "숍" into the bytes 0x3C 0x73 (preceded by
0x0E to shift into multi-byte Korean mode). Chrome doesn't support
ISO-2022-KR, so it will decode as Windows-1252 and execute the script.

The solution

Just use UTF-8, always. It saves a whole lot of bother. Use gzip compression
if you're concerned about bandwidth usage of UTF-8 for non-English languages.

If you really want to support multiple encodings, restrict it to a short
whitelist of acceptable encodings (perhaps UTF-8, Windows-1252, ISO-8859-1,
Shift_JIS, GB2312, Big5, EUC-JP, EUC-KR, …), and absolutely avoid any encodings
where markup characters (<, ", ', …)
can be encoded as different bytes than in their ASCII encoding, or where those
bytes can occur in the encoding of any other character. This means avoid UTF-7,
all EBCDIC encodings (CP1047, CP037, …), ISO-2022-* (ISO-2022-KR, ISO-2022-JP,
ISO-2022-CN, …), JOHAB, SCSU, BOCU-1, and possibly others.

In practice

Yahoo Search was vulnerable to the ISO-2022-KR attack (only affecting Chrome
users), reported to them on 2009-06-29 and fixed the next day.