3 Answers
3

This input text often has characters which are incorrect for the output encoding, things like "smart quotes", which come from a document in Windows-1252 encoding

“Smart quotes” (bytes 147 and 148 in cp1252) are perfectly valid Unicode characters, U+201C and U+201D. Your application should be capable of handling them seamlessly; if not, you're doing something wrong and most likely all non-ASCII characters will fail.

Regardless of whether the characters came from someone typing them or someone pasting them in from Word, the browser should be submitting UTF-8-encoded characters to your application, which should be storing the same UTF-8 bytes to the database.

If the browser is not submitting in UTF-8, chances are you're failing to set the charset of the HTML page containing the form. This can be done using the:

Content-Type: text/html;charset=utf-8

HTTP header and/or the:

<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />

element in <head>.

Can I just set the accept-charset attribute on the form and have the browser do it for me?

No, accept-charset is basically useless thanks to IE, which misinterprets it to mean “try using this charset if the one on the page can't encode the characters we want”, instead of “always use this charset”. This means if you use accept-charset you can end up with a mixture of encodings submitted at once, with no way to figure out which is which. Nice!

how come my database is accepting these characters, which are reserved/control characters in UTF-8?

In MySQL UTF-8 is just a collation, used for comparison and ordering. It's still storing the data as bytes and doesn't really care if they're not valid UTF-8 sequences.

It's a good idea to decode and check incoming UTF-8 sequences in your app anyway, because “short sequences”, invalid in modern Unicode, can hide a ‘<’ character that will still be recognised by older browsers (at least IE6 pre-SP2, Opera 7).

ETA:

So, I entered a string containing byte 146

No, you entered a Unicode character U+201B. The browser deals with Unicode characters, not bytes, right up until the point it has to submit the serialised form to the server. It's then that it decides how to turn the characters into bytes, and if the page is being handled as UTF-8, it will always choose UTF-8.

(If it's not UTF-8, browsers tend to cheat in a non-standards-compliant way: for all characters that can't fit in the encoding, it'll encode them to HTML character references like ‘&#8217;’. This is wrong because you now can't tell the difference between a browser-escaped ‘&’ and a real, user-typed ‘&’, and it's insidiously wrong because if you then echo the reference as unescaped HTML it looks like you're getting it right, which in fact you've just made a big old security hole.)

It went into the database as 146

Really, a ‘\x92’ byte, not ‘\xC2\x92’, ‘\xE2\x80\x99’ or ‘&#146;’?

it came out when I produced the (UTF-8-encoded) XML, as 146. No complaints from the browser

Then it did not come out as a single 146-byte. A browser will complain when given a bare ‘\x92’ in an XML file. (Not an HTML file, in which invalid UTF-8 sequences come out as a missing-character glyph.)

I suspect it is coming out as a ‘&#146;’ character reference, which is well-formed (though the character U+0092 is part of the C1 control set, so won't render as anything useful). If this is what's happening, your form page is not being picked up as UTF-8 after all, and you're suffering the browser-auto-escaping-submission problem described above.

OK "bytes 147 and 148 are perfectly valid Unicode characters, U+201C and U+201D" is the part I don't get. The browser can't know the encoding of the pasted test. How does it know they're CP1252 smart quotes? Surely they're just bytes! Or is the use of 147-148 so unusual it's safe to assume that?
–
AmbroseChapelApr 15 '09 at 20:20

The browser doesn't deal with bytes, it deals with Unicode characters. You paste Unicode characters U+201C (et al) into a text area; the browser neither knows nor cares how they were stored in the application you copied them out of (which was probably Unicode anyway).
–
bobinceApr 15 '09 at 20:28

It might be submitting bytes 147/148, but that would be because it thinks the encoding of the page containing the form was cp1252 and therefore the encoding your form wants to get is cp1252. If you don't tell it otherwise, it can default to cp1252 (or other system codepages in other countries).
–
bobinceApr 15 '09 at 20:29

I'm still confused. If I cut a string containing byte 147 from Word, then paste it into a browser input field, what are you saying the browser does? Automatically convert it to U+201C because it "knows" it's a smart quote and it has an internal 1252-UTF8 mapping?
–
AmbroseChapelApr 16 '09 at 1:29

The encoding of the page is UTF-8, by the way. I may be able to track down the actual user and get them to tell me how exactly they entered the text, as it's an internal app.
–
AmbroseChapelApr 16 '09 at 1:29

You might try the Perl Encode module. It supports conversion between a number of character sets, including UTF-8 of couse. I just checked my install of Perl and it also supported "cp1252", which is just another name for Windows-1252 according to Wikipedia. You can check your own install with the following one liner: