Thinking about Science, Engineering, and Society

December 02, 2010

Yet more about UTF-8 -- the Evils of Platform Defaults.

In my long experience, one of the most common causes of character-set failure is blindly accepting the platform default encoding. Often, people don't even realize they are doing this.

It's an old story. I've seen it time and again, at company after company. And despite the thought that "Java uses Unicode", I've seen it at least as much in Java as in C++. (Though there's a fair amount of observational bias there!)

The developer tests, and it all seems to work. He may throw in a few odd characters he knows how to type, maybe a few words with accents.

The QA lab runs a few tests, perhaps a bit more thorough. Maybe they even throw in some Japanese or Chinese -- and maybe it even works.

Off it goes to customers. Say, maybe a customer in Canada, who speaks French, and runs an older version of Windows, and sets his locale to something different.

And boom. It fails. If you're lucky, it crashes. If you're unlucky, it loses critical data?

What happened?

Simple: new String(myByteArray).

People assume either that it will use UTF-8, or that it will use some platform default that will be the right thing.

Wrongo, charface!

It will use SOME platform default. It won't be under your control. You don't know when it will change. You have given your application's behavior over to the evil gnomes that lurk in the platform.

Because you simply do not know, there is basically no way to use this API in any way that is remotely useful.

If you have any control the charset being used, always choose Unicode, and in most cases, you should encode it as UTF-8.

And NEVER perform any conversion in Java that does not specify the character set and encoding. Don't open a text file with new FileReader(...) you can't specify that second argument. Use new FileInputStream(...) and pass the result to a new InputStreamReader(...) -- and tell that the encoding.