[UPDATE 16 Aug 2011] Armin Ronacher has written a nice module called unicode-nazi that provides the Unicode warnings I discuss at the end of this article.

Though I can't use Python 3 for any of my projects, it does have a few nice things. One particular behaviour where it improves on Python 2 is forbidding implicit conversions between byte strings and Unicode strings. For example:

This is only a 3, at best 3.5 on the Russell scale of API misusability. Even worse, this conversion is done for any function implemented in C that expects unicode or bytes and receives the other type. For example, unicode() converts byte strings to unicode strings, and if not given an explicit encoding argument, uses the default encoding.

It turns out that since the utf8 decoder expects bytes as input, but we gave it unicode... Python helpfully tries to convert the unicode characters into bytes, using the default encoding! This is why the first decode call succeeds - all the characters in it can be converted to ASCII bytes.

Is there any hope?

Since Unicode was added to Python in 2.1, there's been a way to get Python 3's behavior. The site module in the Python standard library sets Python's default encoding on startup. If edited to use "undefined" instead of "ascii", the above examples all fail instead of converting:

Not to mention loads of existing third-party apps and libraries. So what can we do? Fortunately, Python allows registering new codecs. I've written a slight variation on the ASCII codec, ascii_with_complaints, which preserves the default Python behaviour, but also produces warnings.