I'm looking for a simple way of converting a user-supplied string to UTF-8. It doesn't have to be very smart; it should handle all ASCII byte strings and all Unicode strings (2.x unicode, 3.x str).

Since unicode is gone in 3.x and str changed meaning, I thought it might be a good idea to check for the presence of a decode method and call that without arguments to let Python figure out what to do based on the locale, instead of doing isinstance checks. Turns out that's a not a good idea at all:

3 Answers
3

It's not useful to speak of "decoding" a unicode string. You want to encode it to bytes. unicode.decode is solely there for historical reasons; its semantics are meaningless. Therefore, it has been removed in Python 3.

However, the encode/decode semantics have historically been extended to include (character) string-to-string or byte-to-bytes encodings such as rot13 or bzip2. In Python 3.1, these pseudo encodings were removed, and reintroduced in Python 3.2.

In general, you should design your interfaces so that they either accept character or byte strings. An interface that accepts both (for reasons other than backwards compatibility) is a code smell, hard to test, prone to bugs (what if someone passes UTF-16 bytes?) and has questionable semantics in the first place.

If you must have an interface that accepts both character and byte strings, you can check for the presence of the decode method in Python 3. If you want your code to work in 2.x as well, you'll have to use isinstance.

Just what I thought. But then, how do I tackle the problem of going from any basestring to UTF-8, without isinstance?
–
larsmansJul 21 '12 at 13:16

Updated the answer. That's a problem that shouldn't occur in the first place - you should know what you're getting passed. I'm afraid you'll have to use isinstance if you want Python 2 and 3 compatibility.
–
phihagJul 21 '12 at 13:22