Change History (14)

And BTW, any string will look like a unicode string on the Django traceback pages.

I.e. I had a string returned from _() that was in fact a UTF-8 encoded byte-string, but was displayed in the "Locals Vars" section as u"bla bla". What was actually in the local vars, though, was a byte string.

_() is still mapped to gettext instead of ugettext after unicode merge →
gettext/ngettext should no longer be available after unicode merge

Okay, it was pointed out to me that the implicit _() function is supposed to go away according to #2920.

So in that case I'd say the point of this ticket is slightly broader: the non-unicode versions of the gettext functions make no sense in a unicode framework/application. So I'd suggest simply aliasing ugettext to gettext and ungettext to ngettext. Still providing the raw byte-string versions of those functions is prone to break the app in possibly unexpected ways: it just takes a translation using non-ASCII letters, and all the sudden you have a UnicodeDecodingError.

I'm attaching a patch that does this and also simplifies the related code a bit. Of course, it still fixes the implicit _() builtin until that is actually removed.

gettext/ngettext should no longer be available after unicode merge →
gettext/ngettext do not accept UTF-8 and Unicode input strings.

Triage Stage:

Unreviewed →
Accepted

This can't go in as it is.

The problem is return types: ugettext() returns a Unicode string, gettext() returns a UTF-8 encoded bytestring. This makes a difference in some places (e.g. __str__ and __unicode__ methods). I would also, in general, be reluctant to change the return types to differ from the similarly named Python standard library functions to avoid any confusion for users, but it turns out that "neatness" urge is supported by the functionality requirements.

Fixing #2920 removes the ability for people to accidentally shoot themselves in the foot on output. They will need to explicitly import one or other translation function. If we also ensure that gettext() and ugettext() can both accept UTF-8 bytestrings and Unicode strings, we avoid accidents on input as well, which I gather might have been the main motivation. That fixes all the main usability problems, as I understand it.

Are there other usability issues you were trying to fix?

(Changing the title to reflect the more refined problem we are fixing. Removing has_patch checkbox, since all the current patches are not appropriate.)

The problem is return types: ugettext() returns a Unicode string, gettext() returns a UTF-8 encoded bytestring. This makes a difference in some places (e.g. __str__ and __unicode__ methods).

Oh, I thought that after the unicode merge, throwing byte-strings around the framework was undesirable. If that is not the case, you can close this ticket.

However, I'd like to emphasize that I think allowing/encouraging the use of byte-strings is a bad idea and asking for trouble. Even if you make all Django code accept both UTF-8 byte-strings and unicode interchangeably, there are still tons of ways in normal Python usage where this will break left and right. I think I know about this stuff, and the problem that caused me file this ticket and write the patch really threw me off for a couple of hours (the fact that bytestrings are rendered as unicode representations on the debug screen did not help ;-).

In particular for translations, where most gettext() calls start out returning just ASCII or unicode (basically the untranslated string), but when things actually get translated, all the sudden you have UTF-8 encoded bytestrings, which inevitably breaks things. Keeping the byte-string versions of the gettext functions available in a unicode application is not a feature, it's a design bug IMHO.

I would also, in general, be reluctant to change the return types to differ from the similarly named Python standard library functions to avoid any confusion for users, but it turns out that "neatness" urge is supported by the functionality requirements.

Note that the naming of the gettext functions in the standard library is due to it having to support both unicode and non-unicode applications. The actual gettext C library doesn't actually have a uxxx function for every other function. Personally, I see no problem with a framework defaulting to the unicode variants, and thus encouraging (or even forcing) the user to do the right thing. (I suspect we both know how many developers are ignorant about unicode and encodings.)

Fixing #2920 removes the ability for people to accidentally shoot themselves in the foot on output. They will need to explicitly import one or other translation function. If we also ensure that gettext() and ugettext() can both accept UTF-8 bytestrings and Unicode strings, we avoid accidents on input as well, which I gather might have been the main motivation. That fixes all the main usability problems, as I understand it.

I actually don't care much about the input of the gettext functions, as I use ASCII for that for better translation tooling interop. It definitely isn't the point I intended to make with this ticket.

Allowing bytestrings is a useful feature and is there by design. It means that code which uses only ASCII requires essentially no porting, for example.

I'll probably fix the input case at some point, but since it's not the point of this ticket, I'll close it. Thanks for putting the necessary thinking into the changes, but we do want to keep bytestring support; it's not pragmatic to insist everybody always uses Unicode.