Why PERL_UNICODE makes me SAD

When I first got a bug report that Capture::Tiny was breaking under PERL_UNICODE=SAD, I though it would be an easy thing to fix. I was so wrong... I had no idea what a rabbit hole I was in for.

What the heck is PERL_UNICODE?

Unless you're American, you've probably heard of Unicode. Even if you're American, hopefully by now you've realized that a lot of the world uses languages that require more than the ASCII character set. And if you use Perl, you might be aware that Perl has remarkably good Unicode support. (See the Unicode Support Shootout slides.)

The PERL_UNICODE environment variable provides a default for the -C command line argument to the Perl interpreter, which can set UTF-8 translation layers on various filehandles (and command line arguments).

Specifically, PERL_UNICODE=SAD means that Perl should add the :utf8 layer to the Standard IO handles, to the Argument list, and should be the Default for any other handles opened as well.

Is PERL_UNICODE a good idea?

Maybe. One the one hand, if you work in a world that is exclusively ASCII or Unicode I/O, then you can make a lot of input and output "just work".

That strength is also the weakness. PERL_UNICODE has a global effect!

Can you be sure that every module you use is ready to have :utf8 on any handles they open? Are you sure that any modules that reopen standard handles set them back correctly later? Turning on :utf8 globally is a huge bet, with odds that get worse the larger your dependency chain is.

[I can tell you from experience that almost no code on CPAN properly understands how to record the layers on a handle and reapply them to another. Capture::Tiny does, except when it's actually impossible, since tied handles can't report layers correctly.]

Capture::Tiny and PERL_UNICODE walk into a bar...

The bug report I got for Capture::Tiny regarded a failure in one particular test file, when PERL_UNICODE=SAD was set globally in the environment. As I dug into the bug report, it became clear that the bug was being triggered only under these conditions:

Perl prior to v5.12

PERL_UNICODE=D

STDIN closed

Capture::Tiny trying to tee() output

The good news was that newer Perls were unaffected. The bad news was that I couldn't figure out why it was happening.

Not only was it breaking under those conditions, it was weird.

Down the rabbit hole

One of the strange things happening was that a "no output" capture test was capturing the contents of the utf8.pm file in the Perl core. WTF? Something about PERL_UNICODE was loading utf8.pm, which winds up on file descriptor 0, confusing Capture::Tiny. Sticking require utf8; early in the test code "fixed" that problem.

Even after that fix, it looked like the test was leaking a filehandle. Something else was grabbing file descriptor 0 in the middle of a tee() and not letting go.

Given that leak, it wasn't just a matter of taking into account the global presence of :utf8 layers -- something more fundamental was going wrong.

Knowing when to punt

Reading Perl release notes and grepping through Perl core commit logs wasn't giving me any insight into what changed. Git bisection of the core turned into a huge headache. I quickly got to the point where I decided I was spending more time on this than the problem was worth.

Since the issue was a real corner case and only on very old Perl's, I decided to document it as a known issue, bypass the failing tests under the triggering condition, and ship a new release to CPAN. Oh, well.

Lessons to learn from this

Be careful with global effects! It might seem like an easy fix, but you put your entire codebase at risk. It's much smarter to fix your code locally where you do I/O. Even the open pragma is a better choice than PERL_UNICODE, since you can limit the scope of change to the parts of your code that are actually doing I/O.

The real insight I got from this is how important it is to test under production conditions. If you do use PERL_UNICODE=SAD in production, it's a very good idea to do your development and testing with that set as well. It will help you find modules that aren't happy with it.

Finally, this is a great example of why upgrading Perl is a good idea. Hundreds (thousands?) of bugs have been fixed since 5.8 or 5.10. The longer you wait to upgrade, the longer you'll have to suffer them.

Summary

PERL_UNICODE has a global effect, applying :utf8 to layers automatically

Global effects can have unexpected side effects

Avoid global effects if you can

If you must use global effects, test your dependencies under the same conditions

5 Comments

I hit the same problems more than 1 year ago and blogged about whether PERL_UNICODE was such a good thing at http://www.martin-evans.me.uk/node/119. As I said in that blog entry, I think loads of people read Tom Christiansen's post on stackoverflow and thought to get Unicode working in Perl you better set SAD.

What makes me sad is that "use utf8" is not "use utf8::all". Everyone seems to expect that "use utf8" will change default file-handle behavior, and they get confused when it doesn't. ("Oh, when you said 'use utf8', you thought that meant you wanted to use utf8...?").

It could be we should start smoke testing CPAN with PERL_UNICODE on (and off).

Then should we also have use utf16 and use utf32 to designate that all I/O is in UTF-16 or UTF-32, respectively? How about use windows1252? I don't think we'll ever see a time when all data is encoded in a single character encoding, nor have I heard proposals to always use UTF-8 everywhere, and it's not Perl's place to encourage a preferred encoding for your external files, protocols, services, terminal, etc. (Whereas it is Perl's place to make such decisions for your source code or internal representation of strings.) If you want all filehandles to perform implicit UTF-8 encoding and decoding, then specify that with the open pragma, which can also be used to separately set input and output filehandles to different encodings.

I vote against PERL_UNICODE :) It seems to be yet another way of 'f[beeeep] things up big time' for the people who will definitely cargo-cult it out of laziness.

On any large projects, it's almost guaranteed some module will not be compatible with it and not 'do the right thing'.

Dealing with unicode and encoding sensibly is not difficult and should be part of the basic programmer's knowledge. There's no reason (except amateurism) why the Perl programmers should be the only programmers in the world to be exempted from understanding the difference between a String and an array of Bytes.