Monday, October 03, 2011

I guess this is not a surprise to anyone who thought about how this is supposed to work, but for the sake of being systematic, here is the code:
use strict;
use warnings;
use autodie;
use HTML::Entities;
use Encode;

6 comments:

Character-to-byte encodings to keep cropping up in many places, far more than just the "bytes of a file stream" that people tend to think of. As far as I'm aware there's no control in Perl to set the encoding of characters to pass to, or expect from, filename-related syscalls (open, stat, readdir, etc...), so you have to do the Encode dance yourself here.

The heart of the problem is that some systems have decided that filenames in system API calls are bytes, not characters and that it is the task of the underlying filesystem to interpret them into chars (or not).This is the case for Linux.

A cross-OS language like Perl should be platform independent wherever possible, definitely for such common things like file system access.So what's the best way to deal with that?Compile time options that define the default encoding for filenames, STDIN/STDOUT, etc. or a startup detection?

Well, it is cross-platform. Nobody can tell you the encoding of the filename. Even if your default is set to UTF-8. The filename encoding can be some other encoding. So threating everything binary is the only correct way.

What do you expect? Create all possible encodings and then open a file randomly that matches? And the same string can have multiple different encodings. And all of them can be in the same directory. Because filesystems don't knew anything about encodings.

@Sid Burn - what I would expect? I would expect that the documentation for the 'open' function defined this. I think bytes is the only practical solution for now - but it should be documented. The examples you mention are complete straw man, but the compile time options mentioned by abraxxa above could work.