Chapter 15. Unicode

Contents:

If you do not yet know what Unicode is, you will soon--even if you skip
reading this chapter--because working with Unicode is becoming
a necessity. (Some people think of it as a necessary evil, but it's
really more of a necessary good. In either case, it's a necessary pain.)

Historically, people made up character sets to reflect what they needed
to do in the context of their own culture. Since people of all
cultures are naturally lazy, they've tended to include only the symbols
they needed, excluding the ones they didn't need. That worked fine as
long as we were only communicating with other people of our own
culture, but now that we're starting to use the Internet for
cross-cultural communication, we're running into problems with the
exclusive approach. It's hard enough to figure out how to type
accented characters on an American keyboard. How in the world
(literally) can one write a multilingual web page?

Unicode is the answer, or at least part of the answer (see also XML).
Unicode is an inclusive rather than an exclusive character set. While
people can and do haggle over the various details of Unicode (and
there are plenty of details to haggle over), the overall intent is to
make everyone sufficiently happy[1] with Unicode so that they'll
willingly use Unicode as the international medium of exchange for
textual data. Nobody is forcing you to use Unicode, just as nobody is
forcing you to read this chapter (we hope). People will always be
allowed to use their old exclusive character sets within their own
culture. But in that case (as we say), portability
suffers.

[1] Or in some cases,
insufficiently unhappy.

The Law of Conservation of Suffering says that if we reduce the
suffering in one place, suffering must increase elsewhere. In the
case of Unicode, we must suffer the migration from byte semantics to
character semantics. Since, through an accident of history, Perl was
invented by an American, Perl has historically confused the notions of
bytes and characters. In migrating to Unicode, Perl must somehow
unconfuse them.

Paradoxically, by getting Perl itself to unconfuse bytes and
characters, we can allow the Perl programmer to confuse them, relying
on Perl to keep them straight, just as we allow programmers to confuse
numbers and strings and rely on Perl to convert back and forth as
necessary. To the extent possible, Perl's approach to Unicode is the
same as its approach to everything else: Just Do The Right Thing.
Ideally, we'd like to achieve these four Goals:

Goal #1:

Old byte-oriented programs should not spontaneously break on the
old byte-oriented data they used to work on.

Goal #2:

Old byte-oriented programs should magically start working on
the new character-oriented data when appropriate.

Goal #3:

Programs should run just as fast in the new character-oriented mode as
in the old byte-oriented mode.

Goal #4:

Perl should remain one language, rather than forking into a
byte-oriented Perl and a character-oriented Perl.

Taken together, these Goals are practically impossible to reach. But
we've come remarkably close. Or rather, we're still in the process of
coming remarkably close, since this is a work in progress. As Unicode
continues to evolve, so will Perl. But our overarching plan is to
provide a safe migration path that gets us where we want to go with
minimal casualties along the way. How we do that is the subject of
the next section.

15.1. Building Character

In releases of Perl prior to 5.6, all strings were viewed as sequences
of bytes.[2] In
versions 5.6 and later, however, a string may contain characters wider
than a byte. We now view strings not as sequences of bytes, but as
sequences of numbers in the range 0 .. 2**32-1 (or
in the case of 64-bit computers, 0 .. 2**64-1).
These numbers represent abstract characters, and the larger the
number, the "wider" the character, in some sense; but unlike many
languages, Perl is not tied to any particular width of character
representation. Perl uses a variable-length encoding (based on
UTF-8), so these abstract character numbers may, or may not, be packed
one number per byte. Obviously, character number
18,446,744,073,709,551,615 (that is,
"\x{ffff_ffff_ffff_ffff}") is never
going to fit into a byte (in fact, it takes 13 bytes), but if all the
characters in your string are in the range 0..127
decimal, then they are certainly packed one per byte, since UTF-8 is
the same as ASCII in the lowest seven bits.

[2] You may prefer to call them "octets"; that's
okay, but we think the two words are pretty much synonymous these
days, so we'll stick with the blue-collar word.

Perl uses UTF-8 only when it thinks it is beneficial, so if all the
characters in your string are in the range 0..255,
there's a good chance the characters are all packed in bytes--but in
the absence of other knowledge, you can't be sure because internally
Perl converts between fixed 8-bit characters and variable-length UTF-8
characters as necessary. The point is, you shouldn't have to worry
about it most of the time, because the character semantics are
preserved at an abstract level regardless of representation.

In any event, if your string contains any character numbers larger
than 255 decimal, the string is certainly stored in
UTF-8. More accurately, it is stored in Perl's extended version of
UTF-8, which we call utf8, in honor of a pragma
by that name, but mostly because it's easier to type. (And because
"real" UTF-8 is only allowed to contain character numbers blessed by
the Unicode Consortium. Perl's utf8 is allowed to contain any
character numbers you need to get your job done. Perl doesn't give a
rip whether your character numbers are officially correct or just
correct.)

We said you shouldn't worry about it most of the time, but people like to
worry anyway. Suppose you use a v-string to represent an IPv4
address:

Everyone can figure out that $badaddr will not work as an IP address.
So it's easy to think that if O'Reilly's network address gets forced into
a UTF-8 representation, it will no longer work. But the characters in
the string are abstract numbers, not bytes. Anything that uses an IPv4
address, such as the gethostbyaddr function, should automatically
coerce the abstract character numbers back into a byte representation
(and fail on $badaddr).

The interfaces between Perl and the real world have to deal with the
details of the representation. To the extent possible, existing
interfaces try to do the right thing without your having to tell them
what to do. But you do occasionally have to give instructions to some
interfaces (such as the open function), and if you
write your own interface to the real world, it will need to be either
smart enough to figure things out for itself or at least smart enough
to follow instructions when you want it to behave differently than it
would by default.[3]

[3] On some systems, there may be ways
of switching all your interfaces at once. If the -C
command-line switch is used, (or the global
${^WIDE_SYSTEM_CALLS} variable is set to
1), all system calls will use the corresponding
wide character APIs. (This is currently only implemented on Microsoft
Windows.) The current plan of the Linux community is that all
interfaces will switch to UTF-8 mode if
$ENV{LC_CTYPE} is set to
"UTF-8". Other communities may take other
approaches. Our mileage may vary.

Since Perl worries about maintaining transparent character semantics
within the language itself, the only place you need to worry about byte
versus character semantics is in your interfaces. By default, all your
old Perl interfaces to the outside world are byte-oriented,
so they produce and consume byte-oriented data. That is to say, on the
abstract level, all your strings are sequences of numbers in the range
0..255, so if nothing in the program forces them into utf8
representations, your old program continues to work on byte-oriented
data just as it did before. So put a check mark by Goal #1 above.

If you want your old program to work on new character-oriented data,
you must mark your character-oriented interfaces such that Perl knows
to expect character-oriented data from those interfaces. Once you've done
this, Perl should automatically do any conversions necessary to
preserve the character abstraction. The only difference is that you've
introduced some strings into your program that are marked as
potentially containing characters higher than 255, so if you perform
an operation between a byte string and utf8 string, Perl will
internally coerce the byte string into a utf8 string before performing
the operation. Typically, utf8 strings are coerced back to byte
strings only when you send them to a byte interface, at which point, if
the string contains characters larger than 255, you have a problem
that can be handled in various ways depending on the interface in
question. So you can put a check mark by Goal #2.

Sometimes you want to mix code that understands character semantics
with code that has to run with byte semantics, such as I/O code that
reads or writes fixed-size blocks. In this case, you may put a
use bytes declaration around the byte-oriented code
to force it to use byte semantics even on strings marked as utf8
strings. You are then responsible for any necessary conversions. But
it's a way of enforcing a stricter local reading of Goal #1, at the
expense of a looser global reading of Goal #2.

Goal #3 has largely been achieved, partly by doing lazy conversions
between byte and utf8 representations and partly by being sneaky in
how we implement potentially slow features of Unicode, such as
character property lookups in huge tables.

Goal #4 has been achieved by sacrificing a small amount of interface
compatibility in pursuit of the other Goals. By one way of looking at
it, we didn't fork into two different Perls; but by another way of
looking at it, revision 5.6 of Perl is a forked
version of Perl with regard to earlier versions, and we don't expect
people to switch from earlier versions until they're sure the new
version will do what they want. But that's always the case with new
versions, so we'll allow ourselves to put a check mark by Goal #4 as
well.