It's a wide world out there, and the one hundred and twenty characters
in ASCII just doesn't cut it anymore. In a global marketplace - or
whenever we want to talk to those Paris Perl Mongueurs - we need to
use a bigger range of characters. The funny Es with acute signs on
them, weird greek characters, and things that just look like
squiggles. We need them all.

Whenever we want to tell the terminal to print these characters to a
terminal or save them to a file we need to encode them in an character
encoding so they can be represented in bytes. Whenever we read these
characters in we need to decode the byte sequences.

Since Perl 5.6, perl has been able to store Unicode characters in
strings. Consider the Unicode character Ω (omega) with
Unicode code point 937 (i.e. it's the 938th Unicode character, but we
start counting from 0 not 1) and would typically be used like this in
a mathematical formula:

a Ω b

This could be simply be entered as a Perl string by using the chr
function to convert from the code point to a character.

my $string = "a " . chr(937) . " b";

Or by using the \x escape string inside a string with the
hexadecimal code for 937:

my $string = "a \x{03A9} b"

Or by using the \N escape string inside a string with the Unicode
name for the character:

# load the character names into our script
use charnames ":full";

my $string = "a \N{GREEK CAPITAL LETTER OMEGA} b";

Or by using a Unicode aware text editor, the use utf8 pragma and
getting your editor to save the script using the utf8 byte sequence
encoding, meaning you can just type the sequence with your keyboard.

# declare everything after this command will
# be represented on disk by utf8 bytes
use utf8;

my $string = "a Ω b";

Rendering the string

All of these approaches work - Perl now has a five character string in
memory that contains the correct character. For example, if we write
something to print out the code point of each character we get the right
thing:

The trouble comes when you want to print out the characters
themselves. The question is of course "how do you send the character
out to the terminal?" Printing the a out is trivial; Just sending
the byte 97 to the terminal will cause it to render a letter a on
the screen. However, there isn't a single byte that represents omega.
It depends on the encoding that the terminal you're using is using at
the time. You need to know the correct byte (or bytes) to send to the
terminal in the encoding scheme that it's using to get it to display
the letter you want.

For example, if you set your terminal to use "iso-8859-7" then sending
byte 217 to will cause it to print an omega (where if you have it
set to latin-1 as normal it'll just print a Ù.) If you
have a utf8 terminal then you'll be needing to send it the multi-byte
sequence of 206 and 169. The byte sequence you're using is
purely arbitrary - it's what's defined in the form of encoding you're
using.

So how do you work out what to send?

Using Encode to do the Character Translation

The Encode module that ships with perl 5.8 can be used to encode
string that perl holds in memory into byte representations (and,
in fact go the other way and decode byte representations and make perl
strings.) For example, converting our string into "iso-8859-7".

use Encode;
my $bytes = Encode::encode("iso-8859-7", $string);

The scalar $bytes now contains the bytes that represent $string in
the encoding we passed. Printing out one byte per line like we did
for the characters above gives us:

97
32
217
32
98

Most of the numbers are the same - because the byte that represents them
in iso-8859-7 is the same as the Unicode character number. Only 937 has
changed to 217. Printing this scalar to our iso-8859-7 terminal causes it
to display the right characters.

a Ω b

Automatic Translation

Whenever you print something out with Perl it has to work out which
bytes to send for each character. By default (in latin-1 locales at
least) it does no translation on the characters it's printing out
mapping the character code point directly to the byte it prints out
(this also means that when you print scalars that contain binary data
or already encoded byte sequences then thankfully no extra encoding
happens.)

It's possible to tell perl to automatically translate the string into
the correct format when print it out. For example, to write a file as
iso-8859-7 you can use a PerlIO layer to do the translation for you: