NAME

perlunicook - cookbookish examples of handling Unicode in Perl

DESCRIPTION

This manpage contains short recipes demonstrating how to handle common Unicode
operations in Perl, plus one complete program at the end. Any undeclared
variables in individual recipes are assumed to have a previous appropriate
value in them.

EXAMPLES

℞ 0: Standard preamble

Unless otherwise notes, all examples below require this standard preamble
to work correctly, with the #!
adjusted to work on your system:

℞ 3: Declare source in utf8 for identifiers and literals

Without the all-critical useutf8
declaration, putting UTF‑8 in your
literals and identifiers won’t work right. If you used the standard
preamble just given above, this already happened. If you did, you can
do things like this:

℞ 11: Names of CJK codepoints

Sinograms like “東京” come back with character names of
CJKUNIFIEDIDEOGRAPH-6771
and CJKUNIFIEDIDEOGRAPH-4EAC
,
because their “names” vary. The CPAN Unicode::Unihan
module
has a large database for decoding these (and a whole lot more), provided you
know how to understand its output.

℞ 27: Unicode normalization

Typically render into NFD on input and NFC on output. Using NFKC or NFKD
functions improves recall on searches, assuming you've already done to the
same text to be searched. Note that this is about much more than just pre-
combined compatibility glyphs; it also reorders marks according to their
canonical combining classes and weeds out singletons.

℞ 31: Extract by grapheme instead of by codepoint (substr)

℞ 32: Reverse string by grapheme

Reversing by codepoint messes up diacritics, mistakenly converting
crèmebrûlée
into éel̂urbem̀erc
instead of into eélûrbemèrc
;
so reverse by grapheme instead. Both these approaches work
right no matter what normalization the string is in:

℞ 44: PROGRAM: Demo of Unicode collation and printing

Here’s a full program showing how to make use of locale-sensitive
sorting, Unicode casing, and managing print widths when some of the
characters take up zero or two columns, not just one column each time.
When run, the following program produces this nicely aligned output:

The Unicode::Tussle CPAN module includes many programs
to help with working with Unicode, including
these programs to fully or partly replace standard utilities:
tcgrep instead of egrep,
uniquote instead of cat -v or hexdump,
uniwc instead of wc,
unilook instead of look,
unifmt instead of fmt,
and
ucsort instead of sort.
For exploring Unicode character names and character properties,
see its uniprops, unichars, and uninames programs.
It also supplies these programs, all of which are general filters that do Unicode-y things:
unititle and unicaps;
uniwide and uninarrow;
unisupers and unisubs;
nfd, nfc, nfkd, and nfkc;
and uc, lc, and tc.

Finally, see the published Unicode Standard (page numbers are from version
6.0.0), including these specific annexes and technical reports: