"Linux Gazette...making Linux just a little more fun!"

Learning Perl, part 2

"I realized at that point that there was a huge ecological niche
between the C language and Unix shells. C was good for manipulating complex
things - you can call it 'manipulexity.' And the shells were good
at whipping up things - what I call 'whipupitude.' But there was this big
blank area where neither C nor shell were good, and that's where I aimed
Perl." -- Larry Wall, author of Perl

Overview

In the first part, we talked about some basics and general issues in
Perl - writing a script, hash-bangs, style - as well as a number of specifics,
such as scalars, arrays, hashes, operators, and quoting methods. This month,
we'll take a look at the intrinsic Perl tools that make it so easy to use
from the command line, as well as their equivalents in scripts. We'll also
go a little deeper into quoting methods, and get a bit of a start on regexes
(regular expressions, or REs) - one of the most powerful tools in Perl,
and one that deserves an entire book all its own. [1]

Quote Mechanisms

Most of you will be familiar with the standard quoting mechanisms in
Unix: the single and the double quote, which I'd already mentioned in my
previous article, have much the same functionality in Perl as they do in
the shell. Sometimes, though, escaping all the in-line metacharacters can
be a bit painful. Imagine trying to print a string like this:

``/// Don't say "shan't," "can't," or "won't." ///''

Good grief! What can we do with a mess like that?

Well, we could put in a whole bunch of escapes ("\"), but that would
be a pain - as well as a case of the LTS ("Leaning Toothpick Syndrome"):

print '\`\`\/\/\/ Don\'t...

<shudder> Obviously not a good answer. For times like these, Perl
provides alternate quoting mechanisms:

Note also that the delimiter does not have to be '/', but can be any
character. Now our job becomes a bit easier:

print q-``/// Don't say "shan't," "can't," or "won't." ///''-;

Simple, eh? By the way, this is something you would use only inside
a script; the shell interpretation mechanism would make a horrendous mess
of this if you tried it from the command line, especially things like back
quotes and slashes.

Perl Invocation

"Hear my plea, O Perl of Great Wisdom!" Oh, never mind; I think that
was standard in Perl3, and is now deprecated... :)

The most commonly-used switch in invoking Perl, if you're running it
from the command line, is '-e'; this one tells Perl to execute whatever
comes immediately after it. In fact, '-e' must be the last switch used
on the command line because everything after it is considered to
be part of the script!

perl -we 'print "The Gods send thread for the Web begun.\n"'

"-w" is the "warn" switch that I mentioned the last time. It tells you
about all the non-fatal errors in your code, including variables that you
set but didn't use (invaluable for finding mistyped variable names) as
well as many, many other things. You should always - yes, always
- use "-w", whether on the command line or in a script.

"-n" is the "non-printing loop" switch, which causes Perl to iterate
over the input, one line at a time - somewhat like "awk". If you want to
print a given line, you'll need to specify a condition for it:

perl -wne 'print if /holiday/' schedule.txt

Perl will loop through "schedule.txt" and print any line that contains
the word "holiday", so you can get depressed about how little time off
you actually have.

"-p" is the invocation for a "printing loop", which acts just like "-n"
except that it prints every line that it loops over. This is very useful
for "sed"-like operations, like modifying a file and writing it back out
(we'll discuss 's///', the substitution operator, in just a bit):

perl -wpe 's/holiday/Party time!/' schedule.txt

This will perform the substitution on the first occurrence of the word
'holiday' in any given line (see "perldoc perlre" for discussion of modifiers
used with 's///', such as 'g'lobal.)

The "-i" switch works well in combination with either of the above,
depending on the desired action; it allows you to perform an "in-place"
edit, i.e. make the changes in the specified file (optionally performing
a backup beforehand) rather than printing them out to the screen. Note
that we can't just tack an "i" onto the "wpe" string: it takes an optional
argument - the extension to be appended to the backup copy - and the text
that follows it is what specifies that extension.

perl -i~ -wpe 's/holiday/Party time!/' schedule.txt

The above line will produce a "schedule.txt" with the modified text
in it, and a "schedule.txt~" that is the original file. "-i" without any
extension overwrites the original file; this is far more convenient than
producing a modified file and renaming it back to the original, but be
sure that your code is correct, or you'll wipe out your original
data!

RegExes, or "Has The Cat Been Walking On My Keyboard Again?"

One of the most powerful tools available in Perl, the regular expression
is the way to match almost any imaginable character arrangement. Here (necessarily)
I'll cover only the very basics; if you find that you need more information,
dig into the "perlre" manpage that comes with Perl. That should keep you
busy for a while. :)

REs are used for pattern matching, most commonly with the "m//" (matching)
and "s///" substitution) operators. Note that the delimiters in these,
just like in the quoting mechanisms, are not restricted to '/'; in fact,
the leading 'm' in the matching operator is required only if a non-default
delimiter is used. Otherwise, just the "//" is sufficient.

Here are some of the metacharacters used with REs. Note that there are
many more; these are just enough to get us started:

. Matches any character
except the newline^ Match the beginning
of the line$ Match the end of the
line| Alternation (match
"left|right|up|down|sideways")* Match 0 or more times+ Match 1 or more times? Match 0 or 1 times{n} Match exactly n times{n,} Match at least n times{n,m} Match at least n but not more than m times

and we want to replace the first name with 'Captain'. Obviously, we
would go through the file with a printing loop and do a substution if it
matched our criteria:

s/^.+ /Captain /;

The caret ('^') matches at the beginning of the line, the ".+" says
"any character, repeated 1 or more times", and the space matches a space.
Once we find what we're looking for, we're going to replace it with 'Captain'
followed by a space - since the string that we're replacing contains one,
we'll need to put it back.

Let's say that we also knew that somewhere in the file, there are a
couple of names that contain apostrophes (Francois L'Ollonais),
and we wanted to skip them - or anything else that contained 'non-letter'
characters. Let's expand the regex a bit:

s/^[A-Z][a-z]* /Captain /;

We've used the "character class" specifiers, "[]", to first match one
character between 'A' and 'Z' - note that only one character is
matched by this mechanism, a very important distinction! - followed by
a one-character match of 'a' through 'z' and an asterisk, which, again,
says "zero or more of the preceding character".

Oops, wait! How about "KuoHsing"? The match would fail on the
'H', since upper-case characters were not included in the specified range.
OK, we'll modify the regex:

s/^\w* /Captain /;

The '\w' is a "word character" - once again, it matches only one character
- that includes 'A-Z', 'a-z', and '_'. It is preferable to [A-Za-z_] because
it uses the value of $LOCALE (a system value) to determine what characters
should or should not be part of words - and this is important in languages
other than English. As well, '\w' is easier to type than '[A-Za-z_]'.

Let's try something a bit different: What if we still wanted to
match all the first names, but now, rather than replacing them, we wanted
to swap them around with the last names, separate the two with a comma,
and precede the last name with the word 'Captain'? With regexes at our
command, it's not a problem:

s/^(\w*) (\w*)$/Captain $2, $1/;

Note the parentheses and the "$1" and "$2" variables: the
parentheses "capture" the enclosed part of the regex, which we can
then refer to via the variables (the first captured piece is $1, the second
is $2, and so on.) So, here is the above regex in English:

Starting from the beginning of the line, (begin capture into $1)
match any "word character" repeated zero or more times (end capture) and
followed by a space, (begin capture into $2) followed by any "word character"
repeated zero or more times (end capture) until the end of the line. Return
the word 'Captain' followed by a space, which is followed by the value
of $2, a comma, a space, and the value of $1.

I'd say that regexes are a very compact way to say all of the
above. At times like these, it becomes pretty obvious that Larry Wall is
a professional linguist. :)

These are just simple examples of what goes into building a regex. I
must admit to cheating a bit: name-parsing is probably one of the biggest
challenges out there, and I could have spun these example out as long as
I wanted. Considering that the possibilities include "John deJongh", "Jan
M.
van de Geijn", "Kathleen O'Hara-Mears", "Siu Tim Au Yeung", "Nang-Soa-Anee
Bongoj Niratpattanasai", and "Mjölby J. de Wærn" (remember to
use those LOCALE-aware matches, right?), the field is pretty broad and
very odd in spots. (Miss Niratpattanasai, after looking at something like
"John Smith". would probably agree. :)

Here's an important factor to be aware of in the regex mechanism: by
default, it does "greedy matching". In other words, given a phrase like

Hmmm. Everything from the first 'A' (followed by zero or more of any
character) to the last 'es'. How can we match just the first instance,
then? To counteract the greed, Perl provides a "generosity" modifier to
quantifiers such as '*', '+', and '?':

/A.*?es/

Acciones son amores, no besos ni apachurrones|______|

There. Much better. For future reference, remember: if you're breaking
up a string by matching its pieces with a series of regexes, and the last
"chunks" are coming up empty, you've probably got a "greed" problem.

The Default Buffer/Variable

Some of you, especially those who have done some programming in the
past, have probably been curious about some of the code constructs above,
like

print if /holiday/;

"Print what if what? Where is the variable that we're
checking for the match? Shouldn't it be something like 'if $x == /holiday/',
the way it is in the shell?"

I'm glad you asked that question. :)

Perl uses an interesting concept, found in a few other languages, of
the default buffer - also referred to as the default variable
and the default pattern space. Not surprisingly, it's used in the
looping constructs - when we use the "-n/-p" syntax in the Perl invocation,
it is the variable used to hold the current line - as well as in substitution
and matching, and a number of other places. The '$_' variable is the default
for all of the above; when a variable is not specified in a place where
you'd expect one, '$_' is usually the "culprit." In fact, '$_' is rather
difficult to explain - it turns up in so many places that coming up with
an algorithm is seemingly impossible - but it is wonderfully easy and intuitive
to use, once you get the idea.

Consider the following:

perl -wne 'if ( $_ =~ /Henry/ ) { print $_; } pirates

If a line in the "pirates" file, above, matches "Henry", it will be
printed. Fine; but now, let's play some amateur "Perl Golf" - that's a
contest among Perl hackers to see how many (key)strokes can be taken off
a piece of code and still leave it functional.

Since we already know that Perl reads each line into '$_', we'll just
get rid of all the explicit declarations of it:

perl -wne 'if ( /Henry/ ) { print; } pirates

Perl "knows" that we're matching against the default variable, and it
"knows" that the "print" statement applies to the same thing. Now, we apply
a little Perl idiom:

perl -wne 'print if /Henry/' pirates

Isn't that nice? Perl actually allows you to write out your code with
the condition following the action; kinda the way you'd say things in English.
Oh, and we've snipped off the semicolon on the end because we don't need
it: it's a statement separator, and there's no statement following
"/Henry/".

<grin> For those of you playing along at home, try

perl -ne'/Henry/&&print' pirates

It shouldn't be that hard to figure out; the '&&' operator
in Perl works the same way as it does in the shell. Perl Golf is fun to
play, but be careful: it's easy to write code that will work but will
require lots of head-scratching to understand. Don't Do That. I may have
to maintain your code tomorrow... just like you may have to maintain mine.

In the first example, note the "binding operator", '=~', which checks
for a match in the supplied variable. This is what you would use if you
were matching against a variable other than "$_". There is also a "negative
match" operator, '!~', which returns true if the match fails (the inverse
of '=~'.)

Note also that the available modifiers for simple statements, like that
above, include not only the "if", but also "unless", "while", "until",
and "for". All of these, and more, are coming up in Part 3...

[1]. And in fact, has one - "Mastering Regular Expressions"
by Jeffrey E. Friedl is considered to be a reference on the subject. It
includes some wonderful examples, and literally teaches the reader to "think
in regex".