But Wait! There's More!

07/30/1998

Remember the predecessor to today's infomercials? I remember the Ginsu Knife. Firstthey'd cuta tomato, and ask what you'd pay for the knife. Next they'd startcutting wood, bricks, glass, metal and finally tell you the price. After that, they'd keepsaying "but wait, there's more" and throw in all kinds of extra things for thesame price.

Sometimes teaching a Perl programming class seems like one of those commercials --especially when talking about Regular Expressions (regexp, for short). No matter how muchtime you spend on them, it seems like there is always more. This month's column, as anextension to the regexp series we have been doing, will address some of Perl's regexpextensions. I will assume you are familiar with Perl, and know how to use regexps in Perl.

OVER EXTENDED

It is easy to overextend a financial budget, but with computers we often overextend acharacter set. Such is life with regexps. There are so many ways we might want to matchtext, yet we have to use text characters to invoke those meanings. This means that we haveto either have many meanings for the same character depending on where it is used, or usemultiple characters for a single meaning.

Perl designers mostly choose the latter route for their extensions. The common syntaxfor most extensions is:

(?<extension-character> regexp-atom)

In English, this means "parenthesis whose first character is a question mark,invoke the extensions, the next character determines which extension, the parenthesismight include part of the expression." Then again, maybe that is just as confusing,so let's do a couple of examples.

First, here is a chart of the extension characters that are valid as the secondcharacter inside the parentheses:

# comment

: Non-backreferencing

= Zero width positive lookahead assertion

! Zero width negative lookahead assertion

<c> One of the match modifiers (ismx)

The simplest example would be to use the simplest extension. The difference between /test/and /test/I is that the second regexp uses the I match modifier. That meansignore case in the match, so it would also match TeSt.

We can turn the ignore case option on within a regexp using the extensions:

/(?i)test/;

This would be the same as using /test/I, so there is not much point in doing itthis way. The real use is when you are going to assign a variable to the data you arelooking for, and some data is case sensitive, some is not. Here is an example:

$item="(?i$name)"; /$item/;

$item="$name"; /$item/;

The first line looks for the contents of the "name" variablecase-insensitive, the second example requires a case-correct match.

You can also embed comments in a regexp in much the same manner. Continuing thisexample:

$item="(?i$name)"; /$item(?# Item is really $name)/;

$item="(?i$name)"; /$item/;

These two lines are identical in operation since the string inside the (?# )extension will not be part of the regexp. This is rather handy when a regexp is reallylong. You can document each piece of the regexp separately.

LONG AND WINDING REGEXPS

Long regexps can start to look like line noise due to all the funny characters squishedtogether. Because of this, another useful feature for documenting or making a long regexpreadable is the "extended" match modifier feature. This is invoked with eitherof these:

/test/x;

/(?x)exp/;

Like the I modifier, which method you use is determined by if you are assigningthe regexp to a variable or not. The x modifier means make white space meaningless,so you can spread out, or group sections of the expression for better readability. Thesetwo expressions are identical:

/test/;

/t e s t/x;

In fact, the second one could be spread over multiple lines. Since all white space isignored, you need to be careful if you actually want white space in the expression. If wewanted to match "E code," the expression /E code/x would not work sincethe white space would be ignored. We need to encase the space character in a characterclass like this:

/E[ ]code/x.

Note that in an extended regexp, you can add comments just like in the rest of a Perlprogram. A # character means comment till end of line. Combining the extended andcomment extensions we can write expressions that look like this:

/(?x)(?# put the x up front so you will know)

# but then, this is a comment also

\d+ # need one or more digits

( # start a group

\. # literal space character

\d* # may be some more digits

)?/; # end group, and make it optional

DON'T GO BACK!

The last example referred to above in non-extended form would be /\d+(\.\d*)?/.Basically, it means match 123 or 123.45. In Perl there is a way to useregexps to extract a match from a string. For example:

$_='a 13 b 24';

@x=/(\d+)/;

print "@x";

The code above would print "13." This is due to two things. First, we put theentire regexp in parentheses, but also it is because we caught the return in an arrayvariable, thus requesting the match operations array context operation, which means toreturn what was in parentheses. Note the results if we tried to use the regexp from abovewith the extra parentheses added:

$_='12.56 78';

@x=/(\d+(\.\d*)?)/;

print "@x";

The code above would print "12.56 .56." Note that there were two numbersreturned since there were two sets of parentheses. The second set, (\.\d*)?, wasneeded for grouping, but we did not want to have it returned (back referenced). We can fixthis with the non-backreference extension character :. For example:

$_='12.56 78';

@x=/(\d+(?:\.\d*)?)/;

print "@x";

The code above would print "12.56" as desired.

As a side note, there is a global option, which would return all matches:

$_="a 23 b19 c 3";

@x=/\d+/g; # or /(\d+)/g

print "@x";

The code above would print "23 19 3." This would allow us to extract most ofthe floating point numbers from a string with this construct:

@x=/(\d+(?:\.\d*)?)/g;

I say "most of" since this is not a very good expression for the purpose, butillustrates the concept without creating so much line noise.

SNEAKING A PEEK

Perl also allows you to "peek outside" of the expression to see it you have amatch, but without actually matching what you are peeking at. Let's suppose we wanted tofigure out which items in some file were turned on. For example:

NIS=1 WIN95=0 XT=1

In the string above, we want to know that NIS and XT are turned on. In English, we arelooking for an item name, followed by "=1" not "=0." If we used:

@on=/[A-Z]+=1/g;

The array @on would have elements of "NIS=1 XT=1." This sort of works,but is not exactly what we wanted. (But wait, there's more!!!).

In this case we can use positive assertion lookahead to say "only match a set ofupper-case characters if it is followed by '=1'." For example:

@on=/[A-Z]+(?=1)/g;

The array @on would now have elements of "NIS XT."

Suppose we wanted to print the first name of people whose last name is not"simpson" from a file with one name per line like this:

bart simpson

doug billion

Ted Simpson

We could try a command like this:

perl -n -e 'print "$&\n" if /(\w+)(?=\s+simpson\b)/i;'

which would result in:

doug

Pretty exciting, huh? Using the (?! ) extension means that the enclosed must NOTcome after what we are looking for. For example:

/test(?!$)/

would only match "test" when it was not at the end of the string ($means end of string when it is at the end of the regexp).

So, now that you have seen all this stuff that Perl can do with regexps, how much wouldyou pay for it? Our price is only $0 plus $0 shipping, in three easy payments. (Thenagain, you can download it free from www.perl.org or www.perl.com.)

Fred was last seen preening in front of a mirror for a job as an infomercialspokesmodel.