Regular expressions in Perl 5.10

Perl 5.10 was released late last year, and with it come a number of
significant improvements to the language. We'll be running a series of
Perl tips covering some of the changes, and how you can use them to
make your life easier.

say

Perl 5.10 finally has print with a newline! It's called say, and
can be enabled with:

use feature 'say';

at the top of any program or module that needs it. You can then
simple write:

say "Hello World!"; # No \n needed!

rather than:

print "Hello World\n";

While we'll be discussing new functions and constructs in a later
Perl-tip, the say function is so handy we wanted to mention it
before anything else.

Debugging regular expressions

One of the largest improvements to Perl 5.10 has been in the area of
regular expressions (regexs). To get started, it's now possible to debug your
regexs with:

use re 'debug'
$some_string =~ /some_regexp/;

use re 'debug' also existed in Perl 5.8, however its behaviour there
was global, resulting in debugging information for all your regexs.
In 5.10 the pragma has lexical scope, meaning it lasts only until the
end of the current block, file, or eval.

Named Capture Variables

We've always been able to capture information from regular expressions
using parentheses, and recalling them using the match variables
$1, $2, $3.... However sometimes it can be rather challenging to
tell which match variable you want.

This can be doubly challenging when we interpolate smaller regexps
into bigger ones. For example, what match variable will the last
sequence of digits be placed into in the following expression?

/ (\d+) $customer_name_regexp (\d+) /x;

Keep in mind that $customer_name_regexp may or may not contain
parentheses itself.

In Perl 5.10 we can now have named captures. This means we can write:

/ (?<account>\d+) $customer_name_regexp (?<credit>\d+) /x;

Using (?<name>...) syntax allows us to capture a match and then
later refer to it by name. We can also refer to it by its regular
match number, so our account match above can still be referred to
as $1.

In order to retrieve named match information, we can use the special
hash %+:

Alternatives to $`, $& and $'

The special regex variables $`, $& and $' would match everything
before, inside, and after a regex respectively. However they came at
a great cost; mentioning one of these special variables anywhere
in your program would turn them on for all your regular expressions;
even those that didn't need them. As such, the use of these variables
are strongly discouraged in all but the most simple of programs.

However they can be very useful. There are some algorithms that
really appreciate knowing everything that was before or after a given
match.

In Perl 5.10 there's a new regexp modifier, /p, that gives us all
the conveience of the old $`, $& and $' variables, but without
the global performance penalty. Here's how it works:

The ${^PREMATCH}, ${^MATCH} and ${^POSTMATCH} variables are
only set when the /p switch is used.

More information

This tip only reveals some of the improvements made to the regexp
engine in Perl 5.10. A lot of advanced features have been added, and
a lot of new optimisations and improvements have been made
under-the-hood.