This chapter is from the book

This chapter is from the book

Regular Expression Basics

Description

Understanding how to use regular expressions is fundamental to any Perl
programmer. The essential purpose of a regular expression is to match a pattern,
and Perl provides two operators for doing just that: m// (match) and
s/// (substitute). (The ins and outs of those operators are covered in
their own entries.)

When Perl encounters a regular expression, it's handed to a regular
expression engine and compiled into a special kind of state machine (a
Nondeterministic Finite Automaton). This state machine is used against your data
to determine whether the regular expression matches your data. For example, to
use the match operator to test whether the word fudge exists in a scalar
value:

$r=q{"Oh fudge!" Only that's not what I said.};
if ($r =~ m/fudge/) {
# ...
}

The regular expression engine takes /fudge/, compiles a state
machine to use against $r, and executes the state machine. If it was
successful, the pattern matched.

This was a simple example, and could have been accomplished quicker with the
index function. The regular expression engine comes in handy because
the pattern can contain metacharacters. Regular expression metacharacters are
used to specify things that might or might not be in the data, look different
(uppercase? lowercase?) in the data, or portions of the pattern that you just
don't care about.

The simplest metacharacter is the . (dot). Within a regular
expression, the dot stands for a "don't care" position. Any
character will be matched by a dot:

The exception is that a dot won't normally match a newline character.
For that to happen, the match must have the /s modifier tacked on to
the end. See the modifiers entry for details.

Metacharacters stand in for other characters (see "Character
Shorthand") or stand in for entire classes of characters (character
classes). They also specify quantity (quantifiers), choices (alternators), or
positions (anchors).

In general, something that is normally metacharacter can be made
"unspecial" by prefixing it with a backslash, which is sometimes
called "escaping" the character. So to match a literal m..n
(with real dots), change the expression to

m/m\.\.n/; # Matches only m..n

The full list of metacharacters is \, |, ^,
$, *, +, ?, ., (,
), [, {

Everything else in Perl's regular expressions matches itself. A normal
character (nonmetacharacter) can sometimes be turned into a metacharacter by
adding a backslash. For example, "d" is just a letter "d".
However, preceded by a backslash,

/\d/

It matches a digit. More of this is covered in the "Character
Shorthand" section. The entire set of metacharacters as well as some
contrived metacharacters are covered elsewhere in this book.

As you browse the remainder of this section, keep in mind that there are just
a few rules associated with regular expression matching. These are summarized as
follows:

The goal is for the match to succeed as a whole. Everything else takes a
backseat to that goal.

The entire pattern must be used to match the given data.

The match that begins the earliest (the leftmost) will be taken
first.

Unless otherwise directed (with ?), quantifiers will always match as
much as possible, and still have the expression match.

To sum up: the largest possible first match is normally taken.

For more information on how regular expression engines work, see the book
Mastering Regular Expressions by Jeffrey Friedl.

NOTE

See Also

m//, s///, character classes, alternation, quantifiers,
character shorthand, line anchors, word anchors, grouping, backreferences and
qr in this book

Basic Metacharacters and Operators

Match Operator

m//Usage

m/pattern/modifiers

Description

The m// operator is Perl's pattern match operator. The pattern
is first interpolated as though it were a double-quoted string[EM]scalar
variables are expanded, backslash escapes are translated, and so on. Afterward,
the pattern is compiled for the regular expression engine.

Next, the pattern is used to match data against the $_ variable
unless the match operator has been bound with the =~ operator.

In a scalar context, the match operator returns true if it succeeds and false
if it fails. With the /g modifier, in scalar context the match will
proceed along the target string, returning true each time, until the target
string is exhausted.

The modifiers (other than /g and /c) are described in the
Match Modifiers entry.

In a list context, the match operator returns a list consisting of all the
matched portions of the pattern that were captured with parenthesis (as well as
setting $1, $2 and so on as a side-effect of the match). If
there are no parenthesis in the match, the list (1) is returned. If the
match fails, the empty list is returned.

In a list context with the /g modifier, the list of substrings
matched by capturing parenthesis is returned. If no parenthesis are in the
pattern, it returns the entire contents of each match.

After a failed match with the /g modifier, the search position is
normally reset to the beginning of the string. If the /c modifier also
is specified, this won't happen, and the next /g search will
continue where the old one failed. This is useful if you're matching
against a target string that might be appended to during successive checks of
the match.

The delimiters within the match operator can be changed by specifying another
character after the initial m. Any character except whitespace can be
used, and using the delimiter of ' has the side-effect of not
allowing string interpolation to be performed before the regular expression is
compiled. Balanced characters (such as (), [], {},
and <>) can be used to contain the expression.

If the pattern is omitted completely, the pattern from the last successful
regular expression match is used. In the previous sample of code, the expression
<(?:Abigail|Addi)> is re-used for the grep's
pattern.

Example Listing 3.1

# The example from the "backreferences" section
# re-worked to use the list-context-with-/g return
# value of the match operator.
open(CONFIG, "config") || die "Can't open config: $!";
{
local $/;
%conf=<CONFIG>=~m/([^=]+)=(.*)\n/g;
}

NOTE

See Also

Substitution operator, ??, and match modifiers in this book

Substitution Operator

s///Usage

s/pattern/replacement/modifiers

Description

The s/// operator is Perl's substitution operator. The
pattern is first interpolated as though it were a double-quoted
string[EM]scalar variables are expanded, backslash escapes are translated, and
so on. Afterward, the pattern is compiled for the regular expression engine.

The pattern is then used to match against a target string; by default, the
$_ variable is used unless another value is bound using the =~
operator.

If the pattern is successfully matched against the target string, the matched
portion is substituted using the replacement.

The substitution operator returns the number of substitutions made. If no
substitutions were made, the substitution operator returns false (the empty
string). The return value is the same in both scalar and list contexts.

The /g modifier causes the substitution operator to repeat the match
as often as possible. Unlike the match operator, /g has no other side
effects (such as walking along the match in scalar context)[EM]it simply repeats
the substitution as often as possible for nonoverlapping regions of the target
string.

During the substitution, captured patterns from the pattern portion of the
operator are available during the replacement part of the operator as
$1, $2, and so on. If the /g modifier is used, the
captured patterns are refreshed for each replacement.

The /e modifier causes Perl to evaluate the replacement portion of
the substitution for each replacement about to happen as though it were being
run with eval {}. The replacement expression is syntax checked at
compile time and variable substitutions occur at runtime, the same as eval
{}.

The delimiters within the substitution operator can be changed by specifying
another character after the initial s. Any character except whitespace
can be used, and using the delimiter of ' has the side-effect of
not allowing string interpolation to be performed before the regular expression
is compiled. Balanced characters (such as (), [], {},
and <>) can be used to contain the pattern and replacement.
Additionally, a different set of characters can be used to encase the pattern
and the replacement:

The match modifiers (other than /e and /g) are covered in
the entry on match modifiers.

Example Listing 3.2

# This function takes its argument and renders it in
# Pig-Latin following the traditional rules for Pig Latin
# (Note that there's a substitution within a substitution.)
{
my $notvowel=qr/[^aeiou_]/i; # _ is because of \w
sub igpay_atinlay {
local $_=shift;
# Match the word
s[(\w+)]
{
local $_=$1;
# Now re-arrange the leading consonants
# or if none, append "yay"
s/^($notvowel+)(.*)/$2$1ay/
or
s/$/yay/;
$_; # Return the result
}ge;
return $_;
}
}
print igpay_atinlay("Hello world"); # "elloHay orldway"

NOTE

See Also

match operator, match modifiers, capturing, and backreferences in this
book

Character Shorthand

Description

Regular expressions, similar to double-quoted strings, also allow you to
specify hard-to-type characters as digraphs (backslash sequences), by name or
ASCII/Unicode number.

They differ from double-quoted context in that, within a regular expression,
you're trying to match the given character[EM]not trying to emit it. A
single digraph might match more than one kind of character.

The simplest character shorthand is for the common unprintables. These are as
follows:

Character

Matches

\t

A tab (TAB and HT)

\n

A newline (LF, NL). On systems with multicharacter line
termination characters, it matches both characters.

\r

A carriage return (CR)

\a

An alarm character (BEL)

\e

An escape character (ESC)

They also can represent any ASCII character using the octal or
hexadecimal code for that character. The format for the codes are:
\digits for octal and \xdigits for hexadecimal.
So to represent a SYN (ASCII 22) character, you can say

/\x16/; # Match SYN (hex)
/\026/; # Match SYN (oct)

However, beware that using \digits can cause ambiguity with
backreferences (captured pieces of a regexp). The sequence \2
can mean either ASCII 2 (STX), or it can mean the item that was
captured from the second set of parenthesis.

Ambiguous references are resolved in this manner: If the number of captured
parenthesis is greater than digit, \digit from
the capture; otherwise, the value is the corresponding ASCII value (in octal).
Within a character class, \digits will never stand for a
backreference. Single digit references such as \digit always
stand for backreference, except for \0, which means ASCII 0 (NUL).

To avoid this mess, specify octal ASCII codes using three digits (with a
leading zero if necessary). Backreferences will never have a leading zero, and
there probably won't be more than 100 backreferences in a regular
expression.

Wide (multibyte) characters can be specified in hex by surrounding the hex
code with {} to contain the entire sequence of digits. The
utf8 pragma also must be in effect.

use utf8;
/\x{262f}/; # Unicode YIN YANG

When the character is a named character, you can specify the name with a
\N{name} sequence if the charnames module has been
included.

Character Classes

Description

Character classes in Perl are used to match a single character with a
particular property. For example, if you want to match a single alphabetic
uppercase character, it would be nice to have a convenient property to describe
this property. In Perl, surround the characters that describe the property with
a set of square brackets:

m/[ABCDEFGHIJKLMNOPQRSTUVWXYZ]/

This expression will match a single, alphabetic, uppercase character (at
least for English speakers). This is a character class, and stands in for a
single character.

Ranges can be used to simplify the character class:

m/[A-Z]/

Ranges that seem natural (0-9, A-Z, A-M,
a-z, n-z) will work. If you're familiar with ASCII
collating sequence, other less natural ranges (such as [!-/]) can be
constructed. Ranges can be combined simply by putting them next to each other
within the class:

m/[A-Za-z]/; # Upper and lowercase alphabetics

Some characters have special meaning within a character class and deserve
attention:

The dash (-) character must either be preceded by a backslash, or
should appear first within a character class. (Otherwise it might appear to be
the beginning of a range.)

A closing bracket (]) within a character class should be preceded
by a backslash, or it might be mistaken for the end of the class.

The ^ character will negate the character class. That is, every possible
character that doesn't have the property described by the character class
will match. So
that:

m/[^A-Z]/; # Match anything BUT an uppercase, alphabetic character

Remember that negating a character class might include some things you
didn't expect. In the preceding example, control characters, whitespace,
Unicode characters, 8-bit characters, and everything else imaginable would be
matched[EM]just not A-Z.

In general, any other metacharacter (including the special character classes
later in this section) can be included within a character class. Some exceptions
to this are the characters .+()*|$^ which all have their mundane
meanings when they appear within a character class, and backreferences
(\1, \2) don't work within character classes. The
\b sequence means "backspace" in a character class, and not a
word boundary.

The hexadecimal, octal, Unicode, and control sequences for characters also
work just fine within character classes:

works as a Spanish speaker would expect, finding the words feliz and
cumpleaños. The locale can be negated by specifying a
bytes pragma within the lexical block, causing the character classes to
go back to their original meanings.

Perl also defines character classes to match sets of Unicode characters.
These are called Unicode properties, and are represented by
\p{property}. The list of properties is extensive because
Unicode's property list is long and perl adds a few custom
properties to that list as well. Because the Unicode support in Perl is
(currently) in flux, your best bet to find out what is currently implemented is
to consult the perlunicode manual page for the version of perl
that you're interested in.

The last kind of character class shortcut (other than user-defined ones
covered in the section on character classes) is defined by POSIX.
Within another character class, the POSIX classes can be used to match
even more specific kinds of characters. They all have the following form:

[:class:]

where class is the character class you're trying to match. To negate the
class, write it as follows: [:^class:].

Class

Meaning

ascii

7-bit ASCII characters (with an ord value
<127)

alpha

Matches a letter

lower

Matches a lowercase alpha

upper

Matches an uppercase alpha

digit

Matches a decimal digit

alnum

Matches both alpha and digit characters

space

Matches a whitespace character (just like \s)

punct

Matches a punctuation character

print

Matches alnum, punct, or space

graph

Matches alnum and punct

word

Matches alnum or underscore

xdigit

Match hex digits: digit, a-f, and A-F

cntrl

The ASCII characters with an ord value <32 (control
characters)

To use the POSIX character classes, they must be
within another character class:

for(split(//,$line)) {
if (/[[:print:]]/) { print; }
}

Using a POSIX class on its own:

if (/[:print:]/) { } # WRONG!

won't have the intended effect. The previous bit of code would match
:, p, r, i, n, and t.

If the locale pragma is in effect, the POSIX classes will
work as the corresponding C library functions such as isalpha,
isalnum, isascii, and so on.

Quantifiers

Usage

{min,max}
{min,}
{min}
*
+
?

Description

Quantifiers are used to specify how many of a preceding item to match. That
item can be a single character (/a*/), a group (/(foo)?/), or
it can be anything that stands in for a single character such as a character
class (/\w+/).

The first quantifier is ?, which means to match the preceding item
zero or one times (in other words, the preceding item is optional).

Any portion of a match quantified by ? will always be successful.
Sometimes an item will be found, and sometimes not, but the match will always
work.

The quantifier * is similar to ? in that the quantified
item is optional, except * specifies that the preceding item can match
zero or more times. Specifically, the quantified item should be matched as many
times as possible and still have the regular expression match succeed. So,

/fo*bar/;

matches 'fobar', 'foobar',
'foooobar', and also 'fbar'. The
* quantifier will always match positively, but whether a matching item
will be found is another question. Because of this, beware of expressions such
as the following:

/[A-Z]*\w*/

You might hope it will match a series of uppercase characters and then a set
of word characters, and it will. But it also will match numbers, empty strings,
and binary data. Because everything in this expression is optional, the
expression will always match.

With * you can absorb unwanted material to make your match less
specific:

In the preceding example, * was used to make [^"]
match empty quote marks, or quote marks with something inside; it was used to
make the attribute match (foo="bar") optional, and repeat it
as often as necessary.

The + quantifier requires the match not only to succeed at least
once, but also as many times as possible and still have the regular expression
match be successful. So, it's similar to *, except that at least
one match is guaranteed. In the preceding example, the space following the
\w+ was specified as \s+; otherwise items such as
<bodyonload="alert()"> would match.

/fo+bar/;

This matches 'fobar', 'foobar', and of
course 'fooooobar'. But unlike *, it will not match
'fbar'.

Perl also allows you to match an item a minimal, fixed, or maximum number of
times with the {} quantifiers.

Quantifier

Meaning

{min,max}

Matches at least min times, but at most max
times.

{min,}

-Matches at least min times, but as many as necessary for
the match to succeed.

{count}

Matches exactly count times.

Keep in mind that with the {min,} and
{min,max} searches, the match will absorb only as many
characters as necessary and still have the match succeed. Thus with the
following:

The $1 variable winds up with only three characters because the
first \w matched P, the last \w's needed "on"
to be successful, and that left "yth" for the quantified
\w.

Perl's quantifiers are normally maximal matching, meaning that
they match as many characters as possible but still allow the regular expression
as a whole to match. This is also called greedy matching.

The ? quantifier has another meaning in Perl: when affixed to a
*, +, or {} quantifier, it causes the quantifier to
match as few characters as necessary for the match to be successful. This is
called minimal matching (or lazy matching).

It might surprise you to see that the regular expression grabs the entire
string, not just each quote individually. That's because
".*" matches as much as possible between the quote marks,
including other quote marks. Changing the expression to:

m/".*?"/g

solves this problem by asking * to match as little as possible for
the match to succeed.

Keep in mind that ? is just a convenient shorthand and might not
represent the best possible solution to the problem. The pattern
/"[^"]*"/ would have been a more efficient choice
because the amount of backtracking by the regular expression engine to be done
would have been less. But there is programmer efficiency to consider.

NOTE

See Also

m operator in this book

Modification Characters

Usage

\Q \E \L \l \U \u

Description

The modification characters used in string literals (in an interpolated
context) are available in regular expressions as well. See the entry on
modification characters for a list.

Understand that these "metacharacters" aren't really
metacharacters at all. They do their work because regular expression match
operators allow interpolation to happen when the pattern is first
examined[EM]much in the same way that \L and \U are only
effective in double-quoted strings; they're only effective in regular
expressions when the pattern is first examined by perl.

Most useful among these in regular expressions is the \Q modifier.
The \Q modifier is used to quote any metacharacters that follow. When
accepting something that will be used in a pattern match from an untrusted
source, it is vitally important that you not put the pattern into the regular
expression directly. Take this small sample: