Login

Modifiers, Boundaries, and Regular Expressions

In this third part to a four-part series on parsing and regular expressions in Perl, you will learn about cloistered pattern modifiers, boundary assertions, troubleshooting regular expressions, and more. This article is excerpted from chapter one of the book Pro Perl Parsing, written by Christopher M. Frenz (Apress; ISBN: 1590595041).

Cloistered Pattern Modifiers

In the previous section, you saw how to apply pattern modifiers to an entire regular expression. It is also possible to apply these modifiers to just a portion of a given regular expression; however, the syntax is somewhat different. The first step is to define the subpattern to which you want the modifier to apply. You accomplish this by placing the subpattern within a set of parentheses. Immediately after the open parenthesis, but before the subpattern, you add?modifiers: . For example, if you want to match either ABC or AbC , rather than using alternation, you write the following:

/A(?i:B)C/

To create a regular expression that allows . to match /n but only in part of the expression, you can code something like the following, which allows any character to be matched until an A is encountered:

/.*?A(?s:.*?)BC/

It then allows any character to match, including /n , until a BC is encountered.

Note Cloistered pattern modifiers are available only in Perl versions 5.60 and later.

Assertions

Assertions are somewhat different from the topics I covered in the preceding sections on regular expressions, because unlike the other topics, assertions do not deal with characters in a string. Because of this, they are more properly referred to as zero-width assertions.

For example, if you want to match only the beginning of a string, you can employ the A assertion. Similarly, you can also use the ^ assertion, known as the beginning-of-line assertion, which will match characters at the beginning of a string. When used in conjunction with the /m modifier, it will also be able to match characters after any new lines embedded within a string. Thus, if you had the regular expressions /A123/ and /^123/m , both would be able to match the string 123456 , but only /^123/m would be able to match the string abdn123 .

The z, Z, and $ Assertions

Just as there are assertions for dealing with the beginnings of lines and strings, so too are there assertions for dealing with the character sequences that end strings. The first of these assertions is the z assertion, which will match the ending contents of a string, including any new lines. Z works in a similar fashion; however, this assertion will not include a terminal new line character in its match, if one is present at the end of a string. The final assertion is $ , which has functionality similar to Z , except that the /m modifier can enable this assertion to match anywhere in a string that is directly prior to a new line character. For example, /Z321/ , /z321/ , and /$321/ would be able to match the string 654321 .

{mospagebreak title=Boundary Assertions}

While assertions dealing with the beginning and end of a string/line are certainly useful, assertions that allow you to deal with positions internal to a string/line are just as important. Several types of assertions can accomplish this, and the first type you will examine is the so-called boundary assertion. The b boundary assertion allows you to perform matches at any word boundary. A word boundary can exist in two possible forms, since you have both a beginning of a word and an end. In more technical terms, the beginning of a word boundary is defined as Ww , or any nonalphanumeric character followed by any alphanumeric character. An end of a word boundary has the reverse definition. That is, it is defined by wW , or a word character followed by a nonword character. When using these assertions, you should keep in mind several considerations, however. The first is that the underscore character is a part of the w subpattern, even though it is not an alphanumeric character. Furthermore, you need to be careful using this assertion if you are dealing with contractions, abbreviations, or other wordlike structures, such as Web and e-mail addresses, that have embedded nonalphanumeric characters. According to the wW or Ww pattern, any of the following would contain valid boundaries:

Before I discuss the remaining assertion, I will first discuss the pos function, since this function and the G assertion are often used to similar effect. You can use the pos function to either return or specify the position in a string where the next matching operation will start (that is, one after the current match). To better understand this, consider the code in Listing 1-4.

Notice how the first e is missing from the output. This is because Listing 1-4 specified the search to begin at position 3, which is after the occurrence of the first e . Hence, when you print the listing of the returned matches, you can see that the e in the first position was not seen by the regular expression engine.

The remaining assertion, the G assertion, is a little more dynamic than the previous assertions in that it does not specify a fixed type of point where matching attempts are allowed to occur. Rather, the G assertion, when used in conjunction with the /g modifier, will allow you to specify the position right in front of your previous match. Let’s examine how this works by looking at a file containing a list of names followed by phone numbers. Listing 1-5 shows a short script that will search through the list of names until it finds a match. The script will then print the located name and the corresponding phone number.

Note As mentioned earlier, parentheses are metacharacters and must be escaped in order to allow the regular expression to match them.

This script begins with you creating the $string variable and adding the list of names. Next, you define the $name variable as the name Mary. The next line of code is not always necessary but can be if prior matching and other types of string manipulation were previ ously performed on the string. You can use the pos function to set the starting point of the search to the starting point of the string. Finally, you can use a loop structure to search for the name Mary within your $string variable. Once Mary is located, you apply the G assertion in the conditional statement, which will recognize and print any phone number that is present immediately after Mary. If you execute this script, you should receive the following output:

Mary (734)234-9873

{mospagebreak title=Capturing Substrings}

After looking at the previous example, you might be wondering how you were able to capture the recognized phone number in order to print it. Looking at the output and the print statement itself should give you the idea that it had something to do with the variable $1, and indeed it did. Earlier in the chapter, I noted that parentheses could serve two purposes within Perl regular expressions. The first is to define subpatterns, and the second is to capture the substring that matches the given subpattern. These captured substrings are stored in the variables $1, $2 , $3 , and so on. The contents of the first set of parentheses goes into $1 , the second into $2 , the third into $3 , and so on. Thus, in the previous example, by placing the phone number regular expression into parentheses, you are able to capture the phone number and print it by calling the $1 variable.

When using nested parentheses, it is important to remember that the parentheses are given an order of precedence going from left to right, with regard to where the open parenthesis occurs. As a result, the substring is enclosed by the first open parenthesis encountered and its corresponding close parenthesis will be assigned to $1 , even if it is not the first fully complete substring to be evaluated. For example, if you instead wrote the phone number regular expression as follows, the first set of parentheses would capture the entire phone number as before:

=~/(s?((?d{3})?)[-s.](?d{3}[-.]d{4}))/

The second set would capture the area code in $2 , and the third set would put the remainder of the phone number into $3 .

Note If you do not want to capture any values with a set of parentheses but only specify a subpattern, you can place ?: right after ( but before the subpattern (for example, (?:abc) ).

Parentheses are not the only way to capture portions of a string after a regular expression matching operation. In addition to specifying the contents of parentheses in variables such as $1 , the regular expression engine also assigns a value to the variables $` , $& , and $’ . $& is a variable that is assigned the portion of the string that the regular expression was actually able to match. $` is assigned all the contents to the left of the match, and $’ is assigned all the contents to the right of the match (see Table 1-6).

Caution When dealing with situations that involve large amounts of pattern matching, it may not be advisable to use $& , $` , and $’ , since if they are used once they will be repeatedly generated for every match until the Perl program terminates, which can lead to a lengthy increase in the program’s execution time.

Table 1-6. Substring Capturing Variables

Variable

Use

$1 , $2 , $3 , …

Stores captured substrings contained in parentheses

$&

Stores the substring that matched the regex

$`

Stores the substring to the left of the matching regex

$’

Stores the substring to the right of the matching regex

Let’s take some time now to explore both types of capturing in greater depth by considering the medical informatics example, mentioned earlier, of mining medical literature for chemical interactions. Listing 1-6 shows a short script that will search for predefined interaction terms and then capture the names of the chemicals involved in the interaction.

Listing 1-6. Capturing Substrings

#!usr/bin/perl;

($String=<<‘ABOUTA’); ChemicalA is used to treat cancer. ChemicalA reacts with ChemicalB which is found in cancer cells. ChemicalC inhibits ChemicalA.ABOUTA

The script begins by searching through the text until it reaches one of the predefined interaction terms. Rather than using a dictionary-type list with numerous interaction terms, alternation of the two terms found in the text is used for simplicity. When one of the interaction terms is identified, the variable $rxn is set equal to this term, and $left and $right are set equal to the left and right sides of the match, respectively. Conditional statements and parentheses-based string capturing are then used to capture the word before and the word after the interaction term, since these correspond to the chemical names. It is also important to note the use of the z assertion in order to match the word before the inter action term, since this word is located at the end of the $left string. If you run this script, you see that the output describes the interactions explained in the initial text:

ChemicalA reacts with ChemicalB ChemicalC inhibits ChemicalA

Substitution

Earlier I mentioned that in addition to basic pattern matching, you can use the =~ and !~ operations to perform substitution. The operator for this operation is s///. Substitution is similar to basic pattern matching in that it will initially seek to match a specified pattern. However, once a matching pattern is identified, the substitution will replace the part of the string that matches the pattern with another string. Consider the following:

$String="aabcdef"; $String=~s/abc/123/; print $String;

If you execute this code, the string a123def will be printed. In other words, the pattern recognized by /abc/ is replaced with 123 .

{mospagebreak title=Troubleshooting Regexes}

The previous examples clearly demonstrate that regular expressions are a powerful and flexible programming tool and are thus widely applicable to a wealth of programming tasks. As you can imagine, however, all this power and flexibility can often make constructing complex regular expressions quite difficult, especially when certain positions within the expression are allowed to match multiple characters and/or character combinations. The construction of robust regular expressions is something that takes practice; but while you are gaining that experience, you should keep in mind a few common types of mistakes:

Make sure you choose the right wildcard: For example, if you must have one or more of a given character, make sure to use the quantifier + and not * , since * will match a missing character as well.

Watch out for greediness: Remember to control greediness with ? when appropriate.

Make sure to check your case (for example, upper or lowercase): For example, typing W when you mean w will result in the ability to match different things.

Watch out for metacharacters ( , ( , | , [ , { , ^ , $ , * , + , . , and ? ): If a metacharacter is part of your pattern, make sure you turn off its special meaning by prefixing it with .

Check your|conditions carefully: Make sure all the possible paths are appropriate.

Even with these guidelines, debugging a complex regular expression can still be a challenge, and one of the best, although time-consuming, ways to do this can be to actually draw a visual representation of how the regular expression should work, similar to that found in the state machine figures presented earlier in the chapter (Figure 1-2 through Figure 1-8). If drawing this type of schematic seems too arduous a task, you may want to consider using the GraphViz::Regex module.

GraphViz::Regex

GraphViz is a graphing program developed by AT&T for the purpose of creating visual representations of structured information such as computer code (http://www.research.att.com/sw/tools/graphviz/). Leon Brocard wrote the GraphViz Perl module, which serves as a Perl-based interface to the GraphViz program. GraphViz::Regex can be useful when coding complex regular expressions, since this module is able to create visual representations of regular expressions via GraphViz. The syntax for using this module is quite straightforward and is demonstrated in the following code snippet:

When you first employ the GraphViz::Regex module, you place a call to the new constructor, which requires a string of the regular expression that you seek a graphical representation of. The new method is then able to create a GraphViz object that corresponds to this representation and assigns the object to $graph . Lastly, you are able to print the graphical representation you created. This example displays a JPEG file, but numerous other file types are supported, including GIF, PostScript, PNG, and bitmap.

Caution The author of the module reports that there are incompatibilities between this module and Perl versions 5.005_03 and 5.7.1.

Tip Another great tool for debugging regular expressions comes as a component of ActiveState’s programming IDE Komodo. Komodo contains the Rx Toolkit, which allows you to enter a regular expression and a string into each of its fields and which tells you if they do or do not match as you type. This can be a rapid way to determine how well a given expression will match a given string.

Using Regexp::Common

As you can imagine, certain patterns are fairly commonplace and will likely be repeatedly utilized. This is the basis behind Regexp::Common, which is a Perl module originally authored by Damian Conway and maintained by Abigail that provides a means of accessing a variety of regular expression patterns. Since writing regular expressions can often be tricky, you may want to check this module and see if a pattern suited to your needs is available. Table 1-7 lists all the expression pattern categories available in version 2.113 of this module.

Table 1-7. Regexp::CommonPatterns

Pattern Types

Use

Balanced

Matches strings with parenthesized delimiters

Comment

Identifies code comments in 43 languages

Delimited

Matches delimited text

Lingua

Identifies palindromes

List

Works with lists of data

Net

Matches IPv4 and MAC Internet addresses

Number

Works with integers and reals

Profanity

Identifies obscene terms

URI

Identifies diversity of URI types

Whitespace

Matches leading and trailing whitespace

Zip

Matches ZIP codes

Although Table 1-7 provides a general idea of the different types of patterns, it is a good idea to look at the module description available at CPAN ( http://www.cpan.org/ ). The module operates by generating hash values that correspond to different patterns, and these patterns are stored in the hash%RE . When using this module, you can access its predefined subpatterns by referencing the scalar value of a particular hash element. So, if you want to search for Perl comments in a file, you can employ the hash value stored in $RE{comments}{Perl} ; or, if you want to search for real numbers, you can use$RE{num}{real} . This two-layer hash of hash structure is fine for specifying most pattern types, but deeper layers are available in many cases. These deeper hash layers represent flags that modify the basic pattern in some form. For example, with numbers—in addition to just specifying real or integer—you can also set delimiters so that 1,234 is interpreted as a valid number pattern rather than just 1234 . I will briefly cover some types of patterns, but complete coverage of every possible option could easily fill a small book on its own. I recommend you look up the module on CPAN ( http://www.cpan.org ) and refer to the descriptions of the pattern types offered by each component module.

Regexp::Common::Balanced

This namespace generates regular expressions that are able to match sequences located between balanced parentheses or brackets. The basic syntax needed to access these regular expressions is as follows:

$RE{balanced}{-parens=>'()[]{}’}

The first part of this hash value refers to the basic regular expression structure needed to match text between balanced delimiters. The second part is a flag that specifies the types of parentheses you want the regular expression to recognize. In this case, it is set to work with () , [] , and {} . One application of such a regular expression is in the preparation of publications that contain citations, such as “(Smith et al., 1999).” An author may want to search a document for in-text citations in order to ensure they did not miss adding any to their list of references. You can easily accomplish this by passing the filename of the document to the segment of code shown in Listing 1-7.

Listing 1-7. Pulling Out the Contents of ()from a Document

#!/usr/bin/perl -w use Regexp::Common;

while(<>){ /$RE{balanced}{-parens=>'()’}{-keep}/ and print "$1n"; }

Note A more detailed description of the module’s usage will follow in the sections “Standard Usage” and “Subroutine-Based Usage,” since each of the expression types can be accessed through code in the same manner.