Recipe 6.18 Expressing AND, OR, and NOT in a Single Pattern

6.18.1 Problem

You have an existing program that
accepts a pattern as an argument or as input. It doesn't allow you to
add extra logic, like case-insensitive options, ANDs, or NOTs. So you
need to write a single pattern that matches either of two different
patterns (the "or" case) or both of two patterns (the "and" case), or
that reverses the sense of the match ("not").

This situation arises often in configuration files, web forms, or
command-line arguments. Imagine there's a program that does
this:

chomp($pattern = <CONFIG_FH>);
if ( $data =~ /$pattern/ ) { ..... }

As the maintainer of CONFIG_FH, you need to convey
Booleans through to the program using one configuration parameter.

6.18.2 Solution

True if either /ALPHA/ or
/BETA/ matches, like /ALPHA/||/BETA/:

/ALPHA|BETA/
/(?:ALPHA)|(?:BETA)/ # works no matter what in both

True if both /ALPHA/ and /BETA/
match, but may overlap, meaning "BETALPHA" should
be okay, like /ALPHA/&&/BETA/:

/^(?=.*ALPHA)BETA/s

True if both /ALPHA/ and /BETA/
match, but may not overlap, meaning that
"BETALPHA" should fail:

/ALPHA.*BETA|BETA.*ALPHA/s

True if pattern /PAT/ does not match, like
$var!~/PAT/:

/^(?:(?!PAT).)*$/s

True if pattern BAD does not match, but pattern
GOOD does:

/(?=^(?:(?!BAD).)*$)GOOD/s

(You can't actually count on being able to place the
/s modifier there after the trailing slash, but
we'll show how to include it in the pattern itself at the end of the
Discussion.)

6.18.3 Discussion

When in a normal program you want to know whether something
doesn't match, use one of:

Instead of trying to do it all within a single pattern, it's often
more efficient and clearer to use Perl's normal Boolean connectives
to combine regular expressions. However, imagine a trivially short
minigrep program that reads its single pattern
as an argument, as shown in Example 6-9.

Example 6-9. minigrep

To tell minigrep that some pattern must not
match, or that it has to match both subpatterns in any order, you're
at an impasse. The program isn't built to accept multiple patterns.
How can you do it using one pattern? This need comes up in programs
reading patterns from configuration files.

The OR case is pretty easy, since the |
metacharacter provides for alternation. The AND and NOT cases,
however, are more complex.

For AND, you have to distinguish between overlapping and
non-overlapping needs. If, for example, you want to see whether a
string matches both "bell" and
"lab" and allow overlapping, the word
"labelled" should be matched. But if you don't
want to count overlaps, it shouldn't be matched. The overlapping case
uses a lookahead assertion:

"labelled" =~ /^(?=.*bell)lab/s

Remember: in a normal program, you don't have to go through these
contortions. Simply say:

$string =~ /bell/ && $string =~ /lab/

To unravel this, we'll spell it out using /x and
comments. Here's the long version:

We didn't use
.*? to end early, because minimal matching is more
expensive than maximal matching. It's more efficient to use
.* over .*?, given random input
where the occurrence of matches at the front or the end of the string
is completely unpredictable. Of course, sometimes choosing between
.* and .*? may depend on
correctness rather than efficiency, but not here.

To handle the non-overlapping case, you need two parts separated by
an OR. The first branch is THIS followed by THAT; the second is the
other way around:

Neither of those patterns matches the test data of
"labelled", since there "bell"
and "lab" do overlap.

These patterns aren't necessarily efficient.
$murray_hill=~/bell/&&$murray_hill=~/lab/ scans the string at most twice, but the
pattern-matching engine's only option is to try to find a
"lab" for each occurrence of
"bell" with
(?=^.*?bell)(?=^.*?lab), leading to quadratic
worst-case running times.

If you followed those examples, the NOT case should be a breeze. The
general form looks like this:

How would you combine AND, OR, and NOT? It's not a pretty picture,
and in a regular program, you'd almost never do this. But you have
little choice when you're reading from a config file or pulling in
arguments from the command line, because you specify only one
pattern. You just have to combine what we've learned so far.
Carefully.

Let's say you wanted to run the Unix w program
and find out whether user tchrist were logged on
anywhere but a terminal whose name began with
ttyp; that is, tchrist must
match, but ttyp must not.

Of course, this example is contrived: any sane person would call the
standard grep program twice, once with a
-v option to select only
non-matches.

% w | grep tchrist | grep -v ttyp

The point is that Boolean conjunctions and negations
can be coded up in one single pattern. You
should comment this kind of thing, though, having pity on those who
come after youbefore they do.

One last thing: how would you embed that /s in a
pattern passed to a program from the command line? The same way as
you would a /i modifier: by using
(?i) in the pattern. The /s and
/m modifiers can be painlessly included in a
pattern as well, using (?s) or
(?m). These can even cluster, as in
(?smi). That would make these two reasonably
interchangeable:

% grep -i 'pattern' files
% minigrep '(?i)pattern' files

When you turn on a modifier that way, it remains on for the entire
pattern. An alternative notation restricts the scope of the modifier.
Use a clustering parenthesis set, (?:...), and
place the modifiers between the question mark and the colon. Printing
out a qr// quoted regex demonstrates how to do
this:

% perl -le 'print qr/pattern/i'
(?i-xsm:pattern)

Modifiers placed before a minus are enabled for just that pattern;
those placed after the minus are disabled for that
pattern.

6.18.4 See Also

Lookahead assertions are shown in the "Regular Expressions" section
of perlre (1), and in the "Lookaround
Assertions" section of Chapter 5 of Programming
Perl; your system's grep(1) and
w(1) manpages; we talk about configuration files
in Recipe 8.16