Quantify

Of course we now have the problem that it isn’t unreasonable for an ISBN to be written as ISBN: 9 or ISBN:9 with perhaps even more than one space after the colon.

We clearly need a way to specify the number of repeats that are allowed in a matching string.

To do this we make use of “quantifiers” following the specification to be repeated.

The most commonly used quantifiers are:

* zero or more

+ one or more

? zero or one

{n} exactly n times

{n,} n or more times

{n,m} at least n at most m times

In many ways this is the point at which regular expression use starts to become interesting and inevitably more complicated however the basic idea is fairly simple - how many repeats is allowed for a match.

Perhaps the key concept is that the * means something is optional, but + means it must occur. In both cases, whatever it is can occur multiple times. Contrast this to ? which means optional but only once.

For example:

/ISBN:\s*\d/

matches “ISBN:” followed by any number of white-space characters including none at all followed by a digit. Similarly:

/ISBN:?\s*\d/

matches “ISBN” followed by an optional colon (not multiple colons), any number of white-space characters including none followed by a digit.

Greedy!

Quantifiers are easy but there is a subtlety that often goes unnoticed.

Quantifiers, by default, are “greedy”.

That is they match as many entities as they can even when you might think that the regular expression provides a better match a little further on. The only way to really follow this is by the simplest example.

Suppose you need a regular expression to parse some HTML tags:

<div>hello</div>

If you want to match just a pair of opening and closing tags you might well try the following regular expression:

ex2= /<div>.*<\/div>/

which seems to say “the string starts with <div> then any number including zero of other characters followed by </div>”. If you try this out on the example given above you will find that it matches.

That is the final </div> in the regular expression is matched to the final </div> in the string even though there is an earlier occurrence of the same substring.

This is because the quantifiers are greedy by default and attempt to find the longest possible match.

In this case the .* matches everything including the first </div>.

So why doesn’t it also match the final </div>?

The reason is that if it did the entire regular expression would fail to match anything because there would be no closing </div>.

What happens is that the quantifiers continue to match until the regular expression fails, then the regular expression engine backtracks in an effort to find the longest match.

Notice that all of the standard quantifiers are greedy and will match more than you might expect based on what follows in the regular expression.

If you don’t want greedy quantifiers the solution is to use “lazy” quantifiers which are formed by following the standard quantifiers by a question mark.

To see this in action, change the previous regular expression to read:

var ex2= /.*?<\/div>/;

With this change in place the result of matching to

"<div>hello</div>world</div>"

is just the first pair of <div> brackets – that is <div>hello</div>.

Notice that all of the quantifiers, including ?, have a lazy version and yes you can write ?? to mean a lazy “zero or one” occurrence.

The distinction between greedy and lazy quantifiers is perhaps the biggest reason for a reasonably well-tested regular expression to go wrong when used against a wider range of example strings.

Always remember that a standard greedy quantifier will match as many times as possible while still allowing the regular expression to match, and its lazy version will match as few as possible times to make the regular expression match.

Anchors

As well as characters, character sets and quantifiers you can also use location matches or anchors.

For example, the ^ (caret) only matches the start of the string. For example, /^ISBN/ will only match if the string starts with ISBN: and doesn’t match if the same substring occurs anywhere else.

The most useful anchors are:

^ start of string

$ end of string

\b word boundary – i.e. between a \w and \W

\B anywhere but a word boundary

So for example:

/^\d+$/

specifies a string consisting of nothing but digits. Recall that the + symbol means match 1 or more times.

Compare this to

/^\d*$/

which would also accept a null string.

One subtle point only emerges when you consider strings with line breaks.

In this case by default the ^ and $ match only the very start and end of the string.

If you want them to match line beginnings and endings in a multiline string you have to specify the /m option.