Tuesday, 15 April 2008

A zero-width or zero-length match is a regular expression match that does not match any characters. It matches only a position in the string. E.g. the regex \b matches between the 1 and , in 1,2.

Zero-lenght matches are often an unintended result of mistakenly making everything optional in a regular expression. Such a regular expression will in fact find a zero-length match at every position in the string. My floating point example has long shown this.

Apparently, JavaScript developers have it particularly tough. Different browsers handle zero-length matches differently. Steven Levithan argues that IE has a bug because it increments lastIndex. Steven’s observation is correct. When iterating over /\b/g.exec(), regex.lastIndex = match.index + 1 in Internet Explorer, while in other browsers they’re equal. So who’s got it wrong?

The value of the lastIndex property is an integer that specifies the string position at which to start the next match.

It’s easy enough to understand this in the context where the developer sets lastIndex prior to calling exec() to make the match attempt start at a certain position. But how should the exec() method set lastIndex after a successful match?

For String.match() the standard says in 15.5.4.10:

If there is a match with an empty string (in other words, if the value of regexp.lastIndex is left unchanged), increment regexp.lastIndex by 1.

For String.replace() the standard says in 15.5.4.11:

Do the search in the same manner as in String.match(), including the update of searchValue.lastIndex.

But for RegExp.exec() the standard says in 15.10.6.2:

Let e be r’s endIndex value [i.e. the end of the match]. If the global property is true, set lastIndex to e.

The standard contradicts itself. 15.10.6.2 is inconsistent with the three other definitions, in that it omits the +1 in case of a zero-width match.

My opinion though is that, Internet Explorer got it right, and that browsers who implement 15.10.6.2 as written while ignoring the definition in 15.10.7.5 got it wrong. The omission of the lastIndex++ for regex.exec() looks to me as an oversight by the standards writers rather than something they did intentionally. The reason is that every regex engine that I know of works the way Internet Explorer. It’s the only way to avoid an infinite loop, like Firefox does.

If a zero-width match is found, the next match attempt begins one character further ahead in the string. After \b matches between the 1 and , in 1,2, the next match attempt will begin at the position between the , and the 2 (and match there), rather than staying stuck forever.

I do understand where the confusion comes from. The property is called lastIndex, but the standard defines it as something that should be called nextAttempt. lastIndex is not the end of the previous match. The ECMA-262 standard does not provide a property for that. To get that you have to add up match.index and match[0].length yourself.

8 Comments

Jan, thanks for your thoughtful and detailed follow-up. It’s gotten me to think more about the issue, and while I would agree that ECMAScript’s use of lastIndex is poorly designed, it doesn’t technically contradict itself. See my response to your comment on my blog.

Why use interpositions? Are there any benefits from zero-length Regex matches, beyond delimitation of genuine matches? Basically, why hasn’t this nonsense been deprecated and avoided by now – what fundamental reason have I missed?

The same confusions arise time and time again, and it costs us all a fortune.
Microsoft dodged this issue in C# substring indexing by making the second arg the actual length of the required substring, essentially shielding the coder from absurd internal implementation. I wish this was more widespread.

Common use of zero-length matches are using ^ (start of line) and $ (end of line) by themselves in a search-and-replace to insert something at the start or the end of each line.

In JavaScript too you get the position and length of the match. The lastIndex property is supposed to be an implementation detail that unfortunately surfaces because the implementations don’t agree. If there were many ECMA-334 (C#) implementations like there are many ECMA-262 (JavaScript) implementations, there would be similar compatibility problems.

Ok. I think I’ve got my head around this now. Thanks.
Someone else summarised it with “interpositioning saves having to work out whether you need to incr or decr your index/length value, which can be a chore.”
I think my problem with all this is the unintuitive terminology, and my growing conviction that the modern codebases, there’s no benefit anymore. I hope we can alias all this confusion away with 1-based indexing across the board……

[…] This comes into play, e.g., when using a regex to iterate over all matches in a string. However, the fact that lastIndex is actually set to the end position of the last match rather than the position where the next search should start (unlike equivalents in practically all programming languages) causes a problem after zero-length matches, which are easily possible with regexes like /w*/g or /^/mg. Hence, you’re forced to manually increment lastIndex in such cases. I’ve posted about this issue in more detail before (see: An IE lastIndex Bug with Zero-Length Regex Matches), as has Jan Goyvaerts (Watch Out for Zero-Length Matches). […]

what is the point of zero length matches with wildcards though? maybe they have a role in search and replace. since a zero length match can lead to insertion of text.

Perhaps there should be 2 modes of regex, one for checking it is of a certain form, and capturing certain info. e.g. seeing that there’s an integer followed by a hiphen e.t.c. and capturing the indexes of these. The other mode of regex, for search and replace.

[…] (mdn), non word is obviously everything else! The trick for word boundaries is that they are zero width assertions, which means they don’t count as a character. That’s why /wbw/ will never match, […]