Infinite backtracking problem

The backtracking nature of regular expressions in JavaScript (and most other languages which use same type of regexp processing) may lead extremely long or almost infinite searching time.

For instance, try the example below:

alert( '123456789012345678901234567890z'.match(/^(\d+)*$/) )

Depending on your browser, it may either give error (“Regexp is too complex” in Firefox), or hang up (Chrome, IE) or execute if the regexp engine is smart (Opera).

What’s so bad in the regexp? Actually, nothing. The bad is how the engine matches it. The regular exception is inherently simple, so we can see what’s going on.

To make a long story short, we take a shorter string: 1234567890z:

First, the engine starts to match \d+. That + is greedy, so it matches as far as it can.

Then position $ doesn’t match.

So the greedy + has to release a character. The regexp backtracks (green arrow).

After backtracking, there’s 9 on the way.

It is matched as the second instance of \d+ (remember, we have (\d+)):

As it doesn’t help, the regexp engine has to backtrack once more.

The initial \d+ match now shortens to 8 digits. The rest is matched as next \d+:

Still, no match for $. Now, *the last \d+ backtracks and shortens.

The rest of the pattern is matched as 3rd instance of \d+:

Didn’t help again. Second and third \d+ backtracked to their minimum, so the first instance shortens.

Now there are 7 charaters for first \d+. The regexp engine matches the rest as another \d+

There’s no match, so the second \d+ steps back….

… Ultimately, it is easy to see that the engine will go through all possible combinations of \d+ in the number. That’s kind of a lot.

A swift-minded person may say: “Backtracking? Let’s be lazy and there will be no backtracking!”.

Let’s repace \d+ with \d+? and see:

alert( '123456789012345678901234567890z'.match(/^(\d+?)*$/) )

There’s no effect. Lazy regexps will do the same, but in reverse order. Think about how it goes.

Backtracking may lead to very large number of tested cases, as in the example above.

The symphoms are either very slow or failing regexps. Such regexp may pop out surprizingly, profiling helps to detect it.

To tell the truth, not all regexp engines behave like that. Optimizations and heuristics are applicable. For example, Opera works well here. But their code is closed, so probably there are other cases when their improvements don’t work.

The task is to validate if the text consists of numbers delimited by double colons '::'. There may be no number between double colons. Single colons are not alowed.