I came across this article discussing speed of regexp processing. The article comes to the conclusion that Perl regexp is slow compared to an approach called "Thomson NFA" and there is scope for enhancing Perl regexp processing speed. While the article uses Perl as an example, the speed issue reported is not limited to Perl but other languages like Python, Ruby too.

I was surprised because till date, I always assumed that Perl set the bar for regexp and text processing. I never felt that Perl was slow for my regexp needs.

Please comment and discuss this. Have you felt that Perl regexp was slow and needed to be improved?

I have various feelings about this paper. I think if it weren't a veiled attack on the existing corpus of regex engine work it would be absolutely fantastic. If you read it as an explanation of using NFA and DFA's to implement kleene algerbras then its great. If you read it as an argument for why you shouldnt use backtracking NFA engines or from the point of view of software engineering in my opinion it fails to impress.

A general purpose regex engine like that required for perl has to be able to do a lot, and has to balance considerations ranging from memory footprint of a compiled object, construction time, flexibility, rich feature-sets, the ability to accomodate huge character sets, and of course most importantly matching performance. And it turns out that while DFA engines have a very good worst case match time, they dont actually have too many other redeeming features. Construction can be extremely slow, the memory footprint vast, all kinds of trickery is involved to do unicode or capturing properly and they aren't suitable for patterns with backreferences.

It turns out that for a general purpose engine, most of the patterns that people want to match have characteristics that mean that the search space that requires a "real" regex engine can be efficiently pruned. For instance, /most/ patterns people want to match in real life have a fixed substring in them and the beginning and end of the match can be efficiently determined from the location of that substring in the string being matched. This means that specialized string matching algorithms that are more efficient than DFA's (like Fast Boyer Moore) can be utilized to find this substring and only when it is found is the real engine used to verify that the match is valid. In many cases this can result in much faster match times than a DFA can provide, even when the verifier is an NFA with backtracking.

On a similar line a general purpose engine rarely needs to deal with "degenerate patterns" of the type illustrated in the article and many patterns that are degenerate in the current engine need not be. Auto-possessivization is an open optimisation that would go a long way to reducing the number of patterns that would result in degenerate matching, which is itself blocked from being too bad by the superlinear cache. In a similar vein an NFA can be made more efficient by judicious use of explicit atomic subpatterns and possessive quantifiers, both of which are a form of compiler hinting.

So, on the balance it turns out that an NFA can perform quite competitively with a DFA, while at the same time being much more flexible and powerful. Sure there will be classes of patterns that NFA will not do well, but on the other hand there are patterns that a DFA cannot do at all which is less useful than doing them slowly. At then end of the day I feel the balance of factors is on the backtracking NFA's side. I think there is a reason why there aren't too many industrial quality DFA engines out there, they simply arent suited for the job.

I think the truth is that for the messy real life problem of "pattern matching" no one approach on its own will suffice. It seems to me that a hybrid approach is called for. One that blends the benefits of the DFA with the benefits of the NFA. In fact this is the direction that things have been going in the Perl engine. An NFA framework with DFA-like components. In Perl 5.10 we introduced a TRIE regop which allows for efficient matching of alternations starting with literal text and with a bit more work the framework used there could be extended to handle DFA type construction for degenerate patterns like the one in that article.

Anyway, In perl 5.10 youll be able to use whatever regex plug in you like, so perhaps we will see Russ Cox contribute an implementation to prove his case.

I've been thinking of the "atomic subpatterns" and possessive quantifiers as pruning operations. The entire search space may be exponential but entire branches can be eliminated if you just code your pruning operations into your expressions.

Exactly, they are hints to tell the matching engine not to bother backtracking. Hypothetically the optimiser should be able to determine them all automatically and you shouldnt need them, but its a lot easier to let people do it themselves in terms of implementation.

Auto-possessiveification is something that sure does need doing. If you consider the general case XqY, where X and Y are literals and q is a quantifer, you should be able to do Xq+Y whenever X cannot overlap Y. Ive not got round to it yet tho.

Hmmm ... I read the paper and didn't see anything that resembled an attack. To me it felt more like:

"Hey gang, there was a fork in regex technology XX years ago and the fork we're on now has some pathological cases. This other fork doesn't have those pathological cases, but has these other shortcomings. Perhaps we can take the creamy caramel center of the NFA and wrap them in the chocolaty goodness of DFA and get something much tastier?"

I totally agree with your reasoning of why we don't need to go that route. On the other hand, since reading the paper, my gears have been a-grinding with attempts at finding a way to merge backtracking into DFA without consuming insane amounts of storage. It's quite an entertaining mental exercise, even though I've gotten nowhere yet. (I don't actually expect to get anywhere, but it's still fun to try...)

Well lets see, the title, and the lack of any kind of analysis of how a backtracking NFA can mitigate its weaknesses.

For instance the pattern used for the performance analysis could be optimised to /a{n,2n}/, and if written by a human probably would be. And the optimised construct would not exhibit the poor performance that he graphed out.
In the discussion in the section marked "Real world regular expressions" he mentions that an NFA or DFA would do the opposite (unrolling the expression) but he doesnt discuss the impact of such a translation.
Unrolling such a construct would result in an unmanagably large construction. A DFA that matches a{1,30000} is going to need at minum 30k states and transitions in it. Thats big. Now the same thing would be represented in a backtracking NFA like perls in what four regops, lets be generous and say that each regop is going to be 10 words (its more like 1 or 2 per regop), so we would need 40 words to represent that pattern in an backtracking-NFA. A DFA or a Thompsons NFA would 30k states, assuming each transition is only a word that is still a /huge/ difference in memory footprint, and in construction time. Memory for 30k nodes would have to be allocated blah blah. My bet is that before either a DFA or Thompsons NFA had finished construction the perl backtracking NFA would have already finished the match.

Another thing, the benchmarks showed only one side of things, the performance of an accepting match. How fast will thompson construction be when the match fails? Perls regex engine will reject the match in less than N characters comparisons because it will use FBM matching to locate the an segment. Its only when it matches that it behaves poorly, and that can be mitigated by using a better construct, which hypothetically can be optimised automatically at the engine level.

So its like the paper takes the tone "this is much better than that" and then glosses over all the hard stuff, doesnt mention the negative tradeoffs of the "better" solution, and gives no credit to the existing engines and the solutions they use, and makes no admission that there are optimisation techniques for working around the potential degenerate cases that might be encountered.

First off, algorithms can always be improved. There is generally a compromise between pure processing speed and many other considerations, such as RAM usage and code maintainability. Perl generally sacrifices RAM for speed, but generally considers speed and maintainability to be about equal (knocks on the codebase aside).

Second, you must understand that Perl set the bar for regexes when P5 was released over 10 years ago. There's a reason why the primary C library for regexes is called "pcre", or "Perl-compatible regex engine". In that time, a lot of theoretical work has been done. Not all that work has been put into the current engine, for many reasons such as:

The change doesn't provide enough for the risk of the making the change

The change breaks a feature that has to work

The change hasn't been proven to do what it's claimed to do

Remember - every script written for Perl waaaay back to 1.0.0 is still executable in 5.8.8 - backwards compatibility is a major concern for p5p. Also, the number of people who both have the understanding of the Perl engine and the necessary time to work on it are few and far-between. And, honestly, many of them are hard at work fixing bugs, adding other features, and working on Perl6.

And, lastly - fast is as fast does. Perl is "fast enough" for me and my clients, and that includes the regex engine. I'm not a baremetal speed freak.

My criteria for good software:

Does it work?

Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

No, I didn't miss that. New code is always being written to implement old algorithms. Just because an algorithm exists doesn't mean that there is an efficient implementation of it. Theoretical work has advanced in the implementation of these algorithms and in how the various algorithms can be used to solve the problems Perl's regexes need to solve.

My criteria for good software:

Does it work?

Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

I'm sure that demerphq will comment further, but the article glosses over important implementation details like for example Unicode, lookaround and backreferences, all of which are left as an exercise for the astute reader. This is all well for an academical paper but doesn't have much bearing on the things Perl does unless it's accompanied by working code.

The author choses to represent character classes as alternations, which is impractical with Unicode and even impractical memory-wise with a class like [^a] which uses at least 32 bytes of memory in ASCIIspace and all of the Unicode space otherwise.

Zero-width assertions are unimplemented and the author claims them as "hard but possible in general", and for backreferences the author even suggests to use two different engines because the machine cannot do backreferences.

I'm sure somebody with any insight into how regular expression engines work nowadays (and Perl's especially) can point out where the technique might be applicable, but I don't see how the method can replace the Perl regex engine as a whole.

I ran a quick program to sort the regular expressions used in core perl into things that could be turned into DFAs and the ones that can't (anything with back references). There are 1075 transformable regexps and three non-transformable regexps. Potentially we could benefit by using a DFA engine when the regexp doesn't require back references. This has the drawback that the behaviour is going to change. There are usually many potential results from matching a regexp against a string but because of the expected order of execution we get the one we desire. With a DFA, the results that are returned will come out in a different order. This is something that potentially you'd want to indicate is acceptable with a pragma.

You really want finer-grained control than a pragma can provide easily. Perl 6 regexes currently distinguish || (sequential semantics) from | (dfa semantics, at least up till "longest token" point). This
is meant to be reminiscent of the distinction between short-circuit and junctional logic in the rest of the language.

Here's the program I wrote to characterize regular expressions. I dumped the output of -Mre=debug after compiling everything in perl and ran this program on that output. It turns out that it decides only three regexps *require* an NFA, 750 require at least a DFA, and 325 could be done with just a plain Boyer-Moore search.

I think you could rework this to only include [A-Z] in the opcode, if there is a number following the [A-Z] its not important. In other words just look for OPEN not OPEN1 etc.

Also i think that youll need to add MINMOD to the list of non DFA'able constructs, and maybe also should put OPEN in there as well since capturing with a DFA is distinctly non-trivial (actually im not even sure you can do it with a pure-DFA, maybe with NFA based DFA simulation). And why is NOTHING in both lists?

I have to admit to getting lost trying to read that article. However, ignorance is no reason to fail to comment ;-) In all honesty, I pretty much stopped reading after the first pair of graphs and it appearing that they were picking some rather pathological cases. They were picking on Perl's well-known worst-case scenarios, and comparing them to another regex engine that had chosen differing optimisations. I'm sure that someone here could pick another pathological expression that would highlight perl's vast superiority in optimisation on that string - which, most likely, would completely avoid the regular expression engine. Neither one would be a fair comparison.

Where grep and awk really fall down is that to do anything remotely interesting as far as complex programming is concerned, you need to encapsulate them in a shell script. At that point, you have the overhead of forking and execing hundreds or thousands of subprocesses (including awk and/or grep), which probably vastly overshadows any possible perl slowdown on regular (non-pathological) data.

The Perl porters have long known about pathological expressions that will continue to run after the heat-death of the universe. This doesn't mean those patterns should be banished, but with a little care, and a little thought, Perl probably offers the informed programmer techniques for supplying hints to the regular expression engine to avoid such exponential blowups.

I also find it interesting that Perl allowed me to write a very small amount of code in order to explore the problem. And the solution here is to not use greedy matching.

Change 'aaa' =~ /a?a?a?aaa/ to /a??a??a??aaa/ and you have to push n out to about 100000 before you begin to notice any appreciable slowdown in the matching. Take the following for a spin:

Try running perl rematcher 20 and then perl rematcher 20000 ??. I get a better than three orders of magnitude improvement. Not bad for five minutes thought. Seeing how the Thomson graph climbs slowly but steadily from the beginning, I would not be surprised if Perl continued to outperform it all the way out until things begin to swap.

update: after finishing the article, I retract the statement that the author is an idiot. He does know his stuff, however, I still don't agree with the conclusion. He says that DFAs are better because you don't run into grief with patterns like /(.*)(.*)(.*)(.*)(.*)/ since it doesn't go exponential. This is just silly. If anything, it encourages sloppy programming.

Backreferences allow one to solve a much larger range of problems that DFAs cannot. That's why NFAs are in such widespread use. And if it takes a little education and experience to wield NFAs with assurance, then so be it.

Of note is that Perl in particular already has safeguards in place to prevent exponential catastrophes. At least for some large class of pathological patterns, it won’t match them very quickly, but neither will it explode in nearly the same way as more naïve engines. The downside is that the required bookkeeping imposes a performance hit on all matching, which makes Perl’s engine slower than that of languages with more naïve engines, like Java. tilly wrote a great, long article on this subject some time ago.

Thank you a thousand thousand times, grinder! Your note about using non-greedy regexes (seems obvious in hindsight) reduced some of my large text file parse script runtimes by roughly 70%. I was searching very generally for speeding up large regex operations and, as much as my eyes were starting to glaze over, you managed to clearly communicate a rather simple yet important concept without being obtuse or pedantic. If only more searches resulted in such education. :)
/salute

The speed of the perl regex engine isn't all that important unless it is too slow for a particular problem you are trying to solve. If the problem is so time sensitive that the perl regex engine is your biggest bottleneck (you did benchmark, right?) then it might be time to rethink your solution. Perhaps moving away from regex, or perhaps using a language built for speed

Where perl shines really is the ease with which you can use regular expressions, and the useful (if monstrous) extensions such as (?{{}}). Using regular expressions in php or c with pcre makes you really appreciate perl. Older java versions were a bit annoying as well, but they have improved greatly in recent years.

Use whatever happens to be the right tool for the job. It won't always be perl, but you can rejoice when it is.

I wrote the article. Whether you think it applies to Perl depends on some usually-unstated beliefs about regular expressions and how they should be used.

Regular expressions started out as a notation for describing
the strings that they match: a(bb)*a says "first an a, then some
number of bb, then an a". They didn't say anything about how
you had to take an input string and check whether it matched,
and in fact there were multiple algorithms used for doing so.
As a computational measure, regular expressions describe
sets of strings that can be matched in linear time with constant space,
using an algorithm that isn't obvious from looking at the regular
expression notation.

Perl's regular expressions have come to be viewed as a notation
for an algorithm that matches strings. Operators like backreferences
started the trend long before Perl, but new operators like recursion
and, to a lesser extend, lookahead and lookbehind assertions,
certainly continued the trend quite a ways beyond where it had been.
These features are most easily defined in terms of the underlying
matching, not in terms of the strings they describe.
And of course, as everyone loves to say, Perl's regular expressions
are not regular expressions anymore. (While this is true, a large fraction of the ones actually written are true regular expressions and could be treated as such.)

The big difference between these two is whether you think regular
expressions should be treated as descriptive (here's the effect I want)
or prescriptive (here's how I want you to do it).
People who accept the existence of so-called pathological regular expressions
implicitly treat regular expressions as prescriptive:
"we gave you a lot of rope, so of course you can manage
to hang yourself."
I believe that until one crosses into the not-too-often-visited
land of exotic Perl regular expression features like
generalized assertions, recursion, and backreferences,
it is best to treat regular expressions as descriptive.
Then the regular expression engine can implement the expression
efficiently no matter what you've typed.
Continuing the strained analogy from before,
"you can have all the rope you need, but the regex engine is going to
do the knot-tying for you once you tell it what you want to do,
so there is no way you can hang yourself."

In the prescriptive regex mindset, programmers have to microoptimize
their regular expressions, knowing how they get implemented
in order to avoid so-called pathological behavior by avoiding
potentially-expensive operations except when he is sure
(and hopefully correct!) that "they won't be expensive in this particular case."
In the descriptive regex mindset, there are no pathological regular expressions.
If there is some required substring, then fine, do that as a further
optimization. But the programmer can be guaranteed that running a
regular expression will not be expensive, unless it uses advanced
features like assertions, recursion, or backreferences. Right now *
and ? are potentially-expensive features, and they needn't be.

If the Perl programmer is the one writing the regular expressions,
then maybe there isn't much benefit to not having to learn how to
"optimize" regular expressions.
After all, one has already had to
learn Perl; learning how to
write regular expressions that Perl implements efficiently is not
much more work.
However, if the Perl programmer isn't the one writing the regular
expressions, then there is a lot of benefit to knowing that a regular
expression search isn't going to run for years and years.
As an example, let's consider a CGI script that is going to accept
regular expressions as search keys.
The programmer needs to execute those searches,
and it would be very useful if Perl could be
relied upon to run them all efficiently, or at least to flag the
ones that might not run efficiently because they have things like recursion.
Perhaps there could be a regex flag that says "run this in linear time,
and if you can't, don't run it at all".
Then it would be plenty safe to take a regexp from the user and
use it to run a search.
But with the current Perl regexes, no way.
Lest you think the CGI example is contrived,
consider Jeffrey Friedl's
Japanese-English dictionary,
which accepts regular expressions as search terms.
The regexp lookup is really slow already (maybe the host machines
are overloaded) but I bet it would be very very slow if you
gave it a regular expression like ((((.+)+)+)+)+(aaa|bbb).
(Cutting off the search isn't quite correct here, since there are some
entries that do match aaa, at least if you are
searching for English.)

It's great that Perl 5.10 is going to have a hot-swappable regular
expression engine, because then maybe someone could build one
that handles the true-regular-expression notation in guaranteed
linear time and then Perl programmers could
ask for it if they wanted a guarantee of predictable behavior.
Even better, that would pave a way to having Perl check for
the non-regular operators, and if they weren't there,
choose the linear-time implementation automatically.

In short, the article is not about making Perl's regex matching
100x faster across all programs. It is about making it more
predictable and consistent across all programs.
And just to be clear,
that is no more incompatible with whatever optimizations you might add
(like exact string search) than backtracking is.

Thanks. You'll get some pushback from the curmudgeons, but by and large I think we're a healthy enough community that we're actually trying to learn from our mistakes, at least every now and then. Certainly that applies to regex syntax; I think it's somewhat ironic that, just as the rest of the world is adopting Perl 5 regex syntax, Perl 6 is abandoning it. :-)

Anyway, I'd be very interested in any suggestions you might have regarding the hybrid approach we're taking in Synopsis 5, especially the implicit longest-token semantics.

It certainly sounds like distinguishing | and || is a good idea. I do not claim to fully understand everything in Synopsis 5, but the general idea seems like being on the right track.

One interesting gotcha (I can't tell whether it is handled correctly or not). Consider the Perl 5 regular expression (.*)\1 run on (("a" x $n) . "b") x 2. It should match. However, figuring that out requires trying many possible matches for the (.*) part: the first try would be the whole string, then the whole string minus one char, etc., until you got down to just half the string.

It is not clear to me that the longest-token semantics would get this right: it sounds like they would grab the whole string during (.*), fail to match in the equivalent of \1, and then give up.

All this is just to say that backreferences are even harder than they first appear, not that the longest-token idea is flawed.

Going by my experience with the trie optimisation the only difference between leftmost-longest and longest-token will be that you dont need to sort the possible accepting pathways through an alternation when cycling through them to find the one that allows the following pattern to succeed.
Assuming you use a stack to track all the possible accepting pathways then in the case of longest-token they will already be sorted so you just keep popping them off the stack until one matches. The trie code however has to somehow process them in the order they occur in the alternation, which in the perl 5.9.5 implementation involves a linear probe through the list to find the best match each time. (The assumption is that the list will most often be very small.)

The big difference between these two is whether you think regular expressions should be treated as descriptive (here's the effect I want) or prescriptive (here's how I want you to do it).

Well, both are at play in the perl engine. If it was fully prescriptive then the engine wouldnt be able to to do things like using fixed string optimisations.

Then it would be plenty safe to take a regexp from the user and use it to run a search. But with the current Perl regexes, no way.

Doesnt this criticism apply equally to Thompsons algorithm or to DFA construction? I would have thought the only difference would be that in a DFA youd see performance issues with compilation and not execution.

And just to be clear, that is no more incompatible with whatever optimizations you might add (like exact string search) than backtracking is.

Sure. But the question is will Construction time + FBM + Verification be faster for an BNFA (backtracking NFA) than for a DFA? And will the DFA consume radically more memory than the BNFA? And my position is that most likely the DFA will win only on degenerate patterns. The rest of the time my feeling is that the BNFA will win hands down, mostly because of how cheap construction is.

It's great that Perl 5.10 is going to have a hot-swappable regular expression engine, because then maybe someone could build one that handles the true-regular-expression notation in guaranteed linear time and then Perl programmers could ask for it if they wanted a guarantee of predictable behavior.

Yes, this is exactly why I was keen on making the regex engine pluggable. We have seen one proof of concept for it, but its not well tested. It would be nice to see someone do a plug in that uses a different algorithm.

Even better, that would pave a way to having Perl check for the non-regular operators, and if they weren't there, choose the linear-time implementation automatically.

I agree, it would be nice to automatically swap out to a better algorithm under some circumstances. Assuming you did so only when the matching semantics were equivelent to that provided by leftmost-longest.

Thanks for writing the article. Regardless of the debate of backtracking NFA versus more DFA like structures I think it was well written and quite informative.

Then it would be plenty safe to take a regexp from the user and use it to run a search. But with the current Perl regexes, no way.

Doesnt this criticism apply equally to Thompsons algorithm or to DFA construction? I would have thought the only difference would be that in a DFA you'd see performance issues with compilation and not execution.

It depends on the implementation. If you use Thompson's algorithm as implemented in nfa.c from the article, then you end up with a run-time that is O(m*n) for length-m regexp and length-n text.

It is true that if you pre-compile the DFA, then you end up with O(2^m * n) time, but you can build the DFA on the fly and you're back to O(m*n), just with a smaller constant than the nfa.c version.

And just to be clear, that is no more incompatible with whatever optimizations you might add (like exact string search) than backtracking is.

Sure. But the question is will Construction time + FBM + Verification be faster for an BNFA (backtracking NFA) than for a DFA? And will the DFA consume radically more memory than the BNFA? And my position is that most likely the DFA will win only on degenerate patterns. The rest of the time my feeling is that the BNFA will win hands down, mostly because of how cheap construction is.

This is a myth. Like most myths it is based in a truth that is no longer true. When people precomputed the entire DFA before starting the search, like the original egrep did, construction was in fact pretty expensive. But no one does that anymore. They build out the DFA on the fly, as described in the article.
If you build out the DFA on the fly, then the "compilation" or "construction" passes are more or less the same as for backtracking: basically you just have to syntax check the regexp and build some internal representation of a parse tree. If you look at the big graph near the end of the article, you'll see that the nfa.c in the article has cheaper construction cost than all of the fancier implementations, by about a factor of 4. And I wasn't really trying.

Construction costs are what you engineer them to be -- neither approach has any advantage here.

In order to satisfy that "descriptive" you would have to build DFA if for no other reason then to make all |-s equal and to deduce non-greedy multipliers. Just to remind you that once you cross the Rubicon to "descriptive" non-greedy intent has to be deduced.
So, at the very least, you would incur perf hit all the time for the sake of a fringe case. Doesn't really matter if you would hide it behind the dynamic building. So if the DFA has the exponential explosion of states you would eventually reach the point of explosion. Sorry I can't find the one that Ullman used to throw at unsuspecting souls :-), but I'm sure you'll find enough of these on the net ...
You might get away with a stuff like "on the fly" DFA for something that is one-time run, but in such case the whole debate is artificial and academic.
For a real world run, when say I have a regex compiled all the way to the binary, say in .NET, and am paying the price of never being able to "unload" it for the lifetime of the binary, and the binary happens to be a server required to hit 99.9999 and is running that regexp on all incoming text ... I think you get the picture :-) and what's the purpose of non-greedy multipliers and all these assertions you dislike so much and why even a thought that someone might be building an DFA on the fly is not a realistic option. Even if it could handle back-references, or more importantly a recursive parse with implicit stack - which it can't.
See, we actually need the interpretative, ahem interpretation :-) of the regex and the one that is stable across languages and platforms - thereby the PCRE and the only real "price" we "pay" is that we use non-greedy multipliers and we are done. Most horrible thing, isn't it :-)
Now we could debate that we might prefer to have non-greedy as a default but that's a historical note :-) Also a matter of how precisely one wants to define the input he'll take in. Some people may insist on using negative classes instead, but they don't write regexps to match a in a :-).
As for the back-refs and recursive parse, that's going to be more in demand, not less. For example .NET implementation of Perl regex was designed to be able to parse XML without a perf hit and while I couldn't say if Perl would take a perf hit for something like that, it definitely can parse XML with regex much easier then .NET.
So, you can go cursing everyone, starting with Thompson himself :-) or you can kind of come to terms with the fact that regex engines in widespread use are practical tools and not academic toys. So, claiming that one engine is "slow" because it will have a problem with a ridiculous thing like a?{3}a{3} on "aaa" but will run a{0,30}?a{30} without a hiccup looks like the ultimate exercise in futility. If for some reason you really, honestly need to match something like that, well you can always write a regex, to transform "a?{n}a{n}" into "a{0,n}?a{n}" :-)) Might need back-refs to handle more convoluted cases though :-)))))
In the meantime I will happily cherish Perl/PCRE/.NET etc. regex in all it's universal presence and benefits it brings - for all people with real needs for fast string parsing.

As for the back-refs and recursive parse, that's going to be more in demand, not less. For example .NET implementation of Perl regex was designed to be able to parse XML without a perf hit and while I couldn't say if Perl would take a perf hit for something like that, it definitely can parse XML with regex much easier then .NET.

Im not sure what performance hit you mean. Although ambrus had some neat ideas for something like the super-linear cache with regards to pattern level recursion that i have yet to follow up on.

Also juerd has published a validation regex for xml that is derived from the EBNF that w3 has posted. The story that you cant parse XML or HTML with regexes is going to go up in a puff of smoke when Perl 5.10 comes out.

The title of the article raises warning flags: 'Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)'.

The title makes a far broader claim than the body of the article. All the article shows is that Thomson NFA doesn't choke on the specific pathologies that Perl does. At a cursory glance, there is nothing to demonstrate the broader claim that regex matching is unconditionally "slow" in Perl, etc.

You won't find anything useful about NFAs and DFAs in Friedl's book. What's in Friedl's book about that topic is misleading at best. Friedl freely admits that he doesn't care for the underlying theory.

I think it's a bit unfortunate how you phrased that response ("nothing useful about NFAs and DFAs"), since it doesn't make clear that you're talking about the theoretical potential of the techniques (which, indeed, my book does not cover). As such, it sounds like an attack rather than the simple citing of a fact.

You had it right in your (excellent) article:

Friedl's book teaches programmers how best to use today's
regular expression implementations, but not how best to
implement them.

I think a better response to Holli would be one noting that there's more potential offered by the underlying theory than is evident in the common implementations covered in the book.

Whether the underlying theory is my cup of tea is not particularly relevant to my book, because the reality of today's common implementations is that they are what they are. Perhaps you wouldn't have so much angst about my book if it were titled Mastering (what goes for) Regular Expressions ? :-)

In your article, I'm not sure what you mean when you say that I don't "respect" theory (?), but it's true that a lot of the theory behind what led to today's implementations is beyond my desire to stay awake. Here's an example of the type of thing that trying to understand makes my eyes glaze:

"An equivalence relation R over Σ* is right invariant if and only if whenever xRy, then xzRyz for all z ∈ Σ*. An equivalence relation is of finite index if and only if there are only finitely many equivalence classes under R

(from Robert Constable's paper "The Role of Finite Automata in the Development of Modern Coputing Theory" in The Kleene Symposium, North-Holland Publishing, 1980).

You know what they say: "Those who can, do, and those who can't, teach". My book teaches how to use what's there now. It would please me to no end if I could throw out all the chapters on optimization and such, and perhaps your article is a first step toward that goal. Some won't appreciate being told that the emperor has fewer clothes than has been thought, but don't let that deter you.

I do think that along the way you'll find some unexpected problems related to the semantics of capturing parentheses, for example, but heck, if you're half as good an engineer as you are a writer, I'll hold out hope, because the writing and presentation in your article is really top notch.

I hear that the engine has gotten relatively easy to replace in perl 5.10. It'd be wrong to use a DFA with code that was written for perl's default NFA but you could surely plug it in for code that knew it was going to use the DFA. Have you given any thought to trying this out for real?

What's in Friedl's book about that topic is misleading at best. Friedl freely admits that he doesn't care for the underlying theory.

I followed that same link when I first read your article and still have no idea what you are referring to. Would you please quote the statement where
he expresses his feelings about the underlying theory?

I scanned this article, and noted quickly that it is way beyond my scope of knowledge, or interest for that matter. I am very attached to Perl for my own reasons, but not so much so that I don't like to explore other ways of doing things, or attempting to understand how all that CS black magic works, and why there might be better ways to implement the underlying workings.

I always get that little mental twinge when something someone says or writes implies that my tool of choice is less than it should be (even though that is not really what is being said). This is relatively natural, and of course I go read about it and others peoples thought on the matter to see just how blindly religious I have become about my preferences.

From the perspective of a self-taught perl user, the answer to the question "...felt that Perl regexp was slow...needed to be improved", the answer is no. As I understand it, the "P" in perl is for "Practical". Practical for me means something that gets the job done. Perl does extraction and reporting very nicely, and is therefor practical. If Perl's only job was to do regex's, then it is probably not as fast as it should be. But that is not Perls' only job.

Back in the day when I worked for Cray Research, our job was to build the fastest super-computers in the world. The machines we built were pretty darn good at what they did, modeling complex problems. But they were not the fastest at everything, or even the best for all large scale problems. They were (and in some cases still are) the most practical tools to use for certain kinds of problems, in certain environments.

Perl is like that for me. Are other tools better at some of the things that Perl can do? Yes. Is Perl the best at regex's? I don't know, maybe not. What I do know is that Perl is the practical tool to use for so many of the jobs that I do, that whatever 5%, 10% or even 30% increase I could get out of using something else is just not worth the risk or the learning curve. If someone wants to tweak the underpinnings of the tool to make it faster at something, fine. Just don't turn the world up-side-down doing it. I am too damn old to make the adustment :-)

So, from my rather simple perspective, I will take the practical advantages of Perl, and not let myself worry too much about that little twinge I felt at the question about my favorite tool ..... :-) Sometimes I have to just laugh at myself, and give everyone else the opportunity to join me!! lol

...the majority is always wrong, and always the last to know about it...

The perl regular expression engine could be seen as an example of "worse is better" (I mean it as a compliment ;-) It might not be the prettiest from some theoretical point of view, it might have pathological cases, but it exists and gets the job done. Sure, it would be great if it had no pathological cases. I have to say that I have never encountered one in real life, but I always feel a bit uneasy letting users of a website run regular expressions via a CGI script due to the risk that they will input a pathological expression just for fun (you can set a timeout, but it's a bit of a pain).

Sure the perl engine is not that perfect (in fact in many respects its a heap of steaming excrement that happens to work only because some guy named Zaphod and the Infinite Improbability drive happened to be passing by at the time it was compiled), but its good enough, has proved itself to be a powerful tool and has done more than all the other regex implementations put together to advance the programming communities knowledge and exposure to regular expressions.

In reply to everyone saying that the pathological cases are only a theoretical problem, and are easily avoided, I ran into a pathological case in real life, and found Russ's article via google for perl slow regex.

I have a bunch of PDFs and some metadata I want to verify. Basically I want to check that the file actually contains the title listed in the metadata table, and stuff like that, to make sure the right files have the right names. Punctuation often varies, so I did this:

So I'm matching a regex that looks like
Recogntion.+?of.+?Credit.+?History.+?for.+?New.+?Immigrants
against maybe 100kB of text. That regex takes more than 1/2 hour to not match, on a 2.4GHz C2D (AMD64 Ubuntu, perl v5.8.8). (matches are fast, though.)

I'm just using /g here because I wanted to know where the title matched. Matching against the whole file contents is overkill for the title, but I'm planning to check for other metadata, which could appear at the end of the file.

My first attempt didn't use non-greedy wildcards, and took 30s for one easy match near the beginning of the string, and much longer for others. Finding Russ's article clued me in to why it was so slow in the common matching case.

I eventually gave up on perl's regex engine for this task, and replaced the transformation to wildcards with an explicit search for each word in order.

The version using non-greedy wildcards is still running after 784 minutes (started last night). It's still stuck searching for Broadband.+?in.+?Ontario.+?from.+?an.+?Urban.+?and.+?Regional.+?Perspective.+?digital.+?divides.+?content.+?inflation.+?citizen.+?engagement.+?and.+?criteria.+?for.+?success
in a 265k string that doesn't match. (The PDF has the title in an image, so it's not in the pdftotext output).

That 784min perl process is running with LANG=C, too. (I'd rather use a locale where \W didn't match accented vowels, since some of the documents are in French. They're a digital library collection of documents, often EN and FR versions of the same thing from Canadian govt. agencies. Anyway en_CA.utf-8 is not such a locale, so I might as well use LANG=C.)

The point of this post is to say this was a real-world problem for me, and it would have saved me several hours reading about this if perl used a non-backtracking matcher for patterns that didn't need backtracking. I still haven't thought of a regex that's fast (in perl) in the matching and non-matching cases. If a m//g loop didn't work, I could have piped tr -d '\n' | egrep, and it would have been fast enough.

As a user, I do not want to store in my brain the knowledge of which regexes are "safe". If this knowledge is easy to describe, why not code it, and at least for this class of regexes, use the Thompson algorithm?
Of course, as an optimization that applies only for some use cases, this has to be prioritized, tested & integrated. But in a perfect PERL, I would want to see it included.

The article is biased towards FA methods and mistreats backtracking by deliberately focusing on backtracking worst case while omitting FA ones. That is quite bad, since it on top in Google up to now, misleading people.

"This argument misses the point: given a choice between an implementation with a predictable, consistent, fast running time on all inputs or one that usually runs quickly but can take years of CPU time (or more) on some inputs, the decision should be easy." - this is simply wrong. No known (at least to me) regular expression maching algorithm have fast running time on ALL inputs.

Let's see how NFA will treat regular expressions like a{2,N} (using PCRE notation) where N is a large number - e.g. finite quantifier with large difference between left and right border. Backtracking match aaa in no time with this, while with big N NFA didn't even start matching, taking ages to build vast automaton. Like a{2,2000} etc. The graphs would be quite reversed.

I'm not a particular fan of backtracking, but that comparison isn't fair at all. Each regular expression matching algorithm has it's strong and weak points. And the only practical situation where these worst cases come to life I know of is DoS attacks on services that allows someone to supply regex.