Short notes and essays about stuff that interests me (mostly technical stuff).

Tuesday, March 30, 2010

More details on RE2

Russ Cox has posted the third of his detailed articles on the RE2 regular expression engine behind Code Search. The RE2 project is open source, and this third paper goes into considerable detail about the implementation of the engine.

Like most people who have thought about how to use Regular Expressions effectively, I am of two minds about the regex technology. It can be extremely powerful, but it can also be overused.

Still, as an abstract programming exercise, regular expressions are a fascinating sub-field, and I've been interested in them ever since learning about automata theory in school thirty years ago.

As I read through Cox's paper, I found these observations to be the most compelling:

Simplify the problem to simplify the implementation: Cox's engine puts a lot of effort into simplifying complex regular expressions:

One thing that surprised me is the variety of ways that real users write the same regular expression. For example, it is common to see singleton character classes used instead of escaping -- [.] instead of \. -- or alternations instead of character classes -- a|b|c|d instead of [a-d]. The parser takes special care to use the most efficient form for these, so that [.] is still a single literal character and a|b|c|d is still a character class. It applies these simplifications during parsing, rather than in a second pass, to avoid a larger-than-necessary intermediate memory footprint.

Recognize when one sub-problem can be solved using techniques you've already developed for other sub-problems. Regular Expression processing is full of situations where the larger problems can be decomposed into smaller problems, and Cox identifies dozens of these. I particularly enjoyed this one:

Run the DFA backward to find the start. When studying regular expressions in a theory of computation class, a standard exercise is to prove that if you take a regular expression and reverse all the concatenations (e.g., [Gg]oo+gle becomes elgo+o[Gg]) then you end up with a regular expression that matches the reversal of any string that the original matched. In such classes, not many of the exercises involving regular expressions and automata have practical value, but this one does! DFAs only report where a match ends, but if we run the DFA backward over the text, what the DFA sees as the end of the match will actually be the beginning. Because we're reversing the input, we have to reverse the regular expression too, by reversing all the concatenations during compilation.

When testing is a challenge, be smart about how you test. Cox tested his engine by piggy-backing on the testing that has already occurred on several other engines:

How do we know that the RE2 code is correct? Testing a regular expression implementation is a complex undertaking, especially when the implementation has as many different code paths as RE2. Other libraries, like Boost, PCRE, and Perl, have built up large, manually maintained test suites over time....Given a list of small regular expressions and operators, the RegexpGenerator class generates all possible expressions using those operators up to a given size. Then the StringGenerator generates all possible strings over a given alphabet up to a given size. Then, for ever regular expression and every input string, the RE2 tests check that the output of the four different regular expression engines agree with each other.

Learn from those that have gone before you, and study alternate implementations. Cox goes out of his way to credit the work that has occurred over the years in this field. For example:

Philip Hazel's PCRE is an astonishing piece of code. Trying to match most of PCRE's regular expression features has pushed RE2 much farther than it would otherwise have gone, and PCRE has served as an excellent implementation against which to test.

I strongly recommend this series of papers. Cox's articles are clear and well-written, have good examples and great explanations, and are filled with all the references and citations you might want if you should to pursue this topic further.