Friday, May 21, 2010

An Efficient and Elegant Regular Expression Matcher in Python

Two weeks ago, I was at the Workshop Programmiersprachen und Rechenkonzepte,
a yearly meeting of German programming language researchers. At the workshop,
Frank Huch and Sebastian Fischer gave a really excellent talk about an
elegant regular expression matcher written in Haskell. One design goal of the
matcher was to run in time linear to the length of the input string (i.e.
without backtracking) and linear in the size of the regular expression. The
memory use should also only be linear in the regular expression.

In this blog post I want to describe this implementation and show the code of
it, because it is quite simple. In a later post I will show what optimizations
PyPy can perform on this matcher and also do some benchmarks.

A Note on terminology: In the rest of the post "regular expression" is meant
in the Computer Science sense, not in the POSIX sense. Most importantly, that
means that back-references are not allowed.

Another note: This algorithm could not be used to implement PyPy's re
module! So it won't help to speed up this currently rather slow implementation.

Implementing Regular Expression Matchers

There are two typical approaches to implement regular expression. A naive one is
to use a back-tracking implementation, which can lead to exponential matching
times given a sufficiently evil regular expression.

The other, more complex one, is to transform the regular expression into a
non-deterministic finite automaton (NFA) and then transform the NFA into a
deterministic finite automaton (DFA). A DFA can be used to efficiently match
a string, the problem of this approach is that turning an NFA into a DFA can
lead to exponentially large automatons.

Given this problem of potential memory explosion, a more sophisticated approach
to matching is to not construct the DFA fully, but instead use the NFA for
matching. This requires some care, because it is necessary to keep track of
which set of states the automaton is in (it is not just one state, because the
automaton is non-deterministic).

The algorithm described here is essentially equivalent to this approach, however
it does not need an intermediate NFA and represents a state of a corresponding
DFA as marked regular expression (represented as a tree of nodes). For many
details about an alternative approach to implement regular expressions
efficiently, see Russ Cox excellent article collection.

The Algorithm

In the algorithm the regular expression is represented as a tree of nodes. The
leaves of the nodes can match exactly one character (or the epsilon node, which
matches the empty string). The inner nodes of the tree combine other nodes in
various ways, like alternative, sequence or repetition. Every node in the tree
can potentially have a mark. The meaning of the mark is that a node is marked,
if that sub-expression matches the string seen so far.

The basic approach of the algorithm is that for every character of the input
string the regular expression tree is walked and a number of the nodes in the
regular expression are marked. At the end of the string, if the top-level node
is marked, the string matches, otherwise it does not. At the beginning of the
string, one mark gets shifted into the regular expression from the top, and then
the marks that are in the regex already are shifted around for every additional
character.

Let's start looking at some code, and an example to make this clearer. The base
class of all regular expression nodes is this:

defmatch(re, s):
ifnot s:
return re.empty
# shift a mark in from the left
result = re.shift(s[0], True)
for c in s[1:]:
# shift the internal marks around
result = re.shift(c, False)
re.reset()
return result

The most important subclass of Regex is Char, which matches one
concrete character:

Alternative

Now the more interesting cases remain. First we define an abstract base class
Binary for the case of composite regular expressions with two children, and
then the first subclass Alternative which matches if either of two regular
expressions matches the string (usual regular expressions syntax a|b).

An Alternative can match the empty string, if either of its children can.
Similarly, shifting a mark into an Alternative shifts it into both its
children. If either of the children are marked afterwards, the Alternative
is marked too.

As an example, consider the regular expression a|b|c, which would be
represented by the objects Alternative(Alternative(Char('a'),Char('b')),Char('c')).
Matching the string "a" would lead to the following marks in
the regular expression objects (green nodes are marked, white ones are
unmarked):

At the start of the process, no node is marked. Then the first char is matched,
which adds a mark to the Char('a') node, and the mark will propagate up the
two Alternative nodes.

Repetition

The two remaining classes are slightly trickier. Repetition is used to match
a regular expression any number of times (usual regular expressions syntax
a*):

A Repetition can always match the empty string. The mark is shifted into the
child, but if the Repetition is already marked, this will be shifted into
the child as well, because the Repetition could match a second time.

As an example, consider the regular expression (a|b|c)* matching the string
abcbac:

For every character, one of the alternatives matches, thus the repetition matches
as well.

Sequence

The only missing class is that for sequences of expressions, Sequence (usual
regular expressions syntax ab):

A Sequence can be empty only if both its children are empty. The mark
handling is a bit delicate. If a mark is shifted in, it will be shifted to the
left child regular expression. If that left child is already marked before the
shift, that mark is shifted to the right child. If the left child can match the
empty string, the right child gets the mark shifted in as well.

The whole sequence matches (i.e. is marked), if the left child is marked after
the shift and if the right child can match the empty string, or if the right
child is marked.

Consider the regular expression abc matching the string abcd. For the
first three characters, the marks wander from left to right, when the d is
reached, the matching fails.

More Complex Example

As a more complex example, consider the expression ((abc)*|(abcd))(d|e)
matching the string abcabcabcd.

Note how the two branches of the first alternative match the first abc in
parallel, until it becomes clear that only the left alternative (abc)* can
work.

Complexity

The match function above loops over the entire string without going back and
forth. Each iteration goes over the whole tree every time. Thus the complexity
of the algorithm is O(m*n) where m is the size of the regular expression
and n is the length of the string.

Summary & Outlook

So, what have we achieved now? The code shown here can match regular expressions
with the desired complexity. It is also not much code. By itself, the Python
code shown above is not terribly efficient. In the next post I will show how the
JIT generator can be used to make the simple matcher shown above really fast.

Two weeks ago, I was at the Workshop Programmiersprachen und Rechenkonzepte,
a yearly meeting of German programming language researchers. At the workshop,
Frank Huch and Sebastian Fischer gave a really excellent talk about an
elegant regular expression matcher written in Haskell. One design goal of the
matcher was to run in time linear to the length of the input string (i.e.
without backtracking) and linear in the size of the regular expression. The
memory use should also only be linear in the regular expression.

In this blog post I want to describe this implementation and show the code of
it, because it is quite simple. In a later post I will show what optimizations
PyPy can perform on this matcher and also do some benchmarks.

A Note on terminology: In the rest of the post "regular expression" is meant
in the Computer Science sense, not in the POSIX sense. Most importantly, that
means that back-references are not allowed.

Another note: This algorithm could not be used to implement PyPy's re
module! So it won't help to speed up this currently rather slow implementation.

Implementing Regular Expression Matchers

There are two typical approaches to implement regular expression. A naive one is
to use a back-tracking implementation, which can lead to exponential matching
times given a sufficiently evil regular expression.

The other, more complex one, is to transform the regular expression into a
non-deterministic finite automaton (NFA) and then transform the NFA into a
deterministic finite automaton (DFA). A DFA can be used to efficiently match
a string, the problem of this approach is that turning an NFA into a DFA can
lead to exponentially large automatons.

Given this problem of potential memory explosion, a more sophisticated approach
to matching is to not construct the DFA fully, but instead use the NFA for
matching. This requires some care, because it is necessary to keep track of
which set of states the automaton is in (it is not just one state, because the
automaton is non-deterministic).

The algorithm described here is essentially equivalent to this approach, however
it does not need an intermediate NFA and represents a state of a corresponding
DFA as marked regular expression (represented as a tree of nodes). For many
details about an alternative approach to implement regular expressions
efficiently, see Russ Cox excellent article collection.

The Algorithm

In the algorithm the regular expression is represented as a tree of nodes. The
leaves of the nodes can match exactly one character (or the epsilon node, which
matches the empty string). The inner nodes of the tree combine other nodes in
various ways, like alternative, sequence or repetition. Every node in the tree
can potentially have a mark. The meaning of the mark is that a node is marked,
if that sub-expression matches the string seen so far.

The basic approach of the algorithm is that for every character of the input
string the regular expression tree is walked and a number of the nodes in the
regular expression are marked. At the end of the string, if the top-level node
is marked, the string matches, otherwise it does not. At the beginning of the
string, one mark gets shifted into the regular expression from the top, and then
the marks that are in the regex already are shifted around for every additional
character.

Let's start looking at some code, and an example to make this clearer. The base
class of all regular expression nodes is this:

defmatch(re, s):
ifnot s:
return re.empty
# shift a mark in from the left
result = re.shift(s[0], True)
for c in s[1:]:
# shift the internal marks around
result = re.shift(c, False)
re.reset()
return result

The most important subclass of Regex is Char, which matches one
concrete character:

Alternative

Now the more interesting cases remain. First we define an abstract base class
Binary for the case of composite regular expressions with two children, and
then the first subclass Alternative which matches if either of two regular
expressions matches the string (usual regular expressions syntax a|b).

An Alternative can match the empty string, if either of its children can.
Similarly, shifting a mark into an Alternative shifts it into both its
children. If either of the children are marked afterwards, the Alternative
is marked too.

As an example, consider the regular expression a|b|c, which would be
represented by the objects Alternative(Alternative(Char('a'),Char('b')),Char('c')).
Matching the string "a" would lead to the following marks in
the regular expression objects (green nodes are marked, white ones are
unmarked):

At the start of the process, no node is marked. Then the first char is matched,
which adds a mark to the Char('a') node, and the mark will propagate up the
two Alternative nodes.

Repetition

The two remaining classes are slightly trickier. Repetition is used to match
a regular expression any number of times (usual regular expressions syntax
a*):

A Repetition can always match the empty string. The mark is shifted into the
child, but if the Repetition is already marked, this will be shifted into
the child as well, because the Repetition could match a second time.

As an example, consider the regular expression (a|b|c)* matching the string
abcbac:

For every character, one of the alternatives matches, thus the repetition matches
as well.

Sequence

The only missing class is that for sequences of expressions, Sequence (usual
regular expressions syntax ab):

A Sequence can be empty only if both its children are empty. The mark
handling is a bit delicate. If a mark is shifted in, it will be shifted to the
left child regular expression. If that left child is already marked before the
shift, that mark is shifted to the right child. If the left child can match the
empty string, the right child gets the mark shifted in as well.

The whole sequence matches (i.e. is marked), if the left child is marked after
the shift and if the right child can match the empty string, or if the right
child is marked.

Consider the regular expression abc matching the string abcd. For the
first three characters, the marks wander from left to right, when the d is
reached, the matching fails.

More Complex Example

As a more complex example, consider the expression ((abc)*|(abcd))(d|e)
matching the string abcabcabcd.

Note how the two branches of the first alternative match the first abc in
parallel, until it becomes clear that only the left alternative (abc)* can
work.

Complexity

The match function above loops over the entire string without going back and
forth. Each iteration goes over the whole tree every time. Thus the complexity
of the algorithm is O(m*n) where m is the size of the regular expression
and n is the length of the string.

Summary & Outlook

So, what have we achieved now? The code shown here can match regular expressions
with the desired complexity. It is also not much code. By itself, the Python
code shown above is not terribly efficient. In the next post I will show how the
JIT generator can be used to make the simple matcher shown above really fast.

Limited backreferences can be integrated within this pattern matching scheme. General backreferences are only possible with backtracking but unless you want to solve NP complete problems using POSIX style regexps they might not be necessary.

I do not use regular expressions very heavily and am very new to pypy in general (1.2 works pretty good for me on my pure python code). From this article I don't see a full explaination why this basic algorithm couldn't be used for pypy. Is it primarily due to concerns about backward compatiblity or something more interesting? I am looking forward to the article to come about applying the JIT.

Though these regexes can't be used as a drop-in replacement for the re module, if there were strikingly faster it might be worth having them as an alternative. The backtracking features are so seldom required that a faster, non-backtracking algorithm might prove popular with people who worry about matching speed.

It would be fun to read an article where you take the real Python regexes and apply PyPy's JIT code generation to them, i.e. when you call re.compile(...), you'd get native code out of it, specialized for the regex being compiled. After all, haven't you used the JIT on "toy" languages before? Regexes are a "toy" language, albeit a useful one..

Instead of a "special" interpreter for REs in RPython, and a JIT for it, what about "compiling" REs to Python bytecode, and letting the existing PyPy JIT trace and compile them if they end up being used often enough? This is probably slower in the case of lots of throwaway REs that are used once, but when a few REs are used repeatedly it ought to work.