Tuesday, June 8, 2010

A JIT for Regular Expression Matching

This is part 2 of a series, see Part 1 for an introduction. In this post
I want to describe how the JIT generator of the PyPy project can be used to turn
the elegant but not particularly fast regular expression matcher from the first
part into a rather fast implementation. In addition, I will show some speed
measurements against various regular expression implementations.

Again, note the disclaimer: This technology could not easily be used
to implement Python's re-module.

Example Expression and First Numbers

The regular expression I will use as an example in the rest of this paper is
the expression (a|b)*a(a|b){20}a(a|b)*. It matches all strings that have two
a with exactly 20 characters between them. This regular expression has
the property that the corresponding DFA needs 2**(n+1) different states. As
an input string, we use a random string (of varying lengths) that does not
match the regular expression. I will give all results as number of chars matched
per second. While this is not a particularly typical regular expression, it
should still be possible to get some ballpark numbers for the speeds of various
implementations – as we will see, the differences between implementations are
huge anyway.

All the benchmarks were performed on my laptop, which has an Intel Core2 Duo
P8400 processor with 2.26 GHz and 3072 KB of cache on a machine with 3GB RAM
running Ubuntu Linux 10.04.

To get a feeling for the orders of magnitude involved, the CPython re module
(which is implemented in C and quite optimized) can match 2'500'000 chars/s.
Google's new re2 implementation still matches 550'000 chars/s. Google's
implementation is slower, but their algorithm gives complexity and space
guarantees similar to our implementation in the last blog post.

On the other end of the performance scale is the pure-Python code from the last
blog post running on CPython. It can match only 12'200 chars/s and is thus 200
times slower than the re module.

Translating the Matcher

The code described in the last blog post is not only normal Python code, but
also perfectly valid RPython code. Nothing particularly dynamic is going on in
the code, thus it can be translated with PyPy's translation toolchain to C code.
The resulting binary is considerably faster and can match 720'000 chars/s, 60
times faster than the untranslated version.

Another approach is to write equivalent versions of the algorithms in lower
level languages. This has been done for C++ by Sebastian Fischer and for Java by
Baltasar Trancón y Widemann. The algorithm is object-oriented enough to be
mapped very closely to the respective languages. The C++ version is
a little bit faster than the RPython version translated to C, at 750'000 chars/s. That's
not very surprising, given their similarity. The Java version is more than twice
as fast, with 1'920'000 chars/s. Apparently the Java JIT compiler is a lot
better at optimizing the method calls in the algorithm or does some other
optimizations. One reason for this could be that the Java JIT can assume that
the classes it sees are all there are (and it will invalidate the generated
machine code if more classes are loaded), whereas the C++ compiler needs to
generate code that works even in the presence of more regular expression
classes.

Generating a JIT

To get even more performance out of the RPython code, it is possible to generate
a JIT for it with the help of the PyPy translation toolchain. To do this, the
matching code needs to be extended somewhat by some hints that tell PyPy's JIT
generator how this is to be done. The JIT generator can automatically produce a
JIT compiler from an RPython interpreter of the source language. In our case,
we view the regular expression matcher as an interpreter for regular
expressions. Then the match function corresponds to the
dispatch loop of a traditional interpreter.

Our regular expression matcher is a very peculiar interpreter. The matcher
works by running exactly one loop (the one in match) as many times as the
input string is long, irrespective of the "program", i.e. the particular
regular expressions. In addition, within the loop there are no conditions (e.g.
if statements) at all, it is just linear code. This makes it almost perfectly
suited
to the JIT generator, which produces a tracing JIT. A tracing JIT compiles the
hot loops of a program (i.e. regular expression) and has to do extra work if
there are conditions in the loop. In our case, there is exactly one loop per
regular expression, without any condition.

JIT Hints

The hints that are needed for the match function of the last blog post can
be seen here (the function is slightly rewritten, e.g. the JIT does only
properly support a while loop as the main dispatch loop):

The jitdriver is an instance describing the data of the interpreter we are
dealing with. The arguments to the constructor need to list all local variables
of the dispatch loop. The local variables are classified into two classes, red
ones and green ones. The green ones hold the objects that make up the program
that the interpreter currently runs and which position in the program is
currently being executed. In a typical bytecode interpreter, the bytecode object
and the program counter would be green. In our case, the regular expression is
the program, so it is green. The rest of the variables are red.

The green variables are treated specially by the JIT generator. At runtime, for
a given value of the green variables, one piece of machine code will be
generated. This piece of machine code can therefore assume that the value of
the green variable is constant.

There are two additional hints, which are method calls on the
jitdriver instance. The jit_merge_point method marks the beginning of
the main interpreter loop. The can_enter_jit function marks the point where
a loop in the user program can be closed, which in our case is trivial, it's
just at the end of the interpreter loop (for technical reasons it is put at the beginning, because nothing must happen between the can_enter_jit and jit_merge_point invocations).

Those are the hints that the JIT generator needs to function at all. We added
some additional hints, that give the JIT generator more information to work
with. Those hints are immutability information, which means that certain
instance fields can not be changed after the object has been constructed. Apart
from the marked field, none of the fields of any of the Regex subclasses
can change. For example for the Char class this is expressed in the
following way:

These hints allow the generated JIT to constant-fold reads out of the immutable
fields in some situations.

Adaptions to the Original Code

In the introduction above I wrote that the code within the loop in match
uses no conditions. It is indeed true that none of the _shift methods
have an if statement or similar. However, there are some hidden conditions
due to the fact that the and and or boolean operators are used, which
are short-circuiting. Therefore the JIT-version of the code needs to be adapted
to use the non-short-circuiting operators & and |.

JIT Example

To get an impression of how the generated machine code looks like, consider the
regular expression (a|b)*. As regular expression objects this would be
Repetition(Alternative(Char('a'),Char('b'))). The machine code in its intermediate,
machine-independent form looks as follows (I have slightly cleaned it up and
added comments for clarity):

The various ConstPtr(ptr_*) denote constant addresses of parts of the regular
expression tree:

ptr_repetition is the Repetition

ptr_chara is Char('a')

ptr_charb is Char('b')

ptr_alternative is the Alternative

Essentially the machine code reads the next char out of the string, the current
mark out of the Repetition and then performs some boolean operations on
those, writing back the new marks. Note in particular how the generated
machine code does not need to do any method calls to shift and _shift and
that most field reads out of the regular expression classes have been optimized
away, because the fields are immutable. Therefore the machine code does not
need to deconstruct the tree of regular expression objects at all, it just
knows where in memory the various parts of it are, and encodes that directly
into the code.

Performance Results With JIT

With the regular expression matcher translated to C and with a generated JIT,
the regular expression performance increases significantly. Our running example
can match 16'500'000 chars/s, which is more than six times faster than the
re module. This is not an entirely fair comparison, because the re
module can give more information than just "matches" or "doesn't match", but
it's still interesting to see. A more relevant comparison is that between the
program with and without a JIT: Generating a JIT speeds the matcher up by more
than 20 times.

Conclusion

So, what have we actually won? We translated the relatively simple and very slow
regular expression matching algorithm from the last post to C and were thus able
to speed it up significantly. The real win is gained by also generating a JIT
for the matcher, which can be regarded as a simple interpreter. The resulting
matcher is rather fast.

The lesson from these posts is not that you can or should write a practical
and general regular expression module in this way – indeed, enhancing the
algorithm to support more features of the re module would be a lot of work
and it is also unclear what the speed results for more realistic regular
expressions would be. However, it makes for a great case study of the JIT
generator. It was relatively straightforward to generate a JIT for the regex
matcher, and the speed results were great (Admittedly I know rather a lot about
PyPy's JIT though). This approach is generalizable to many programs that are
sufficiently "interpreter-like" (whatever that exactly means).

All the results that appeared at various points in this blog post can be seen
here:

Implementation

chars/s

speedup over pure Python

Pure Python code

12'200

1

Python re module

2'500'000

205

Google's re2 implementation

550'000

45

RPython implementation translated to C

720'000

59

C++ implementation

750'000

61

Java implementation

1'920'000

157

RPython implementation with JIT

16'500'000

1352

Sources

All the source code can be found in my Subversion user directory on Codespeak.

Edit:

Armin is right (see first comment). I fixed the problem.

This is part 2 of a series, see Part 1 for an introduction. In this post
I want to describe how the JIT generator of the PyPy project can be used to turn
the elegant but not particularly fast regular expression matcher from the first
part into a rather fast implementation. In addition, I will show some speed
measurements against various regular expression implementations.

Again, note the disclaimer: This technology could not easily be used
to implement Python's re-module.

Example Expression and First Numbers

The regular expression I will use as an example in the rest of this paper is
the expression (a|b)*a(a|b){20}a(a|b)*. It matches all strings that have two
a with exactly 20 characters between them. This regular expression has
the property that the corresponding DFA needs 2**(n+1) different states. As
an input string, we use a random string (of varying lengths) that does not
match the regular expression. I will give all results as number of chars matched
per second. While this is not a particularly typical regular expression, it
should still be possible to get some ballpark numbers for the speeds of various
implementations – as we will see, the differences between implementations are
huge anyway.

All the benchmarks were performed on my laptop, which has an Intel Core2 Duo
P8400 processor with 2.26 GHz and 3072 KB of cache on a machine with 3GB RAM
running Ubuntu Linux 10.04.

To get a feeling for the orders of magnitude involved, the CPython re module
(which is implemented in C and quite optimized) can match 2'500'000 chars/s.
Google's new re2 implementation still matches 550'000 chars/s. Google's
implementation is slower, but their algorithm gives complexity and space
guarantees similar to our implementation in the last blog post.

On the other end of the performance scale is the pure-Python code from the last
blog post running on CPython. It can match only 12'200 chars/s and is thus 200
times slower than the re module.

Translating the Matcher

The code described in the last blog post is not only normal Python code, but
also perfectly valid RPython code. Nothing particularly dynamic is going on in
the code, thus it can be translated with PyPy's translation toolchain to C code.
The resulting binary is considerably faster and can match 720'000 chars/s, 60
times faster than the untranslated version.

Another approach is to write equivalent versions of the algorithms in lower
level languages. This has been done for C++ by Sebastian Fischer and for Java by
Baltasar Trancón y Widemann. The algorithm is object-oriented enough to be
mapped very closely to the respective languages. The C++ version is
a little bit faster than the RPython version translated to C, at 750'000 chars/s. That's
not very surprising, given their similarity. The Java version is more than twice
as fast, with 1'920'000 chars/s. Apparently the Java JIT compiler is a lot
better at optimizing the method calls in the algorithm or does some other
optimizations. One reason for this could be that the Java JIT can assume that
the classes it sees are all there are (and it will invalidate the generated
machine code if more classes are loaded), whereas the C++ compiler needs to
generate code that works even in the presence of more regular expression
classes.

Generating a JIT

To get even more performance out of the RPython code, it is possible to generate
a JIT for it with the help of the PyPy translation toolchain. To do this, the
matching code needs to be extended somewhat by some hints that tell PyPy's JIT
generator how this is to be done. The JIT generator can automatically produce a
JIT compiler from an RPython interpreter of the source language. In our case,
we view the regular expression matcher as an interpreter for regular
expressions. Then the match function corresponds to the
dispatch loop of a traditional interpreter.

Our regular expression matcher is a very peculiar interpreter. The matcher
works by running exactly one loop (the one in match) as many times as the
input string is long, irrespective of the "program", i.e. the particular
regular expressions. In addition, within the loop there are no conditions (e.g.
if statements) at all, it is just linear code. This makes it almost perfectly
suited
to the JIT generator, which produces a tracing JIT. A tracing JIT compiles the
hot loops of a program (i.e. regular expression) and has to do extra work if
there are conditions in the loop. In our case, there is exactly one loop per
regular expression, without any condition.

JIT Hints

The hints that are needed for the match function of the last blog post can
be seen here (the function is slightly rewritten, e.g. the JIT does only
properly support a while loop as the main dispatch loop):

The jitdriver is an instance describing the data of the interpreter we are
dealing with. The arguments to the constructor need to list all local variables
of the dispatch loop. The local variables are classified into two classes, red
ones and green ones. The green ones hold the objects that make up the program
that the interpreter currently runs and which position in the program is
currently being executed. In a typical bytecode interpreter, the bytecode object
and the program counter would be green. In our case, the regular expression is
the program, so it is green. The rest of the variables are red.

The green variables are treated specially by the JIT generator. At runtime, for
a given value of the green variables, one piece of machine code will be
generated. This piece of machine code can therefore assume that the value of
the green variable is constant.

There are two additional hints, which are method calls on the
jitdriver instance. The jit_merge_point method marks the beginning of
the main interpreter loop. The can_enter_jit function marks the point where
a loop in the user program can be closed, which in our case is trivial, it's
just at the end of the interpreter loop (for technical reasons it is put at the beginning, because nothing must happen between the can_enter_jit and jit_merge_point invocations).

Those are the hints that the JIT generator needs to function at all. We added
some additional hints, that give the JIT generator more information to work
with. Those hints are immutability information, which means that certain
instance fields can not be changed after the object has been constructed. Apart
from the marked field, none of the fields of any of the Regex subclasses
can change. For example for the Char class this is expressed in the
following way:

These hints allow the generated JIT to constant-fold reads out of the immutable
fields in some situations.

Adaptions to the Original Code

In the introduction above I wrote that the code within the loop in match
uses no conditions. It is indeed true that none of the _shift methods
have an if statement or similar. However, there are some hidden conditions
due to the fact that the and and or boolean operators are used, which
are short-circuiting. Therefore the JIT-version of the code needs to be adapted
to use the non-short-circuiting operators & and |.

JIT Example

To get an impression of how the generated machine code looks like, consider the
regular expression (a|b)*. As regular expression objects this would be
Repetition(Alternative(Char('a'),Char('b'))). The machine code in its intermediate,
machine-independent form looks as follows (I have slightly cleaned it up and
added comments for clarity):

The various ConstPtr(ptr_*) denote constant addresses of parts of the regular
expression tree:

ptr_repetition is the Repetition

ptr_chara is Char('a')

ptr_charb is Char('b')

ptr_alternative is the Alternative

Essentially the machine code reads the next char out of the string, the current
mark out of the Repetition and then performs some boolean operations on
those, writing back the new marks. Note in particular how the generated
machine code does not need to do any method calls to shift and _shift and
that most field reads out of the regular expression classes have been optimized
away, because the fields are immutable. Therefore the machine code does not
need to deconstruct the tree of regular expression objects at all, it just
knows where in memory the various parts of it are, and encodes that directly
into the code.

Performance Results With JIT

With the regular expression matcher translated to C and with a generated JIT,
the regular expression performance increases significantly. Our running example
can match 16'500'000 chars/s, which is more than six times faster than the
re module. This is not an entirely fair comparison, because the re
module can give more information than just "matches" or "doesn't match", but
it's still interesting to see. A more relevant comparison is that between the
program with and without a JIT: Generating a JIT speeds the matcher up by more
than 20 times.

Conclusion

So, what have we actually won? We translated the relatively simple and very slow
regular expression matching algorithm from the last post to C and were thus able
to speed it up significantly. The real win is gained by also generating a JIT
for the matcher, which can be regarded as a simple interpreter. The resulting
matcher is rather fast.

The lesson from these posts is not that you can or should write a practical
and general regular expression module in this way – indeed, enhancing the
algorithm to support more features of the re module would be a lot of work
and it is also unclear what the speed results for more realistic regular
expressions would be. However, it makes for a great case study of the JIT
generator. It was relatively straightforward to generate a JIT for the regex
matcher, and the speed results were great (Admittedly I know rather a lot about
PyPy's JIT though). This approach is generalizable to many programs that are
sufficiently "interpreter-like" (whatever that exactly means).

All the results that appeared at various points in this blog post can be seen
here:

Implementation

chars/s

speedup over pure Python

Pure Python code

12'200

1

Python re module

2'500'000

205

Google's re2 implementation

550'000

45

RPython implementation translated to C

720'000

59

C++ implementation

750'000

61

Java implementation

1'920'000

157

RPython implementation with JIT

16'500'000

1352

Sources

All the source code can be found in my Subversion user directory on Codespeak.

12 comments:

Warning: the first example is wrong: there should be no code executed between can_enter_jit() and jit_merge_point(). In this case, there is the exit condition of the loop. It needs to be rewritten as a "while True:" loop with a "break" just before can_enter_jit().

Without those changes, the modifications for JIT really are prety small --mostly just some annotations in the main loop and at toplevel for eachclass. With those changes, though, you need to potentially check the entirecodebase of your interpreter.

@Nelson: If you don't change the "and" and "or" you get a lot of assembler code generated, and it's not particularly fast.

Note that this "and" and "or" business is quite specific to this particular example. Usually you can work more incrementally by generating a JIT, then looking at the produced assembler and then doing some small changes in the interpreter to improve parts of it. Each such change is usually localized to one part of the interpreter improves the performance of some language feature.

This example is not really large enough to show this way of working, though :-). Maybe at some point I should write a walk-through for some interpreter.

boost::regex is not mentioned. It's got both recursive and non-recursive implementations. And it is the base of the standard C++ TR1 regex. Would be interesting to stack it up against other results because it is