Java Regex Wrangling

Search

I needed a quick and dirty tokenizer for a big chunk of XML-ish text to
feed into some Java code so I was going to fire up Perl, then I remembered
that modern
Java comes with its own regular-expression library.
Hey, it’s good! I put it together in quick-n-dirty hacker style, and it ran
over a 100M file, finding fifteen million tokens, in about three minutes of
CPU time on my 1.25GHz PowerBook.
Quite respectable, but, I thought with a snicker, I bet Perl can beat that.
(Perl’s regex engine is
generally regarded as the state of the art.) So I whacked together a Perl
version and, just to make sure I was getting the right answers, I had both
the Java and Perl versions print out all the tokens they found. They both
burned something over ten minutes, and Perl was maybe 10% faster; might have
been the I/O or other static. I was impressed to find Java within 10% of the
best.
So then I ran it again without the output, just counting the tokens, and
yowie zowie, Perl was at 8 minutes 47 seconds, Java back at 3 minutes 4
seconds.
So I re-ran on a nearby Debian box, on the theory that the OS X versions of
Java and Perl might not be representative of their kind. There are all sorts
of variations around I/O and so on, but my finding is that for this problem,
the Java 1.4.2 regex processing is somewhere around twice as fast as Perl
5.8.1.
Frankly, I’m astounded.
Read on for acknowledgements, some gory details, and a tasteful
selection of Google ads for regular expression software.
[Update: There is a good reason things are the way they are, and Perl’s
trade-off
may well be better.]

Acknowledgements ·
Thanks to Phillip Pearson, Sriram Srinivasan, Chris Lightfoot, Robert Sayre,
and Chris Nokleberg for pointing out my thinko on the first cut of the perl
regexp.
Thanks to Anne van Kesteren and Russell Beattie for CSS help.