PHP vs. Python vs. Perl -- Regular Expression Showdown

The Goal

I was in a discussion yesterday with one of my co-workers about the speed of Spamassassin. We were talking about how slow it is and good ways in which to speed it up. He mentioned some optimizations in other languages, which got me wondering about exactly what the speed differences would be in a test of PHP, Python and Perl. This writeup details the results of my tests of 5 different scripts on 2 different machines running 2 different distros of Linux.

The Hardware

I've run these tests on my “work” workstation, which is a Sun Ultra 20 running Gentoo Linux. From here on out, I will refer to this machine as the “Sun box”. The details are as follows.

CPU

single, single core AMD Opteron 2.6GHz

Memory

2GB

OS

Gentoo Linux

The second machine was my personal laptop, a Gateway MX6931 running Ubuntu Linux. From here on out, I will refer to this machine as the “Laptop”. The details are as follows

CPU

single, dual core Intel Core2 1.66GHz

Memory

2GB

OS

Ubuntu 7.10

The Interpreters

Here is the output of the version command from each of our 3 interpreters on each of the machines. The first set is from the Sun box.

The Scripts

I wrote a total of 5 scripts for this experiment. There is one for each interpreter as well as one additional script for Python. The reason there are 2 Python scripts is due to the option in Python's regular expression library to compile the regular expressions prior to use. Therefore, I have one Python script which uses pre-compiled regular expressions and one that does not. The reason for the 2 Perl scripts is that the first one I wrote uses the same programmatic mechanism of looping through an array of regular expression strings and using a string variable (m/$r/) in the regex matching. At the behest of one of my co-workers, I wrote a second version where all the regexs were hard coded in the match expression (m/regex code.*$/). The difference, as you will see is quite dramatic.

Though these are different languages, I kept the execution of the scripts almost completely the same between them, with the exception of one of the Perl scripts. The basis of the scripts is that they all use a set of 5 different regular expressions in an array to try and match against lines in an email logfile. More on the logfile later. If there is a match, a simple integer counter is incremented and the script moves on to the next line. Very basic, but also very real world. Parsing log files is definitely one of the major uses for interpreted languages, which is what this is about. The only thing I'm not doing is aggregating any kind of data from what I'm parsing as I want this to be purely about the speed of the regular expression matching. Now, the source of the 5 scripts.

The Testbed

The testbed was very simple. The testing script is just a simple shell script that ran the Unix time on each of our 5 test scripts 5 consecutive times each. The “maillog” file that was used was simply a compilation of a number of days of maillogs from my personal mail server. Nothing was altered in this file and the same file was used in all tests. The size of that file is 72886261. The same maillog file was used in all tests, as you can see from the scripts above. For reasons that should be obvious, I'm not going to post my personal maillogs here.

The Results

The tables below show the Unix time output of these tests on each machine. The fastest time for each language is in bold.

The Sun Box

PHP

Python (non-compiled)

Python (compiled)

Perl (interpolated string loop)

Perl (hard coded regexes)

Real

User

Sys

Real

User

Sys

Real

User

Sys

Real

User

Sys

Real

User

Sys

Test 1

9.45s

9.10s

0.07s

13.42s

12.66s

0.11s

7.97s

7.20s

0.11s

31.89s

29.43s

0.17s

1.59s

1.53s

0.05s

Test 2

9.90s

9.06s

0.06s

13.28s

12.45s

0.15s

7.86s

7.25s

0.01s

31.78s

29.97s

0.18s

1.67s

1.52s

0.07s

Test 3

9.58s

9.07s

0.06s

13.56s

12.59s

0.10s

7.45s

7.09s

0.13s

31.29s

29.61s

0.14s

2.32s

1.46s

0.04s

Test 4

9.52s

9.08s

0.10s

13.58s

12.63s

0.09s

7.40s

7.18s

0.04s

33.19s

30.27s

0.15s

1.76s

1.47s

0.04s

Test 5

9.94s

8.87s

0.12s

13.00s

12.44s

0.12s

7.43s

7.19s

0.08s

33.22s

30.22s

0.16s

1.82s

1.42s

0.10s

The Laptop

PHP

Python (non-compiled)

Python (compiled)

Perl (interpolated string loop)

Perl (hard coded regexes)

Real

User

Sys

Real

User

Sys

Real

User

Sys

Real

User

Sys

Real

User

Sys

Test 1

14.25s

14.05s

0.10s

12.10s

11.98s

0.06s

6.12s

6.03s

0.07s

42.42s

42.11s

0.11s

1.63s

1.54s

0.08s

Test 2

14.00s

13.62s

0.08s

12.27s

11.91s

0.04s

6.17s

6.08s

0.06s

43.02s

42.72s

0.09s

1.71s

1.64s

0.05s

Test 3

14.24s

14.01s

0.14s

12.43s

12.29s

0.06s

6.14s

5.98s

0.08s

43.15s

42.67s

0.15s

1.71s

1.65s

0.06s

Test 4

13.94s

13.62s

0.08s

12.21s

11.93s

0.11s

6.30s

6.22s

0.05s

43.25s

43.00s

0.07s

1.63s

1.58s

0.05s

Test 5

14.30s

14.06s

0.14s

12.24s

12.08s

0.09s

6.32s

6.19s

0.08s

31.89s

42.60s

0.20s

1.61s

1.55s

0.05s

Just For Fun...

…one of my co-workers whipped up this C code which uses libpcre just to see how it would perform versus the interpreted languages. I'm not including it in the main results because this is a test of 3 interpreted languages speed capabilities, but I thought I would drop the results in here just for fun.

The Results

The Sun Box

The Laptop

Real

User

Sys

Test 1

13.14s

12.92s

0.06s

Test 2

13.08s

12.88s

0.06s

Test 3

13.09s

12.94s

0.02s

Test 4

13.21s

13.00s

0.07s

Test 5

13.07s

12.88s

0.04s

Conclusion

Well, it appears to be that the non pre-compiled Python regexes are about on par with PHP. My Sun box running Gentoo was probably a bit faster because I'm running a bit more stripped down version of the php binary compiled specifically for my machine, rather than the generic i386 binary on the Ubuntu laptop.

The Python numbers are fairly consistent in terms of the compiled versions being about twice as fast as the non compiled versions.

I think that the most amazing thing here is difference in the 2 Perl tests. If you use a scalar string variable as the regular expression, it's dog slow. However, if you hard code that string in the expression, it's lightning fast. I was not expecting this kind of a discrepancy at all, but I'm glad that I tested both approaches.

Though I didn't include it in the official results, I thought it was kind of interesting that the compiled C program performed about the same as the Python program with pre-compiled regular expressions.

I think the conclusion that I have to draw from this experiment is that Perl is your best choice, as is often the case, for a simple static regular expression based parser. On the other hand, if you wanted a more dynamic approach to the regular expressions that you are using (like loading them in from a file, command-line, etc.), compiled Python is definitely your best answer, but PHP is also a good candidate. It's pretty obvious that Perl is not the language to use in that particular case.

Please, feel free to post to the discussion here in answer to this writeup.