PHP strpos() vs. preg_match()

How expensive is the PHP strpos() implementation? When should
I start using regular expressions for text matching? Is PHP faster
than an implementation in C? The PHP manpage states that strpos()
is always to be preferred over regular expressions. The following
describes some simple experiments and results that shed some light
on these questions.

Test Setup

The following measurements were taken using a clean installation
of the 'php5-cgi' and a 'libapache2-mod-php5' package in Debian Lenny
(both with package version 5.2.6-3 on top of an, besides for the enabled
PHP5 module, unchanged installation of apache2 package 2.2.9-7). For
some additional testing the Zend platform was installed to see how
it affects the basic string searching methods.

Please do not worry
about the absolute times measurement values, only their relative
factors are relevant.

Performance of strpos()

One question one might have is wether the native PHP5
strpos() and
stripos() is
more expensive than a strstr() in C and if a PHP
extension might be cheaper. Of course according to the wide use
of PHP the expectation is that there is almost no overhead.

To determine some meaningful values lets search a short string
of 10 bytes in a larger one of 1kB which doesn't contain the
search string.

And here are the average results for 10000 1k string searches on the test setup:

C

strstr()

0.08 ms

php5-cgi

strpos()

27.7 ms

mod_php5

strpos()

30.6 ms

mod_php5 + ZP

strpos()

37.2 ms

php5-cgi

stripos()

163.6 ms

mod_php5

stripos()

172.8 ms

mod_php5 + ZP

stripos()

177.3 ms

php5-cgi

strstr()

25.3 ms

mod_php5

strstr()

27.5 ms

mod_php5 + ZP

strstr()

42.7 ms

php5-cgi

stristr()

156.7 ms

mod_php5

stristr()

164.3 ms

mod_php5 + ZP

stristr()

174.0 ms

It seems like C is a lot faster and standalone PHP is somewhat better than
a Apache with the overhead of the Zend platform (ZP). The difference between
the different search variants is as expected, the case-insensitive search
being more expensive (between 3 and 5 times). The strstr()/stristr() methods which
are discouraged by the PHP manpage because they are memory-intensive are
about as fast as the strpos()/stripos() methods.

Cost of Regular Expressions

Now let's compare the strpos() function with preg_match(). All tests
will be performed using 'php5-cgi'. The following table lists results
for several test runs with different numbers of repetitions.

n=1

2

3

10

100

1000

10000

strpos()

0.01 ms

0.02 ms

0.04 ms

0.2 ms

0.9 ms

2.6 ms

25.6 ms

preg_match()

0.2 ms

0.2 ms

0.3 ms

0.47 ms

0.95 ms

7.4 ms

72.2 ms

Ratio

1/20

1/10

1/7

1/2

1/1

1/3

1/3

From the numbers one can see there is a significant overhead for
the first preg_match() call. This is for the regular expression
compilation which is only necessary on the first run. Thus the
execution time ratio is 1/20 at first but goes to 1/3 with larger
numbers of executions.

Nonetheless preg_match() will always be more expensive than strpos().

When To Still Prefer preg_match()

How can preg_match() be useful with static matches when it will never be faster
than strpos()? Well, imagine a use case where you have to match a set of static
strings. Let's take the following example:

The cost of the above is determined by calling strpos() 5 times using the same input string with different comparison strings of a similar length. Running the code in the test setup 10000 times results in an average runtime of ~58 ms.

Now imagine the list of strpos() executions is a bit longer, maybe 100 comparisons serving as a comment spam filtering black list. Each new check sentence would add another 10ms (per 10000 executions). The question is how to avoid having linear increasing runtime costs for adding new checks. One possible answer is by clustering matches. For example: By checking wether a check string contains the word "rain" one can prevent executing the last three checks for "It's raining", "These rainy days make you sad." and "Travelling by train is fun.". Only if the word "rain" matches we have to run the specific checks.

The extra checks naturally adds another strpos() execution. And if it is a common word that appears in most of the check string like "is" it would not be very useful. So the idea is to have a very long string ("rain" is pretty bad in that sense) that is very uncommon. Also if the number of "subchecks" should be larger than 2, otherwise there would be no gain when a subcheck is triggered (there would be still 2 strpos() calls).

So when you have a good clustering with rare match strings and the clustered subcheck group is quite large the performance gain might be significant. Imaging 100 test strings clustered into 3 main groups, each of those clustered into 4 other groups each containing only 5 to 10 checks. In such a case you would have a linear matching effort for non-matching string of 3 strpos() runs and a maximum matching effort for matching strings of 17 strpos() matches (3+4 for the clustering group matching, 10 at most for the subcheck matching).

Now the problem remains how to organise matches like this and if it is possible in all cases. With luck one might have similar sentences with really equal substrings, but of the match would need to be case insensitive and sometimes whitespaces or different word order might prevent grouping similar checks. In such case preg_match might help using a bit more complex expression bridging the gaps between word order and white spaces. When preg_match() is used on the upper check levels only the cost impact should stay moderate.

Finally the question remains wether this is a good idea at all? Are there cheaper solutions? Feel free to comment!