Monday, 28 June 2010

Aho-Corasick (multiple pattern match) in PHP

When I started programming Mergifier, my "RSS feeds dispenser", I knew I was going to need some kind of terms filtering. It's useless to take into account the most common words, for they don't bring any information.

My initial solution was to split the text into an array of substrings and use array_diff to remove those in my blacklist array. I don't know the asymptotic cost of array_diff algorythm, but using it with plenty of items (featuring long titles, summaries and descriptions) soon appeared a bit too expensive.

Recently I studied a lot of pattern matching algorythms, including the well known "Aho-Corasick" which checks occourrences of multiple patterns "at once" inside a text of length n in O(M+n), being M the total number of characters composing the patterns. Useful for virus signature detection and bioinformatics, I did not find its implementation anywhere, so I did my own!

As you could see inside the code, this algorythm is cool because one can precalculate delta/phi transition tables and cache them, brutally speeding up computations on many different texts made of many terms.

If you find this code useful, or if you have any kind of suggestion to improve it, please let me know. :)