Aho-Corasick string matching in Haskell

The Aho-Corasick string matching algorithm constructs an automaton for matching a dictionary of patterns. When applied to an input string, the automaton’s time complexity is linear in the length of the input, plus the number of matches (so at worst quadratic in the input). It’s been around since 1975, but it isn’t implemented in the Haskell stringsearch library and I couldn’t even find a general trie data structure from google. So I implemented the Aho-Corasick algorithm myself: take a look at the full Aho-Corasick module.

There was an interesting paper on deriving the algorithm as a result of applying fully-lazy evaluation and memoization on a more naive algorithm. Unfortunately, applying fully-lazy evaluation and memoization to a function in Haskell is non-trivial (despite it being theoretically possible for the compiler to do so!).

It’s always interesting trying to find the functional equivalent to an imperative algorithm. I ended up using some cute Haskell tricks.

Instead of using a list to implement the branches of a rose tree, I used partial-application over edge. This certainly looks elegant, but in fact it is the weak point, as withPrefix is a linear search; the imperative approach is an O(1) lookup (with small alphabets) or O(log m) over m branches. Furthermore, the lazy evaluation of edge means that the trie is being constantly reconstructed as it is traversed by the automaton.

Obviously it’s not generic over types or anything, but it should work fine with lists of types other than Char.

The following pathological case didn’t run too badly (25 seconds for m=50, n=100000 on scud, compiled with ghc -O2). Profiling it revealed 20 million entries into edge; which easily dominates the timing. Oddly enough this just seems to be a large constant—other samples suggest it’s linear in the product m n.