Chris F Clark <cfc@shell01.TheWorld.com> wrote:>...thus the classic Aho-Corasik (AC) and Boyer-Moore (BM) algorithms.>In addition, I've known for a "long time" that the trick of working>backwards ala BM could be applied to a AC style matcher..>Still, it seems to me to be an obvious thing to do, so I would be>surprised if someone else hasn't already solved this problem...

It's the Commentz-Walter algorithm, published (although rather obscurely)
in 1979. CW is to AC as BM is to Knuth-Morris-Pratt, roughly speaking, a
backwards-first optimization. Or alternatively :-), CW is to BM as AC is
to KMP, a multi-string generalization.

Commentz-Walter, B., "A string matching algorithm fast on the average",
in Maurer ed., Proc 6th Internat. Coll. on Automata, Languages, and
Programming, Springer-Verlag 1979, pp118-132. The inaccessibility of this
paper (and the internal IBM Germany tech report it was based on) are a
large part of why CW is little known.

I'm tempted to say that a more accessible description is findable in
Crochemore&Rytter, "Text Algorithms", Oxford 1994, but I won't :-),
because that book is a nightmare to read -- in particular, it is riddled
with really nasty typos that make it very difficult to follow the more
subtle algorithms unless you already know what's going on.

Bruce Watson's 1995 Eindhoven PhD thesis, "Taxonomies and Toolkits of
Regular Language Algorithms", is a much more readable and far more
definitive reference -- including extensions of BM to full regular
expressions -- but as far as I know, it suffers from the same problem of
relative inaccessibility.

There is reportedly a good description of CW in Aho's chapter in
van Leeuwen ed., "Handbook of Theoretical Computer Science", vol. A,
but I have not checked this.
--
spsystems.net is temporarily off the air; | Henry Spencer
mail to henry at zoo.utoronto.ca instead. | henry@spsystems.net