Saturday, January 24, 2009

Fast incremental regular expression matching with monoids

The Problem

Consider this problem: Fix a regular expression R. Suppose you have a string of length N. There's not much you can do about it, you'll likely have to scan all N characters to test so see if the string matches R. But once you've performed the test, how fast can you test the string again if a small edit is made to it? It seems that in general you'd have to rescan the entire string, or at least rescan from where the edit was made. But it turns out that you can do regular expression matching incrementally so that for many changes you might make to the string, you only require O(log N) time to recompute whether the string matches. This is true even if characters at opposite ends of the string interact to make a succcessful match. What's more, it's remarkably straightforward to implement if we make use of fingertrees and monoids.

A Finite State Machine

So let's start with an example regular expression: .*(.*007.*).*. We're looking for "007" enclosed between parentheses, but the parentheses could be millions of characters apart.

A standard technique for finding regular expressions is to compile them to a finite state automaton. It takes quite a bit of code to do that, but it is completely standard. So rather than do that, here's a finite state machine I constructed by hand for this regular expression:I've used the convention that an unlabelled edge corresponds to any input that isn't matched by another labelled edge. We can express the transitions as a function fsm.

> fsm 0 '(' = 1> fsm 0 _ = 0

> fsm 1 '0' = 2> fsm 1 _ = 1

> fsm 2 '0' = 3> fsm 2 _ = 1

> fsm 3 '7' = 4> fsm 3 '0' = 3> fsm 3 _ = 1

> fsm 4 ')' = 5> fsm 4 _ = 4

> fsm 5 _ = 5

The initial state is 0 and a match corresponds to state 5.

We can test a string in the standard way using

> matches s = Prelude.foldl fsm 0 s==5

Try matches "(00 7)" and matches "He(007xxxxxxxxxxxx)llo".

We can thing of the inputs as being functions acting on the automaton. Each input character is a function that maps the automaton from one state to another. We could use the Haskell composition function, (.), to compose these functions. But (.) doesn't really do anything, f . g is just a closure that says "when the time comes, apply g and then f.

On the other hand we could tabulate our transition functions as follows:

Given two tabulated functions we can easily form the table of the composition function. In fact, our tabulated functions form a monoid with mappend for composition. I used unboxed arrays for performance:

Note that we've cheated a bit. An object of type Table could be any array of Ints indexed by Int. But if we promise to only build arrays indexed by elements of [0..5] and containing elements of the same range then our claim to monoidhood is valid.

Given any string, we can compute whether it matches our regular expression by looking up the corresponding Table in our letters array, composing them, and then checking if the tabulated function maps the initial state 0 to the final state 5:

This is slower and more complex than our original implementation of matches. But what we've now done is 'tease out' a monoid structure from the problem. If we store a string as a sequence of characters represented by a fingertree, we can store in each subtree the element of Table corresponding to the substring it represents. Every time the tree is rebalanced we need to recompute the corresponding Tables. But that's fine, it typically involves only O(log N) operations, and we don't need to write any code, the fingertree will do it for us automatically. Once we've done this, we end up with a representation of strings with the property that we always know what the corresponding Table is. We can freely split and join such trees knowing that the Table will always be up to date.

The only slight complication is that I want to be able to randomly access the nth character of the tree. apfelmus explains that in his post. The change I need to make is that I'm going to use both the Size monoid and the Table monoid, so I need the product monoid.

Now you can run this interactively. Input values like (100,'f') to insert an 'f' at position 100. It can take a good few seconds to compute the initial tree, but after that the matching process is instantaneous. (Actually, the second match might take a few seconds, that's because despite the foldl' the tree hasn't been fully built.)

A suitable sample input is:

(100,'(')(900000,')')(20105,'0')(20106,'0')(20107,'7')

Discussion

Note there is quite an overhead for this example. I'm storing an entire Table for each character. But you can easily store chunks of string (like in a rope). This means that some chunks will be rescanned when a string is edited - but rescanning a 1K chunk, say, is a lot less expensive than scanning a gigabyte file in its entirety. Working in blocks will probably speed up the initial scan too, a much smaller tree needs to be built.

When Hinze and Patterson originally wrote the fingertree paper they were motivated by parallel prefix sum methods. Just about any parallel prefix algorithm can be converted to an incremental algorithm using fingertrees. This article is based on the idea of doing this with the parallel lexing scheme described by Hillis and Steele in their classic Connection Machine paper.

So why would you want to match against a fixed regular expression like this? Well this method extends to a full blown incremental lexer. This will lex quickly even if placing a character in a string changes the type of lexemes billions of characters away. See the Hillis and Steele paper for details.

Note there's nothing especially Haskelly about this code except that Haskell made it easy to prototype. You can do this in C++, say, using mutable red-black trees.

10 comments:

In other words, deterministic finite state automata and thus regular expressions are actually monoid morphisms from the free monoid over the alphabet to a finite monoid, namely the monoid of endomorphisms on the states.

I finally remember this connection from a past automata & languages lesson. For instance, the converse is true as well, every homomorphism to a finite monoid gives rise to a finite state automaton. (Construction: take the monoid itself as state space, every element corresponds to an endomorphism via left multiplication). I think this can and is used to minimize finite state automata.

In other words, deterministic finite state automata and thus regular expressions are actually monoid morphisms from the free monoid over the alphabet to a finite monoid, namely the monoid of endomorphisms on the states.

I finally remember this connection from a past automata & languages lesson. For instance, the converse is true as well, every homomorphism to a finite monoid gives rise to a finite state automaton. (Construction: take the monoid itself as state space, every element corresponds to an endomorphism via left multiplication). I think this can and is used to minimize finite state automata.

There's an extra twist you can squeeze out of this formalism. You can label the transitions of the automaton with output values that live in a monoid. Then you can do things like counting and minimisation while matching - eg. to count the number of lexems, or find the longest one, or find the longest string between a certain pair and so on.

Dan, I just published an article that expands on the ideas from this blogpost and builds a Java library for incremental matching of multiple regular expressions: http://jkff.info/articles/ire (it was actually published in the Russian functional programming journal http://fprog.ru/ but I recently got around to finish the translation)

1. Can you do some benchmarks against the Rope implementation? - http://jkff.info/articles/ire

2. You mention Red/Black trees for C++ at the end there. Wouldn't that adversely affect the complexities? - Why wouldn't you use Finger Trees in C++? (there is a Masters theses floating around on the interwebs with an implementation of the 2-3 tree variant)