Thursday, March 24, 2011

Lucene's FuzzyQuery is 100 times faster in 4.0

There are many exciting improvements in Lucene's eventual 4.0 (trunk) release, but the awesome speedup to FuzzyQuery really stands out, not only from its incredible gains but also because of the amazing behind-the-scenes story of how it all came to be.

FuzzyQuery matches terms "close" to a specified base term: you specify an allowed maximum edit distance, and any terms within that edit distance from the base term (and, then, the docs containing those terms) are matched.

The QueryParser syntax is term~ or term~N, where N is the maximum allowed number of edits (for older releases N was a confusing float between 0.0 and 1.0, which translates to an equivalent max edit distance through a tricky formula).

FuzzyQuery is great for matching proper names: I can search for mcandless~1 and it will match mccandless (insert c), mcandles (remove s), mkandless (replace c with k) and a great many other "close" terms. With max edit distance 2 you can have up to 2 insertions, deletions or substitutions. The score for each match is based on the edit distance of that term; so an exact match is scored highest; edit distance 1, lower; etc.

Prior to 4.0, FuzzyQuery took the simple yet horribly costly brute force approach: it visits every single unique term in the index, computes the edit distance for it, and accepts the term (and its documents) if the edit distance is low enough.

The journey begins

The long journey began when Robert Muir had the idea of pre-building a Levenshtein Automaton, a deterministic automaton (DFA) that accepts only the terms within edit distance N. Doing this, up front, and then intersecting that automaton with the terms in the index, should give a massive speedup, he reasoned.

At first he built a simple prototype, explicitly unioning the separate DFAs that allow for up to N insertions, deletions and substitutions. But, unfortunately, just building that DFA (let alone then intersecting it with the terms in the index), was too slow.

Fortunately, after some Googling, he discovered a paper, by Klaus Schulz and Stoyan Mihov (now famous among the Lucene/Solr committers!) detailing an efficient algorithm for building the Levenshtein Automaton from a given base term and max edit distance. All he had to do is code it up! It's just software after all. Somehow, he roped Mark Miller, another Lucene/Solr committer, into helping him do this.

Unfortunately, the paper was nearly unintelligible! It's 67 pages, filled with all sorts of equations, Greek symbols, definitions, propositions, lemmas, proofs. It uses scary concepts like Subsumption Triangles, along with beautiful yet still unintelligible diagrams. Really the paper may as well have been written in Latin.

Much coffee and beer was consumed, sometimes simultaneously. Many hours were spent on IRC, staying up all night, with Mark and Robert carrying on long conversations, which none of the rest of us could understand, trying desperately to decode the paper and turn it into Java code. Weeks went by like this and they actually had made some good initial progress, managing to loosely crack the paper to the point where they had a test implementation of the N=1 case, and it seemed to work. But generalizing that to the N=2 case was... daunting.

The breakthrough

Then, finally, a breakthrough! Robert found, after even more Googling, an existence proof, in an unexpected place: an open-source package, Moman, under the generous MIT license. The author, Jean-Phillipe Barrette-LaPierre, had somehow, incredibly, magically, quietly, implemented the algorithm from this paper. And this was apparently a random side project for him, unrelated to his day job. So now we knew it was possible (and we all have deep admiration for Jean-Phillipe!).

Not really understanding the Python code, and also neither the paper, we desperately tried to write our own Python code to tap into the various functions embedded in Moman's code, to auto-generate Java code containing the necessary tables for each max edit distance case (N=1, N=2, etc.). We had to guess what each Python function did, by its name, trying to roughly match this up to the spooky terminology in the paper.

The result was createLevAutomata.py: it auto-generates crazy looking Java code (see Lev2ParametricDescription.java, and scroll to the cryptic packed tables at the bottom), which in turn is used by further Java code to create the Levenshtein automaton per-query. We only generate the N=1 and N=2 cases (the N>=3 cases aren't really practical, at least not yet).

The last bug...

Realize, now, what a crazy position we were in. We wrote our own scary Python code, tapping into various functions in the Moman package, to auto-generate unreadable Java code with big tables of numbers, which is then used to generate Levenshtein automata from the base term and N. We went through many iterations with this crazy chain of Python and Java code that we barely understood, slowly iterating to get the bugs out.

After fixing many problems, we still had one persistent bug which we just couldn't understand, let alone fix. We struggled for several days, assuming the bug was in our crazy Python/Java chain. Finally, we considered the possibility that the bug was in Moman, and indeed Robert managed to reduce the problem to a tiny Python-only case showing where Moman failed to match the right terms. Robert sent this example to Jean-Phillipe, who quickly confirmed the bug and posted a patch the next day. We applied his patch and suddenly everything was working perfectly!

Fortunately, while this fast FuzzyQuery was unbelievably hairy to implement, testing it well is relatively easy since we can validate it against the brute-force enumeration from 3.0. We have several tests verifying the different layers executed by the full FuzzyQuery. The tests are exhaustive in that they test all structurally different cases possible in the Levenshtein construction, using a binary (only characters 0 and 1) terms.

Beyond just solving this nearly impossible task of efficiently compiling a term to a Levenshtein Automaton, we had many other parts to fill in. For example, Robert separately created a general AutomatonQuery, re-using infrastructure from the open-source Brics automaton package, to enable fast intersection of an automaton against all terms and documents in the index. This query is now used to handle WildcardQuery, RegexpQuery, and FuzzyQuery. It's also useful for custom cases, too; for example it's used by Solr to reverse wildcard queries. These slides from Robert describe AutomatonQuery, and its fun possible use case, in more detail.

Separately, we had an impedance mismatch: these automatons speak full unicode (UTF32) characters, yet Lucene's terms are stored in UTF8 bytes, so we had to create a UTF32 -> UTF8 automaton converter, which by itself was also very hairy! That converter translates any UTF32 automaton into an equivalent UTF8 Levenshtein automaton, which can be directly intersected against the terms in the index.

So, today, when you run a FuzzyQuery in 4.0, it efficiently seeks and scans only those regions of the term space which may have matches, guided by the Levenshtein automaton. This, coupled with ongoing performance improvements to seeking and scanning terms, as well as other major improvements like performing MultiTermQuery rewrites per-segment, has given us the astounding overall gains in FuzzyQuery.

Thanks to these enormous performance improvements, Robert has created an entirely new automaton spell checker that uses this same algorithm to find candidate terms for respelling. This is just like FuzzyQuery, except it doesn't visit the matching documents. This is a big improvement over the existing spellchecker as it does not require a separate spellchecker index be maintained.

This whole exciting experience is a great example of why open-source development works so well. Here we have diverse committers from Lucene/Solr, bringing together their various unusual strengths (automatons, Unicode, Python, etc.) to bear on an insanely hard challenge, leveraging other potent open-source packages including Moman and Brics, iterating with the authors of these packages to resolve bugs. No single person involved in this really understands all of the parts; it's truly a team effort.

And now you know what's going on under the hood when you see incredible speedups with FuzzyQuery in 4.0!

[For the not-faint-of-heart, you can browse LUCENE-1606 to see parts of this story unfolding through Jira]

Fantastic post Mike! It's really nice to hear about the behind-the-scenes story of how something in the research literature makes its way into Lucene and all the hard work you guys do to make it happen. I'm really looking forward to 4.0!

The Levenshtein automaton is actually built once up front, and then "intersected" with each segment's terms, segment by segment. We have a single PQ that's used to merge the terms from each segment, then at the end we take the top terms from this PQ and create the real query (this is in TopTermsScoringBooleanQueryRewrite, in MultiTermQuery).

Moman on Jython would work, but, Lucene is all Java today, so we wanted to keep that.

N=3 is possible, but it produces biggish tables (like a few hundred KB increase in Lucene's JAR, from the packed tables we store in the generated Java code). Further, since the space of terms accepted by N=3 is so large, it's going to result in much more seeking/scanning to intersect with the terms dictionary, so it could be slowish. It'd be fairly easy to test this...

Is there some reason you didn't contact the authors of the paper with your questions? Seems like they would have been happy to help, or at least put you in touch with a graduate student who could translate the paper for you.

I over-dramatized things somewhat... we do have enough of an understanding to believe it's correct. Furthermore, the tests we created are strenuous, and are exhaustive in that we test the different possible characteristic vectors, so if the paper is correct, the algorithm we implemented should be as well (famous last words I know...).

That's a great question. Licensing and open-source are a very complex (and obviously important) topic. I am not a lawyer.... but, here's how the licensing/copyright worked in this case:

First, Moman's license is the MIT/X11 license, which Apache projects are free to incorporate in source and binary form (see http://www.apache.org/legal/3party.html) as this license does not remove any rights from ASL2.

Second, the Moman package is being used only as a library by the code generator we (Lucene developers) wrote; none of Moman's sources are incorporated into Lucene (only the generated Java code as a result of running the generator, and our Python code generator). In theory, Moman's license has no bearing (much like the license of a compiler doesn't apply to the binaries it produces).

But, finally, to be sure, we contacted Jean-Phillipe to confirm he was the sole author of this package, and he was OK with our usage of it (he was), and we've also added Moman's copyright and license in Lucene's LICENSE.txt file.

Pre-building the Levenshtein automaton changes the problem from "test every term" to a graph intersection problem, ie, we intersect the graph (Levenshtein automaton) with the terms in the terms dictionary by using seek/next APIs already available in Lucene.

In fact, at some point we should add an intersect() method directly into Lucene's APIs, because some terms dict representations can potentially do this more efficiently than seek/next.

As I can not fully understand your above explaination, I have the following questions which might be stupid.1. Suppose we have n terms in the terms dictionary , do we have one graph or n graphes?2. Why using seek/next APIs can skip some terms?I do think that we have to calculate the intersection for all terms.

You have 1 graph created, from the fuzzy query's term plus edit distance N. This graph is the Levenshtein automaton.

You then intersect that graph with the terms in the index, by iteratively seeking to the next possible match. This seek is able to skip terms because chunks of the terms space cannot possibly match.

For example if your query is foobar~1 (ie, edit distance 1), and you are at terms starting with foba, you know the only possible match with that prefix is fobar so you seek there, possible skipping fobaa, fobap, fobaq, etc.

I'm really glad to see someone picked up the paper from Klaus Schulz and Stoyan Mihow.When I first read it in 2007 or 2008 I didn't understand much more of it than it's potential. Thanks for your efforts coding it within lucene - I would never be capable to do so. I'm really waiting to see multiple token fuzzy matches in Action!

I just stumbled upon this very interesting blog post and have a question: doesn't the complexity of the generation of the Levenshtein automaton in terms of time and storage depend on the used alphabet?

I mean, generation of the automaton for the Latin alphabet with its 26 letters may be easy and efficient, but what about Unicode? Having an alphabet with hundreds of thousands of letters must complicate the situation dramatically, no?

Lucene generates the Levenshtein automaton with full Unicode alphabetand then we convert that to the equivalent automaton with UTF8 labels(since terms are stored UTF8 byte[] in the index, by default). Ithink this conversion is logically a composition of the FSA with anFST (but it's not implemented "generically").

This means the edit distance is defined with Unicode characters. Soan edit distance of 2 means you can drop/insert/replace/transpose any 2 full Unicodechars. This is a difference vs previous versions of Lucene which measureedit distance in UTF16 code units.

I don't think the cost of building the Levenshtein automaton increasesmuch with increased alphabet size, as long as your automatonrepresentation can efficiently handle arcs that match a range ofcharacters (ours, poached from http://www.brics.dk/automaton, does).

That's fine: how the Unicode Consortium assigns Unicode characters won't affect FuzzyQuery's implementations. The tables we use to generate the Levenshtein automata are agnostic to the binding of each character: they simply accept any int up to Character.MAX_CODE_POINT.

I wonder how fast is 100 times faster? The example of database and query with search time would be useful. I know it can be done in Java pretty fast, like this one:http://www.softcorporation.com/products/people/index.html

It's quite fast now ... you see see the nightly benchmarks (http://people.apache.org/~mikemccand/lucenebench/ ) ... ~ 30-40 QPS on the full Wikipedia index.

We also now have a DirectSpellChecker that runs a FuzzyQuery minus visiting the matching documents. This is nice because it avoids requiring the "sidecar" spellcheck index that's necessary with the old spellchecker.

I have a confusion here regarding Fuzzy Query. Since Solr 4 is supporting fuzzy searches using Edit Distance which needs a parameter i.e. N which can have values as 0 or 1 or 2(max). So, why are the values like 0.4,0.6..till 1 are still supported and 1.5,2.2.. are not ? How does it makes sense ? Is it just for backward compatibility or there is something that I am missing ?

I believe a value > 1 is supposed to be an integer edit distance, while a value <= 1 is allowed to be the legacy similarity measure (which under the hood is changed to an edit distance based on the length of the query term). But maybe ask this question on java-user@lucene.apache.org to be sure!

Thanks for the reply.I have asked question to java-user@lucene.apache.org.

As per your reply, what I understand is, the values between 0 to 1 are still allowed so as not to make changes in the way lucene is queried in case of fuzzy matching. And the values like 1 or 2 are supported to provide a parameter for fuzzy search explicitly(where we can specify the number that signifies the edit distance between source and target strings , however value between 0 to 1 also does the same thing but with some internal calculation).

Is there a way where I can combine fuzzy search with phonetic searches. Actually, I want to search on the fields like first_name, last_name and so on to get the records that can have some spelling mistakes as well.(Spell suggester is not fit for me as I want to get the solr documents in output not the list of words/strings)

Only fuzzy search is not fit for me as I can at max provide ~2 (edit distance) as fuzziness factor in query and only Phonetic will also not work as there are some words for which encoding in DoubleMetaphone completely changes with the change in a single character.

And also I came to know that with fuzzy search all the query time analysis is by passed. So I am unable to find a way to have both together.

One way that I just found is to have to fields one analyzed with phonetic filters and one as text. Then I could query them as (firstname_text:hello~2 AND firstname_phonetic:hello)

If there is no such way to have both together then is the approach I have in mind is correct or not ?

This is certainly possible with Lucene: just analyzing the field with a phonetic analyzer, do the same at query time, and create a FuzzyQuery from that. But it sounds like you need to know how to do this with Solr? I'm not sure ... did you already email the solr-user@lucene.apache.org list?

Hey Michael,first of all thanks for this awesome blog!I'm currently facing the exact problem as described here. Now I want to analyze the input at query time but I'm not quite sure how to do so. Do I analyze a single String and return the analyzed version?

Michael,Thanks for the blog -- I had a clarification re the nature of improvement:

In the old Fuzzysearch, the system examined every 'plausible' candidate and computed the actual Levenshtein distance -- an expensive computation for each candidate to decide whether it was within N. Whereas in this new one, it uses a new data structure (the autoaton/table built specifically for the current query) to check whether every 'plausible' candidate is within a distance of N.

That's correct, except with the approach in 4.x, since we pre-compile the space of all terms within edit distance N of the target term into an automaton up front, visiting the matching terms is a much, much faster process (just an intersection of that automaton with the tree structure of the terms in the terms dictionary).

iam working on OpenNLP with SOLR. I have successfully applied the patch LUCENE-2899-x.patch to latest SOLR code branch_4x.I desgined some analyers based on OpenNLP filters and tokenziers and index some documnets on that fields.Searching on OpenNLP field is not constant. Not able to search on these OpenNLP designed fields in solr schema.xml properly.Also, how to use payloads for boosting the document.

I was now thinking to the application to an Autocomplete feature :Currently there are 2 different approaches for Autocomplete :1) Using the term enum and filtering based on a byte prefix on each instance of term enum ( which is a ByteRef)2) Using the suggester (org.apache.solr.spelling.suggest.Suggester org.apache.solr.spelling.suggest.fst.FSTLookup)

The second approach should be very similar to the SpellCheck FSA approach.So it's faster to use (2)FST or the (1) Byte prefix filter ?

Subscribe To

About Me

Michael loves building software; he's been building search engines for more than a decade. In 1999 he co-founded iPhrase Technologies, a startup providing a user-centric enterprise search application, written primarily in Python and C. After IBM acquired iPhrase in 2005, Michael fell in love with Lucene, becoming a committer in 2006 and PMC member in 2008. Michael has remained an active committer, helping to push Lucene to new places in recent years. He's co-author of Lucene in Action, 2nd edition. In his spare time Michael enjoys building his own computers, writing software to control his house (mostly in Python), encoding videos and tinkering with all sorts of other things.