Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

Details

Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms.
New algo uses integer distances between objects.

Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms.
New algo uses integer distances between objects.

Average timing after many hours of tests. We may consider "integer" distance instead of "float" for Lucene Query if we accept this algo; I tried the best to have it close to "float" distance.

BKTree is cached at FuzzyTermEnumNEW. It needs warm-up if index changed; simplest way - to recalc it at night (separate thread will do it).

======
P.S.

FuzzyQuery constructs instance of FuzzyTermEnum and passes instance of IndexReader to constructor. This is the way... If IndexReader changed (new instance) we can simply repopulate BKTree (or, for instance, we can compare list of terms and simply add terms missed in BKTree)...

Fuad Efendi
added a comment - 23/Jan/10 02:45 - edited Minor bug fixed (with cache warm-up)...
Don't forget to disable QueryResultsCache and to enable large DocumentCache (if you are using SOLR); otherwise you won't see the difference. (SOLR users: there are some other tricks!)
With Lucene 2.9.1:
800ms
With BKTree and Levenstein algo:
200ms
With BKTree and http://www.catalysoft.com/articles/StrikeAMatch.html
70ms
Average timing after many hours of tests. We may consider "integer" distance instead of "float" for Lucene Query if we accept this algo; I tried the best to have it close to "float" distance.
BKTree is cached at FuzzyTermEnumNEW. It needs warm-up if index changed; simplest way - to recalc it at night (separate thread will do it).
======
P.S.
FuzzyQuery constructs instance of FuzzyTermEnum and passes instance of IndexReader to constructor. This is the way... If IndexReader changed (new instance) we can simply repopulate BKTree (or, for instance, we can compare list of terms and simply add terms missed in BKTree)...

I have tried your patch on lucene 3.0.0. I had to make a small change to get it work.
In the current implementation the enumerator is before the first element. This is no problem in lucene 2.9.1 as it simply does one more iteration in FuzzyQuery (rewrite). In lucene 3.0.0 FuzzyQuery directly accesses the first element of the enumeration. As this is null it simply stops further processing of terms (if (t == null) break.

I have made a small change in the FuzzyTermEnumNEW class to assign the first element to the enumerator during creation. I simply appended the following lines within the constructor:

Dirk Goldhan
added a comment - 05/Feb/10 16:45 I have tried your patch on lucene 3.0.0. I had to make a small change to get it work.
In the current implementation the enumerator is before the first element. This is no problem in lucene 2.9.1 as it simply does one more iteration in FuzzyQuery (rewrite). In lucene 3.0.0 FuzzyQuery directly accesses the first element of the enumeration. As this is null it simply stops further processing of terms (if (t == null) break .
I have made a small change in the FuzzyTermEnumNEW class to assign the first element to the enumerator during creation. I simply appended the following lines within the constructor:
if (this.termIterator.hasNext())
{
Term firstTerm = new Term(field, termMap.keySet().iterator().next());
this.currentTerm = firstTerm;
}
Thank you for the patch.

thanks for submitting your changed FuzzyQuery. After quickly looking through the classes I found the following problems:

The cache is incorrectly synchronized: The cache is static but access is synchronized against the instance.

The cache is not limited, maybe it should be a WeakHashMap. It can easily overflow the memory (as it consumes lots of memory).

When you build the tree, you use a class from spellchecker: org.apache.lucene.search.spell.LuceneDictionary. This adds an additional memory consumption, esp. if the index has a large term dict. Why not simply iterate over the IndexReaders's TermEnum?

The cache cannot work correctly with per segment search (since 2.9) or reopened IndexReaders, because it is only bound to the field name but not an index reader. To have a correct cache, do it like FieldCache and use a combined key from field name and IndexReader.getFieldCacheKey().

Else it looks like a good approach, but the memory consumption is immense for large term dicts. We currently develop a DFA-based FuzzyQuery, which will be provided, when the new flex branch gets out: LUCENE-2089

If you fix the problems in your classes, we can add this patch to contrib.

Uwe Schindler
added a comment - 10/Feb/10 14:20 - edited Hi Fuad,
thanks for submitting your changed FuzzyQuery. After quickly looking through the classes I found the following problems:
The cache is incorrectly synchronized: The cache is static but access is synchronized against the instance.
The cache is not limited, maybe it should be a WeakHashMap. It can easily overflow the memory (as it consumes lots of memory).
When you build the tree, you use a class from spellchecker: org.apache.lucene.search.spell.LuceneDictionary. This adds an additional memory consumption, esp. if the index has a large term dict. Why not simply iterate over the IndexReaders's TermEnum?
The cache cannot work correctly with per segment search (since 2.9) or reopened IndexReaders, because it is only bound to the field name but not an index reader. To have a correct cache, do it like FieldCache and use a combined key from field name and IndexReader.getFieldCacheKey().
Else it looks like a good approach, but the memory consumption is immense for large term dicts. We currently develop a DFA-based FuzzyQuery, which will be provided, when the new flex branch gets out: LUCENE-2089
If you fix the problems in your classes, we can add this patch to contrib.

I need to use IndexReader (index version number and etc.) also to rewarm a cache; if term disappeared from index we can still leave it in BKTree (not a problem; can't remove!), and if we have new term we need simply call

public void add(E term)

Synchronization should be significantly improved...

Cache warming takes 10-15 seconds in my environment, about 250k tokens, and I use TreeSet internally for fast lookup. I also believe that main performance issue is related to Levenstein algo (which is significantly improved in trunk; plus synchronization is removed from FuzzySearch: LUCENE-2258)

Regarding memory requirements: BKTree is not heavy... I should use

StringHelper.intern(fld);

it's already in memory... and FuzzyTermEnum uses almost same amount of memory for processing as BKTree. I'll check FieldCache.

Fuad Efendi
added a comment - 10/Feb/10 15:28 Hi Uwe,
Thanks for the analysis! I spent only few days on this basic PoC.
I need to use IndexReader (index version number and etc.) also to rewarm a cache; if term disappeared from index we can still leave it in BKTree (not a problem; can't remove!), and if we have new term we need simply call
public void add(E term)
Synchronization should be significantly improved...
Cache warming takes 10-15 seconds in my environment, about 250k tokens, and I use TreeSet internally for fast lookup. I also believe that main performance issue is related to Levenstein algo (which is significantly improved in trunk; plus synchronization is removed from FuzzySearch: LUCENE-2258 )
Regarding memory requirements: BKTree is not heavy... I should use
StringHelper.intern(fld);
it's already in memory... and FuzzyTermEnum uses almost same amount of memory for processing as BKTree. I'll check FieldCache.
BKTree-approach can be significantly improved.

Ok thanks for the explanation about the cache. But there should still be some eviction algorithm or at least a WeakHashmap so the JVM can cleanup the cache for unused fields. My problem with IndexReaders missing in the cache logic was that if you reopen the index, the BKTree contains terms no longer available and the FuzzyTermEnum enumerates terms that are no longer available. This is bad parctise and should not be done in query rewrite. So enumerating terms with no relation to a real term dict is not really supported by MultiTermQuery, but works for fuzzy, because it does not use the low level segment-based term enumeration and linkage to TermDocs.

The new LUCENE-2258 issue needs no warmup, as it uses a different algorithm for the Levenstein algorithm and also does not scan the whole term dict. By the way, in flex also RegEx queries and Wildcard queries are speed up. The problem with trunk not having this automaton package used for that is the problem, that the AutomatonTermsEnum used for that needs to do lots of seeking in the TermsEnum, which is improved in flex and we do not want to do the work twice.

Flex has a completely different API on the enum side, so TermEnumerations work different and so on.

Uwe Schindler
added a comment - 10/Feb/10 16:35 Hi Fuad,
Ok thanks for the explanation about the cache. But there should still be some eviction algorithm or at least a WeakHashmap so the JVM can cleanup the cache for unused fields. My problem with IndexReaders missing in the cache logic was that if you reopen the index, the BKTree contains terms no longer available and the FuzzyTermEnum enumerates terms that are no longer available. This is bad parctise and should not be done in query rewrite. So enumerating terms with no relation to a real term dict is not really supported by MultiTermQuery, but works for fuzzy, because it does not use the low level segment-based term enumeration and linkage to TermDocs.
The new LUCENE-2258 issue needs no warmup, as it uses a different algorithm for the Levenstein algorithm and also does not scan the whole term dict. By the way, in flex also RegEx queries and Wildcard queries are speed up. The problem with trunk not having this automaton package used for that is the problem, that the AutomatonTermsEnum used for that needs to do lots of seeking in the TermsEnum, which is improved in flex and we do not want to do the work twice.
Flex has a completely different API on the enum side, so TermEnumerations work different and so on.

BKTree contains objects, not terms; in my sample it contains Strings, new BKTree<String>(new Distance()). It is a structure for fast lookup of close objects from a set of objects, with predefined distance algorithm.

It won't hurt if String appears in BKTree structure, and corresponding Term disappeared from Index; search results will be the same. Simply, search for <DisappearedTerm> OR <AnotherTerm> is the same as search for <AnotherTerm>.
At least, we can run background thread which will create new BKTree instance, without hurting end users.

Yes, Term<->String is another thing to do... I recreate fake terms in TermEnum...

BKTree allows to iterate about 5-10% of whole structure in order to find closest matches only if distance threshold is small, 2. If it is 4, almost no any improvement. And, classic Levenshtein distance is slow...

Fuad Efendi
added a comment - 10/Feb/10 17:55 - edited Hi Uwe,
I am trying to study LUCENE-2258 right now...
BKTree contains terms no longer available
BKTree contains objects, not terms; in my sample it contains Strings, new BKTree<String>(new Distance()). It is a structure for fast lookup of close objects from a set of objects, with predefined distance algorithm.
It won't hurt if String appears in BKTree structure, and corresponding Term disappeared from Index; search results will be the same. Simply, search for <DisappearedTerm> OR <AnotherTerm> is the same as search for <AnotherTerm>.
At least, we can run background thread which will create new BKTree instance, without hurting end users.
Yes, Term<->String is another thing to do... I recreate fake terms in TermEnum...
BKTree allows to iterate about 5-10% of whole structure in order to find closest matches only if distance threshold is small, 2. If it is 4, almost no any improvement. And, classic Levenshtein distance is slow...
Edited: trying to study LUCENE-2089 ...

I believe this issue should be closed due to significant performance improvements related to LUCENE-2089 and LUCENE-2258.
I don't think there is any interest from the community to continue with this (BK Tree and "Strike a Match") naive approach; although some people found it useful. Of course we might have few more distance implementations as a separate improvement.

Fuad Efendi
added a comment - 17/May/11 20:43 I believe this issue should be closed due to significant performance improvements related to LUCENE-2089 and LUCENE-2258 .
I don't think there is any interest from the community to continue with this (BK Tree and "Strike a Match") naive approach; although some people found it useful. Of course we might have few more distance implementations as a separate improvement.
Please close it.
Thanks