Description

> Without having looked at the code for a long time, I think the problem is what the
> lucene scoring consider to be best. First the grams are searched, resulting in a number
> of hits. Then the edit-distance is calculated on each hit. "Genetics" is appearently the
> third most similar hit according to Lucene, but the best according to Levenshtein.
>
> I.e. Lucene does not use edit-distance as similarity. You need to get a bunch of best hits
> in order to find the one with the smallest edit-distance.

It might be noteworthy that the spell checker will gather numSug * 10 hits from the a priori corpus. I suppose that number (10) was something the original author came up with when testing. In most cases it is seems to be good enough. In my refactor I've introduced a method parameter for the factor. This is probably a better looking solution than telling the user to increase numSug, as numSug saves a few clock ticks when not adding a suggestion to the priority list.

Karl Wettin
added a comment - 03/Mar/07 10:55
It might be noteworthy that the spell checker will gather numSug * 10 hits from the a priori corpus. I suppose that number (10) was something the original author came up with when testing. In most cases it is seems to be good enough. In my refactor I've introduced a method parameter for the factor. This is probably a better looking solution than telling the user to increase numSug, as numSug saves a few clock ticks when not adding a suggestion to the priority list.
The javadocs should probaly state something like that instead.