lucene-java-user mailing list archives

hello -
a fuzzy query related question:
has there been any other implementations of "fuzzy" queries other than
edit-distance? and/or modifications of edit-distance to less penalize
common alternate spellings? - i.e. "couldn't" vs. "couldnt" -- here the
apostrophe would get a smaller penalty than character mismatch.
i'm thinking specifically of the algorithms in the SecondString open
source package:
http://secondstring.sourceforge.net/
what do you think the difficulty would be to wrap an alternate algorithm
that provides a:
float score(String1, String2)
function?
---marc
mark harwood wrote:
>>One thing I was thinking of doing was checking the
>>character frequency
>>
>>
>
>An alternative idea is index-time fuzzification rather
>than query-time. This is documented in one of the case
>studies in LIA - the principle is you don't
>index/search for whole words but use an NGram Analyzer
>to break them up at index time:
>
>Kylie becomes multiple words:
>[ k]
>[ ky]
>[ kyl]
>[ky]
>[kyl]
>[kyli]
>[yl]
>[yli]
>[ylie]
>[ kylie ]
>
>Obviously you use the same analyzer to process
>queries.
>Lucene will automatically look after relevancy of
>partial matches for you but your indexes are bigger
>and your queries will generate many more Boolean
>clauses.
>
>
>
>
>
>
>
>
>___________________________________________________________
>Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>