Class DirectSpellChecker

Candidates are presented directly from the term dictionary, based on
Levenshtein distance. This is an alternative to SpellChecker
if you are using an edit-distance-like metric such as Levenshtein
or JaroWinklerDistance.

A practical benefit of this spellchecker is that it requires no additional
datastructures (neither in RAM nor on disk) to do its work.

Field Detail

INTERNAL_LEVENSHTEIN

Note: this is the fastest distance metric, because Damerau-Levenshtein is used
to draw candidates from the term dictionary: this just re-uses the scoring.

Constructor Detail

DirectSpellChecker

public DirectSpellChecker()

Creates a DirectSpellChecker with default configuration values

Method Detail

getMaxEdits

public int getMaxEdits()

Get the maximum number of Levenshtein edit-distances to draw
candidate terms from.

setMaxEdits

public void setMaxEdits(int maxEdits)

Sets the maximum number of Levenshtein edit-distances to draw
candidate terms from. This value can be 1 or 2. The default is 2.

Note: a large number of spelling errors occur with an edit distance
of 1, by setting this value to 1 you can increase both performance
and precision at the cost of recall.

getMinPrefix

public int getMinPrefix()

Get the minimal number of characters that must match exactly

setMinPrefix

public void setMinPrefix(int minPrefix)

Sets the minimal number of initial characters (default: 1)
that must match exactly.

This can improve both performance and accuracy of results,
as misspellings are commonly not the first character.

getMaxInspections

public int getMaxInspections()

Get the maximum number of top-N inspections per suggestion

setMaxInspections

public void setMaxInspections(int maxInspections)

Set the maximum number of top-N inspections (default: 5) per suggestion.

Increasing this number can improve the accuracy of results, at the cost
of performance.

getAccuracy

public float getAccuracy()

Get the minimal accuracy from the StringDistance for a match

setAccuracy

public void setAccuracy(float accuracy)

Set the minimal accuracy required (default: 0.5f) from a StringDistance
for a suggestion match.

getThresholdFrequency

public float getThresholdFrequency()

Get the minimal threshold of documents a term must appear for a match

setThresholdFrequency

public void setThresholdFrequency(float thresholdFrequency)

Set the minimal threshold of documents a term must appear for a match.

This can improve quality by only suggesting high-frequency terms. Note that
very high values might decrease performance slightly, by forcing the spellchecker
to draw more candidates from the term dictionary, but a practical value such
as 1 can be very useful towards improving quality.

This can be specified as a relative percentage of documents such as 0.5f,
or it can be specified as an absolute whole document frequency, such as 4f.
Absolute document frequencies may not be fractional.

getMinQueryLength

public int getMinQueryLength()

Get the minimum length of a query term needed to return suggestions

setMinQueryLength

public void setMinQueryLength(int minQueryLength)

Set the minimum length of a query term (default: 4) needed to return suggestions.

Very short query terms will often cause only bad suggestions with any distance
metric.

getMaxQueryFrequency

public float getMaxQueryFrequency()

Get the maximum threshold of documents a query term can appear in order
to provide suggestions.

setMaxQueryFrequency

public void setMaxQueryFrequency(float maxQueryFrequency)

Set the maximum threshold (default: 0.01f) of documents a query term can
appear in order to provide suggestions.

Very high-frequency terms are typically spelled correctly. Additionally,
this can increase performance as it will do no work for the common case
of correctly-spelled input terms.

This can be specified as a relative percentage of documents such as 0.5f,
or it can be specified as an absolute whole document frequency, such as 4f.
Absolute document frequencies may not be fractional.

getLowerCaseTerms

public boolean getLowerCaseTerms()

true if the spellchecker should lowercase terms

setLowerCaseTerms

public void setLowerCaseTerms(boolean lowerCaseTerms)

True if the spellchecker should lowercase terms (default: true)

This is a convenience method, if your index field has more complicated
analysis (such as StandardTokenizer removing punctuation), its probably
better to turn this off, and instead run your query terms through your
Analyzer first.

getDistance

setDistance

Note: because this spellchecker draws its candidates from the term
dictionary using Damerau-Levenshtein, it works best with an edit-distance-like
string metric. If you use a different metric than the default,
you might want to consider increasing setMaxInspections(int)
to draw more candidates for your metric to rank.