Users can also specify multiple "spellcheck.dictionary" parameters. All specified dictionaries are consulted and results are interleaved. (this is handled by the new ConjunctionSolrSpellChecker) Collations are created with combinations from the different spellcheckers, with care taken that mutliple overlapping corrections do not occur in the same collation.

A future enhancement (outside the scope of this issue) would be to extend ConjunctionSolrSpellChecker to allow arbitrary dictionary combinations. For instance, if a user wanted to query two fields and have two separate dictionaries consulted for each field, etc. With this patch, however, ConjunctionSolrSpellChecker is intended to be used to add Word-Break suggestions in with Single-Word suggestions.

Okke Klein
added a comment - 02/Jan/12 13:58 I'm having some trouble combining this patch with your other patch in https://issues.apache.org/jira/browse/SOLR-2585 . Could you make a patch with both features if possible?

Thanks for your interest. For now you may need to evaluate the features separately. Possibly you could vote for your favorite one. Should either issue get committed, I will sync the other issue to the updated state of Trunk. Then we can have both at the same time. If there isn't any movement on these 2 for a long time maybe I'd consider merging the patches but that seems like an unnecessary step. It would be nice if one of the first 4.x releases included both of these features...

James Dyer
added a comment - 03/Jan/12 15:26 Okke,
Thanks for your interest. For now you may need to evaluate the features separately. Possibly you could vote for your favorite one. Should either issue get committed, I will sync the other issue to the updated state of Trunk. Then we can have both at the same time. If there isn't any movement on these 2 for a long time maybe I'd consider merging the patches but that seems like an unnecessary step. It would be nice if one of the first 4.x releases included both of these features...

If I am not mistaken the functionality from https://issues.apache.org/jira/browse/SOLR-2585 can also be achieved in DirectSolrSpellChecker with thresholdTokenFrequency parameter. So I patched trunk with this patch and the corresponding Lucene patch and did some experimenting.

The misplaced whitespaces were fixed and proper suggestions were returned. However if both word parts resulted in suggestions, the collation made no sense.

Hypothetical example:
"spe llcheck" would give suggestions "spa" and "spellcheck" and collate this into "spa spellcheck"

In my use case I never got any results back when one of the parts had a typo. So "spe llchek" would not give any suggestions.

For my use case it would also be handy if "spell check" would result in the suggestion "spellcheck".

Okke Klein
added a comment - 07/Jan/12 16:09 If I am not mistaken the functionality from https://issues.apache.org/jira/browse/SOLR-2585 can also be achieved in DirectSolrSpellChecker with thresholdTokenFrequency parameter. So I patched trunk with this patch and the corresponding Lucene patch and did some experimenting.
The misplaced whitespaces were fixed and proper suggestions were returned. However if both word parts resulted in suggestions, the collation made no sense.
Hypothetical example:
"spe llcheck" would give suggestions "spa" and "spellcheck" and collate this into "spa spellcheck"
In my use case I never got any results back when one of the parts had a typo. So "spe llchek" would not give any suggestions.
For my use case it would also be handy if "spell check" would result in the suggestion "spellcheck".
Or is this already possible?

if both word parts resulted in suggestions, the collation made no sense.

This is a problem with collations in general: By default, it simply mashes the top corrections together, often resulting in nonsense. The solution is to set "spellcheck.maxCollationTries" to a non-zero value. Doing so will cause the spellchecker to vet the collation possibilities against the index, resulting in collations that are guaranteed to generate hits.

"spe llcheck" would give suggestions "spa" and "spellcheck" and collate this into "spa spellcheck"

This is surprising to me and might indicate a bug. This patch is designed to carefully ensure that when building collations, the corrections do not overlap one another. For instance if "q=spe llcheck" and it gives corrections of "spe>spa" and "spe llcheck>spellcheck", it should not collate these to "q=spa spellcheck" because "spe" overlaps with "spe llcheck". So if you can describe in detail what you're indexing and querying (maybe paste the resulting xml), it would be help me figure out what's going on. Better yet, if you can write a failing unit test and post a patch...

I never got any results back when one of the parts had a typo. So "spe llchek" would not give any suggestions.

This patch does not have the ability to first correct a word fragment and then combine it with another fragment to make a corrected word. Possibly this would be a good next step after what we've got here already gets worked out.

it would also be handy if "spell check" would result in the suggestion "spellcheck". Or is this already possible?

This is the core of what this issue (really LUCENE-3523) is all about, provided that "spellcheck" is in the dictionary&index you're using.

James Dyer
added a comment - 09/Jan/12 15:58 Okke,
Thanks for looking at this patch. Here are a few comments:
if both word parts resulted in suggestions, the collation made no sense.
This is a problem with collations in general: By default, it simply mashes the top corrections together, often resulting in nonsense. The solution is to set "spellcheck.maxCollationTries" to a non-zero value. Doing so will cause the spellchecker to vet the collation possibilities against the index, resulting in collations that are guaranteed to generate hits.
"spe llcheck" would give suggestions "spa" and "spellcheck" and collate this into "spa spellcheck"
This is surprising to me and might indicate a bug. This patch is designed to carefully ensure that when building collations, the corrections do not overlap one another. For instance if "q=spe llcheck" and it gives corrections of "spe>spa" and "spe llcheck>spellcheck", it should not collate these to "q=spa spellcheck" because "spe" overlaps with "spe llcheck". So if you can describe in detail what you're indexing and querying (maybe paste the resulting xml), it would be help me figure out what's going on. Better yet, if you can write a failing unit test and post a patch...
I never got any results back when one of the parts had a typo. So "spe llchek" would not give any suggestions.
This patch does not have the ability to first correct a word fragment and then combine it with another fragment to make a corrected word. Possibly this would be a good next step after what we've got here already gets worked out.
it would also be handy if "spell check" would result in the suggestion "spellcheck". Or is this already possible?
This is the core of what this issue (really LUCENE-3523 ) is all about, provided that "spellcheck" is in the dictionary&index you're using.

This is a problem with collations in general: By default, it simply mashes the top corrections together, often resulting in nonsense. The solution is to set "spellcheck.maxCollationTries" to a non-zero value. Doing so will cause the spellchecker to vet the collation possibilities against the index, resulting in collations that are guaranteed to generate hits.

If wordbreak gives back a suggestion of a combined word, a suggestion with a word fragment with more hits is still ranked higher in the collation.

So "spa llcheck" is preferred over "spellcheck" if spa has more hits then spellcheck.

it would also be handy if "spell check" would result in the suggestion "spellcheck". Or is this already possible?

This is the core of what this issue (really LUCENE-3523) is all about, provided that "spellcheck" is in the dictionary&index you're using.

Never got this working as no suggestions were given when both word fragments were spelled correctly and the combined word was in the index. (when making typo in combined word the word was returned as suggestion)

Okke Klein
added a comment - 09/Jan/12 20:27
This is a problem with collations in general: By default, it simply mashes the top corrections together, often resulting in nonsense. The solution is to set "spellcheck.maxCollationTries" to a non-zero value. Doing so will cause the spellchecker to vet the collation possibilities against the index, resulting in collations that are guaranteed to generate hits.
If wordbreak gives back a suggestion of a combined word, a suggestion with a word fragment with more hits is still ranked higher in the collation.
So "spa llcheck" is preferred over "spellcheck" if spa has more hits then spellcheck.
it would also be handy if "spell check" would result in the suggestion "spellcheck". Or is this already possible?
This is the core of what this issue (really LUCENE-3523 ) is all about, provided that "spellcheck" is in the dictionary&index you're using.
Never got this working as no suggestions were given when both word fragments were spelled correctly and the combined word was in the index. (when making typo in combined word the word was returned as suggestion)

So "spa llcheck" is preferred over "spellcheck" if spa has more hits then spellcheck.

I honestly didn't try this much with queries having all optional terms. I see what you mean, though that you might prefer it just leave the misspelled word in there if its an optional term anyhow. But wouldn't the query, in addition to giving spelling suggestions, also return some results because it would ignore the optional & misspelled query terms? If that's the case, your app can look at the results you got back and compare that to the collation options and determine what to do from there.

no suggestions were given when both word fragments were spelled correctly

As discussed in SOLR-2585, you can't get suggestions for terms that are in the index, unless you specify "spellcheck.onlyMorePopular=true". Of course "onlyMorePopular" can have its own unintended consequences. Hopefully someday in the not too distant future we'll be in a state where we can have both this issue and SOLR-2585 working together.

James Dyer
added a comment - 09/Jan/12 20:54
So "spa llcheck" is preferred over "spellcheck" if spa has more hits then spellcheck.
I honestly didn't try this much with queries having all optional terms. I see what you mean, though that you might prefer it just leave the misspelled word in there if its an optional term anyhow. But wouldn't the query, in addition to giving spelling suggestions, also return some results because it would ignore the optional & misspelled query terms? If that's the case, your app can look at the results you got back and compare that to the collation options and determine what to do from there.
no suggestions were given when both word fragments were spelled correctly
As discussed in SOLR-2585 , you can't get suggestions for terms that are in the index, unless you specify "spellcheck.onlyMorePopular=true". Of course "onlyMorePopular" can have its own unintended consequences. Hopefully someday in the not too distant future we'll be in a state where we can have both this issue and SOLR-2585 working together.

I honestly didn't try this much with queries having all optional terms.

Setting mm to 100% gave me the result I expected.

Im confused:

"This is the core of what this issue (really LUCENE-3523) is all about, provided that "spellcheck" is in the dictionary&index you're using".

and then

As discussed in SOLR-2585, you can't get suggestions for terms that are in the index, unless you specify "spellcheck.onlyMorePopular=true". Of course "onlyMorePopular" can have its own unintended consequences. Hopefully someday in the not too distant future we'll be in a state where we can have both this issue and SOLR-2585 working together.

So should it be possible to get the suggestion "spellcheck" from "spell check", or not?

Okke Klein
added a comment - 09/Jan/12 23:27
I honestly didn't try this much with queries having all optional terms.
Setting mm to 100% gave me the result I expected.
Im confused:
"This is the core of what this issue (really LUCENE-3523 ) is all about, provided that "spellcheck" is in the dictionary&index you're using".
and then
As discussed in SOLR-2585 , you can't get suggestions for terms that are in the index, unless you specify "spellcheck.onlyMorePopular=true". Of course "onlyMorePopular" can have its own unintended consequences. Hopefully someday in the not too distant future we'll be in a state where we can have both this issue and SOLR-2585 working together.
So should it be possible to get the suggestion "spellcheck" from "spell check", or not?
Note: I do get suggestions for terms that are in the index.

So should it be possible to get the suggestion "spellcheck" from "spell check", or not? Note: I do get suggestions for terms that are in the index.

When combining words, it will require that at least one of the original terms be not in the index.

So to use your example, WordBreakSpellChecker will combine "spell check" to "spellcheck" provided that:
1. "spellcheck" is in the index.
2. either:

"spell" is NOT in the index.OR

"check" is NOT in the index"OR

both "spell" and "check" are NOT in the index.

But if both "spell" and "check" are in the index, then you won't get "spellcheck" as a suggestion. You can override this behavior if:
1. You specify "onlyMorePopular". This works if "spellcheck" has a document frequency that is greater or equal than the highest document frequency between "spell" and "check".
2. You apply SOLR-2585 (theoretically...not possible yet) and set "spellcheck.alternativeTermCount" greater than zero. This would tell it to generate alternative term suggestions for indexed terms.

If this is not consistent with what you're experiencing then there is a possible bug in the WordBreakSpellChecker. In that case, please provide as many details as possible (or write a failing unit test) and I can look into it further.

James Dyer
added a comment - 10/Jan/12 15:47
So should it be possible to get the suggestion "spellcheck" from "spell check", or not? Note: I do get suggestions for terms that are in the index.
When combining words, it will require that at least one of the original terms be not in the index.
So to use your example, WordBreakSpellChecker will combine "spell check" to "spellcheck" provided that:
1. "spellcheck" is in the index.
2. either:
"spell" is NOT in the index.
OR
"check" is NOT in the index"
OR
both "spell" and "check" are NOT in the index.
But if both "spell" and "check" are in the index, then you won't get "spellcheck" as a suggestion. You can override this behavior if:
1. You specify "onlyMorePopular". This works if "spellcheck" has a document frequency that is greater or equal than the highest document frequency between "spell" and "check".
2. You apply SOLR-2585 (theoretically...not possible yet) and set "spellcheck.alternativeTermCount" greater than zero. This would tell it to generate alternative term suggestions for indexed terms.
If this is not consistent with what you're experiencing then there is a possible bug in the WordBreakSpellChecker. In that case, please provide as many details as possible (or write a failing unit test) and I can look into it further.

Thanks for the explanation. I experimented with onlyMorePopular and it worked a few times. Unfortunately it also showed unwanted behavior as expected. So https://issues.apache.org/jira/browse/SOLR-2585 would be a next step to see if it provides the behavior I'm looking for.

For the English language this feature might not be very important, but for languages like Dutch and German that have a lot of compounded words, a spellchecker that also combines word parts even if one of them has a typo (like Google does) would be extremely useful.

Unfortunately I'm not a programmer, but I'll gladly test anything you throw at me

Okke Klein
added a comment - 10/Jan/12 18:42 Thanks for the explanation. I experimented with onlyMorePopular and it worked a few times. Unfortunately it also showed unwanted behavior as expected. So https://issues.apache.org/jira/browse/SOLR-2585 would be a next step to see if it provides the behavior I'm looking for.
For the English language this feature might not be very important, but for languages like Dutch and German that have a lot of compounded words, a spellchecker that also combines word parts even if one of them has a typo (like Google does) would be extremely useful.
Unfortunately I'm not a programmer, but I'll gladly test anything you throw at me

Here is a new patch that can better handle collations involving mixed required/prohibited/optional terms and also boolean operators (AND/OR/NOT).

When combining words, we do not want to combine an optional term with a prohibited one, etc. We also do not want to combine words that belong to different boolean clauses or those that were "NOT"ed to one another.

Likewise, when splitting a term into multiples, we want to ensure all the resulting terms are required if the original one was required, etc. Also, if the query contains boolean operators (AND/OR/NOT), this version ANDs the split terms together.

In the case of Boolean operators, SpellingQueryConverter can only make a guess as to the best action. It doesn't know the actual query parser used, the default "q.op" or "mm" setting, etc. All this does is make a reasonable guess as to the best way to re-write the query if corrections involved combining and/or splitting words.

See WordBreakSpellCheckerTest#testCollate and SpellingQueryConverterTest#testRequiredOrProhibitedFlags for examples of how this works.

James Dyer
added a comment - 01/Jun/12 21:55 Here is a new patch that can better handle collations involving mixed required/prohibited/optional terms and also boolean operators (AND/OR/NOT).
When combining words, we do not want to combine an optional term with a prohibited one, etc. We also do not want to combine words that belong to different boolean clauses or those that were "NOT"ed to one another.
Likewise, when splitting a term into multiples, we want to ensure all the resulting terms are required if the original one was required, etc. Also, if the query contains boolean operators (AND/OR/NOT), this version ANDs the split terms together.
In the case of Boolean operators, SpellingQueryConverter can only make a guess as to the best action. It doesn't know the actual query parser used, the default "q.op" or "mm" setting, etc. All this does is make a reasonable guess as to the best way to re-write the query if corrections involved combining and/or splitting words.
See WordBreakSpellCheckerTest#testCollate and SpellingQueryConverterTest#testRequiredOrProhibitedFlags for examples of how this works.
Unless there are other issues, I plan to commit this in a few days.

James Dyer
added a comment - 04/Jun/12 19:07 Committed...Trunk r1346058, branch_4x r1346069
This commit includes updates to the Solr Example spellcheck config to use some of the newer SpellCheckComponent features, including this one.

This was my mistake. I ran tests, then changed the Solr Example config, forgetting that some tests depend on the Example config. I committed a quick test fix that hopefully will stop the failures for now. But one of the failures might be an actual problem. I am looking into it now.

James Dyer
added a comment - 04/Jun/12 20:25 This was my mistake. I ran tests, then changed the Solr Example config, forgetting that some tests depend on the Example config. I committed a quick test fix that hopefully will stop the failures for now. But one of the failures might be an actual problem. I am looking into it now.

Re-open to figure out if failure with "testSpellCheckResponse" with WordBreakSolrSpellChecker added in is a valid failure. My original fix for this caused DistributedSpellCheckComponentTest to fail, so I'll need to investigate more thoroughly tomorrow. For now the offending tests are disabled. (Sorry for the stormy weather on Jenkins!)

James Dyer
added a comment - 05/Jun/12 02:11 Re-open to figure out if failure with "testSpellCheckResponse" with WordBreakSolrSpellChecker added in is a valid failure. My original fix for this caused DistributedSpellCheckComponentTest to fail, so I'll need to investigate more thoroughly tomorrow. For now the offending tests are disabled. (Sorry for the stormy weather on Jenkins!)

Here is a patch that re-activates the previously-failing tests and fixes all the problems. All tests pass and I checked the solr example also. Here's a summary of the problems:

TestSpellCheckResponse had a test bug in that data wasn't being cleaned from the index between tests. Bug did not mainfest until I made solr example changes.

Some asserts in TestSpellCheckResponse needed modifying to conform to changes in the solr example (test relies on example config).

ConjunctionSolrSpellChecker was not preverving the original token doc freq's from the child spellcheckers. Bug wasn't being properly tested for before, but showed up once TestSpellCheckResponse was fixed.

WordBreakSolrSpellChecker was not generating original token doc freq's. Bug wasn't being properly tested for before, but showed up once TestSpellCheckResponse was fixed.

James Dyer
added a comment - 05/Jun/12 18:41 Here is a patch that re-activates the previously-failing tests and fixes all the problems. All tests pass and I checked the solr example also. Here's a summary of the problems:
TestSpellCheckResponse had a test bug in that data wasn't being cleaned from the index between tests. Bug did not mainfest until I made solr example changes.
Some asserts in TestSpellCheckResponse needed modifying to conform to changes in the solr example (test relies on example config).
ConjunctionSolrSpellChecker was not preverving the original token doc freq's from the child spellcheckers. Bug wasn't being properly tested for before, but showed up once TestSpellCheckResponse was fixed.
WordBreakSolrSpellChecker was not generating original token doc freq's. Bug wasn't being properly tested for before, but showed up once TestSpellCheckResponse was fixed.
I will commit shortly.