[suggest] Query Interface for suggest API

Details

Description

This patch factors out common indexing/search API used by the recently introduced NRTSuggester.
The motivation is to provide a query interface for FST-based fields (SuggestField and ContextSuggestField)
to enable suggestion scoring and more powerful automaton queries.

Previously, only prefix ‘queries’ with index-time weights were supported but we can also support:

Suggest API

CompletionQuery

CompletionQuery is used to query SuggestField and ContextSuggestField. A CompletionQuery produces a CompletionWeight,
which allows CompletionQuery implementations to pass in an automaton that will be intersected with a FST and allows boosting and
meta data extraction from the intersected partial paths. A CompletionWeight produces a CompletionScorer. A CompletionScorer
executes a Top N search against the FST with the provided automaton, scoring and filtering all matched paths.

PrefixCompletionQuery

Return documents with values that match the prefix of an analyzed term text
Documents are sorted according to their suggest field weight.

PrefixCompletionQuery(Analyzer analyzer, Term term)

RegexCompletionQuery

Return documents with values that match the prefix of a regular expression
Documents are sorted according to their suggest field weight.

RegexCompletionQuery(Term term)

FuzzyCompletionQuery

Return documents with values that has prefixes within a specified edit distance of an analyzed term text.
Documents are ‘boosted’ by the number of matching prefix letters of the suggestion with respect to the original term text.

ContextQuery

NOTE:ContextQuery should be used with ContextSuggestField to query suggestions boosted and/or filtered by contexts.
Running ContextQuery against a SuggestField will error out.

Scoring

suggestion_weight * context_boost
where suggestion_weight and context_boost are all integers

When used with FuzzyCompletionQuery,suggestion_weight * (context_boost + fuzzy_boost)

Context Suggest Field

To use ContextQuery, use ContextSuggestField instead of SuggestField. Any CompletionQuery can be used with ContextSuggestField, the default behaviour is to return suggestions from all contexts. Context for every completion hit
can be accessed through SuggestScoreDoc#context.

Thanks Areek Zillur, this is a big patch! I'm trying to wrap my brain around it... it will
take some iterations

The filled-out javadocs are nice.

It seems like the overall idea is to make a generic index-time and
search-time API that other suggesters could use, but for now it's just
NRTSuggester using it? Do we expect the non-document based suggesters
to also eventually be able to use this API?

One thing we are struggling with on AnalyzingInfixSuggester is how to
allow arbitrary context filters (LUCENE-6464) ... I wonder if this API
would make that easier.

This patch also adds new capabilities to NRTSuggester, like fuzzy and
regexp suggestions? What other new functions are exposed? What
use-cases do you see for RegexCompletionQuery?

Should FSTPath.toString also include the context?

The search side largely mirrors/subclasses Lucene's normal search
classes (CompletionQuery, CompletionWeight, SuggestIndexSearcher,
etc.) but there are some differences, e.g. we pass an Analyzer to the
completion queries (so they can build the automaton).

If you try to use ContextQuery against a field that you had not
indexed contexts with (using ContextSuggestField) do you see any
error? Maybe this is too hard.

Are you allowed to mix ContextSuggestField and SuggestField even for
the same field name, within one suggester?

Michael McCandless
added a comment - 08/May/15 21:18 Thanks Areek Zillur , this is a big patch! I'm trying to wrap my brain around it... it will
take some iterations
The filled-out javadocs are nice.
It seems like the overall idea is to make a generic index-time and
search-time API that other suggesters could use, but for now it's just
NRTSuggester using it? Do we expect the non-document based suggesters
to also eventually be able to use this API?
One thing we are struggling with on AnalyzingInfixSuggester is how to
allow arbitrary context filters ( LUCENE-6464 ) ... I wonder if this API
would make that easier.
This patch also adds new capabilities to NRTSuggester, like fuzzy and
regexp suggestions? What other new functions are exposed? What
use-cases do you see for RegexCompletionQuery?
Should FSTPath.toString also include the context?
The search side largely mirrors/subclasses Lucene's normal search
classes (CompletionQuery, CompletionWeight, SuggestIndexSearcher,
etc.) but there are some differences, e.g. we pass an Analyzer to the
completion queries (so they can build the automaton).
If you try to use ContextQuery against a field that you had not
indexed contexts with (using ContextSuggestField) do you see any
error? Maybe this is too hard.
Are you allowed to mix ContextSuggestField and SuggestField even for
the same field name, within one suggester?

It seems like the overall idea is to make a generic index-time and
search-time API that other suggesters could use, but for now it's just
NRTSuggester using it? Do we expect the non-document based suggesters
to also eventually be able to use this API?

At the moment, only NRTSuggester is using it. We should use this API for other suggesters,
but maybe in a separate issue . For LUCENE-6464, the existing ContextQuery needs
to add support for BooleanClause.Occur.

This patch also adds new capabilities to NRTSuggester, like fuzzy and
regexp suggestions? What other new functions are exposed? What
use-cases do you see for RegexCompletionQuery?

In terms of new functionality, fuzzy, regex and context queries are added. One thing to
note, for fuzzy and context queries, the suggestion scores are influenced by their common
prefix length (w.r.t. query term) and query-time context boosts respectively,
along with index-time suggestion weights. IMO, the way a suggestion is scored currently
needs some thought. As we only allow integer index-time weights, one possibility would
be to use index-weight + (Int.MAX * boost) instead of using MaxWeight of suggestions?

IMO RegexCompletionQuery can be used to query multiple prefixes at one go. This can allow
for simpler query analyzer, but still give the power to query for synonyms, domain-specific typos etc.
In the future, we can also add boosting (like in ContextQuery) where query-time boosts can be
specified for some matched prefixes of the regex pattern.

Should FSTPath.toString also include the context?

It should, will change.

but there are some differences, e.g. we pass an Analyzer to the
completion queries (so they can build the automaton)

open to suggestions on improving this

If you try to use ContextQuery against a field that you had not
indexed contexts with (using ContextSuggestField) do you see any
error? Maybe this is too hard.

There should not be any error. A ContextQuery will never be run on a SuggestField,
CompletionQuery rewrites appropriately given the type of the field (context-enabled or not).
This also makes non-context queries work as expected when run against ContextSuggestField
(as in the query is wrapped as a ContextQuery with no context filtering/boosting).

If a ContextSuggestField is indexed with no context, then a null context is extracted at query
time for the entry. Fields with no context will only be returned, if a wildcard context '*' is
specified (default behaviour of ContextQuery).

Are you allowed to mix ContextSuggestField and SuggestField even for
the same field name, within one suggester?

No you are not. If mixed, CompletionFieldsConsumer will throw IllegalArgumentException
upon indexing.

Areek Zillur
added a comment - 09/May/15 21:40 - edited Thanks Michael McCandless for taking a look!
It seems like the overall idea is to make a generic index-time and
search-time API that other suggesters could use, but for now it's just
NRTSuggester using it? Do we expect the non-document based suggesters
to also eventually be able to use this API?
At the moment, only NRTSuggester is using it. We should use this API for other suggesters,
but maybe in a separate issue . For LUCENE-6464 , the existing ContextQuery needs
to add support for BooleanClause.Occur .
This patch also adds new capabilities to NRTSuggester, like fuzzy and
regexp suggestions? What other new functions are exposed? What
use-cases do you see for RegexCompletionQuery?
In terms of new functionality, fuzzy, regex and context queries are added. One thing to
note, for fuzzy and context queries, the suggestion scores are influenced by their common
prefix length (w.r.t. query term) and query-time context boosts respectively,
along with index-time suggestion weights. IMO, the way a suggestion is scored currently
needs some thought. As we only allow integer index-time weights, one possibility would
be to use index-weight + (Int.MAX * boost) instead of using MaxWeight of suggestions?
IMO RegexCompletionQuery can be used to query multiple prefixes at one go. This can allow
for simpler query analyzer, but still give the power to query for synonyms, domain-specific typos etc.
In the future, we can also add boosting (like in ContextQuery) where query-time boosts can be
specified for some matched prefixes of the regex pattern.
Should FSTPath.toString also include the context?
It should, will change.
but there are some differences, e.g. we pass an Analyzer to the
completion queries (so they can build the automaton)
open to suggestions on improving this
If you try to use ContextQuery against a field that you had not
indexed contexts with (using ContextSuggestField) do you see any
error? Maybe this is too hard.
There should not be any error. A ContextQuery will never be run on a SuggestField,
CompletionQuery rewrites appropriately given the type of the field (context-enabled or not).
This also makes non-context queries work as expected when run against ContextSuggestField
(as in the query is wrapped as a ContextQuery with no context filtering/boosting).
If a ContextSuggestField is indexed with no context, then a null context is extracted at query
time for the entry. Fields with no context will only be returned, if a wildcard context '*' is
specified (default behaviour of ContextQuery).
Are you allowed to mix ContextSuggestField and SuggestField even for
the same field name, within one suggester?
No you are not. If mixed, CompletionFieldsConsumer will throw IllegalArgumentException
upon indexing.

We should use this API for other suggesters, but maybe in a separate issue

Yes, definitely separate!

In terms of new functionality, fuzzy, regex and context queries are added.

OK, cool, these are nice additions.

query-time context boosts

Cool: so you boost some contexts more than others, using ContextQuery.

As we only allow integer index-time weights

I thought we accept long (not int) as index-time weight? (But, I
think that really is overkill... maybe they should just floats, like
per-field boosting at index time). But we can worry about this
later...

one possibility would be to use index-weight + (Int.MAX * boost) instead of using MaxWeight of suggestions

Sorry I don't understand the idea here?

If you try to use ContextQuery against a field that you had not indexed contexts with (using ContextSuggestField) do you see any error? Maybe this is too hard.

There should not be any error. A ContextQuery will never be run on a SuggestField

It seems like we could detect this mis-use, since CompletionTerms seems to know whether the field was indexed with contexts or not? I.e, if I accidentally try to run a ContextQuery against a field indexed with only SuggestField, it seems like I should get an exception saying I screwed up ... (similar to trying to run a PhraseQuery on a field that did not index positions)? Maybe add a simple test case?

A ContextQuery will never be run on a SuggestField,
CompletionQuery rewrites appropriately given the type of the field (context-enabled or not).

OK maybe at that rewrite is the time to throw the exc?

This also makes non-context queries work as expected when run against ContextSuggestField
(as in the query is wrapped as a ContextQuery with no context filtering/boosting).

OK, for that direction it makes sense allow ... good.

Are you allowed to mix ContextSuggestField and SuggestField even for the same field name, within one suggester?

No you are not. If mixed, CompletionFieldsConsumer will throw IllegalArgumentException upon indexing.

OK, excellent.

Can we rename TopSuggestDocsCollector.num() to maybe .getCountToCollect or something a bit more verbose?

Michael McCandless
added a comment - 14/May/15 09:36
We should use this API for other suggesters, but maybe in a separate issue
Yes, definitely separate!
In terms of new functionality, fuzzy, regex and context queries are added.
OK, cool, these are nice additions.
query-time context boosts
Cool: so you boost some contexts more than others, using ContextQuery.
As we only allow integer index-time weights
I thought we accept long (not int) as index-time weight? (But, I
think that really is overkill... maybe they should just floats, like
per-field boosting at index time). But we can worry about this
later...
one possibility would be to use index-weight + (Int.MAX * boost) instead of using MaxWeight of suggestions
Sorry I don't understand the idea here?
If you try to use ContextQuery against a field that you had not indexed contexts with (using ContextSuggestField) do you see any error? Maybe this is too hard.
There should not be any error. A ContextQuery will never be run on a SuggestField
It seems like we could detect this mis-use, since CompletionTerms seems to know whether the field was indexed with contexts or not? I.e, if I accidentally try to run a ContextQuery against a field indexed with only SuggestField, it seems like I should get an exception saying I screwed up ... (similar to trying to run a PhraseQuery on a field that did not index positions)? Maybe add a simple test case?
A ContextQuery will never be run on a SuggestField,
CompletionQuery rewrites appropriately given the type of the field (context-enabled or not).
OK maybe at that rewrite is the time to throw the exc?
This also makes non-context queries work as expected when run against ContextSuggestField
(as in the query is wrapped as a ContextQuery with no context filtering/boosting).
OK, for that direction it makes sense allow ... good.
Are you allowed to mix ContextSuggestField and SuggestField even for the same field name, within one suggester?
No you are not. If mixed, CompletionFieldsConsumer will throw IllegalArgumentException upon indexing.
OK, excellent.
Can we rename TopSuggestDocsCollector.num() to maybe .getCountToCollect or something a bit more verbose?
Net/net this is a nice change, thanks Areek Zillur !

I thought we accept long (not int) as index-time weight? (But, I
think that really is overkill... maybe they should just floats, like
per-field boosting at index time)

IMO, a suggestion weight is just an index-time boost for the
associated entry.

one possibility would be to use index-weight + (Int.MAX * boost) instead of using MaxWeight of suggestions

Sorry I don't understand the idea here?

After a query automaton has been intersected with the FST in NRTSuggester,
boosts and/or context is computed/extracted from each of the partial matched paths
by the CompletionWeight before performing a TopN search.
For example, FuzzyCompletionWeight would count the number of prefix characters
a matched input has w.r.t. the analyzed query prefix and set the boost for it to the
number of common prefix length.
Calculating a suggestion score of weight + (maxWeight * boost) makes sure that
entries with a higher boost (longer common prefix w.r.t. query prefix) will always be
scored higher regardless of the index-time weight of suggestion entries. The
segment-level maxWeight is stored in CompletionPostingsFormat (CompletionIndex),
and the maxWeight is computed across all segments at query-time.
Since, the maximum weight for any suggestion entry will be <= Integer.MAX_VALUE, we
can just replace the maxWeight for a suggestField with Integer.MAX_VALUE? One problem
might be the loss of precision when converting the long score to a float?

It seems like we could detect this mis-use, since CompletionTerms seems to know whether the field was indexed with contexts or not? I.e, if I accidentally try to run a ContextQuery against a field indexed with only SuggestField, it seems like I should get an exception saying I screwed up ... (similar to trying to run a PhraseQuery on a field that did not index positions)? Maybe add a simple test case?

I updated the patch to error out when using a ContextQuery against a SuggestField at rewrite with test.

Can we rename TopSuggestDocsCollector.num() to maybe .getCountToCollect or something a bit more verbose?

Areek Zillur
added a comment - 14/May/15 17:13 Thanks Michael McCandless for the feedback
I thought we accept long (not int) as index-time weight? (But, I
think that really is overkill... maybe they should just floats, like
per-field boosting at index time)
IMO, a suggestion weight is just an index-time boost for the
associated entry.
one possibility would be to use index-weight + (Int.MAX * boost) instead of using MaxWeight of suggestions
Sorry I don't understand the idea here?
After a query automaton has been intersected with the FST in NRTSuggester ,
boosts and/or context is computed/extracted from each of the partial matched paths
by the CompletionWeight before performing a TopN search.
For example, FuzzyCompletionWeight would count the number of prefix characters
a matched input has w.r.t. the analyzed query prefix and set the boost for it to the
number of common prefix length.
Calculating a suggestion score of weight + (maxWeight * boost) makes sure that
entries with a higher boost (longer common prefix w.r.t. query prefix) will always be
scored higher regardless of the index-time weight of suggestion entries. The
segment-level maxWeight is stored in CompletionPostingsFormat (CompletionIndex),
and the maxWeight is computed across all segments at query-time.
Since, the maximum weight for any suggestion entry will be <= Integer.MAX_VALUE, we
can just replace the maxWeight for a suggestField with Integer.MAX_VALUE? One problem
might be the loss of precision when converting the long score to a float?
It seems like we could detect this mis-use, since CompletionTerms seems to know whether the field was indexed with contexts or not? I.e, if I accidentally try to run a ContextQuery against a field indexed with only SuggestField, it seems like I should get an exception saying I screwed up ... (similar to trying to run a PhraseQuery on a field that did not index positions)? Maybe add a simple test case?
I updated the patch to error out when using a ContextQuery against a SuggestField at rewrite with test.
Can we rename TopSuggestDocsCollector.num() to maybe .getCountToCollect or something a bit more verbose?
Changed TopSuggestDocsCollector.num() to getCountToCollect()

IMO, a suggestion weight is just an index-time boost for the associated entry.

+1

Shouldn't we switch to float (later, separate issue!)?

Calculating a suggestion score of weight + (maxWeight * boost) makes sure that
entries with a higher boost (longer common prefix w.r.t. query prefix) will always be
scored higher regardless of the index-time weight of suggestion entries. The
segment-level maxWeight is stored in CompletionPostingsFormat (CompletionIndex),
and the maxWeight is computed across all segments at query-time.
Since, the maximum weight for any suggestion entry will be <= Integer.MAX_VALUE, we
can just replace the maxWeight for a suggestField with Integer.MAX_VALUE? One problem
might be the loss of precision when converting the long score to a float?

OK I think I understand! It's a scoring model that guarantees that
the search-time score ranking comes first and index time ranking is
used only for tie-breaking the search-time score.

But, I'm not sure that's a good goal? E.g I may not want a very low
index-time boosted suggestion that has more shared prefix to score
higher than a highly index-time boosted suggestion that matched a bit
smaller prefix?

E.g. when I type "pyh" into google, the top suggestion is "python"
(prefix=2) and after that is "pyhs2" (prefix=3).

Anyway, I think scoring will forever be a challenging topic and we
just have to keep it pluggable (CompletionScorer.score).

Now any CompletionQuery can plug in one of the score modes (or custom implementation)
through CompletionQuery.setScoreMode.
Default score mode for existing queries:

PrefixCompletionQuery & RegexCompletionQuery - IGNORE_BOOST

FuzzyCompletionQuery - BOOST_FIRST

ContextQuery - BOOST

ScoreMode is also used in the comparator for the TopNSearcher queue.

It would be good to know your thoughts on how the TopNSearcher queue size is calculated here.
Now that every intersected prefix path might have a different boost and this boost influences the top N,
the topN for the queue has been changed to n * num_prefix_paths from just n to allow for n path
expansions per intersected prefix path. In reality, search execution will early terminate as soon as the
original topN results have been collected.

Areek Zillur
added a comment - 26/May/15 02:38 - edited Thanks Michael McCandless for the feedback!
Anyway, I think scoring will forever be a challenging topic and we
just have to keep it pluggable (CompletionScorer.score).
I have made the CompletionScorer.score pluggable by extracting a ScoreMode interface, which
computes a score for a suggestion using its weight and boost (along with min/max weight stats).
public interface ScoreMode {
float score( float weight, float boost, float minWeight, float maxWeight);
}
Currently there are three implementations:
BOOST (weight * boost),
IGNORE_BOOST (weight)
BOOST_FIRST (weight + (maxWeight * boost))
Now any CompletionQuery can plug in one of the score modes (or custom implementation)
through CompletionQuery.setScoreMode .
Default score mode for existing queries:
PrefixCompletionQuery & RegexCompletionQuery - IGNORE_BOOST
FuzzyCompletionQuery - BOOST_FIRST
ContextQuery - BOOST
ScoreMode is also used in the comparator for the TopNSearcher queue.
It would be good to know your thoughts on how the TopNSearcher queue size is calculated here.
Now that every intersected prefix path might have a different boost and this boost influences the top N,
the topN for the queue has been changed to n * num_prefix_paths from just n to allow for n path
expansions per intersected prefix path. In reality, search execution will early terminate as soon as the
original topN results have been collected.
Updated Patch:
Added ScoreMode for pluggable scoring
Increase test coverage
Added dedicated tests for PrefixCompletionQuery, RegexCompletionQuery,
FuzzyCompletionQuery and ContextQuery
Fixed javadocs, making ant precommit happy

Michael McCandless
added a comment - 28/May/15 08:40 Thanks Areek Zillur , I committed the last patch!
I ran PMD linter and it spotted a few issues:
lucene/suggest/src/java/org/apache/lucene/search/suggest/document/CompletionWeight.java:52: Avoid unused private fields such as 'maxWeight'.
lucene/suggest/src/java/org/apache/lucene/search/suggest/document/CompletionWeight.java:53: Avoid unused private fields such as 'minWeight'.
lucene/suggest/src/java/org/apache/lucene/search/suggest/document/NRTSuggester.java:97: Avoid unused private fields such as 'endByte'.
lucene/suggest/src/java/org/apache/lucene/search/suggest/document/NRTSuggester.java:173: Avoid unused local variables such as 'search'.
Can you look into them?
Also, the API to add ContextSuggestField is somewhat annnoying with
Java 7, e.g.:
document.add(new ContextSuggestField("context_suggest_field", Collections.<CharSequence>singletonList("type1"), "suggestion1", 4));
Maybe we can improve this to take a varargs?