Details

Description

It seems like PrefixQuery should "simply" be an AutomatonQuery rather than specializing its own TermsEnum ... with maybe some performance improvements to ByteRunAutomaton.run to short-circuit once it's in a "sink state", AutomatonTermsEnum could be just as fast as PrefixTermsEnum.

Activity

Patch, cutting over PrefixQuery to AutomatonQuery and removing
PrefixTermsEnum.

I explored the optimization of having Byte/CharRunAutomaton.run
optimize (short-circuit) when you're in a sink state but it became
quite difficult/invasive fixing all callers of .step to handle this.
With LUCENE-5879 we also need to know the sink state under-the-hood,
but that's separate from fixing .run to make use of it.

So I backed out that opto and tried just doing the PrefixQuery cutover
without optimizing for sink states. I'm running a perf test w/
luceneutil and it looks like the impact is trivial (well within
noise). Net/net I think it's fine to "just cutover" without the
invasive opto?

I also changed PrefixQuery's semantics to apply to full binary space
terms, not just UTF-8 space. While this is technically a change in
behavior, it won't impact users who index only unicode terms. It's
also necessary for LUCENE-5879, because if prefixing is done only in
unicode space (like today), then the resulting binary space automaton
will not have a sink state and auto-prefix can't apply.

If this part is somehow controversial I can revert and try to do it
only with LUCENE-5879 instead... if it's OK, I'll add some tests
showing that PrefixQuery on binary terms works.

Michael McCandless
added a comment - 23/Mar/15 08:02 Patch, cutting over PrefixQuery to AutomatonQuery and removing
PrefixTermsEnum.
I explored the optimization of having Byte/CharRunAutomaton.run
optimize (short-circuit) when you're in a sink state but it became
quite difficult/invasive fixing all callers of .step to handle this.
With LUCENE-5879 we also need to know the sink state under-the-hood,
but that's separate from fixing .run to make use of it.
So I backed out that opto and tried just doing the PrefixQuery cutover
without optimizing for sink states. I'm running a perf test w/
luceneutil and it looks like the impact is trivial (well within
noise). Net/net I think it's fine to "just cutover" without the
invasive opto?
I also changed PrefixQuery's semantics to apply to full binary space
terms, not just UTF-8 space. While this is technically a change in
behavior, it won't impact users who index only unicode terms. It's
also necessary for LUCENE-5879 , because if prefixing is done only in
unicode space (like today), then the resulting binary space automaton
will not have a sink state and auto-prefix can't apply.
If this part is somehow controversial I can revert and try to do it
only with LUCENE-5879 instead... if it's OK, I'll add some tests
showing that PrefixQuery on binary terms works.

Michael McCandless
added a comment - 24/Mar/15 00:25 New patch, adding tests for binary terms to PrefixQuery.
I also added a new Operations.getSingleton, to work for both unicode
and binary automata, and fixed CompiledAutomaton to use that when it's
trying to "simplify" the incoming automaton.