[jira] Created: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

Paul Cowan (JIRA <jira <at> apache.org>
2008-09-01 02:27:44 GMT

Proposal: introduce more sensible sorting when a doc has multiple values for a term
-----------------------------------------------------------------------------------
Key: LUCENE-1372
URL: https://issues.apache.org/jira/browse/LUCENE-1372
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Affects Versions: 2.3.2
Reporter: Paul Cowan
Priority: Minor
At the moment, FieldCacheImpl has somewhat disconcerting values when sorting on a field for which
multiple values exist for one document. For example, imagine a field "fruit" which is added to a document
multiple times, with the values as follows:
doc 1: {"apple"}
doc 2: {"banana"}
doc 3: {"apple", "banana"}
doc 4: {"apple", "zebra"}
if one sorts on the field "fruit", the loop in FieldCacheImpl.stringsIndexCache.createValue() (and
similarly for the other methods in the various FieldCacheImpl caches) does the following:
while (termDocs.next()) {
retArray[termDocs.doc()] = t;
}
which means that we look over the terms in their natural order and, on each one, overwrite retArray[doc]
with the value for each document with that term. Effectively, this overwriting means that a string sort in

[jira] Updated: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

Paul Cowan (JIRA <jira <at> apache.org>
2008-09-01 02:31:44 GMT

[
https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paul Cowan updated LUCENE-1372:
-------------------------------
Attachment: lucene-multisort.patch
Patch which deals with this in the case of Strings, with a test case. This is a POC example; if people are happy
with the approach I'll implement for the other types (float, int, etc) as I think it makes sense there also.
> Proposal: introduce more sensible sorting when a doc has multiple values for a term
> -----------------------------------------------------------------------------------
>
> Key: LUCENE-1372
> URL: https://issues.apache.org/jira/browse/LUCENE-1372
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.3.2
> Reporter: Paul Cowan
> Priority: Minor
> Attachments: lucene-multisort.patch
>
>
> At the moment, FieldCacheImpl has somewhat disconcerting values when sorting on a field for which
multiple values exist for one document. For example, imagine a field "fruit" which is added to a document
multiple times, with the values as follows:
> doc 1: {"apple"}

[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

Mark Miller (JIRA <jira <at> apache.org>
2008-09-04 13:03:44 GMT

[
https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628331#action_12628331
]
Mark Miller commented on LUCENE-1372:
-------------------------------------
Hey Paul,
I agree that your patch is more intuitive than the current behavior, but I don't know how intuitive that is -
if the sort worked on multiple tokens, you would expect it to sort lexicographically across each word, and
even with your patch it won't, it will just use the first word rather than the last, right? In other words, I
see it as a half fix.
So while its low overhead, I wonder if any overhead is worth not getting the full fix? Currently the solution
has been that you should only be sorting on single token fields - in fact, there is a check for this (that just
isnt very good at checking <g>) that will possibly throw an exception if you sort on a field with multiple
tokens - its just not safe unless that check is taken out (FieldCacheImpl string sorting).
It appears that to do this right, we need to pay a cost in the general case and sorting across multiple tokens
may not be worth that, as you can get around the limitation by using multiple fields etc now. Personally
though, if a patch were to be accepted, I think it would have to fully support the correct sorting and
disable that check I mentioned (again, i doubt people want to pay that perf cost though). Finally, even if
the committers decide this is a good way to go, the check needs to come out at a minimum.
> Proposal: introduce more sensible sorting when a doc has multiple values for a term
> -----------------------------------------------------------------------------------
>
> Key: LUCENE-1372

[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

Paul Smith (JIRA <jira <at> apache.org>
2008-09-04 22:09:44 GMT

[
https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628480#action_12628480
]
Paul Smith commented on LUCENE-1372:
------------------------------------
Having a Document sorted last because it has "zebra", even though it has "apple" seems way incorrect. Yes it
would be ideal if Lucene _could_ perform the multi-term sort properly, but in the absence of an effective
fix in the short term, having the lexographically earlier term 'picked' as the primary sort candidate is
likely to generate results that match what users would expect (even if it's not quite perfect).
Right now it looks blatantly silly at the presentation layer when one presents the search results with
their data, and show that "apple,zebra" appears last in the list..
> Proposal: introduce more sensible sorting when a doc has multiple values for a term
> -----------------------------------------------------------------------------------
>
> Key: LUCENE-1372
> URL: https://issues.apache.org/jira/browse/LUCENE-1372
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.3.2
> Reporter: Paul Cowan
> Priority: Minor
> Attachments: lucene-multisort.patch
>
>

[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

Mark Miller (JIRA <jira <at> apache.org>
2008-09-04 22:35:44 GMT

[
https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628496#action_12628496
]
Mark Miller commented on LUCENE-1372:
-------------------------------------
Ah, but right now, the documentation will tell you its not supported, and possibly even that you can't do it
(though you can because the check that would stop you is basically a guess). So it looks funny when
presenting data because you are violating the rules<g> You are not just proposing making it work more
unintuitive, but by necessity, you are also proposing Lucene support sorting on a field with multiple
tokens in the first place, because the stance right now is that it does not - hence the weak guard around it in
the code.
> Proposal: introduce more sensible sorting when a doc has multiple values for a term
> -----------------------------------------------------------------------------------
>
> Key: LUCENE-1372
> URL: https://issues.apache.org/jira/browse/LUCENE-1372
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.3.2
> Reporter: Paul Cowan
> Priority: Minor
> Attachments: lucene-multisort.patch
>
>
> At the moment, FieldCacheImpl has somewhat disconcerting values when sorting on a field for which

[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

Hoss Man (JIRA <jira <at> apache.org>
2008-09-04 23:01:44 GMT

[
https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628509#action_12628509
]
Hoss Man commented on LUCENE-1372:
----------------------------------
bq. Right now it looks blatantly silly at the presentation layer when one presents the search results with
their data, and show that "apple,zebra" appears last in the list.
I'm not following this argument. Will it be less silly when {zebra,apple} sorts before {banana} ?
If we're going to break backwards compatibility for FieldCache users, let's break it completely and make
the code throw a RuntimeException when it sees that retArray[termDocs.doc()] is non-null ... that way we
are quickly alerting the client code that they are doing something very, very wrong.
> Proposal: introduce more sensible sorting when a doc has multiple values for a term
> -----------------------------------------------------------------------------------
>
> Key: LUCENE-1372
> URL: https://issues.apache.org/jira/browse/LUCENE-1372
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.3.2
> Reporter: Paul Cowan
> Priority: Minor
> Attachments: lucene-multisort.patch
>

[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

Paul Smith (JIRA <jira <at> apache.org>
2008-09-04 23:13:44 GMT

[
https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628513#action_12628513
]
Paul Smith commented on LUCENE-1372:
------------------------------------
bq. I'm not following this argument. Will it be less silly when {zebra,apple} sorts before {banana} ?
Well, at the presentation layer I don't think you'd present it like that (we don't). We'd sort the list of
attributes so that it would appear as "apple,zebra".
> Proposal: introduce more sensible sorting when a doc has multiple values for a term
> -----------------------------------------------------------------------------------
>
> Key: LUCENE-1372
> URL: https://issues.apache.org/jira/browse/LUCENE-1372
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.3.2
> Reporter: Paul Cowan
> Priority: Minor
> Attachments: lucene-multisort.patch
>
>
> At the moment, FieldCacheImpl has somewhat disconcerting values when sorting on a field for which
multiple values exist for one document. For example, imagine a field "fruit" which is added to a document
multiple times, with the values as follows:

[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

Hoss Man (JIRA <jira <at> apache.org>
2008-09-05 05:45:44 GMT

[
https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628555#action_12628555
]
Hoss Man commented on LUCENE-1372:
----------------------------------
bq. We'd sort the list of attributes so that it would appear as "apple,zebra".
Again i'm missing something in your argument ... you'll put code in your application which will change the
order of stored fields when displaying them so it looks better, but you won't put code in your application
to ensure that multiple values aren't indexed in the first place?
The application using Lucene is in the best position to decide "this is the value i want to sort on."
FieldCache shouldn't guess which value to use if the application breaks the rules and indexes more then
one. the fact that FieldCache currently picks the last one is just an artifact of how it was implemented ...
it is "consistent" but "undefined" behavior.
if we are going to change the behavior we should change it should be an error.
> Proposal: introduce more sensible sorting when a doc has multiple values for a term
> -----------------------------------------------------------------------------------
>
> Key: LUCENE-1372
> URL: https://issues.apache.org/jira/browse/LUCENE-1372
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.3.2

[jira] Updated: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

Paul Cowan (JIRA <jira <at> apache.org>
2009-03-04 06:10:01 GMT

[
https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paul Cowan updated LUCENE-1372:
-------------------------------
Attachment: LUCENE-1372-MultiValueSorters.patch
I think we're after somewhat different things here Uwe, but still pulling generally in the same direction.
For your case, I personally am in favour of:
1) replacing (as I did in my original patch) the loops in FieldCacheImpl that look like this:
{code}while (termDocs.next()) {
retArray[termDocs.doc()] = termval;
}{code}
with ones that look like this:
{code}while (termDocs.next()) {
if (retArray[termDocs.doc() == null) {
retArray[termDocs.doc()] = termval;
}
}{code}
(or == 0 for the int case, == 0.0 for the float case, whatever). This, I think, meets your sorting goal (order
by lexicographically first term using simple binary ordering of the term text). You then just need either:
a) a code path that uses FieldCacheImpl.getStrings() rather than than .getStringIndex() (the former
doesn't care about the more-terms-than-documents case), but this is obviously not optimally performant
b) a change to .getStringIndex which doesn't assume that there are fewer terms than documents. Not sure if
this is harder or not.... don't know if there is an easy way to find the number of terms for a field in advance
to size the array?
I think 'multi-value fields order by the first term in binary string order' is a valid behaviour, doesn't

[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

Uwe Schindler (JIRA <jira <at> apache.org>
2009-03-04 07:41:56 GMT

[
https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678623#action_12678623
]
Uwe Schindler commented on LUCENE-1372:
---------------------------------------
For TrieRange the proposed variant to sort by the lowest term in TermEnum is absolutely fine.
Sorting against the first term in the document is simply impossible (maybe working if you use the term
positions during array creation, but this will slow down and it only works with real tokenized fields, not
fields like TrieRange).
TrieRange does not use String/StringIndex sorting, the ordering is done using the raw long/int values.
The arrays are filled and SortFields are instantiated using a custom FieldCache.Parser (see
LUCENE-1478). So if it is ordered by the lowest term (which is always the highest precision one in
TrieRange), the order would be correct.
In the current version, the results would be sorted using the last term in TermEnum, which is the lowest
precision. The order is then simply to unprecise (because the documents indexed with TrieRange have the
lower int/long bits stripped away).
The "simple" proposal is enough for trie range. Maybe we can add a option to switch between first/last term
(and make this option also available to SortFields and other parts where the FieldCache is used).
> Proposal: introduce more sensible sorting when a doc has multiple values for a term
> -----------------------------------------------------------------------------------
>
> Key: LUCENE-1372
> URL: https://issues.apache.org/jira/browse/LUCENE-1372

[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

Paul Cowan (JIRA <jira <at> apache.org>
2009-03-04 09:13:56 GMT

[
https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678650#action_12678650
]
Paul Cowan commented on LUCENE-1372:
------------------------------------
Yes, sorry, I might have been unclear. When I referred to 'first term' I meant 'the first term
lexicographically' -- at least as far as binary order is 'lexicographically' -- i.e. the 'lowest' term.
I like the idea of the pluggable behaviour, even if it's a simple boolean:
{code}
boolean sortByLowestTerm = ...
if (retArray[termDocs.doc() == null || !sortByLowestTerm) {
retArray[termDocs.doc()] = termval;
}
{code}
We could replace this with a pluggable 'TermSelectionPolicy' or somesuch (as suggested by Earwin on
java-dev <at> ).... perhaps something like
{code}
interface SortTermCollector {
void addTermText(String text);
Comparable toSortValue();
}
{code}