Details

Description

When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was confronted by the problem that the special trie-encoded values (which are longs in a special encoding) cannot be sorted by Searcher.search() and SortField. The problem is: If you use SortField.LONG, you get NumberFormatExceptions. The trie encoded values may be sorted using SortField.String (as the encoding is in such a way, that they are sortable as Strings), but this is very memory ineffective.

ExtendedFieldCache gives the possibility to specify a custom LongParser when retrieving the cached values. But you cannot use this during searching, because there is no possibility to supply this custom LongParser to the SortField.

I propose a change in the sort classes:
Include a pointer to the parser instance to be used in SortField (if not given use the default). My idea is to create a SortField using a new constructor

SortField(String field, int type, Object parser, boolean reverse)

The parser is "object" because all current parsers have no super-interface. The ideal solution would be to have:

and FieldCache.Parser is a super-interface (just empty, more like a marker-interface) of all other parsers (like LongParser...). The sort implementation then must be changed to respect the given parser (if not NULL), else use the default FieldCache.getXXXX without parser.

Ah, now I understand (by the way, I do not understand Solrs ExternalFileField in complete, too):

With my workaround, I did not want to work around the problem of a changed external file. I propsed, not to implement a own FieldCache for the whole thing and just use the new SortField's parser to fill the standard FloatCache during searching/filtering. Just use the external (static) file and convert the the values from the index using the HashMap in ExternalFileField to the floats. This parser is a singleton for each ExternalFileField.

But I think we should wait, until Yonik explains us the problem more detail.

Uwe Schindler
added a comment - 30/Jan/09 11:33 Ah, now I understand (by the way, I do not understand Solrs ExternalFileField in complete, too):
With my workaround, I did not want to work around the problem of a changed external file. I propsed, not to implement a own FieldCache for the whole thing and just use the new SortField's parser to fill the standard FloatCache during searching/filtering. Just use the external (static) file and convert the the values from the index using the HashMap in ExternalFileField to the floats. This parser is a singleton for each ExternalFileField.
But I think we should wait, until Yonik explains us the problem more detail.

the parser should be a singleton for all field caches or have hashCode()/equals().

I guess I don't understand your proposed workaround then – I thought
you were explicitly proposing not using a singleton, so that you
could force re-parsing of all values (even for old segments) when a
new external file was being used.

The Cache of FieldCache instances in FieldCacheImpl is a WeakHashMap, so unused FieldCache instances for no longer used readers/parsers are GC'ed.

The WeakHashMap is keyed by the IndexReader, and then there's a normal
HashMap (innerCache) keyed on field,Parser. So it's only when the
IndexReader (SegmentReader) is no longer referenced elsewhere that the
innerCache, in entirety, can be GCd. There's no reclaiming of stale
entries in the innerCache.

Michael McCandless
added a comment - 30/Jan/09 11:20
the parser should be a singleton for all field caches or have hashCode()/equals().
I guess I don't understand your proposed workaround then – I thought
you were explicitly proposing not using a singleton, so that you
could force re-parsing of all values (even for old segments) when a
new external file was being used.
(Also likely my lack of understanding of Solr's ExternalFileField is
adding to my confusion here – I haven't yet looked at it.)
The Cache of FieldCache instances in FieldCacheImpl is a WeakHashMap, so unused FieldCache instances for no longer used readers/parsers are GC'ed.
The WeakHashMap is keyed by the IndexReader, and then there's a normal
HashMap (innerCache) keyed on field,Parser. So it's only when the
IndexReader (SegmentReader) is no longer referenced elsewhere that the
innerCache, in entirety, can be GCd. There's no reclaiming of stale
entries in the innerCache.

Uwe Schindler
added a comment - 30/Jan/09 11:02 By the way: The Cache of FieldCache instances in FieldCacheImpl is a WeakHashMap, so unused FieldCache instances for no longer used readers/parsers are GC'ed.

Uwe, would that result in a memory leak? Ie, a single long-lived segment would accumulate multiple entries for each new XXXParser instance used during sorting? (Unless there's logic to evict the "stale" entries).

As noted before on Dec 9, 08, the parser should be a singleton for all field caches or have hashCode()/equals(). You create a FloatParser and supply it to SortField. When then sorting against this field, the new sort implementation would create a FieldCache for each segment. When one segment is reloaded, the FieldCache gets unused and a new one for the replacement segment is created (this is the same like with the standard field parser). The FieldCache is using (IndexReader,SortField.type,Parser) as key.

If the FieldCache is used for CachingFilters, there is also no problem: The new search algorithm executes each filter's getDocIDSet() for each single SegmentReader.

So the only problem is, that with the new search implementation, you cannot rely anymore on the fact, that for a MultiReader only one FieldCache exists and every filter's getDocIdSet() is executed only one time. So injecting a custom FieldCache into the cache for the whole MultiReader before search (by getting it) is not possible anymore.

I tested the new search impl with trie fields, sorting works perfect using TrieUtils.getSortField()/TrieUtils.LONG_PARSER, no leaks (because FieldParser is singleton). I had to only modify the test case (Revision: 737079), because with an unoptimized index, the statistics for retrieving the number of visited terms did not work anymore (because the Filter was called more than once per search).

The problem with stale entries, if the external file changes is another problem, that also happend before the new sort impl.

It seems like LUCENE-831, which would expose / allow custom control over FieldCache's caching impl, would help here too.

Uwe Schindler
added a comment - 30/Jan/09 10:51 Uwe, would that result in a memory leak? Ie, a single long-lived segment would accumulate multiple entries for each new XXXParser instance used during sorting? (Unless there's logic to evict the "stale" entries).
As noted before on Dec 9, 08, the parser should be a singleton for all field caches or have hashCode()/equals(). You create a FloatParser and supply it to SortField. When then sorting against this field, the new sort implementation would create a FieldCache for each segment. When one segment is reloaded, the FieldCache gets unused and a new one for the replacement segment is created (this is the same like with the standard field parser). The FieldCache is using (IndexReader,SortField.type,Parser) as key.
If the FieldCache is used for CachingFilters, there is also no problem: The new search algorithm executes each filter's getDocIDSet() for each single SegmentReader.
So the only problem is, that with the new search implementation, you cannot rely anymore on the fact, that for a MultiReader only one FieldCache exists and every filter's getDocIdSet() is executed only one time. So injecting a custom FieldCache into the cache for the whole MultiReader before search (by getting it) is not possible anymore.
I tested the new search impl with trie fields, sorting works perfect using TrieUtils.getSortField()/TrieUtils.LONG_PARSER, no leaks (because FieldParser is singleton). I had to only modify the test case (Revision: 737079), because with an unoptimized index, the statistics for retrieving the number of visited terms did not work anymore (because the Filter was called more than once per search).
The problem with stale entries, if the external file changes is another problem, that also happend before the new sort impl.
It seems like LUCENE-831 , which would expose / allow custom control over FieldCache's caching impl, would help here too.
Yes!

Write a FloatParser that maps the the uniqueKey to a float value using the external file.

Uwe, would that result in a memory leak? Ie, a single long-lived segment would accumulate multiple entries for each new XXXParser instance used during sorting? (Unless there's logic to evict the "stale" entries).

It seems like LUCENE-831, which would expose / allow custom control over FieldCache's caching impl, would help here too.

Michael McCandless
added a comment - 30/Jan/09 10:17 Write a FloatParser that maps the the uniqueKey to a float value using the external file.
Uwe, would that result in a memory leak? Ie, a single long-lived segment would accumulate multiple entries for each new XXXParser instance used during sorting? (Unless there's logic to evict the "stale" entries).
It seems like LUCENE-831 , which would expose / allow custom control over FieldCache's caching impl, would help here too.

So it sounds like Solr was relying on Lucene loading the entire
float[] for all docs in the MultiSegmentReader, when only some
segments were new? (And so it was LUCENE-1483 that caused the
failure).

Lucene implicitly assumes that a FieldCache's arrays do not change for
a given segment; this is normally safe since the arrays are derived
from the postings in the field (which are write once).

But it sounds like Solr changed that assumption, and the values in the
(Solr-subclass of) FieldCache's arrays are now derived from something
external, which is no longer write once.

How do you plan to fix it with Solr? It seems like, since you are
maintaining a private cache, you could forcefully evict entries from
the cache for all SegmentReaders whenever the external file has
changed (or a new MultiSegmentReader had been opened)?

Michael McCandless
added a comment - 30/Jan/09 10:11 Yonik, why was the failure so intermittent?
So it sounds like Solr was relying on Lucene loading the entire
float[] for all docs in the MultiSegmentReader, when only some
segments were new? (And so it was LUCENE-1483 that caused the
failure).
Lucene implicitly assumes that a FieldCache's arrays do not change for
a given segment; this is normally safe since the arrays are derived
from the postings in the field (which are write once).
But it sounds like Solr changed that assumption, and the values in the
(Solr-subclass of) FieldCache's arrays are now derived from something
external, which is no longer write once.
How do you plan to fix it with Solr? It seems like, since you are
maintaining a private cache, you could forcefully evict entries from
the cache for all SegmentReaders whenever the external file has
changed (or a new MultiSegmentReader had been opened)?

After reading your comment several times and looking into ExternalFileField, the problem can maybe fixed using this issue:

You problem may come from the fact, that you instantated this own FieldCache implementation, keyed it with the parent MultiReader and the new Sort implementation does not use it.

Write a FloatParser that maps the the uniqueKey to a float value using the external file. If you want to use it with the new MultiReaderHitCollector sorting (LUCENE-1483), create the SortField using the parser interface parameter from this issue. When the field caches for searching are now rebuild (even only for one of them in a Multi(Segment)Reader)), the new Lucene search API will re-read the changed segments and build several new FieldCaches (standard ones) using the suplied parser. There is no longer the need for a extra FieldCache implementation with this issue.

Uwe Schindler
added a comment - 30/Jan/09 08:16 After reading your comment several times and looking into ExternalFileField, the problem can maybe fixed using this issue:
You problem may come from the fact, that you instantated this own FieldCache implementation, keyed it with the parent MultiReader and the new Sort implementation does not use it.
Write a FloatParser that maps the the uniqueKey to a float value using the external file. If you want to use it with the new MultiReaderHitCollector sorting ( LUCENE-1483 ), create the SortField using the parser interface parameter from this issue. When the field caches for searching are now rebuild (even only for one of them in a Multi(Segment)Reader)), the new Lucene search API will re-read the changed segments and build several new FieldCaches (standard ones) using the suplied parser. There is no longer the need for a extra FieldCache implementation with this issue.
Is this your problem? Maybe this is why you wrote the comment here.

I am not really sure, how this patch can can change this. Maybe you have more a problem with the new MultiReaderHitCollector patch (LUCENE-1483) - your comment looks like this? The current patch does not change anything, as long as you use the old SortField constructors.

Uwe Schindler
added a comment - 30/Jan/09 07:50 I am not really sure, how this patch can can change this. Maybe you have more a problem with the new MultiReaderHitCollector patch ( LUCENE-1483 ) - your comment looks like this? The current patch does not change anything, as long as you use the old SortField constructors.
Can you explain a little bit more, maybe a test?

ExternalFileField in Solr maps from a uniqueKey to a float value from a separate file.
There is a cache that is essentially keyed by (IndexReader,field) that gives back a float[].

Any change in the index used to cause all values to be updated (cache miss because the MultiReader was a different instance). Now, since it's called segment-at-a-time, only new segments are reloaded from the file, leaving older segments with stale values.

It's certainly in the very gray area... but perhaps Solr won't be the only one affected by this - maybe apps that implement security filters, etc?

Yonik Seeley
added a comment - 30/Jan/09 04:58 I tracked down how this patch was causing Solr failures:
ExternalFileField in Solr maps from a uniqueKey to a float value from a separate file.
There is a cache that is essentially keyed by (IndexReader,field) that gives back a float[].
Any change in the index used to cause all values to be updated (cache miss because the MultiReader was a different instance). Now, since it's called segment-at-a-time, only new segments are reloaded from the file, leaving older segments with stale values.
It's certainly in the very gray area... but perhaps Solr won't be the only one affected by this - maybe apps that implement security filters, etc?

Just a note:
For the FieldCache it is also important, that the parser is a singleton or implements hashCode() and equals(). If not, each call to sort with another SortField using a different parser instance (but from the same class) would create a new FieldCache. This is why I said, that SortComparators and Parsers should generally be made static final members (and so singletons) like I have done it in TrieUtils. With that you can be sure, that all SortFields hit the same cache entry when looking up using FieldCacheImpl.Entry. The implementation of hashCode and equals for parsers is the other variant, but it does not make really sense (as long as parsers and comparators do not have an instance-specific state).

Uwe Schindler
added a comment - 09/Dec/08 22:52 - edited Just a note:
For the FieldCache it is also important, that the parser is a singleton or implements hashCode() and equals(). If not, each call to sort with another SortField using a different parser instance (but from the same class) would create a new FieldCache. This is why I said, that SortComparators and Parsers should generally be made static final members (and so singletons) like I have done it in TrieUtils. With that you can be sure, that all SortFields hit the same cache entry when looking up using FieldCacheImpl.Entry. The implementation of hashCode and equals for parsers is the other variant, but it does not make really sense (as long as parsers and comparators do not have an instance-specific state).

Hi Mike,
sorry, after looking a second time into the new SortField ctors, I chaged two cosmetic things:

The ctor for parser assigns this.type=type and later calls the init method with the member variable type. The init method assigns so the meber to the member agian. Cleaner is just call the initFieldType() method in the correct instanceof clause hit.

Uwe Schindler
added a comment - 08/Dec/08 22:54 Hi Mike,
sorry, after looking a second time into the new SortField ctors, I chaged two cosmetic things:
The ctor for parser assigns this.type=type and later calls the init method with the member variable type. The init method assigns so the meber to the member agian. Cleaner is just call the initFieldType() method in the correct instanceof clause hit.
Moved the initFieldType() behind all ctors.
But this is only cosmetic

Michael McCandless
added a comment - 08/Dec/08 20:21 New patch attached:
Maybe you should add the parser here, too.
OK done.
The default Object equals/hashcode is enough for singletons.
OK I updated javadoc to add "unless a singleton is always used".
The additional null check is OK but in my opinion not needed, because field!=null when not one of the special RELEVANCE/DOCORDER sorts. For consistency we may add the check to the other ctors, too.
Yeah I added that because the javadoc had previously said field could
be null for DOC/SCORE sort type. OK I added that check for all ctors
(added initFieldType private utility method).

But on the other hand side: If the parser and/or comparator are static singletons (like it is done by the TrieUtils factories) they are not needed to implement equals and hashcode. The default Object equals/hashcode is enough for singletons. And I think most parsers and comparators are singletons. A short not should be enough.

The additional null check is OK but in my opinion not needed, because field!=null when not one of the special RELEVANCE/DOCORDER sorts. For consistency we may add the check to the other ctors, too.

Uwe Schindler
added a comment - 08/Dec/08 19:33 - edited Hi Mike,
patch looks good, checked each change of you with TortoiseMerge. All tests pass including Trie ones.
The only comments: You added this java docs to hashCode() and equals() in the patch of LUCENE-1481 . Maybe you should add the parser here, too.
/** Returns a hash code value for this object. If a
* {@link SortComparatorSource} was provided, it must
* properly implement hashCode. */
But on the other hand side: If the parser and/or comparator are static singletons (like it is done by the TrieUtils factories) they are not needed to implement equals and hashcode. The default Object equals/hashcode is enough for singletons. And I think most parsers and comparators are singletons. A short not should be enough.
The additional null check is OK but in my opinion not needed, because field!=null when not one of the special RELEVANCE/DOCORDER sorts. For consistency we may add the check to the other ctors, too.

Michael McCandless
added a comment - 08/Dec/08 18:21 OK I made a few small changes to the patch: added CHANGES entry, touched up javadocs, and added null check for field in the new SortField ctors. I think it's ready to commit!
Uwe can you look over my changes? Thanks.

New patch that integrates Mike's comments. This version still uses a super-interface. If we want to remove it, we can create 6*2 new constructors (for each parser with and without reverse) and use Object for the internal member variable holding the parser. In FieldSortedHitQueue, the parser is stored in the CacheEntry's custom member which is also Object..., so the supertype is only used in SortField.

12 new constructors are (in my eyes) very bad and from the JavaDoc side very heavy. The only advantage of them is:

removing the super-interface

the if-queue of instanceof tests can be removed because of stronger typing

Uwe Schindler
added a comment - 07/Dec/08 14:51 New patch that integrates Mike's comments. This version still uses a super-interface. If we want to remove it, we can create 6*2 new constructors (for each parser with and without reverse) and use Object for the internal member variable holding the parser. In FieldSortedHitQueue, the parser is stored in the CacheEntry's custom member which is also Object..., so the supertype is only used in SortField.
12 new constructors are (in my eyes) very bad and from the JavaDoc side very heavy. The only advantage of them is:
removing the super-interface
the if-queue of instanceof tests can be removed because of stronger typing

The problem is, that the comparator for byte fields returned SortField.INT in sortType() instead of SortField.BYTE.

Whoa, nice catch! I think it's fine to include the fix here (it was not manifesting as a bug w/o your changes).

The patch also provides a test case in "TestSort" that uses a custom parser, that maps a simple char value from 'A' to 'J' to a byte, short, int, long, float, double with some crazy algorithm to test sorting.

Awesome, I like the new test!

Additionally the new SortField constructor checks the sort type and corresponding parser for consistency and throws IllegalArgumentException, if not correct

I like this stronger typing. Could you change the message thrown exception to detail the type that was given and what parser was provided (better transparency)?

SortField could have separate constructors for each parser type without type parameter

That is a good idea, I think. The two args are redundant & therefore a source of confusion/error. It is annoying how we keep having to "multiply by N" so many places in Lucene that want to switch on the different builtin types (LUCENE-831 has this too).

Changing this would require some more work in the FieldSortedHitQueue, as copying the SortField would require more work

Can we just SortField.clone() instead of "new SortField(lots-of-tricky-args)" in FieldSortedHitQueue?

this patch also extends contrib's TrieUtils and test cases to support a static trie-tolong/double-parser and a SortField factory to handle trie-encoded fields very simple using the long fieldcache

Michael McCandless
added a comment - 06/Dec/08 13:47 The problem is, that the comparator for byte fields returned SortField.INT in sortType() instead of SortField.BYTE.
Whoa, nice catch! I think it's fine to include the fix here (it was not manifesting as a bug w/o your changes).
The patch also provides a test case in "TestSort" that uses a custom parser, that maps a simple char value from 'A' to 'J' to a byte, short, int, long, float, double with some crazy algorithm to test sorting.
Awesome, I like the new test!
Additionally the new SortField constructor checks the sort type and corresponding parser for consistency and throws IllegalArgumentException, if not correct
I like this stronger typing. Could you change the message thrown exception to detail the type that was given and what parser was provided (better transparency)?
SortField could have separate constructors for each parser type without type parameter
That is a good idea, I think. The two args are redundant & therefore a source of confusion/error. It is annoying how we keep having to "multiply by N" so many places in Lucene that want to switch on the different builtin types ( LUCENE-831 has this too).
Changing this would require some more work in the FieldSortedHitQueue, as copying the SortField would require more work
Can we just SortField.clone() instead of "new SortField(lots-of-tricky-args)" in FieldSortedHitQueue?
this patch also extends contrib's TrieUtils and test cases to support a static trie-tolong/double-parser and a SortField factory to handle trie-encoded fields very simple using the long fieldcache
Great!

I forget to mention, this patch also extends contrib's TrieUtils and test cases to support a static trie-tolong/double-parser and a SortField factory to handle trie-encoded fields very simple using the long fieldcache (for sorting, there need to be no difference between long/double, it will work for all 3 trie encodings and long is the simpliest to handle an compare) - for which this patch was mainly invented. The double parser instance is only supplied for other usages of FieldCache from TrieUtils.

Uwe Schindler
added a comment - 06/Dec/08 13:03 I forget to mention, this patch also extends contrib's TrieUtils and test cases to support a static trie-tolong/double-parser and a SortField factory to handle trie-encoded fields very simple using the long fieldcache (for sorting, there need to be no difference between long/double, it will work for all 3 trie encodings and long is the simpliest to handle an compare) - for which this patch was mainly invented. The double parser instance is only supplied for other usages of FieldCache from TrieUtils.

Here is the patch using the superinterface for all field parsers. The patch also provides a test case in "TestSort" that uses a custom parser, that maps a simple char value from 'A' to 'J' to a byte, short, int, long, float, double with some crazy algorithm to test sorting.

Also fixed is the bug with the sortType() in comparator as mentioned before, without this, the test case for bytes would not pass.

Additionally the new SortField constructor checks the sort type and corresponding parser for consistency and throws IllegalArgumentException, if not correct (e.g. using a LongParser with SortField.BYTE). During inventing this patch I had another idea to fix the issue with not to have a new super-interface: SortField could have separate constructors for each parser type without type parameter (as type is forced by parser instance – we could also remove the type parameter in the current patch, as the type maybe set by instanceof operator). Changing this would require some more work in the FieldSortedHitQueue, as copying the SortField would require more work (but could be fixed better as is is currently: a new SortField instance in the constructor is only needed, if SortField.AUTO or CUSTOM was used, in all other cases, the SortField instance can just be directly used).

Uwe Schindler
added a comment - 06/Dec/08 12:59 Here is the patch using the superinterface for all field parsers. The patch also provides a test case in "TestSort" that uses a custom parser, that maps a simple char value from 'A' to 'J' to a byte, short, int, long, float, double with some crazy algorithm to test sorting.
Also fixed is the bug with the sortType() in comparator as mentioned before, without this, the test case for bytes would not pass.
Additionally the new SortField constructor checks the sort type and corresponding parser for consistency and throws IllegalArgumentException, if not correct (e.g. using a LongParser with SortField.BYTE). During inventing this patch I had another idea to fix the issue with not to have a new super-interface: SortField could have separate constructors for each parser type without type parameter (as type is forced by parser instance – we could also remove the type parameter in the current patch, as the type maybe set by instanceof operator). Changing this would require some more work in the FieldSortedHitQueue, as copying the SortField would require more work (but could be fixed better as is is currently: a new SortField instance in the constructor is only needed, if SortField.AUTO or CUSTOM was used, in all other cases, the SortField instance can just be directly used).
Just give some comments!

I found a hidden bug in FieldSortedHitQueue that materialized when writing a TestCase for a SortField.BYTE sorting with custom parser (it took me a long time to find out whats happening). The problem is, that the comparator for byte fields returned SortField.INT in sortType() instead of SortField.BYTE. The effect of this was, that my new code crahsed when generating the new SortField in the constructor because of the incorrect Parser/SortField type. The other problem was, that on later calling the cache again, it would return another hit, because the type was different. The problem is, that this does not affect correctness of the previous implementation, but may coud lead to errors, like with my new impl.

I fix this bug (its just one line) in my next patch (I will supply it shortly). Is it OK, or should I open a separate bug report?

Uwe Schindler
added a comment - 06/Dec/08 12:08 I found a hidden bug in FieldSortedHitQueue that materialized when writing a TestCase for a SortField.BYTE sorting with custom parser (it took me a long time to find out whats happening). The problem is, that the comparator for byte fields returned SortField.INT in sortType() instead of SortField.BYTE. The effect of this was, that my new code crahsed when generating the new SortField in the constructor because of the incorrect Parser/SortField type. The other problem was, that on later calling the cache again, it would return another hit, because the type was different. The problem is, that this does not affect correctness of the previous implementation, but may coud lead to errors, like with my new impl.
I fix this bug (its just one line) in my next patch (I will supply it shortly). Is it OK, or should I open a separate bug report?

Patch looks good, thanks Uwe! Back compat looks preserved; while some
APIs (FieldSortedHitQueue.getCachedComparator) were changed, they are
package private.

Back-compat tests ("ant test-tag") pass as well.

For testing, I modified one of my contrib TrieRangeQuery test cases locally to sort using a custom LongParser that decoded the encoded longs in the cache [parseLong(value) returns TrieUtils.trieCodedToLong(value)].

It looks like this didn't make it into the patch – could you add it?

Actually, adding a core test case would also be good. It could be
something silly, eg that parses ints but negates them, and then assert
that this yields the same result as the default IntParser with
reverse=true (assuming no ties).

If you like my patch, we could also discuss about using a super-interface for all Parsers. The modifications are rather simple (only the SortField constructor would be affected and some casts, and of course: the superinterface in all declarations inside FieldCache, ExtendedFieldCache)

I agree, I would like to at least get some minimal static typing into
the API (Object is not ideal) even if it's simply a "marker" interface
If you're sure this can be done, such that all changes to
FieldCache/ExtendedFieldCache remain back compatibitle, then let's do
it? And I think I do now agree: this can be done w/o breaking back
compat. The only affected public methods should be your new SortField
methods, which is fine (no public methods take "Object parser" as far
as I can tell).

Michael McCandless
added a comment - 05/Dec/08 10:55 Patch looks good, thanks Uwe! Back compat looks preserved; while some
APIs (FieldSortedHitQueue.getCachedComparator) were changed, they are
package private.
Back-compat tests ("ant test-tag") pass as well.
For testing, I modified one of my contrib TrieRangeQuery test cases locally to sort using a custom LongParser that decoded the encoded longs in the cache [parseLong(value) returns TrieUtils.trieCodedToLong(value)] .
It looks like this didn't make it into the patch – could you add it?
Actually, adding a core test case would also be good. It could be
something silly, eg that parses ints but negates them, and then assert
that this yields the same result as the default IntParser with
reverse=true (assuming no ties).
If you like my patch, we could also discuss about using a super-interface for all Parsers. The modifications are rather simple (only the SortField constructor would be affected and some casts, and of course: the superinterface in all declarations inside FieldCache, ExtendedFieldCache)
I agree, I would like to at least get some minimal static typing into
the API (Object is not ideal) even if it's simply a "marker" interface
If you're sure this can be done, such that all changes to
FieldCache/ExtendedFieldCache remain back compatibitle, then let's do
it? And I think I do now agree: this can be done w/o breaking back
compat. The only affected public methods should be your new SortField
methods, which is fine (no public methods take "Object parser" as far
as I can tell).

Attached is a patch that implements the first variant (without super interface for all FieldParsers). All current tests pass. A special test case for a custum field parser was not implemented.

For testing, I modified one of my contrib TrieRangeQuery test cases locally to sort using a custom LongParser that decoded the encoded longs in the cache [parseLong(value) returns TrieUtils.trieCodedToLong(value)].

A good test case would be to store some dates in ISO format in a field and then sort it as longs after parsing using SimpleDateFormat. This would be another typical use case (sorting by date, but not using SortField.STRING to minimize memory usage).

If you like my patch, we could also discuss about using a super-interface for all Parsers. The modifications are rather simple (only the SortField constructor would be affected and some casts, and of course: the superinterface in all declarations inside FieldCache, ExtendedFieldCache)

Uwe Schindler
added a comment - 04/Dec/08 21:25 Attached is a patch that implements the first variant (without super interface for all FieldParsers). All current tests pass. A special test case for a custum field parser was not implemented.
For testing, I modified one of my contrib TrieRangeQuery test cases locally to sort using a custom LongParser that decoded the encoded longs in the cache [parseLong(value) returns TrieUtils.trieCodedToLong(value)] .
A good test case would be to store some dates in ISO format in a field and then sort it as longs after parsing using SimpleDateFormat. This would be another typical use case (sorting by date, but not using SortField.STRING to minimize memory usage).
If you like my patch, we could also discuss about using a super-interface for all Parsers. The modifications are rather simple (only the SortField constructor would be affected and some casts, and of course: the superinterface in all declarations inside FieldCache, ExtendedFieldCache)

When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was confronted by the problem that the special trie-encoded values (which are longs in a special encoding) cannot be sorted by Searcher.search() and SortField. The problem is: If you use SortField.LONG, you get NumberFormatExceptions. The trie encoded values may be sorted using SortField.String (as the encoding is in such a way, that they are sortable as Strings), but this is very memory ineffective.

ExtendedFieldCache gives the possibility to specify a custom LongParser when retrieving the cached values. But you cannot use this during searching, because there is no possibility to supply this custom LongParser to the SortField.

I propose a change in the sort classes:
Include a pointer to the parser instance to be used in SortField (if not given use the default). My idea is to create a SortField using a new constructor
{code}SortField(String field, int type, Object parser, boolean reverse){code}

The parser is "object" bcause all parsers have no super-interface. The ideal solution would be to have:

and FieldCache.Parser is a super-interface (just empty, more like a marker-interface) of all other parsers (like LongParser...). The sort implementation then must be changed to respect the given parser (if not NULL), else use the default FieldCache.getXXXX without parser.

When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was confronted by the problem that the special trie-encoded values (which are longs in a special encoding) cannot be sorted by Searcher.search() and SortField. The problem is: If you use SortField.LONG, you get NumberFormatExceptions. The trie encoded values may be sorted using SortField.String (as the encoding is in such a way, that they are sortable as Strings), but this is very memory ineffective.

ExtendedFieldCache gives the possibility to specify a custom LongParser when retrieving the cached values. But you cannot use this during searching, because there is no possibility to supply this custom LongParser to the SortField.

I propose a change in the sort classes:
Include a pointer to the parser instance to be used in SortField (if not given use the default). My idea is to create a SortField using a new constructor
{code}SortField(String field, int type, Object parser, boolean reverse){code}

The parser is "object" because all current parsers have no super-interface. The ideal solution would be to have:

and FieldCache.Parser is a super-interface (just empty, more like a marker-interface) of all other parsers (like LongParser...). The sort implementation then must be changed to respect the given parser (if not NULL), else use the default FieldCache.getXXXX without parser.