Details

Description

The current implementation of locale-based sort in Lucene uses the FieldCache which keeps all sort terms in memory. Beside the huge memory overhead, searching requires comparison of terms with collator.compare every time, making searches with millions of hits fairly expensive.

This proposed alternative implementation is to create a packed list of pre-sorted ordinals for the sort terms and a map from document-IDs to entries in the sorted ordinals list. This results in very low memory overhead and faster sorted searches, at the cost of increased startup-time. As the ordinals can be resolved to terms after the sorting has been performed, this approach supports fillFields=true.

Robert Muir
added a comment - 06/Apr/10 11:26 Toke, I still think it would be better to use ICU collation keys here.
for your danish text, the memory usage will still be smaller than trunk (as ICU collation keys as byte[] are smaller than java's internal utf-16 encoding).
then you can do this sort at index-time...
one step towards this is to switch collation in flex to use byte[] rather than encoding in char[] like it does today.

I would like to note that new types of sort caches do not require invasive surgery on SegmentReader, of the type seen in previous issue.
Or did I miss some peculiar requirement current APIs do not satisfy?

Earwin Burrfoot
added a comment - 06/Apr/10 11:58 I would like to note that new types of sort caches do not require invasive surgery on SegmentReader, of the type seen in previous issue.
Or did I miss some peculiar requirement current APIs do not satisfy?

The current implementation accepts Comparator<Object> (which must accept Strings) as well as a Locale (which is converted to Collator.getInstance(locale) under the hoo)d as arguments. Plugging in the ICU collator directly should be trivial. If/when it gets possible to use byte[] for sorters in general, I'll add support for that.

Indexing ICU collator keys and using them in combination with LUCENE-2369 is an interesting idea, as it would speed up the building process quite a lot, while keeping the memory usage down. As long as fillFields=false, the two methods are independent as should work well with each other. Fairly easy to try.

For fillFields=true, it gets a bit trickier and requires a special FieldComparatorSource that keeps two maps from docID: One to the ICU collator key, one to the original term. Still, it should not be that hard to implement and I'll be happy to do it if the fillFields=false-case turns out to work well.

Toke Eskildsen
added a comment - 06/Apr/10 12:02 The current implementation accepts Comparator<Object> (which must accept Strings) as well as a Locale (which is converted to Collator.getInstance(locale) under the hoo)d as arguments. Plugging in the ICU collator directly should be trivial. If/when it gets possible to use byte[] for sorters in general, I'll add support for that.
Indexing ICU collator keys and using them in combination with LUCENE-2369 is an interesting idea, as it would speed up the building process quite a lot, while keeping the memory usage down. As long as fillFields=false, the two methods are independent as should work well with each other. Fairly easy to try.
For fillFields=true, it gets a bit trickier and requires a special FieldComparatorSource that keeps two maps from docID: One to the ICU collator key, one to the original term. Still, it should not be that hard to implement and I'll be happy to do it if the fillFields=false-case turns out to work well.

Lotsa devils in the details when you're poking around in the belly of Lucene, but modulo some business with deleted documents, it looks fine for simple (no parallel and multi readers) usage. fillFields=true works just as it should by delaying the actual term resolving until the documents ID's are determined. The current code makes it possible to create an exposed Sort quite easily:

this will present the user with a simple shell where searches can be performed. Heap-usage and execution times are displayed along with the search result.

I did a little bit of real world experimenting: A 2.5GB index with 400K documents with 320K unique sort terms took 14 seconds to open. After that, a locale-based sorted search that hit 250K documents and returned the first 50 took 30ms (fully warmed by re-searching 5 times). Max heap was specified to 40MB of which 20MB was used after the building of the sort structure was finished.

The same search using the standard locale-oriented sorter took about 1 second to start up, After that, the 250K search took 130ms, fully warmed. Max heap was specified to 100MB.

The default sorter was able to get by with 80MB, but execution-time increased drastically to 2000ms. Probably because of the GC-overhead that the Collator introduces by temporarily allocating two new objects for each comparison.

The bad news is that this is quite a bit of code (400+ extra lines for the SegmentReader alone) with several levels of indirection in the data structures. As an example, getting the actual term for a given docID in the ExposedFieldComparatorSource is done with

is real hard to do when writing a new IndexReader. My current approach is to use an interface ExposedReader with the methods and let the updated IndexReaders implement that, thereby making it optional for IndexReaders to be Exposed.

LUCENE-2335 seems to fulfill the promises so far, with the previously discussed trade-offs:

Regarding making a proper patch, I would like to know what I should patch against. I use LUCENE-1990 so I need to do it against a fairly new version. I can see that Flex is about to be merged, so I guess it would make sense to wait for that one.

Toke Eskildsen
added a comment - 06/Apr/10 12:04 Moved from LUCENE-2335 as it really belongs here.
Lotsa devils in the details when you're poking around in the belly of Lucene, but modulo some business with deleted documents, it looks fine for simple (no parallel and multi readers) usage. fillFields=true works just as it should by delaying the actual term resolving until the documents ID's are determined. The current code makes it possible to create an exposed Sort quite easily:
ExposedFieldComparatorSource exposedFCS =
new ExposedFieldComparatorSource(reader, new Locale( "da" ));
Sort sort = new Sort( new SortField( "mySortField" , exposedFCS));
For the curious, a modified Lucene-JAR can be downloaded at http://github.com/tokee/lucene/downloads and tested with
java -cp lucene-core-3.1-dev-LUCENE-2335-20100405.jar org.apache.lucene.index.ExposedPOC expose <index> <sortField> <locale> <defaultField>
this will present the user with a simple shell where searches can be performed. Heap-usage and execution times are displayed along with the search result.
I did a little bit of real world experimenting: A 2.5GB index with 400K documents with 320K unique sort terms took 14 seconds to open. After that, a locale-based sorted search that hit 250K documents and returned the first 50 took 30ms (fully warmed by re-searching 5 times). Max heap was specified to 40MB of which 20MB was used after the building of the sort structure was finished.
The same search using the standard locale-oriented sorter took about 1 second to start up, After that, the 250K search took 130ms, fully warmed. Max heap was specified to 100MB.
The default sorter was able to get by with 80MB, but execution-time increased drastically to 2000ms. Probably because of the GC-overhead that the Collator introduces by temporarily allocating two new objects for each comparison.
The bad news is that this is quite a bit of code (400+ extra lines for the SegmentReader alone) with several levels of indirection in the data structures. As an example, getting the actual term for a given docID in the ExposedFieldComparatorSource is done with
final long resolvedDocOrder = docOrder.get(order[slot]);
return resolvedDocOrder == undefinedTerm ? null : reader.getTermText(
( int )termOrder.get(( int )resolvedDocOrder));
which is not easily digested without a very thorough explanation, preferably with a diagram.
The API-changes to the IndexReaders is the addition of two methods:
String getTermText( int ordinal) throws IOException;
is self-explanatory, but
ExposedIterator getExposedTuples(
String persistenceKey, Comparator< Object > comparator, String field,
boolean collectDocIDs) throws IOException;
is real hard to do when writing a new IndexReader. My current approach is to use an interface ExposedReader with the methods and let the updated IndexReaders implement that, thereby making it optional for IndexReaders to be Exposed.
LUCENE-2335 seems to fulfill the promises so far, with the previously discussed trade-offs:
Long startup (~1 min/1M documents, less on re-open)
Fast locale-based sorted search (supports fillFields=true and has near-zero GC overhead)
Very low memory overhead (both permanent and for initializing)
Regarding making a proper patch, I would like to know what I should patch against. I use LUCENE-1990 so I need to do it against a fairly new version. I can see that Flex is about to be merged, so I guess it would make sense to wait for that one.

Earwin, it would be great if this can be done without modifying IndexReaders at all. I know that the getExposedTuples can be refactored out - it'll be somewhat clunky with checks for the class for a given IndexReader and different handling of different readers though. But, however I tweak it, I still need to be able to access a given term by its ordinal from outside the reader. Last time I looked, this was not possible. Has this changed since then?

Toke Eskildsen
added a comment - 06/Apr/10 12:10 Earwin, it would be great if this can be done without modifying IndexReaders at all. I know that the getExposedTuples can be refactored out - it'll be somewhat clunky with checks for the class for a given IndexReader and different handling of different readers though. But, however I tweak it, I still need to be able to access a given term by its ordinal from outside the reader. Last time I looked, this was not possible. Has this changed since then?

As can be seen, the timings are fairly consistent for this small ad-hoc test. The difference between standard and exposed sorting is the time it takes for the collator to perform compares. I'll have to test if that can be improved by using a plain int-array to hold the order of the documents, just as the non-locale-using String sorter does.

Earwin: I've spend some time looking through trunk and happily the access-by-ordinal is part of it. This seems to make it possible to do the low-mem sort trick without the invasive surgery to the IndexReaders. However, I find it hard to do totally clean: The FieldCache.DEFAULT is used throughout Lucene but is hardwired for specific types of Terms. As it would be best to have purging of closed IndexReaders done automatically, some sort of extension of the DEFAULT is needed. Making a wrapper FieldCache that encapsulates the existing DEFAULT and replaces it should work and since the DEFAULT is public static, this might even be the intended way of doing it? This would change LUCENE-2369 from a large patch for Lucene core to a standard contrib.

Toke Eskildsen
added a comment - 03/May/10 00:34 Earwin: I've spend some time looking through trunk and happily the access-by-ordinal is part of it. This seems to make it possible to do the low-mem sort trick without the invasive surgery to the IndexReaders. However, I find it hard to do totally clean: The FieldCache.DEFAULT is used throughout Lucene but is hardwired for specific types of Terms. As it would be best to have purging of closed IndexReaders done automatically, some sort of extension of the DEFAULT is needed. Making a wrapper FieldCache that encapsulates the existing DEFAULT and replaces it should work and since the DEFAULT is public static, this might even be the intended way of doing it? This would change LUCENE-2369 from a large patch for Lucene core to a standard contrib.
Thanks for the heads-up, Earwin.

FieldCache should move to become a plugin of IndexReader in some future. So there's no longer any statics and no need to call purge.

> I know that the getExposedTuples can be refactored out - it'll be somewhat clunky with checks for the class for a given IndexReader and different handling of different readers though
And Mike seems to push the split between primitive SegmentReaders and all kinds of MultiReaders, so they no longer extend same base class. So that task should be easier.

Earwin Burrfoot
added a comment - 07/May/10 14:01 FieldCache should move to become a plugin of IndexReader in some future. So there's no longer any statics and no need to call purge.
> I know that the getExposedTuples can be refactored out - it'll be somewhat clunky with checks for the class for a given IndexReader and different handling of different readers though
And Mike seems to push the split between primitive SegmentReaders and all kinds of MultiReaders, so they no longer extend same base class. So that task should be easier.

A status update might be in order. Switching to the current Lucene trunk with flex did require a lot of changes. Luckily it seems that they are all external so this could be a contrib.

The current implementation (not patch yet) seems to scale fairly well: Quick tests were made with a test-index with 5 fields of which one contained random Strings at average length 10 characters. No index optimize. The goal was to perform a Collator-based sorted search with fillFields=true (the terms used for sorting are returned along with the result) to get top-20 out of a lot of hits. Search-time was kept low by a field that was defined with the same term for every other document. The hardware was a Dell M6500 with i7@1.7GHz, PC1333 RAM, Intel X-25G2 SSD. The tests were performed in the background while coding and ZIPping 6M files.

2M document index, search hits 1M documents:

Initial exposed search: 1:16 minutes

Subsequent exposed searches: 45 ms

Total heap usage for Lucene + exposed structure: 21 MB

Initial default Lucene search: 3.3 s

Subsequent default Lucene searches: 2.1 s

Total heap usage for Lucene + field cache: 54 MB

20M document index, search hits 10M documents:

Initial exposed search: 15:27 minutes

Subsequent exposed searches: 370 ms

Total heap usage for Lucene + exposed structure: 209 MB

Initial default Lucene search: 28 s

Subsequent default Lucene searches: 20 s

Total heap usage for Lucene + field cache: 530 MB

200M document index, search hits 100M documents:

Initial exposed search: 186:31 minutes

Subsequent exposed searches: 3.5 s

Total heap usage for Lucene + exposed structure: 2300 MB

No data for default Lucene search as there was OOM with 6 GB of heap.

Observations:

The memory-requirement for the exposed structures is larger than the strict minimum. This is necessary in order to provide support for fast re-opening of indexes (the order of the terms in unchanged segments is reused). It seems like an obvious option to disable this cache.

The time for startup scales about n * log n with the number of terms. Comparing the 200M to 2M: 186 minutes / (200M * log(200M)) * (2M * log(2M)) ~= 1:25 min(observed was 1:16 min).

No tests with 100M documents yet, but 1½ hour for build and 1.5GB of RAM would be the expected requirement. Having a startup-time of 1 hour+ is of course excessive, but if one made the calculated structures persistent (and thereby reduced restart time to near zero), this would work well with a classic "update once every night"-scenario. This would provide Collator-sorted search for a 100M document index on a machine with 2GB of RAM.

Toke Eskildsen
added a comment - 31/Aug/10 15:49 A status update might be in order. Switching to the current Lucene trunk with flex did require a lot of changes. Luckily it seems that they are all external so this could be a contrib.
The current implementation (not patch yet) seems to scale fairly well: Quick tests were made with a test-index with 5 fields of which one contained random Strings at average length 10 characters. No index optimize. The goal was to perform a Collator-based sorted search with fillFields=true (the terms used for sorting are returned along with the result) to get top-20 out of a lot of hits. Search-time was kept low by a field that was defined with the same term for every other document. The hardware was a Dell M6500 with i7@1.7GHz, PC1333 RAM, Intel X-25G2 SSD. The tests were performed in the background while coding and ZIPping 6M files.
2M document index, search hits 1M documents:
Initial exposed search: 1:16 minutes
Subsequent exposed searches: 45 ms
Total heap usage for Lucene + exposed structure: 21 MB
Initial default Lucene search: 3.3 s
Subsequent default Lucene searches: 2.1 s
Total heap usage for Lucene + field cache: 54 MB
20M document index, search hits 10M documents:
Initial exposed search: 15:27 minutes
Subsequent exposed searches: 370 ms
Total heap usage for Lucene + exposed structure: 209 MB
Initial default Lucene search: 28 s
Subsequent default Lucene searches: 20 s
Total heap usage for Lucene + field cache: 530 MB
200M document index, search hits 100M documents:
Initial exposed search: 186:31 minutes
Subsequent exposed searches: 3.5 s
Total heap usage for Lucene + exposed structure: 2300 MB
No data for default Lucene search as there was OOM with 6 GB of heap.
Observations:
The memory-requirement for the exposed structures is larger than the strict minimum. This is necessary in order to provide support for fast re-opening of indexes (the order of the terms in unchanged segments is reused). It seems like an obvious option to disable this cache.
The time for startup scales about n * log n with the number of terms. Comparing the 200M to 2M: 186 minutes / (200M * log(200M)) * (2M * log(2M)) ~= 1:25 min(observed was 1:16 min).
No tests with 100M documents yet, but 1½ hour for build and 1.5GB of RAM would be the expected requirement. Having a startup-time of 1 hour+ is of course excessive, but if one made the calculated structures persistent (and thereby reduced restart time to near zero), this would work well with a classic "update once every night"-scenario. This would provide Collator-sorted search for a 100M document index on a machine with 2GB of RAM.

Robert Muir
added a comment - 31/Aug/10 16:08 No tests with 100M documents yet, but 1½ hour for build and 1.5GB of RAM would be the expected requirement.
Toke, have you tried doing this 'build' at index time instead? I would recommend applying LUCENE-2551 and indexing with ICU Collation, strength=primary
Now that we can mostly do everything as bytes, I think this slow functionality to do collation/range query at 'runtime' might soon be on its way out of lucene (see patches on LUCENE-2514 ).
Instead, I think its better to encourage users to index their content accordingly for the use cases they need.

Toke, have you tried doing this 'build' at index time instead? I would recommend applying LUCENE-2551 and indexing with ICU Collation, strength=primary

Robert, I'll sum up my understanding of the issue:

ICU collator keys makes sorting very fast at the cost of some extra disk space, as one will probably want to store the original Term together with the key. It requires a non-trivial memory overhead, in the ideal case as many bytes as there are characters in the terms. Works extremely well with reopening.

My experiment makes sorting relatively low-memory and extremely fast at the cost of very high pre-calculation time. Works halfway well with reopening as some structures are reused.

The two approaches are not in conflict and combining them would indeed seem to give many benefits. Moving the building of the structures to index-time seems fairly easy: If nothing else, it could just be a post-processing of the index.

ICU is clearly what's on people's mind when it comes to collator based sorting. I can see that I have to do some Lucene standard vs. ICU vs. pre-calculated vs. ICU+pre-calculated tests to explore what the benefits of the different approaches are.

Now that we can mostly do everything as bytes, I think this slow functionality to do collation/range query at 'runtime' might soon be on its way out of lucene (see patches on LUCENE-2514).

No argument from me. I'll keep my work at the runtime level for now though, but that's just to avoid working on two fronts at the same time.

Instead, I think its better to encourage users to index their content accordingly for the use cases they need.

I agree that the sort-fields as well as sort-locale is well known at index time in most cases.

Toke Eskildsen
added a comment - 31/Aug/10 21:53
Toke, have you tried doing this 'build' at index time instead? I would recommend applying LUCENE-2551 and indexing with ICU Collation, strength=primary
Robert, I'll sum up my understanding of the issue:
ICU collator keys makes sorting very fast at the cost of some extra disk space, as one will probably want to store the original Term together with the key. It requires a non-trivial memory overhead, in the ideal case as many bytes as there are characters in the terms. Works extremely well with reopening.
My experiment makes sorting relatively low-memory and extremely fast at the cost of very high pre-calculation time. Works halfway well with reopening as some structures are reused.
The two approaches are not in conflict and combining them would indeed seem to give many benefits. Moving the building of the structures to index-time seems fairly easy: If nothing else, it could just be a post-processing of the index.
ICU is clearly what's on people's mind when it comes to collator based sorting. I can see that I have to do some Lucene standard vs. ICU vs. pre-calculated vs. ICU+pre-calculated tests to explore what the benefits of the different approaches are.
Now that we can mostly do everything as bytes, I think this slow functionality to do collation/range query at 'runtime' might soon be on its way out of lucene (see patches on LUCENE-2514 ).
No argument from me. I'll keep my work at the runtime level for now though, but that's just to avoid working on two fronts at the same time.
Instead, I think its better to encourage users to index their content accordingly for the use cases they need.
I agree that the sort-fields as well as sort-locale is well known at index time in most cases.

ICU collator keys makes sorting very fast at the cost of some extra disk space, as one will probably want to store the original Term together with the key. It requires a non-trivial memory overhead, in the ideal case as many bytes as there are characters in the terms. Works extremely well with reopening.

This doesnt make sense, why do you need the original term also?

What 'memory overhead'? indexing collation keys, even at tertiary strength (the largest size) is in general less than 2 bytes per character. this is actually less than the cost of a term in ram in lucene 3.1, so i don't understand this?

The two approaches are not in conflict and combining them would indeed seem to give many benefits

if you are using collation keys, then binary order gives you collated results. So thats what I am hinting at here, is there a more general improvement here you can apply to sorting bytes? If this issue has some ideas that can improve the more general case, I think we should look at factoring those improvements out, and leave the locale stuff as an indexing-time thing.

I agree that the sort-fields as well as sort-locale is well known at index time in most cases.

In all cases really. I don't see this issue really helping if you dont know the locale at index time, by invoking the collator over all the terms at startup you are essentially reindexing in RAM.

if one doesnt know the necessary locales at index-time, i suggest using a generic UCA collator: ULocale.ROOT as a 'catch-all' field for all other locales.

Robert Muir
added a comment - 31/Aug/10 22:35 ICU collator keys makes sorting very fast at the cost of some extra disk space, as one will probably want to store the original Term together with the key. It requires a non-trivial memory overhead, in the ideal case as many bytes as there are characters in the terms. Works extremely well with reopening.
This doesnt make sense, why do you need the original term also?
What 'memory overhead'? indexing collation keys, even at tertiary strength (the largest size) is in general less than 2 bytes per character. this is actually less than the cost of a term in ram in lucene 3.1, so i don't understand this?
The two approaches are not in conflict and combining them would indeed seem to give many benefits
if you are using collation keys, then binary order gives you collated results. So thats what I am hinting at here, is there a more general improvement here you can apply to sorting bytes? If this issue has some ideas that can improve the more general case, I think we should look at factoring those improvements out, and leave the locale stuff as an indexing-time thing.
I agree that the sort-fields as well as sort-locale is well known at index time in most cases.
In all cases really. I don't see this issue really helping if you dont know the locale at index time, by invoking the collator over all the terms at startup you are essentially reindexing in RAM.
if one doesnt know the necessary locales at index-time, i suggest using a generic UCA collator: ULocale.ROOT as a 'catch-all' field for all other locales.

I was thinking aggregation, but you are right. For aggregation one would of course just use the keys and have no need for the original Strings. Then we're left with federated search.

What 'memory overhead'? indexing collation keys, even at tertiary strength (the largest size) is in general less than 2 bytes per character. this is actually less than the cost of a term in ram in lucene 3.1, so i don't understand this?

That is the memory overhead. If you have 20M terms of average length 10 chars, that is 400MB in raw bytes and quite a bit more when you're taking pointers into account. When I'm talking memory overhead, my baseline is a newly opened Lucene index without field caching of terms. Having a beefy machine and a small index is trivial. Low-end hardware, virtualized servers and huge indexes all calls for conserving memory. PackedInts helps a lot, avoiding storing the terms (or ICU keys) i RAM helps more.

if you are using collation keys, then binary order gives you collated results. So thats what I am hinting at here, is there a more general improvement here you can apply to sorting bytes? If this issue has some ideas that can improve the more general case, I think we should look at factoring those improvements out, and leave the locale stuff as an indexing-time thing.

My approach is in theory very simple: Provide an order mapping for each segment, merge the order maps for index level. Having segments with the ICU keys already in sorted order would make the segment maps 1:1 (i.e. they do not require a sort) leaving only the index level merging of order maps. This would halve the build time and together with the 10 times speedup (guessing from the ICU website) that ICU gives, doing the calculation for 20M terms in my example above should take less than a minute. As the caches for the segments are then superfluous, the memory requirements are halved and thus providing faster sort at a memory cost of 100MB as compared to ICU sort speed and 400MB of memory.

I don't see this issue really helping if you dont know the locale at index time, by invoking the collator over all the terms at startup you are essentially reindexing in RAM.

I fail to see why that is a bad thing if we're looking at the rare scenario of having to postpone the sorting decision to search time. What is the alternative? Right now, search-time collator-based sorting with field cache has low startup time, high memory usage and horrible execution time for large results.

Toke Eskildsen
added a comment - 01/Sep/10 08:26
This doesnt make sense, why do you need the original term also?
I was thinking aggregation, but you are right. For aggregation one would of course just use the keys and have no need for the original Strings. Then we're left with federated search.
What 'memory overhead'? indexing collation keys, even at tertiary strength (the largest size) is in general less than 2 bytes per character. this is actually less than the cost of a term in ram in lucene 3.1, so i don't understand this?
That is the memory overhead. If you have 20M terms of average length 10 chars, that is 400MB in raw bytes and quite a bit more when you're taking pointers into account. When I'm talking memory overhead, my baseline is a newly opened Lucene index without field caching of terms. Having a beefy machine and a small index is trivial. Low-end hardware, virtualized servers and huge indexes all calls for conserving memory. PackedInts helps a lot, avoiding storing the terms (or ICU keys) i RAM helps more.
if you are using collation keys, then binary order gives you collated results. So thats what I am hinting at here, is there a more general improvement here you can apply to sorting bytes? If this issue has some ideas that can improve the more general case, I think we should look at factoring those improvements out, and leave the locale stuff as an indexing-time thing.
My approach is in theory very simple: Provide an order mapping for each segment, merge the order maps for index level. Having segments with the ICU keys already in sorted order would make the segment maps 1:1 (i.e. they do not require a sort) leaving only the index level merging of order maps. This would halve the build time and together with the 10 times speedup (guessing from the ICU website) that ICU gives, doing the calculation for 20M terms in my example above should take less than a minute. As the caches for the segments are then superfluous, the memory requirements are halved and thus providing faster sort at a memory cost of 100MB as compared to ICU sort speed and 400MB of memory.
I don't see this issue really helping if you dont know the locale at index time, by invoking the collator over all the terms at startup you are essentially reindexing in RAM.
I fail to see why that is a bad thing if we're looking at the rare scenario of having to postpone the sorting decision to search time. What is the alternative? Right now, search-time collator-based sorting with field cache has low startup time, high memory usage and horrible execution time for large results.

I was thinking aggregation, but you are right. For aggregation one would of course just use the keys and have no need for the original Strings. Then we're left with federated search.

I don't see why federated search needs anything but sort keys?

That is the memory overhead. If you have 20M terms of average length 10 chars, that is 400MB in raw bytes and quite a bit more when you're taking pointers into account.

The "memory" overhead is no different than the "overhead" of regular terms, there is nothing special about the collation key case, this is my point (see below). and in practice for most people, its encoded as way less than 2 bytes/char.

I fail to see why that is a bad thing if we're looking at the rare scenario of having to postpone the sorting decision to search time. What is the alternative? Right now, search-time collator-based sorting with field cache has low startup time, high memory usage and horrible execution time for large results.

Because "search-time" collator-sorting is the wrong approach, and should not exist at all.

Robert Muir
added a comment - 01/Sep/10 12:51 I was thinking aggregation, but you are right. For aggregation one would of course just use the keys and have no need for the original Strings. Then we're left with federated search.
I don't see why federated search needs anything but sort keys?
That is the memory overhead. If you have 20M terms of average length 10 chars, that is 400MB in raw bytes and quite a bit more when you're taking pointers into account.
The "memory" overhead is no different than the "overhead" of regular terms, there is nothing special about the collation key case, this is my point (see below). and in practice for most people, its encoded as way less than 2 bytes/char.
I fail to see why that is a bad thing if we're looking at the rare scenario of having to postpone the sorting decision to search time. What is the alternative? Right now, search-time collator-based sorting with field cache has low startup time, high memory usage and horrible execution time for large results.
Because "search-time" collator-sorting is the wrong approach, and should not exist at all.
Indexing with collation keys once we fix LUCENE-2551 has:
same startup time as regular terms
approximately the same memory usage as regular terms [e.g. PRIMARY key for "Robert Muir" is 12 bytes versus 11 bytes]
same execution time (binary compare) as regular terms

Because the sort terms themselves are needed when a search-result from your service is merged with a result from a source that you do not control by an aggregator that is not you.

The "memory" overhead is no different than the "overhead" of regular terms, there is nothing special about the collation key case, this is my point (see below). and in practice for most people, its encoded as way less than 2 bytes/char.

We're clearly not talking about the same thing here. Maybe I've misunderstood something. Let me try and rephrase myself and break it down so that we can pinpoint where the problem is.

1) Opening a Lucene index and performing a search with relevance ranking requires X bytes of heap.
2) Performing a locale-based sorted search with Lucene 3.0.2 takes X + A bytes, where A is relatively large as all Terms from the sort field is kept in memory as Strings.
3) Performing the same search with Lucene trunk takes X + B bytes, where B is quite a lot smaller than A as BytesRefs are used for the Terms kept in memory.
4) Performing the same search on an ICU key infused index with Lucene trunk + ICU magic takes X + C bytes, where C is about the same size as B.
5) Performing the same search doing pre-sorting (what I'm doing) takes X + D bytes, where D is smaller than C. None of the Terms are kept in memory.

It seems to me that you are assuming that the sort-terms are kept in memory in case 5? Or not kept in memory in any of the cases 3, 4 and 5?

Because "search-time" collator-sorting is the wrong approach, and should not exist at all.

If you agree on this breakdown, the next logical step for me is to create a performance (speed & heap) test for the different cases to see whether the memory savings and the alleged faster sorting with integer comparison is enough to warrant the hassle.

Toke Eskildsen
added a comment - 01/Sep/10 13:45
I don't see why federated search needs anything but sort keys?
Because the sort terms themselves are needed when a search-result from your service is merged with a result from a source that you do not control by an aggregator that is not you.
The "memory" overhead is no different than the "overhead" of regular terms, there is nothing special about the collation key case, this is my point (see below). and in practice for most people, its encoded as way less than 2 bytes/char.
We're clearly not talking about the same thing here. Maybe I've misunderstood something. Let me try and rephrase myself and break it down so that we can pinpoint where the problem is.
1) Opening a Lucene index and performing a search with relevance ranking requires X bytes of heap.
2) Performing a locale-based sorted search with Lucene 3.0.2 takes X + A bytes, where A is relatively large as all Terms from the sort field is kept in memory as Strings.
3) Performing the same search with Lucene trunk takes X + B bytes, where B is quite a lot smaller than A as BytesRefs are used for the Terms kept in memory.
4) Performing the same search on an ICU key infused index with Lucene trunk + ICU magic takes X + C bytes, where C is about the same size as B.
5) Performing the same search doing pre-sorting (what I'm doing) takes X + D bytes, where D is smaller than C. None of the Terms are kept in memory.
It seems to me that you are assuming that the sort-terms are kept in memory in case 5? Or not kept in memory in any of the cases 3, 4 and 5?
Because "search-time" collator-sorting is the wrong approach, and should not exist at all.
Opinion noted. Circular argumentation skipped.
Indexing with collation keys once we fix LUCENE-2551 has:
same startup time as regular terms
approximately the same memory usage as regular terms [e.g. PRIMARY key for "Robert Muir" is 12 bytes versus 11 bytes]
same execution time (binary compare) as regular terms
Indexing with pre-sorting (or whatever we call what I'm trying to do) has
Huge startup time (or index commit time penalty if we move it to indexing)
Lower memory usage that sorting with regular terms or ICU keys
Faster execution time (single integer compare) than LUCENE-2551 or regular terms
Have I misunderstood something here? When a sorted search with ICU keys is performed, the keys themselves are still compared to each other for each search, right?
Combining the two approaches and doing the pre-sorting at index time, we have
Time penalty at index commit (1 min / 10M terms? More? This requires real testing)
Faster startup time than regular term or ICU key sorting (load two PackedInts structures)
Lower memory usage that sorting with regular terms or ICU keys
Faster execution time (single integer compare) than LUCENE-2551
Bad for real-time, good for fast sorting and low memory usage.
If you agree on this breakdown, the next logical step for me is to create a performance (speed & heap) test for the different cases to see whether the memory savings and the alleged faster sorting with integer comparison is enough to warrant the hassle.

Robert Muir
added a comment - 01/Sep/10 14:07 Faster startup time than regular term or ICU key sorting (load two PackedInts structures)
But this is what I keep trying to get at (the whole point of my comments):
ICU keys are just byte[] just like regular terms. they are "regular terms"
Can we forget about the stupid runtime Locale sort, if you have a way to improve memory usage for byte[] terms, lets look just at that? Then this could be more general and more useful.

ICU keys are just byte[] just like regular terms. they are "regular terms"

Do they or do they not need to be loaded into heap in order to be used for sorted search?

Can we forget about the stupid runtime Locale sort, if you have a way to improve memory usage for byte[] terms, lets look just at that? Then this could be more general and more useful.

Easy now. The whole runtime-vs-index-time issue is something that I don't care much about at this point. Pre-sorting can be done both at index and search time. Let's just say that we do it at index-time and go from there.

Not holding the sort-terms in memory (whether they be Strings, BytesRefs, regular terms or ICU keys) and doing all possible sorting up front (in the case of a hybrid ICU-approach: A merge-sort of the already sorted segments), is what I'm looking at. Could you please re-read my comment with that in mind and see if my breakdown and trade-off lists makes sense? It seems to me that you're quite certain that there is something I've missed, but I haven't yet understood what it is. I do know that ICU keys are just regular terms in the technical sense. When I use the designation ICU keys, I do it to make it clear that we're getting locale-specific ordering.

Deep breaths, ok? I'm going to fetch the kids from school, so you don't need to rush your answer.

Toke Eskildsen
added a comment - 01/Sep/10 14:32
ICU keys are just byte[] just like regular terms. they are "regular terms"
Do they or do they not need to be loaded into heap in order to be used for sorted search?
Can we forget about the stupid runtime Locale sort, if you have a way to improve memory usage for byte[] terms, lets look just at that? Then this could be more general and more useful.
Easy now. The whole runtime-vs-index-time issue is something that I don't care much about at this point. Pre-sorting can be done both at index and search time. Let's just say that we do it at index-time and go from there.
Not holding the sort-terms in memory (whether they be Strings, BytesRefs, regular terms or ICU keys) and doing all possible sorting up front (in the case of a hybrid ICU-approach: A merge-sort of the already sorted segments), is what I'm looking at. Could you please re-read my comment with that in mind and see if my breakdown and trade-off lists makes sense? It seems to me that you're quite certain that there is something I've missed, but I haven't yet understood what it is. I do know that ICU keys are just regular terms in the technical sense. When I use the designation ICU keys, I do it to make it clear that we're getting locale-specific ordering.
Deep breaths, ok? I'm going to fetch the kids from school, so you don't need to rush your answer.

Do they or do they not need to be loaded into heap in order to be used for sorted search?

They are just regular terms! you can do a TermQuery on them, sort them as byte[], etc.
its just the bytes use 'collation encoding' instead of 'utf-8 encoding'.
This is why i want to factor out the whole 'locale' thing from the issue, since sorting is agnostic to whats in the byte[], its unrelated and it would simplify the issue to just discuss that.

Easy now. The whole runtime-vs-index-time issue is something that I don't care much about at this point. Pre-sorting can be done both at index and search time. Let's just say that we do it at index-time and go from there.

Well, the thing is, its something i care a lot about. The problems are:

Users who develop localized applications tend to use methods with Locale/Collator parameters if they are available: its best practice.

In the case of lucene, it is not best practice, but a silly trap (as you get horrible performance).

However, users are used to the concept of collation keys wrt indexing (e.g. when building a database index)

The apis here are wrong anyway: it shouldnt take Locale but Collator.
There is no way to set strength or any other options, and theres no way to supply a Collator i made myself (e.g. from RuleBasedCollator)

Robert Muir
added a comment - 01/Sep/10 14:48 Do they or do they not need to be loaded into heap in order to be used for sorted search?
They are just regular terms! you can do a TermQuery on them, sort them as byte[], etc.
its just the bytes use 'collation encoding' instead of 'utf-8 encoding'.
This is why i want to factor out the whole 'locale' thing from the issue, since sorting is agnostic to whats in the byte[], its unrelated and it would simplify the issue to just discuss that.
Easy now. The whole runtime-vs-index-time issue is something that I don't care much about at this point. Pre-sorting can be done both at index and search time. Let's just say that we do it at index-time and go from there.
Well, the thing is, its something i care a lot about. The problems are:
Users who develop localized applications tend to use methods with Locale/Collator parameters if they are available: its best practice.
In the case of lucene, it is not best practice, but a silly trap (as you get horrible performance).
However, users are used to the concept of collation keys wrt indexing (e.g. when building a database index)
The apis here are wrong anyway: it shouldnt take Locale but Collator.
There is no way to set strength or any other options, and theres no way to supply a Collator i made myself (e.g. from RuleBasedCollator)

They are just regular terms! you can do a TermQuery on them, sort them as byte[], etc.
its just the bytes use 'collation encoding' instead of 'utf-8 encoding'.

Yes, you have stated that repeatedly.

As you are unwilling to answer my questions, I've tweaked my tests to see for myself. It worked very well, as I learned something new. To answer one of the questions, then yes, the terms are loaded into memory when a sort is performed on a field. This is the case for the locale-based sort (yes, I have understood that you do not consider that a viable form of sorting and I only write it for completeness), for STRING-sorting (natural order with ordinal based speedup) and for STRING_VAL-sorting (natural order without the ordinal-speedup). All this against Lucene truck, no patches.

This is why i want to factor out the whole 'locale' thing from the issue, since sorting is agnostic to whats in the byte[], its unrelated and it would simplify the issue to just discuss that.

For my new tests I've switched to natural order sorting (direct byte[] comparison aka Lucene STRING sorted search without specifying a Locale). The tests should be fairly telling for the different scenarios as the ICU keys should be about the same size as the original terms.

The apis here are wrong anyway: it shouldnt take Locale but Collator.
There is no way to set strength or any other options, and theres no way to supply a Collator i made myself (e.g. from RuleBasedCollator)

I fully agree. The code I've made so far takes a Comparator<BytesRef>, with optimization for wrapped Collators. The title of LUCENE-2369 is not technically correct but was chosen as "Locale" is a fairly known concept while "Collator" is more complex. That might have been a mistake.

Onwards to testing with natural order (new Sort(new SortField(myField, SortField.STRING))) for Lucene and the hybrid approach (natural order + pre-sorting) for exposed. No ZIPping in the background this time, so measurements differ from the previous test. Heap sizes were measured after a call to System.gc().

Natural order search in Lucene with STRING is very fast (as most of the work is ordinal comparison).

Exposed sorting is actually slower than natural order (that's news for me). The culprit is a modified PackedInts-structure. I'll look into that.

The exposed structure build penalty for the hybrid approach (i.e. relying on natural order instead of doing an explicit sort) was indeed markedly lower than exposed with explicit sorting. A factor 5. I would have expected it to be more though.

The hybrid approach uses less than a third of the amount of RAM required by Lucene natural order sorting.

So, Robert, does this answer your challenge "if you have a way to improve memory usage for byte[] terms, lets look just at that?"?

Toke Eskildsen
added a comment - 02/Sep/10 14:49 - edited
They are just regular terms! you can do a TermQuery on them, sort them as byte[], etc.
its just the bytes use 'collation encoding' instead of 'utf-8 encoding'.
Yes, you have stated that repeatedly.
As you are unwilling to answer my questions, I've tweaked my tests to see for myself. It worked very well, as I learned something new. To answer one of the questions, then yes, the terms are loaded into memory when a sort is performed on a field. This is the case for the locale-based sort (yes, I have understood that you do not consider that a viable form of sorting and I only write it for completeness), for STRING-sorting (natural order with ordinal based speedup) and for STRING_VAL-sorting (natural order without the ordinal-speedup). All this against Lucene truck, no patches.
This is why i want to factor out the whole 'locale' thing from the issue, since sorting is agnostic to whats in the byte[], its unrelated and it would simplify the issue to just discuss that.
For my new tests I've switched to natural order sorting (direct byte[] comparison aka Lucene STRING sorted search without specifying a Locale). The tests should be fairly telling for the different scenarios as the ICU keys should be about the same size as the original terms.
The apis here are wrong anyway: it shouldnt take Locale but Collator.
There is no way to set strength or any other options, and theres no way to supply a Collator i made myself (e.g. from RuleBasedCollator)
I fully agree. The code I've made so far takes a Comparator<BytesRef>, with optimization for wrapped Collators. The title of LUCENE-2369 is not technically correct but was chosen as "Locale" is a fairly known concept while "Collator" is more complex. That might have been a mistake.
Onwards to testing with natural order (new Sort(new SortField(myField, SortField.STRING))) for Lucene and the hybrid approach (natural order + pre-sorting) for exposed. No ZIPping in the background this time, so measurements differ from the previous test. Heap sizes were measured after a call to System.gc().
2M document index, search hits 1M documents, top 10 hits extracted:
Initial exposed search: 0:16 minutes
Subsequent exposed searches: 40 ms
Total heap usage for Lucene + exposed structure: 20 MB
Initial default Lucene search: 0.8 s
Subsequent default Lucene searches: 25 ms
Total heap usage for Lucene + field cache: 60 MB
20M document index, search hits 10M documents, top 10 hits extracted:
Initial exposed search: 2:53 minutes
Subsequent exposed searches: 330 ms
Total heap usage for Lucene + exposed structure: 154 MB
Initial default Lucene search: 6 s
Subsequent default Lucene searches: 220 ms
Total heap usage for Lucene + field cache: 600 MB
200M document index, search hits 100M documents, top 10 hits extracted:
Initial exposed search: 31:33 minutes
Subsequent exposed searches: 3200 ms
Total heap usage for Lucene + exposed structure: 1660 MB
No data for default Lucene search as there was OOM with 7 GB of heap.
What did we learn from this?
Natural order search in Lucene with STRING is very fast (as most of the work is ordinal comparison).
Exposed sorting is actually slower than natural order (that's news for me). The culprit is a modified PackedInts-structure. I'll look into that.
The exposed structure build penalty for the hybrid approach (i.e. relying on natural order instead of doing an explicit sort) was indeed markedly lower than exposed with explicit sorting. A factor 5. I would have expected it to be more though.
The hybrid approach uses less than a third of the amount of RAM required by Lucene natural order sorting.
So, Robert, does this answer your challenge "if you have a way to improve memory usage for byte[] terms, lets look just at that?"?

This patch is to keep in sync with Lucene trunk (20100923) and to explore some ideas. Besides the updated code with some bug fixing and some optimization, there's sample code for faceting and index lookup (check out the unit-test TestExposedFacets.testScale). I know that this does not belong in Lucene core, so see it as a demonstration of the potential in providing the doc/term mappings.

Now, revisiting the previous test with the updated code and this time actually remembering not to do an explicit sort in the exposed-part (simulating that ICU collator keys are indexed), the numbers are

While the time for first search is still substantial, it is a lot shorter than the previous measurements. Lucene natural order sorting is still nearly double as fast (I haven't tried switching to int[] instead of PackedInts yet, so that part is not closed). I'll try and find the time to do some more detailed tests with a more realistic number of hits, but I estimate that the speed will be the same, relative to Lucene natural order sort.

Toke Eskildsen
added a comment - 23/Sep/10 14:47 This patch is to keep in sync with Lucene trunk (20100923) and to explore some ideas. Besides the updated code with some bug fixing and some optimization, there's sample code for faceting and index lookup (check out the unit-test TestExposedFacets.testScale). I know that this does not belong in Lucene core, so see it as a demonstration of the potential in providing the doc/term mappings.
Now, revisiting the previous test with the updated code and this time actually remembering not to do an explicit sort in the exposed-part (simulating that ICU collator keys are indexed), the numbers are
2M document index, search hits 1M documents, top 10 hits extracted:
Opening the index and doing a plain relevance-sorted search: 3 MB
Initial exposed search: 3.5 seconds
Subsequent exposed searches: 40-60 ms
Total heap usage for Lucene + exposed structure: 23 MB
Initial default Lucene sorted search: 1.0 seconds
Subsequent default Lucene searches: 30-35 ms
Total heap usage for Lucene + field cache: 61 MB
20M document index, search hits 10M documents, top 10 hits extracted:
Opening the index and doing a plain relevance-sorted search: 27 MB
Initial exposed search: 44 seconds
Subsequent exposed searches: 350-380 ms
Total heap usage for Lucene + exposed structure: 183 MB
Initial default Lucene sorted search: 6.7 seconds
Subsequent default Lucene searches: 220-240 ms
Total heap usage for Lucene + field cache: 614 MB
200M document index, search hits 100M documents, top 10 hits extracted:
Opening the index and doing a plain relevance-sorted search: 210 MB
Initial exposed search: 7:35 minutes
Subsequent exposed searches: 3320-3550 ms
Total heap usage for Lucene + exposed structure: 1744 MB
No data for default Lucene search as there was OOM with 7 GB of heap.
While the time for first search is still substantial, it is a lot shorter than the previous measurements. Lucene natural order sorting is still nearly double as fast (I haven't tried switching to int[] instead of PackedInts yet, so that part is not closed). I'll try and find the time to do some more detailed tests with a more realistic number of hits, but I estimate that the speed will be the same, relative to Lucene natural order sort.

Toke Eskildsen
added a comment - 10/Oct/10 23:51 Some bug fixes, some sample code demonstrating how to build a hierarchical faceting system using ordered term ordinals.
This is moving away from the original issue at high speed. I'll try and sum up my observations and ideas on the mailing list Real Soon Now.

Toke Eskildsen
added a comment - 19/Nov/10 22:49 Bugfixes and maintenance. This patches against Lucene trunk revision 1036986 (latest one at the time of writing). Apply the patch in the lucene sub-folder.

Toke Eskildsen
added a comment - 03/Dec/12 15:39 A patch with code both for Lucene 4 and Solr 4 is now maintained at https://issues.apache.org/jira/browse/SOLR-2412
The code still works with Lucene 4 standalone and provides hierarchical faceting with custom sorting. Further development will be announced under that JIRA issue.