Details

Description

There are various places in Lucene that could take advantage of an
efficient packed unsigned int/long impl. EG the terms dict index in
the standard codec in LUCENE-1458 could subsantially reduce it's RAM
usage. FieldCache.StringIndex could as well. And I think "load into
RAM" codecs like the one in TestExternalCodecs could use this too.

This is vendor-specific, and possibly may cause unexpected problems, but we can try in some specific cases:
"Compressed Oops have been included (but disabled by default) in the performance release JDK6u6p (requires you to fill a survey), so I decided to try it in an internal application and compare it with 64-bit mode and 32-bit mode."

-XX:+UseCompressedOops

There are other vendors around too such as Oracle JRockit which is much faster server-side...

Fuad Efendi
added a comment - 12/Nov/09 17:02 Suttiwat sent me a link:
http://blog.juma.me.uk/2008/10/14/32-bit-or-64-bit-jvm-how-about-a-hybrid/
This is vendor-specific, and possibly may cause unexpected problems, but we can try in some specific cases:
"Compressed Oops have been included (but disabled by default) in the performance release JDK6u6p (requires you to fill a survey), so I decided to try it in an internal application and compare it with 64-bit mode and 32-bit mode."
-XX:+UseCompressedOops
There are other vendors around too such as Oracle JRockit which is much faster server-side...

I have very recently done some experiments with random-access packed positive integer arrays (we could also call them unsigned). The basic approach is to put the bits after each other in an int- or long-array, then extract the value by shifting and masking. Unfortunately I have not been able to determine a single winning strategy for space/speed trade-offs.

I've tried four simple implementations:

1. Pack the bits right after each other.
2. Pack the bits right after each other but allow only 1, 2, 4, 8, 16 or 32 as value-bits.
3. Wrap an int[] or long[] and store the values directly (for comparison with 1 & 2).
4. Use int[] directly without any wrapping.

The obvious benefit from #2 is that the bits for values are always contained in a single block (int or long), which reduces the amount of operations for set and get considerably. Unfortunately this means that as soon as a value greater than 65535 needs to be stored, the internal representation will require 32 bits/value. This means no space-benefit compared to #3 while the speed penalty remains. A hybrid approach might be considered where the implementation is determined by the number of bits needed to store a value.

In the tables below, the measurements for Packed, Aligned, Int, Constant and int[] are the total time in milliseconds to perform actionCount actions (either read or write). Random numbers (using the same seed) were used for index as well as value and the overhead of constructing the random numbers were subtracted from the measurements. "Constant" is a dummy where no values are stored and the same value is always returned on a get. It is used to measure method-call overhead.

For arrays of 10M values, 6-33 million values can be read or written per second. If this is within acceptable limits, I'd be happy to try and make a contribution to Lucene based on the code. However, I probably won't find the time before February or March 2010.

Michael McCandless
added a comment - 01/Jan/10 12:01 These are great results – thanks Toke!
What's the difference between the two "sections" of your results? The bottom section seems to have slower times overall than the top one.
It's curious that "aligned" isn't always a win.
For the packed case, did you generate specialized code for each of the cases? I wonder how much / if that'd really help in practice.
For the standard codec's terms dict, my guess is we'd just go with packed ints.

The first section if for 1M values in the structure, the second is for 10M. As the CPU on the test-machine (Intel T2400) has only 2MB of level 2 cache, the increased processing time for the seemingly same amount of work is an effect of more cache-misses.

Caching also accounts for why the packed version is sometimes better than the aligned. For values representable as 9 or 17 bits, the aligned version needs 16 and 32 bits respectively. In the case with 10M values, the packed version uses 1.1MB and 2.1MB for 9 and 17 bits respectively, while the aligned uses 2MB and 4MB respectively. The simpler logic of the aligned version does not compensate enough for the higher amount of trips around main memory.

I did not generate any specialized code for the aligned case: No matter the number of bits/value, the amount of shifts, masks and ors is always the same. If the number of bits/value is known beforehand, specialized cases should be used (I made a factory that selects between packed, aligned and direct (#3), depending on the number of bits/value). The reason for not doing so in the first place is that I wanted to let the structure auto-adjust the bits/value when a new value was added. Having different implementations encapsulated in the same class means another level of indirection or conditionals, both of which I wanted to avoid for performance reasons. That being said, I haven't tested how much of a penalty this would be.

The standard use case seems to be some sort of update-round, after which no updates are performed. Having a cleanupAndOptimize-call that potentially creates a new and optimized structure, would fit well into this and would avoid the indirection / conditional penalty.

A whole other matter is long vs. ints. I've tried using longs instead of ints as the backing array and the penalty on my 32bit processor was very high (I need to make some tests on this). If it must be possible to set and get longs, it's hard to avoid using long[] as the internal structure, but if ints are accepted as the only valid values, selecting long[] as backing array for 64 bit machines and int[] for 32 bit, might be the solution.

All this calls for a factory-approach to hide the fairly complex task of choosing the right implementation.

Toke Eskildsen
added a comment - 02/Jan/10 15:03 The first section if for 1M values in the structure, the second is for 10M. As the CPU on the test-machine (Intel T2400) has only 2MB of level 2 cache, the increased processing time for the seemingly same amount of work is an effect of more cache-misses.
Caching also accounts for why the packed version is sometimes better than the aligned. For values representable as 9 or 17 bits, the aligned version needs 16 and 32 bits respectively. In the case with 10M values, the packed version uses 1.1MB and 2.1MB for 9 and 17 bits respectively, while the aligned uses 2MB and 4MB respectively. The simpler logic of the aligned version does not compensate enough for the higher amount of trips around main memory.
I did not generate any specialized code for the aligned case: No matter the number of bits/value, the amount of shifts, masks and ors is always the same. If the number of bits/value is known beforehand, specialized cases should be used (I made a factory that selects between packed, aligned and direct (#3), depending on the number of bits/value). The reason for not doing so in the first place is that I wanted to let the structure auto-adjust the bits/value when a new value was added. Having different implementations encapsulated in the same class means another level of indirection or conditionals, both of which I wanted to avoid for performance reasons. That being said, I haven't tested how much of a penalty this would be.
The standard use case seems to be some sort of update-round, after which no updates are performed. Having a cleanupAndOptimize-call that potentially creates a new and optimized structure, would fit well into this and would avoid the indirection / conditional penalty.
A whole other matter is long vs. ints. I've tried using longs instead of ints as the backing array and the penalty on my 32bit processor was very high (I need to make some tests on this). If it must be possible to set and get longs, it's hard to avoid using long[] as the internal structure, but if ints are accepted as the only valid values, selecting long[] as backing array for 64 bit machines and int[] for 32 bit, might be the solution.
All this calls for a factory-approach to hide the fairly complex task of choosing the right implementation.

I made some small tweaks to improve performance and added long[]-backed versions of Packed (optimal space) and Aligned (no values span underlying blocks), the ran the performance tests on 5 different computers. It seems very clear that level 2 cache (and presumably RAM-speed, but I do not know how to determine that without root-access on a Linux box) plays a bigger role for access speed than mere CPU speed. One 3GHz with 1MB of level 2 cache was about half as fast than a 1.8GHz laptop with 2MB of level 2 cache.

There is a whole lot of measurements and it is getting hard to digest. I've attached logs from the 5 computers, should anyone want to have a look. Some observations are:

1. The penalty of using long[] instead of int[] on my 32 bit laptop depends on the number of values in the array. For less than a million values, it is severe: The long[]-version if 30-60% slower, depending on whether packed or aligned values are used. Above that, it was 10% slower for Aligned, 25% slower for Packed.
On the other hand, 64 bit machines dos not seem to care that much whether int[] or long[] is used: There was 10% win for arrays below 1M for one machine, 50% for arrays below 100K for another (8% for 1M, 6% for 10M) for another and a small loss of below 1% for all lenghts above 10K for a third.

2. There's a fast drop-off in speed when the array reaches a certain size that is correlated to level 2 cache size. After that, the speed does not decrease much when the array grows. This also affects direct writes to an int[] and has the interesting implication that a packed array out-performs the direct access approach for writes in a number of cases. For reads, there's no contest: Direct access to int[] is blazingly fast.

3. The access-speed of the different implementations converges when the number of values in the array rises (think 10M+ values): The slow round-trip to main memory dwarfs the logic used for value-extraction.

Observation #3 supports Mike McCandless choice of going for the packed approach and #1 suggests using int[] as the internal structure for now. Using int[] as internal structure makes if unfeasible to accept longs as input (or rather: longs with more than 32 significant bits). I don't know if this is acceptable?

Toke Eskildsen
added a comment - 04/Jan/10 14:21 I made some small tweaks to improve performance and added long[]-backed versions of Packed (optimal space) and Aligned (no values span underlying blocks), the ran the performance tests on 5 different computers. It seems very clear that level 2 cache (and presumably RAM-speed, but I do not know how to determine that without root-access on a Linux box) plays a bigger role for access speed than mere CPU speed. One 3GHz with 1MB of level 2 cache was about half as fast than a 1.8GHz laptop with 2MB of level 2 cache.
There is a whole lot of measurements and it is getting hard to digest. I've attached logs from the 5 computers, should anyone want to have a look. Some observations are:
1. The penalty of using long[] instead of int[] on my 32 bit laptop depends on the number of values in the array. For less than a million values, it is severe: The long[]-version if 30-60% slower, depending on whether packed or aligned values are used. Above that, it was 10% slower for Aligned, 25% slower for Packed.
On the other hand, 64 bit machines dos not seem to care that much whether int[] or long[] is used: There was 10% win for arrays below 1M for one machine, 50% for arrays below 100K for another (8% for 1M, 6% for 10M) for another and a small loss of below 1% for all lenghts above 10K for a third.
2. There's a fast drop-off in speed when the array reaches a certain size that is correlated to level 2 cache size. After that, the speed does not decrease much when the array grows. This also affects direct writes to an int[] and has the interesting implication that a packed array out-performs the direct access approach for writes in a number of cases. For reads, there's no contest: Direct access to int[] is blazingly fast.
3. The access-speed of the different implementations converges when the number of values in the array rises (think 10M+ values): The slow round-trip to main memory dwarfs the logic used for value-extraction.
Observation #3 supports Mike McCandless choice of going for the packed approach and #1 suggests using int[] as the internal structure for now. Using int[] as internal structure makes if unfeasible to accept longs as input (or rather: longs with more than 32 significant bits). I don't know if this is acceptable?

The first section if for 1M values in the structure, the second is for 10M. As the CPU on the test-machine (Intel T2400) has only 2MB of level 2 cache, the increased processing time for the seemingly same amount of work is an effect of more cache-misses.

Got it.

Caching also accounts for why the packed version is sometimes better than the aligned. For values representable as 9 or 17 bits, the aligned version needs 16 and 32 bits respectively. In the case with 10M values, the packed version uses 1.1MB and 2.1MB for 9 and 17 bits respectively, while the aligned uses 2MB and 4MB respectively. The simpler logic of the aligned version does not compensate enough for the higher amount of trips around main memory.

OK, though, I think we should somehow exclude these CPU cache effects
from the test. In a real production situation, lots of other things
are competing for that cache, so I think the mostly-cache-miss case is
most realistic here.

Actually a good way to do that is to test it in a wider context, eg
once I can get LUCENE-2186 "integrated", we can test sorting by int
field with these different ways of encoding.

I did not generate any specialized code for the aligned case: No matter the number of bits/value, the amount of shifts, masks and ors is always the same. If the number of bits/value is known beforehand, specialized cases should be used (I made a factory that selects between packed, aligned and direct (#3), depending on the number of bits/value).

OK I have a start at this under LUCENE-2186. I'll try to clean up &
post here...

The reason for not doing so in the first place is that I wanted to let the structure auto-adjust the bits/value when a new value was added. Having different implementations encapsulated in the same class means another level of indirection or conditionals, both of which I wanted to avoid for performance reasons. That being said, I haven't tested how much of a penalty this would be.

The standard use case seems to be some sort of update-round, after which no updates are performed. Having a cleanupAndOptimize-call that potentially creates a new and optimized structure, would fit well into this and would avoid the indirection / conditional penalty.

I think the writing API can be a write-once API, where I declare the
maxValue I will write. This fits with the indexing chain, where we
buffer values in RAM and then flush to the segmetn files.

This way we don't need the complexity of having to resize the bits
internally when a new max value is seen.

So, the factory would take number of values we will write, the max
value, and I guess an enum for Packed/Aligned/DedicatedArray, and
return a simple writer that just exposes an "add(long value)" method?

A whole other matter is long vs. ints. I've tried using longs instead of ints as the backing array and the penalty on my 32bit processor was very high (I need to make some tests on this). If it must be possible to set and get longs, it's hard to avoid using long[] as the internal structure, but if ints are accepted as the only valid values, selecting long[] as backing array for 64 bit machines and int[] for 32 bit, might be the solution.

Hmm... I guess we could still support values requiring more than 32
bits, encoded into int[], but it's more hairy as the value could span
2 or 3 ints.

Probably we should specialize/default the factory selection based on
whether the JVM is 32/64 bit? And, whether the bit size is <= 32.

All this calls for a factory-approach to hide the fairly complex task of choosing the right implementation.

Yes.

I made some small tweaks to improve performance and added long[]-backed versions of Packed (optimal space) and Aligned (no values span underlying blocks), the ran the performance tests on 5 different computers. It seems very clear that level 2 cache (and presumably RAM-speed, but I do not know how to determine that without root-access on a Linux box) plays a bigger role for access speed than mere CPU speed. One 3GHz with 1MB of level 2 cache was about half as fast than a 1.8GHz laptop with 2MB of level 2 cache.
There is a whole lot of measurements and it is getting hard to digest. I've attached logs from the 5 computers, should anyone want to have a look. Some observations are:

1. The penalty of using long[] instead of int[] on my 32 bit laptop depends on the number of values in the array. For less than a million values, it is severe: The long[]-version if 30-60% slower, depending on whether packed or aligned values are used. Above that, it was 10% slower for Aligned, 25% slower for Packed.
On the other hand, 64 bit machines dos not seem to care that much whether int[] or long[] is used: There was 10% win for arrays below 1M for one machine, 50% for arrays below 100K for another (8% for 1M, 6% for 10M) for another and a small loss of below 1% for all lenghts above 10K for a third.

2. There's a fast drop-off in speed when the array reaches a certain size that is correlated to level 2 cache size. After that, the speed does not decrease much when the array grows. This also affects direct writes to an int[] and has the interesting implication that a packed array out-performs the direct access approach for writes in a number of cases. For reads, there's no contest: Direct access to int[] is blazingly fast.

3. The access-speed of the different implementations converges when the number of values in the array rises (think 10M+ values): The slow round-trip to main memory dwarfs the logic used for value-extraction.

Observation #3 supports Mike McCandless choice of going for the packed approach and #1 suggests using int[] as the internal structure for now. Using int[] as internal structure makes if unfeasible to accept longs as input (or rather: longs with more than 32 significant bits). I don't know if this is acceptable?

I think we should agree on an API that LUCENE-2186 can use to
write/read packed ints, then get our two patches talking. I'll pull
out the barebones packed ints I'm currently using, and post
here... then let's merge/iterate to a common API, so I can cutover to
the patch from here in LUCENE-2186?

Michael McCandless
added a comment - 13/Jan/10 12:00 The first section if for 1M values in the structure, the second is for 10M. As the CPU on the test-machine (Intel T2400) has only 2MB of level 2 cache, the increased processing time for the seemingly same amount of work is an effect of more cache-misses.
Got it.
Caching also accounts for why the packed version is sometimes better than the aligned. For values representable as 9 or 17 bits, the aligned version needs 16 and 32 bits respectively. In the case with 10M values, the packed version uses 1.1MB and 2.1MB for 9 and 17 bits respectively, while the aligned uses 2MB and 4MB respectively. The simpler logic of the aligned version does not compensate enough for the higher amount of trips around main memory.
OK, though, I think we should somehow exclude these CPU cache effects
from the test. In a real production situation, lots of other things
are competing for that cache, so I think the mostly-cache-miss case is
most realistic here.
Actually a good way to do that is to test it in a wider context, eg
once I can get LUCENE-2186 "integrated", we can test sorting by int
field with these different ways of encoding.
I did not generate any specialized code for the aligned case: No matter the number of bits/value, the amount of shifts, masks and ors is always the same. If the number of bits/value is known beforehand, specialized cases should be used (I made a factory that selects between packed, aligned and direct (#3), depending on the number of bits/value).
OK I have a start at this under LUCENE-2186 . I'll try to clean up &
post here...
The reason for not doing so in the first place is that I wanted to let the structure auto-adjust the bits/value when a new value was added. Having different implementations encapsulated in the same class means another level of indirection or conditionals, both of which I wanted to avoid for performance reasons. That being said, I haven't tested how much of a penalty this would be.
The standard use case seems to be some sort of update-round, after which no updates are performed. Having a cleanupAndOptimize-call that potentially creates a new and optimized structure, would fit well into this and would avoid the indirection / conditional penalty.
I think the writing API can be a write-once API, where I declare the
maxValue I will write. This fits with the indexing chain, where we
buffer values in RAM and then flush to the segmetn files.
This way we don't need the complexity of having to resize the bits
internally when a new max value is seen.
So, the factory would take number of values we will write, the max
value, and I guess an enum for Packed/Aligned/DedicatedArray, and
return a simple writer that just exposes an "add(long value)" method?
A whole other matter is long vs. ints. I've tried using longs instead of ints as the backing array and the penalty on my 32bit processor was very high (I need to make some tests on this). If it must be possible to set and get longs, it's hard to avoid using long[] as the internal structure, but if ints are accepted as the only valid values, selecting long[] as backing array for 64 bit machines and int[] for 32 bit, might be the solution.
Hmm... I guess we could still support values requiring more than 32
bits, encoded into int[], but it's more hairy as the value could span
2 or 3 ints.
Probably we should specialize/default the factory selection based on
whether the JVM is 32/64 bit? And, whether the bit size is <= 32.
All this calls for a factory-approach to hide the fairly complex task of choosing the right implementation.
Yes.
I made some small tweaks to improve performance and added long[]-backed versions of Packed (optimal space) and Aligned (no values span underlying blocks), the ran the performance tests on 5 different computers. It seems very clear that level 2 cache (and presumably RAM-speed, but I do not know how to determine that without root-access on a Linux box) plays a bigger role for access speed than mere CPU speed. One 3GHz with 1MB of level 2 cache was about half as fast than a 1.8GHz laptop with 2MB of level 2 cache.
There is a whole lot of measurements and it is getting hard to digest. I've attached logs from the 5 computers, should anyone want to have a look. Some observations are:
1. The penalty of using long[] instead of int[] on my 32 bit laptop depends on the number of values in the array. For less than a million values, it is severe: The long[]-version if 30-60% slower, depending on whether packed or aligned values are used. Above that, it was 10% slower for Aligned, 25% slower for Packed.
On the other hand, 64 bit machines dos not seem to care that much whether int[] or long[] is used: There was 10% win for arrays below 1M for one machine, 50% for arrays below 100K for another (8% for 1M, 6% for 10M) for another and a small loss of below 1% for all lenghts above 10K for a third.
2. There's a fast drop-off in speed when the array reaches a certain size that is correlated to level 2 cache size. After that, the speed does not decrease much when the array grows. This also affects direct writes to an int[] and has the interesting implication that a packed array out-performs the direct access approach for writes in a number of cases. For reads, there's no contest: Direct access to int[] is blazingly fast.
3. The access-speed of the different implementations converges when the number of values in the array rises (think 10M+ values): The slow round-trip to main memory dwarfs the logic used for value-extraction.
Observation #3 supports Mike McCandless choice of going for the packed approach and #1 suggests using int[] as the internal structure for now. Using int[] as internal structure makes if unfeasible to accept longs as input (or rather: longs with more than 32 significant bits). I don't know if this is acceptable?
I think we should agree on an API that LUCENE-2186 can use to
write/read packed ints, then get our two patches talking. I'll pull
out the barebones packed ints I'm currently using, and post
here... then let's merge/iterate to a common API, so I can cutover to
the patch from here in LUCENE-2186 ?

Attached patch with my current roughed up approach for packed ints
(from LUCENE-2186).

Let's try to standardize the API, then merge the two approaches, then
I'll cutover with LUCENE-2186.

It includes gen.py, which autogens dedicated decoders for each of the
nbits cases, excluding 8, 16, 32, 64 bits, since these are done with
dedicated array reader impls.

It uses a single writer (I don't think we need specialized writers),
but the writer encodes in the same byte order as
IndexOutput.writeLong, so that the byte order matches the dedicated
array reader impls.

It only encodes into long[] – we should create cases for int[]
(selected by the factory depending on 32 vs 64 bit jre).

We should also explore just reading in a full byte[] and using
Int/Short/Long buffer to decode. This API should also allow for a
future mmap impl as well.

Probably we should name all of these UnsignedPackedInts, since they
require values >= 0. (Hmm, though, the 64 bit case is tricky – I
guess we make an exception for that case).

Michael McCandless
added a comment - 13/Jan/10 12:22 Attached patch with my current roughed up approach for packed ints
(from LUCENE-2186 ).
Let's try to standardize the API, then merge the two approaches, then
I'll cutover with LUCENE-2186 .
It includes gen.py, which autogens dedicated decoders for each of the
nbits cases, excluding 8, 16, 32, 64 bits, since these are done with
dedicated array reader impls.
It uses a single writer (I don't think we need specialized writers),
but the writer encodes in the same byte order as
IndexOutput.writeLong, so that the byte order matches the dedicated
array reader impls.
It only encodes into long[] – we should create cases for int[]
(selected by the factory depending on 32 vs 64 bit jre).
We should also explore just reading in a full byte[] and using
Int/Short/Long buffer to decode. This API should also allow for a
future mmap impl as well.
Probably we should name all of these UnsignedPackedInts, since they
require values >= 0. (Hmm, though, the 64 bit case is tricky – I
guess we make an exception for that case).

Introducing yet another level of indirection and making a byte/short/int/long-prvider detached from the implementation of the packed values it tempting. I'm fairly afraid of the overhead of the extra method-calls, but I'll try it and see what happens.

I've read your (Michael McCandless) code an I can see that the tiny interfaces for Reader and Writer works well for your scenario. However, as the Reader must have (fast) random access, wouldn't it make sense to make it possible to update values? That way the same code can be used to hold ords for sorting and similar structures.

...should the index also be a long? No need to be bound by Java's 31-bit limit on array-length, although I might very well be over-engineering here.

The whole 32bit vs. 64bit as backing array does present a bit of a problem with persistence. We'll be in a situation where the index will be optimized for the architecture used for building, not the one used for searching. Leaving the option of a future mmap open means that it is not possible to do a conversion when retrieving the bits, so I have no solution for this (other than doing memory-only).

Toke Eskildsen
added a comment - 19/Jan/10 23:46 Introducing yet another level of indirection and making a byte/short/int/long-prvider detached from the implementation of the packed values it tempting. I'm fairly afraid of the overhead of the extra method-calls, but I'll try it and see what happens.
I've read your (Michael McCandless) code an I can see that the tiny interfaces for Reader and Writer works well for your scenario. However, as the Reader must have (fast) random access, wouldn't it make sense to make it possible to update values? That way the same code can be used to hold ords for sorting and similar structures.
Instead of Reader, we could use
abstract class Mutator {
public abstract long get( int index);
public abstract long set( int index, long value);
}
...should the index also be a long? No need to be bound by Java's 31-bit limit on array-length, although I might very well be over-engineering here.
The whole 32bit vs. 64bit as backing array does present a bit of a problem with persistence. We'll be in a situation where the index will be optimized for the architecture used for building, not the one used for searching. Leaving the option of a future mmap open means that it is not possible to do a conversion when retrieving the bits, so I have no solution for this (other than doing memory-only).

Introducing yet another level of indirection and making a byte/short/int/long-prvider detached from the implementation of the packed values it tempting.

You mean the layer that stores the minValue, so that the full range is
supported? I actually think we should absorb that into packed ints,
so it's only one method call per lookup, and specialize the "positive
only" cases to avoid the extra add per lookup.

With that fix, it's still a method call per lookup, but I don't see
how we can get away from that, unless we allow for exposure of the raw
array for the no-packing cases (which we could consider...).

Remember we use packed ints in places where we can accept some loss of
CPU perf. for improvements in RAM usage (see the comment I just added
to LUCENE-2186).

However, as the Reader must have (fast) random access, wouldn't it make sense to make it possible to update values?

Yeah, we do eventually want CSF to be updateable, but I don't think we
need this for phase 1? Likewise, I think all we need now for Lucene
is a "WriteOnceWriter", not a "RandomAccessWriter". Ie, you open a
writer, you add (sequentially) all values, you close.

...should the index also be a long?

I would stick with int now (we are doing this for Lucene, whose docIDs
are still ints...). Design for today.

The whole 32bit vs. 64bit as backing array does present a bit of a problem with persistence. We'll be in a situation where the index will be optimized for the architecture used for building, not the one used for searching. Leaving the option of a future mmap open means that it is not possible to do a conversion when retrieving the bits, so I have no solution for this (other than doing memory-only).

I'm confused – a future mmap impl shouldn't put pressure on the file
format used by packed ints today? Ie, a future mmap impl can use a
totally different format than the designed-to-be-slurped-into-RAM
format for packed ints, today?

Also, what do you mean by optimized for building not searching?

Note that on 32 bit machines, if there is actually a gain, we can make
a backing store with ints yet still allow for storage of nbits>32? It
"just" means a value may be split across 2 or 3 values?

Michael McCandless
added a comment - 20/Jan/10 10:16 Introducing yet another level of indirection and making a byte/short/int/long-prvider detached from the implementation of the packed values it tempting.
You mean the layer that stores the minValue, so that the full range is
supported? I actually think we should absorb that into packed ints,
so it's only one method call per lookup, and specialize the "positive
only" cases to avoid the extra add per lookup.
With that fix, it's still a method call per lookup, but I don't see
how we can get away from that, unless we allow for exposure of the raw
array for the no-packing cases (which we could consider...).
Remember we use packed ints in places where we can accept some loss of
CPU perf. for improvements in RAM usage (see the comment I just added
to LUCENE-2186 ).
However, as the Reader must have (fast) random access, wouldn't it make sense to make it possible to update values?
Yeah, we do eventually want CSF to be updateable, but I don't think we
need this for phase 1? Likewise, I think all we need now for Lucene
is a "WriteOnceWriter", not a "RandomAccessWriter". Ie, you open a
writer, you add (sequentially) all values, you close.
...should the index also be a long?
I would stick with int now (we are doing this for Lucene, whose docIDs
are still ints...). Design for today.
The whole 32bit vs. 64bit as backing array does present a bit of a problem with persistence. We'll be in a situation where the index will be optimized for the architecture used for building, not the one used for searching. Leaving the option of a future mmap open means that it is not possible to do a conversion when retrieving the bits, so I have no solution for this (other than doing memory-only).
I'm confused – a future mmap impl shouldn't put pressure on the file
format used by packed ints today? Ie, a future mmap impl can use a
totally different format than the designed-to-be-slurped-into-RAM
format for packed ints, today?
Also, what do you mean by optimized for building not searching?
Note that on 32 bit machines, if there is actually a gain, we can make
a backing store with ints yet still allow for storage of nbits>32? It
"just" means a value may be split across 2 or 3 values?

I've made a remark at LUCENE-1410 (a first attempt at a PFOR implementation) about the header structure for encoding this.
One thing that is not covered here is how to deal with input arrays with intermediate length that are shorter than 32 and longer than 3 or 4. Shorter ones can easily be encoded as vByte.
Simple9 might be a solution, but it has only 28 data bits and 9 different encoding cases so it appears to be somewhat small.
There is first attempt at Simple9 at LUCENE-2189.

Since the discussion here is on alignment (int/long) I'm wondering how (and whether) to go from the current byte aligned structures to int aligned. Using aligned ints would save the shifting done at IndexInput.getInt() that reads 4 bytes and shifts them into place to create an int from them.
Simple9 can be int aligned and I'd like to add bigger variations of that, but peferably only ones that add a multiple of 4 bytes.

So would make sense to add functionality to IndexInput and IndexOutput to allow int aligned access?
Are java's data streams and/or nio buffers smart enough to avoid the byte shifting for ints in such cases?

Paul Elschot
added a comment - 20/Jan/10 13:29 I've made a remark at LUCENE-1410 (a first attempt at a PFOR implementation) about the header structure for encoding this.
One thing that is not covered here is how to deal with input arrays with intermediate length that are shorter than 32 and longer than 3 or 4. Shorter ones can easily be encoded as vByte.
Simple9 might be a solution, but it has only 28 data bits and 9 different encoding cases so it appears to be somewhat small.
There is first attempt at Simple9 at LUCENE-2189 .
Since the discussion here is on alignment (int/long) I'm wondering how (and whether) to go from the current byte aligned structures to int aligned. Using aligned ints would save the shifting done at IndexInput.getInt() that reads 4 bytes and shifts them into place to create an int from them.
Simple9 can be int aligned and I'd like to add bigger variations of that, but peferably only ones that add a multiple of 4 bytes.
So would make sense to add functionality to IndexInput and IndexOutput to allow int aligned access?
Are java's data streams and/or nio buffers smart enough to avoid the byte shifting for ints in such cases?

Yonik Seeley
added a comment - 20/Jan/10 14:53 Using aligned ints would save the shifting done at IndexInput.getInt() that reads 4 bytes and shifts them into place to create an int from them.
How's that? Is there some JVM intrinsic?

Paul Elschot
added a comment - 20/Jan/10 16:00
Using aligned ints would save the shifting done at IndexInput.getInt() that reads 4 bytes and shifts them into place to create an int from them.
How's that? Is there some JVM intrinsic?
In the aligned case, and with the right byte order, getInt() on a java data stream might be reduced to processor instructions operating on 4 bytes at a time.
Is that what you mean by JVM intrinsic?

Yonik Seeley
added a comment - 20/Jan/10 16:25 In the aligned case, and with the right byte order, getInt() on a java data stream might be reduced to processor instructions operating on 4 bytes at a time. Is that what you mean by JVM intrinsic?
Yes. But are you aware of that having been implemented in any JVMs?

Toke:
Introducing yet another level of indirection and making a byte/short/int/long-prvider detached from the implementation of the packed values it tempting.

You mean the layer that stores the minValue, so that the full range is
supported? I actually think we should absorb that into packed ints,
so it's only one method call per lookup, and specialize the "positive
only" cases to avoid the extra add per lookup.

No, I mean the backing array of ints or longs. For memory, the obvious choice is int[] or long[], but designing for flexibility calls for an API, which coincidentally is identical to Reader. So, a 4-bit Aligned could be backed by a DirectInt (which will contain an int[]) or a Persistent (doing mmap or some other persistent-oriented lookup).

...I should show this in code instead. I'll try and find the time.

Yeah, we do eventually want CSF to be updateable, but I don't think we
need this for phase 1?

Not in the specific scenario, no.

Toke:
The whole 32bit vs. 64bit as backing array does present a bit of a problem with persistence. We'll be in a situation where the index will be optimized for the architecture used for building, not the one used for searching. Leaving the option of a future mmap open means that it is not possible to do a conversion when retrieving the bits, so I have no solution for this (other than doing memory-only).

I'm confused - a future mmap impl shouldn't put pressure on the file
format used by packed ints today? Ie, a future mmap impl can use a
totally different format than the designed-to-be-slurped-into-RAM
format for packed ints, today?

Also, what do you mean by optimized for building not searching?

When we're using Aligned, the choice of using int or long for the backing array dictates how the persistent bitstream will be. If the index builder is a 64 bit machine and 7 bits/value is used, the result will be longs, each containing 9 values.

When the structure is read into memory by the searcher, the backing array will again be long. But if the searcher happens to be a 32 bit machine, the performance will be less than if ints were used for the backing array.

One way to handle this is to do a conversion, when loading into memory. If the searcher is 64 bit, it will always convert into longs. If the searcher is 32 bit, it will convert into int, if the bits/value is <= 32. The conversion is cheap, so this is no problem in itself.

However, if we're planning for the future (or for flexibility, depending on point of view), we would very much like the persistent format to be directly usable, so that we don't need to load the whole structure into memory. This rules out conversion and sets us back to step 1: The index will be optimized for either 32bit or 64 bit searchers.

Oh well, we could always just ignore it and say that Aligned is 64 bit based. As it is more memory-efficient than Aligned on 32 bit machines, maybe the slightly smaller number of backing longs will compensate for the overhead of retrieving longs on a 64 bit machine.

Note that on 32 bit machines, if there is actually a gain, we can make
a backing store with ints yet still allow for storage of nbits>32? It
"just" means a value may be split across 2 or 3 values?

My guess is that the number of values vs. the available level 2 cache will play a big role here: For a relatively small number of values, the added logic will be costly. For a larger number of values, the cache-misses will dwarf that.

Toke Eskildsen
added a comment - 20/Jan/10 16:43
Toke:
Introducing yet another level of indirection and making a byte/short/int/long-prvider detached from the implementation of the packed values it tempting.
You mean the layer that stores the minValue, so that the full range is
supported? I actually think we should absorb that into packed ints,
so it's only one method call per lookup, and specialize the "positive
only" cases to avoid the extra add per lookup.
No, I mean the backing array of ints or longs. For memory, the obvious choice is int[] or long[], but designing for flexibility calls for an API, which coincidentally is identical to Reader. So, a 4-bit Aligned could be backed by a DirectInt (which will contain an int[]) or a Persistent (doing mmap or some other persistent-oriented lookup).
...I should show this in code instead. I'll try and find the time.
Yeah, we do eventually want CSF to be updateable, but I don't think we
need this for phase 1?
Not in the specific scenario, no.
Toke:
The whole 32bit vs. 64bit as backing array does present a bit of a problem with persistence. We'll be in a situation where the index will be optimized for the architecture used for building, not the one used for searching. Leaving the option of a future mmap open means that it is not possible to do a conversion when retrieving the bits, so I have no solution for this (other than doing memory-only).
I'm confused - a future mmap impl shouldn't put pressure on the file
format used by packed ints today? Ie, a future mmap impl can use a
totally different format than the designed-to-be-slurped-into-RAM
format for packed ints, today?
Also, what do you mean by optimized for building not searching?
When we're using Aligned, the choice of using int or long for the backing array dictates how the persistent bitstream will be. If the index builder is a 64 bit machine and 7 bits/value is used, the result will be longs, each containing 9 values.
When the structure is read into memory by the searcher, the backing array will again be long. But if the searcher happens to be a 32 bit machine, the performance will be less than if ints were used for the backing array.
One way to handle this is to do a conversion, when loading into memory. If the searcher is 64 bit, it will always convert into longs. If the searcher is 32 bit, it will convert into int, if the bits/value is <= 32. The conversion is cheap, so this is no problem in itself.
However, if we're planning for the future (or for flexibility, depending on point of view), we would very much like the persistent format to be directly usable, so that we don't need to load the whole structure into memory. This rules out conversion and sets us back to step 1: The index will be optimized for either 32bit or 64 bit searchers.
Oh well, we could always just ignore it and say that Aligned is 64 bit based. As it is more memory-efficient than Aligned on 32 bit machines, maybe the slightly smaller number of backing longs will compensate for the overhead of retrieving longs on a 64 bit machine.
Note that on 32 bit machines, if there is actually a gain, we can make
a backing store with ints yet still allow for storage of nbits>32? It
"just" means a value may be split across 2 or 3 values?
My guess is that the number of values vs. the available level 2 cache will play a big role here: For a relatively small number of values, the added logic will be costly. For a larger number of values, the cache-misses will dwarf that.

Paul: In the aligned case, and with the right byte order, getInt() on a java data stream might be reduced to processor instructions operating on 4 bytes at a time. Is that what you mean by JVM intrinsic?

Yonik: Yes. But are you aware of that having been implemented in any JVMs?

I remember reading about an implementation doing that, but I can't find it back now.

If one cannot have such ints directly, there is not much point in unpacking via IndexInput.getInt(), it would be better to unpack directly from the bytes.
Also, in that case, I'd prefer to use single byte increments (above 4 byte increments) for the size of the encoded data for variations on Simple9.

Paul Elschot
added a comment - 20/Jan/10 17:30 Paul: In the aligned case, and with the right byte order, getInt() on a java data stream might be reduced to processor instructions operating on 4 bytes at a time. Is that what you mean by JVM intrinsic?
Yonik: Yes. But are you aware of that having been implemented in any JVMs?
I remember reading about an implementation doing that, but I can't find it back now.
If one cannot have such ints directly, there is not much point in unpacking via IndexInput.getInt(), it would be better to unpack directly from the bytes.
Also, in that case, I'd prefer to use single byte increments (above 4 byte increments) for the size of the encoded data for variations on Simple9.

Looking at bit patterns and persistence, I see 3 different ones: Packed, aligned32 and aligned64. Regardless of whether 32bit or 64bit is used when a packed structure is created, it can be read as both 32bit and 64bit packed. As for the special cases of 8, 16, 32 and 64 bits/value, the bit patterns are identically to both packed and aligned. This leeds me to propose a header designating one of the three structures mentioned.

The current draft from Michael McCandless states both bitsPerValue and maxValue in the persistent format. It seems a redundant to have both, but I might be missing something here? Either way, the bitsPerValue is ambiguous as it does not translate to memory usage the same way for packed, aligned32 or aligned64. Should I choose, I'd go for maxValue.

Toke Eskildsen
added a comment - 22/Jan/10 08:07 Looking at bit patterns and persistence, I see 3 different ones: Packed, aligned32 and aligned64. Regardless of whether 32bit or 64bit is used when a packed structure is created, it can be read as both 32bit and 64bit packed. As for the special cases of 8, 16, 32 and 64 bits/value, the bit patterns are identically to both packed and aligned. This leeds me to propose a header designating one of the three structures mentioned.
The current draft from Michael McCandless states both bitsPerValue and maxValue in the persistent format. It seems a redundant to have both, but I might be missing something here? Either way, the bitsPerValue is ambiguous as it does not translate to memory usage the same way for packed, aligned32 or aligned64. Should I choose, I'd go for maxValue.
What about a header stating
format ( String "packed" , "aligned32" or "aligned64" )
valueCount (vInt)
maxValue (vLong)
?
I have working code for packed32 and packed64 and am currently fitting it into Michael's patch. I hope to finish it this weekend.

VByte 1 bit.
Simple9 4 bits. These cases imply valueCount and maxValue.
For around 4-16 numbers encode the complete header in 5-6 bits, also implying valueCount and maxValue.
For FrameOfRef, encoding 32 or more numbers, a larger header can be used, 4 bytes for example, maxValue is implied from the number of frame bits. valueCount could be 32, 64, 128 (i.e. 2 bits). Also the number of exceptions could be put there.

This header "type" can be chosen depending on the given length of the encoded sequence.

Paul Elschot
added a comment - 22/Jan/10 08:37 How about encoding the header something like this:
VByte 1 bit.
Simple9 4 bits. These cases imply valueCount and maxValue.
For around 4-16 numbers encode the complete header in 5-6 bits, also implying valueCount and maxValue.
For FrameOfRef, encoding 32 or more numbers, a larger header can be used, 4 bytes for example, maxValue is implied from the number of frame bits. valueCount could be 32, 64, 128 (i.e. 2 bits). Also the number of exceptions could be put there.
This header "type" can be chosen depending on the given length of the encoded sequence.

I have working code for packed32 and packed64 and am currently fitting it into Michael's patch. I hope to finish it this weekend.

Nice! Sounds like good progress Toke!

The current draft from Michael McCandless states both bitsPerValue and maxValue in the persistent format

I was only storing maxValue as a convenience for the layer above – we don't need to do that – I think storing format (packed, aligned32, aligned64) and bitsPerValue makes sense.

Regardless of whether 32bit or 64bit is used when a packed structure is created, it can be read as both 32bit and 64bit packed.

Right, but with the challenge (if we use 32bit backing array) of properly handling the nbits>32 case (this is perfectly doable... "it's just software" ).

As for the special cases of 8, 16, 32 and 64 bits/value, the bit patterns are identically to both packed and aligned.

I had chosen to match IndexOutput/Inputs's byte order (big-endian) so that the packed format naturally reads back with IndexInput's readLong/Int/Short (I added a readShort).

I'm assuming for these special cases that dedicated Reader impls, with byte[], short[], int[], long[] backing array, is faster than eg backing with a long[] and shift/masking per lookup.

But eg for the nbits=3 case, aligned 32/64 would ensure that no value spans across two underlying entries in the backing array (wasting some bits of storage in exchange). Whereas the nbits=2 or 4 cases would naturally be aligned anyway...

One question: the Reader api is now this:

long get(int index);

Which is convenient since obviously long can accommodate all of the underlying possible nbits, but... for small nbits values, this logically entails a cast. EG say nbits=8, so it's a direct byte[] backing array. get() must cast up to long, and caller must operate with long... I'm wondering whether that forced casting is going to hurt performance enough to make us want to have dedicated precision (8, 16, 32, 64) Reader interfaces....

Michael McCandless
added a comment - 22/Jan/10 08:39 I have working code for packed32 and packed64 and am currently fitting it into Michael's patch. I hope to finish it this weekend.
Nice! Sounds like good progress Toke!
The current draft from Michael McCandless states both bitsPerValue and maxValue in the persistent format
I was only storing maxValue as a convenience for the layer above – we don't need to do that – I think storing format (packed, aligned32, aligned64) and bitsPerValue makes sense.
Regardless of whether 32bit or 64bit is used when a packed structure is created, it can be read as both 32bit and 64bit packed.
Right, but with the challenge (if we use 32bit backing array) of properly handling the nbits>32 case (this is perfectly doable... "it's just software" ).
As for the special cases of 8, 16, 32 and 64 bits/value, the bit patterns are identically to both packed and aligned.
I had chosen to match IndexOutput/Inputs's byte order (big-endian) so that the packed format naturally reads back with IndexInput's readLong/Int/Short (I added a readShort).
I'm assuming for these special cases that dedicated Reader impls, with byte[], short[], int[], long[] backing array, is faster than eg backing with a long[] and shift/masking per lookup.
But eg for the nbits=3 case, aligned 32/64 would ensure that no value spans across two underlying entries in the backing array (wasting some bits of storage in exchange). Whereas the nbits=2 or 4 cases would naturally be aligned anyway...
One question: the Reader api is now this:
long get( int index);
Which is convenient since obviously long can accommodate all of the underlying possible nbits, but... for small nbits values, this logically entails a cast. EG say nbits=8, so it's a direct byte[] backing array. get() must cast up to long, and caller must operate with long... I'm wondering whether that forced casting is going to hurt performance enough to make us want to have dedicated precision (8, 16, 32, 64) Reader interfaces....

This header "type" can be chosen depending on the given length of the encoded sequence.

I think we should separate the header needed for this issue (single header stored once at the beginning of a potentially massive file), from the per-block header used by int block based formats like SimpleN/PForDelta/etc?

I think Toke's proposed header, if we swap in bitsPerValue in place of maxValue, is good for this issue?

Michael McCandless
added a comment - 22/Jan/10 08:45 This header "type" can be chosen depending on the given length of the encoded sequence.
I think we should separate the header needed for this issue (single header stored once at the beginning of a potentially massive file), from the per-block header used by int block based formats like SimpleN/PForDelta/etc?
I think Toke's proposed header, if we swap in bitsPerValue in place of maxValue, is good for this issue?

I've uploaded a preliminary patch with packed32, packed64, directByte, directShort, directInt and directLong implementations. I've used Michael McCandless patch as foundation, but the new patch is generated to be independent from the old one. It uses maxValue instead of bitsPerValue for the header, there's no test of packed32 and there's a general need for cleanup. The main missing components are aligned32 and aligned64.

I've done quite a bit of refactoring and (cheater that I am) added setters to all implementations of Reader, although not to the interface. Besides the nitty-gritty details of the implementation, I suspect that the code for selecting which implementation to use is a prime candidate for discussion. It is located in PackedInts and tries to select the best implementation based on preference for packed, aligned and direct paired with preference for 32bit and 64bit.

I think that an "auto"-value for priority is worth considering: For 9, 17 and 33 bits/value, packed is often faster than aligned due to only using half the memory and thus having lower risk of level 2 cache misses. For high bits/value, such as 30, 31, 62, 63 and 64 (guesstimating here), choosing direct seems to be the best choice for most situations. Users of PackedInts should not be expected to know this.

I'll start work on aligned32 and aligned64, but I will leave the rest of the patch alone for now, as I suspect that there'll be some changes to the current draft.

Toke Eskildsen
added a comment - 22/Jan/10 12:27 I've uploaded a preliminary patch with packed32, packed64, directByte, directShort, directInt and directLong implementations. I've used Michael McCandless patch as foundation, but the new patch is generated to be independent from the old one. It uses maxValue instead of bitsPerValue for the header, there's no test of packed32 and there's a general need for cleanup. The main missing components are aligned32 and aligned64.
I've done quite a bit of refactoring and (cheater that I am) added setters to all implementations of Reader, although not to the interface. Besides the nitty-gritty details of the implementation, I suspect that the code for selecting which implementation to use is a prime candidate for discussion. It is located in PackedInts and tries to select the best implementation based on preference for packed, aligned and direct paired with preference for 32bit and 64bit.
private static IMPLEMENTATION getImplementation(
long maxValue, PRIORITY priority, BLOCK_PREFERENCE block) {
int bits = bitsRequired(maxValue);
switch (priority) {
case direct: {
bits = getNextFixedSize(bits);
break ;
}
case aligned: {
if (block == BLOCK_PREFERENCE.bit32) {
if (bits == 7 || bits >= 11) {
bits = getNextFixedSize(bits); // Align to byte , short , int or long
}
} else {
if ((bits >= 13 && bits <= 15) || (bits >= 22)) {
bits = getNextFixedSize(bits); // Align to short , int or long
}
}
}
}
switch (bits) { // The safe choices
case 8: return IMPLEMENTATION.directByte;
case 16: return IMPLEMENTATION.directShort;
case 32: return IMPLEMENTATION.directInt;
case 63:
case 64: return IMPLEMENTATION.directLong;
}
if (priority == PRIORITY.aligned || bits == 1 || bits == 2 || bits == 4) {
return block == BLOCK_PREFERENCE.bit32 && bits < 32 ?
IMPLEMENTATION.aligned32 : IMPLEMENTATION.aligned64;
}
return block == BLOCK_PREFERENCE.bit32 && bits < 32 ?
IMPLEMENTATION.packed32 : IMPLEMENTATION.packed64;
return IMPLEMENTATION.packed64;
I think that an "auto"-value for priority is worth considering: For 9, 17 and 33 bits/value, packed is often faster than aligned due to only using half the memory and thus having lower risk of level 2 cache misses. For high bits/value, such as 30, 31, 62, 63 and 64 (guesstimating here), choosing direct seems to be the best choice for most situations. Users of PackedInts should not be expected to know this.
I'll start work on aligned32 and aligned64, but I will leave the rest of the patch alone for now, as I suspect that there'll be some changes to the current draft.

Paul Elschot
added a comment - 22/Jan/10 16:17 The generated code in the patches has quite a few switch statements to decode a single value.
These switch statements could be avoided by using something like this (adapted from the 1410b patch):
/** Decode a value from the compressed array of b bit values by retrieving the corresponding bits.
* Since numFrameBits is always smaller than the number of bits in an int ,
* at most two ints in the buffer will be used.
*/
public int decodeCompressedValueBase( int compressedPos, int numBits) {
int compressedBitPos = numBits * compressedPos;
int intIndex = (compressedBitPos >> 5);
int firstBitPosition = compressedBitPos & 31;
int value = intBuffer.get(intIndex) >>> firstBitPosition;
if ((firstBitPosition + numBits) > 32) { // value does not fit in first int
intIndex++;
value |= (intBuffer.get(intIndex) << (32 - firstBitPosition));
}
final int maxValue = ( int ) ((1L << numBits) - 1);
return value & maxValue;
}
As maxValue is essentially a mask, it could also be looked up in an array.
Could that be faster than these generated switch statements?

which looks a lot like your (Paul Elschot) suggestion. It avoids all conditionals at the cost of more bit-operations and a dummy element at the end of the backing array. I must admit that my performance testing of the different solutions has been fairly ad hoc (measure, tweak, repeat), so an appropriate test would be in order.

Toke Eskildsen
added a comment - 22/Jan/10 21:11 I think Michaels generated code was meant as a temporary solution, until a handcrafted version was available. In packed32 and packed64, the code for decoding a value is
public long get( final int index) {
final long majorBitPos = index * elementBits;
final int elementPos = ( int )(majorBitPos >>> BLOCK_BITS); // / BLOCK_SIZE
final int bitPos = ( int )(majorBitPos & MOD_MASK); // % BLOCK_SIZE);
final int base = bitPos * FAC_BITPOS;
return ((blocks[elementPos] << shifts[base]) >>> shifts[base+1]) |
((blocks[elementPos+1] >>> shifts[base+2]) & readMasks[bitPos]);
}
which looks a lot like your (Paul Elschot) suggestion. It avoids all conditionals at the cost of more bit-operations and a dummy element at the end of the backing array. I must admit that my performance testing of the different solutions has been fairly ad hoc (measure, tweak, repeat), so an appropriate test would be in order.

Nice to see a more mature alternative.
The trade off between (dummy element/unconditioned extra access) and (conditioned extra access) is of later concern.
Both the extra access and the condition take cycles, so it's not clear which one will be faster. It might even depend on the value of elementBits.

Paul Elschot
added a comment - 23/Jan/10 10:44 Nice to see a more mature alternative.
The trade off between (dummy element/unconditioned extra access) and (conditioned extra access) is of later concern.
Both the extra access and the condition take cycles, so it's not clear which one will be faster. It might even depend on the value of elementBits.

As to whether to use int or long in the interface unsigned packed int, the only numbers that will probably need to be long in the foreseeable future are docids. However this change can be delayed by not allowing an index segment to grow beyond 2^32 or 2^31-1docs, and by only implementing the long docids for multiple index segments.
So as long as it is ok to assume that an index segment can have MAXINT docs at most, we could use an int interface here.
Do Nutch and/or Solr already have long docids implemented on multiple index readers/writers or segments?

The other border is the max size of a document field. If that goes beyond MAXINT, the positions and maybe even the frequencies would need to be changed from int to long. But for now I can't think of a real use case with a document field that has more than MAXINT positions. That would be like a book with ten million pages of text. Did anyone ever run into this limitation?

Paul Elschot
added a comment - 23/Jan/10 11:48 - edited As to whether to use int or long in the interface unsigned packed int, the only numbers that will probably need to be long in the foreseeable future are docids. However this change can be delayed by not allowing an index segment to grow beyond 2^32 or 2^31-1docs, and by only implementing the long docids for multiple index segments.
So as long as it is ok to assume that an index segment can have MAXINT docs at most, we could use an int interface here.
Do Nutch and/or Solr already have long docids implemented on multiple index readers/writers or segments?
The other border is the max size of a document field. If that goes beyond MAXINT, the positions and maybe even the frequencies would need to be changed from int to long. But for now I can't think of a real use case with a document field that has more than MAXINT positions. That would be like a book with ten million pages of text. Did anyone ever run into this limitation?

I think Michaels generated code was meant as a temporary solution, until a handcrafted version was available

Actually that was intended to be a fast impl... the switch should be
compiled to a direct lookup (maybe plus a conditional to catch the
"default" case even though it will never happen...ugh). But I like
your impl with no conditional at all. We should test both.

As to whether to use int or long in the interface unsigned packed int, the only numbers that will probably need to be long in the foreseeable future are docids.

Also the file offsets into the terms dict, possibly the offsets in RAM
into the terms dict character data (UTF8 byte[]). Also, when we do
column stride fields, we allow storing values > int. I think we
should stick with long get(index) for now.

Other comments:

Maybe we should move all of this under oal.util.packed?
(packedints? ints?)

Why not store bitsPerValue in the header instead of maxValue? EG
maybe my maxValue is 7000, but because I'm using directShort,
bitsPerValue is 16. Also, the maxValue at write time should not
have to be known – eg the factory API should let me ask for a
direct short writer without declaring the maxValue I will store.

I wonder if we should add an optional Object
getDirectBackingArray(). The packed/aligned impls would return
null, but the direct byte/short/int/long impls would return their
array. This would allow callers to specialize upstream impls to
do the direct array lookup without the cast-to-long (like how
FieldComparator now has impls for byte,short,int,long). I suspect
for column stride fields, when sorting by an integer field, on a
32bit arch, this would be a perf win. But: let's wait until we
have CSFs, and we can test whether there really is a gain here....

I think we shouldn't put a getWriter on every Reader
impl... because it's a one to many mapping? Eg the format written
by PackedWriter can be read by direct byte/short/int/long,
Packed32/64.

For starters I don't think we should make reader impls that can
read nbits > 31 bits with an int[] backing array. I think long[]
backing array is fine.

I don't think we need separate PRIORITY and BLOCK_PREFERENCE?
Can't we have a single enum (STORAGE?) with: packed, aligned32,
aligned64? "Direct" is really just packed with nbits rounded up
to 8,16,32,64.

Aligned32/64 is very wasteful for certain nbits... I like the idea
of "auto" to avoid risk that caller picks a bad combination.

I think for starters we should not make any reader impls that do
remapping at load time.

Michael McCandless
added a comment - 25/Jan/10 21:21 Good progress !
I think Michaels generated code was meant as a temporary solution, until a handcrafted version was available
Actually that was intended to be a fast impl... the switch should be
compiled to a direct lookup (maybe plus a conditional to catch the
"default" case even though it will never happen...ugh). But I like
your impl with no conditional at all. We should test both.
As to whether to use int or long in the interface unsigned packed int, the only numbers that will probably need to be long in the foreseeable future are docids.
Also the file offsets into the terms dict, possibly the offsets in RAM
into the terms dict character data (UTF8 byte[]). Also, when we do
column stride fields, we allow storing values > int. I think we
should stick with long get(index) for now.
Other comments:
Maybe we should move all of this under oal.util.packed?
(packedints? ints?)
I think we should remove getMaxValue() from the Reader interface?
Why create the IMPLEMENTATION enum? Why not simply return an
[anonymous] instance of Writer?
Why not store bitsPerValue in the header instead of maxValue? EG
maybe my maxValue is 7000, but because I'm using directShort,
bitsPerValue is 16. Also, the maxValue at write time should not
have to be known – eg the factory API should let me ask for a
direct short writer without declaring the maxValue I will store.
I wonder if we should add an optional Object
getDirectBackingArray(). The packed/aligned impls would return
null, but the direct byte/short/int/long impls would return their
array. This would allow callers to specialize upstream impls to
do the direct array lookup without the cast-to-long (like how
FieldComparator now has impls for byte,short,int,long). I suspect
for column stride fields, when sorting by an integer field, on a
32bit arch, this would be a perf win. But: let's wait until we
have CSFs, and we can test whether there really is a gain here....
I think we shouldn't put a getWriter on every Reader
impl... because it's a one to many mapping? Eg the format written
by PackedWriter can be read by direct byte/short/int/long,
Packed32/64.
For starters I don't think we should make reader impls that can
read nbits > 31 bits with an int[] backing array. I think long[]
backing array is fine.
I don't think we need separate PRIORITY and BLOCK_PREFERENCE?
Can't we have a single enum (STORAGE?) with: packed, aligned32,
aligned64? "Direct" is really just packed with nbits rounded up
to 8,16,32,64.
Aligned32/64 is very wasteful for certain nbits... I like the idea
of "auto" to avoid risk that caller picks a bad combination.
I think for starters we should not make any reader impls that do
remapping at load time.

As to whether to use int or long in the interface unsigned packed int, the only numbers that will probably need to be long in the foreseeable future are docids.

Also the file offsets into the terms dict, possibly the offsets in RAM
into the terms dict character data (UTF8 byte[]). Also, when we do
column stride fields, we allow storing values > int. I think we
should stick with long get(index) for now.

Indeed I missed the offsets. However, a long get(index) will probably get in the way, because there are far fewer offsets
than normal data, and the 32 bit processors will be around for a long time. So I think we'll need both long (for the offsets) and int (for the rest) in the end.

Paul Elschot
added a comment - 25/Jan/10 21:49
As to whether to use int or long in the interface unsigned packed int, the only numbers that will probably need to be long in the foreseeable future are docids.
Also the file offsets into the terms dict, possibly the offsets in RAM
into the terms dict character data (UTF8 byte[]). Also, when we do
column stride fields, we allow storing values > int. I think we
should stick with long get(index) for now.
Indeed I missed the offsets. However, a long get(index) will probably get in the way, because there are far fewer offsets
than normal data, and the 32 bit processors will be around for a long time. So I think we'll need both long (for the offsets) and int (for the rest) in the end.

The IMPLEMENTATION enum is only used internally and is package private. It was introduced to separate decision-making from specific implementations - e.g. the packed writer is the same for packed32 and packed64, although the reader differs. But it could very well be that it confuses more than it helps.

Why not store bitsPerValue in the header instead of maxValue?

As above - I did not have the time to fix it and wanted to push the patch in order to move the discussion along.

Also, the maxValue at write time should not
have to be known - eg the factory API should let me ask for a
direct short writer without declaring the maxValue I will store.

Since Packed and Aligned needs maxValue (or bitsPerValue), this would require two distinct methods in the factory, each returning a subset of the possible implementations. I find that rather confusing.

I wonder if we should add an optional Object
getDirectBackingArray(). The packed/aligned impls would return
null, but the direct byte/short/int/long impls would return their
array. [...]

are trivial to implement for the known implementations. They open up for things like auto-growing to fit higher values by using a delegating wrapper, re-using the structure for counting purposes and sorting in-place.

I think we shouldn't put a getWriter on every Reader
impl... because it's a one to many mapping? Eg the format written
by PackedWriter can be read by direct byte/short/int/long,
Packed32/64.

Quite right.

For starters I don't think we should make reader impls that can
read nbits > 31 bits with an int[] backing array. I think long[]
backing array is fine.

The current patch limits nbits to 32 for Packed32. I am confident that an int-backed reader with nbits > 32 will be slower than a long-backed reader on a 32 bit machine.

I don't think we need separate PRIORITY and BLOCK_PREFERENCE?
Can't we have a single enum (STORAGE?) with: packed, aligned32,
aligned64? "Direct" is really just packed with nbits rounded up
to 8,16,32,64.

I agree that it does complicate matters somewhat to have them separated. When calling getReader the BLOCK_PREFERENCE should also be removed, as the block preference will always be the same as that architecture. Removing the "direct" option would require the caller to do some of the logic in some cases: If low processing requirements is a priority, direct is preferably and when the bitsPerValue is calculated, the caller would have to do the if (bitsPerValue > 32) bitsPerValue = 64 and so on.

Toke Eskildsen
added a comment - 26/Jan/10 00:17
I think we should remove getMaxValue() from the Reader interface?
Yes. I only left maxValue in the code because I ran out of time.
Why create the IMPLEMENTATION enum? Why not simply return an
[anonymous] instance of Writer?
The IMPLEMENTATION enum is only used internally and is package private. It was introduced to separate decision-making from specific implementations - e.g. the packed writer is the same for packed32 and packed64, although the reader differs. But it could very well be that it confuses more than it helps.
Why not store bitsPerValue in the header instead of maxValue?
As above - I did not have the time to fix it and wanted to push the patch in order to move the discussion along.
Also, the maxValue at write time should not
have to be known - eg the factory API should let me ask for a
direct short writer without declaring the maxValue I will store.
Since Packed and Aligned needs maxValue (or bitsPerValue), this would require two distinct methods in the factory, each returning a subset of the possible implementations. I find that rather confusing.
I wonder if we should add an optional Object
getDirectBackingArray(). The packed/aligned impls would return
null, but the direct byte/short/int/long impls would return their
array. [...]
Speaking of API additions, I find that
public int getBitsPerValue();
public int size();
public void set( long value);
public void clear();
are trivial to implement for the known implementations. They open up for things like auto-growing to fit higher values by using a delegating wrapper, re-using the structure for counting purposes and sorting in-place.
I think we shouldn't put a getWriter on every Reader
impl... because it's a one to many mapping? Eg the format written
by PackedWriter can be read by direct byte/short/int/long,
Packed32/64.
Quite right.
For starters I don't think we should make reader impls that can
read nbits > 31 bits with an int[] backing array. I think long[]
backing array is fine.
The current patch limits nbits to 32 for Packed32. I am confident that an int-backed reader with nbits > 32 will be slower than a long-backed reader on a 32 bit machine.
I don't think we need separate PRIORITY and BLOCK_PREFERENCE?
Can't we have a single enum (STORAGE?) with: packed, aligned32,
aligned64? "Direct" is really just packed with nbits rounded up
to 8,16,32,64.
I agree that it does complicate matters somewhat to have them separated. When calling getReader the BLOCK_PREFERENCE should also be removed, as the block preference will always be the same as that architecture. Removing the "direct" option would require the caller to do some of the logic in some cases: If low processing requirements is a priority, direct is preferably and when the bitsPerValue is calculated, the caller would have to do the if (bitsPerValue > 32) bitsPerValue = 64 and so on.

Also, the maxValue at write time should not have to be known - eg the factory API should let me ask for a direct short writer without declaring the maxValue I will store.

Since Packed and Aligned needs maxValue (or bitsPerValue), this would require two distinct methods in the factory, each returning a subset of the possible implementations. I find that rather confusing.

Maybe the caller just always uses the bitsRequired method to get the
required bit width per value?

Though, when we enable specializing storing of negative values as
well, that'll be a hassle...

are trivial to implement for the known implementations. They open up for things like auto-growing to fit higher values by using a delegating wrapper, re-using the structure for counting purposes and sorting in-place.

I think the first 2 make sense, but I'd rather not pursue the 2nd two
at this time. Ie, I think this API only needs write-once, and then
read-only.

If we open up random writing (set/clear), with auto-growing, etc.,
that does add complexities to the impl. EG the backing store can no
longer be final, we'd have to do some locking (or mark the array
volatile) for thread safety, etc.

As far as I can tell... Lucene today doesn't yet need random write to
the packed ints. The terms dict index and CSF are the two needs I
think we have now. Someday (when CSF supports writing) we will... but
not yet today?

I don't think we need separate PRIORITY and BLOCK_PREFERENCE? Can't we have a single enum (STORAGE?) with: packed, aligned32, aligned64? "Direct" is really just packed with nbits rounded up to 8,16,32,64.

I agree that it does complicate matters somewhat to have them separated. When calling getReader the BLOCK_PREFERENCE should also be removed, as the block preference will always be the same as that architecture. Removing the "direct" option would require the caller to do some of the logic in some cases: If low processing requirements is a priority, direct is preferably and when the bitsPerValue is calculated, the caller would have to do the if (bitsPerValue > 32) bitsPerValue = 64 and so on.

(There's a bug in the patch in PackedInts.getReader, where it switches
the block size based on whether JRE is 64 bit: it's always choosing 64
bit now).

The "direct" option only applies during writing (ie, you round up to
the nearest native type bit width). At read time it's just a packed
8/16/32/64.

Hmm... maybe we could just add an optional 2nd arg to bitsRequired, a
boolean eg "roundUpToNative" or something, which if true does that
rounding for you? (And then go back to caller computes bit width and
passes it in?).

Michael McCandless
added a comment - 04/Feb/10 19:24
Also, the maxValue at write time should not have to be known - eg the factory API should let me ask for a direct short writer without declaring the maxValue I will store.
Since Packed and Aligned needs maxValue (or bitsPerValue), this would require two distinct methods in the factory, each returning a subset of the possible implementations. I find that rather confusing.
Maybe the caller just always uses the bitsRequired method to get the
required bit width per value?
Though, when we enable specializing storing of negative values as
well, that'll be a hassle...
OK let's leave it as you must pass the maxValue for now.
Speaking of API additions, I find that
public int getBitsPerValue();
public int size();
public void set( long value);
public void clear();
are trivial to implement for the known implementations. They open up for things like auto-growing to fit higher values by using a delegating wrapper, re-using the structure for counting purposes and sorting in-place.
I think the first 2 make sense, but I'd rather not pursue the 2nd two
at this time. Ie, I think this API only needs write-once, and then
read-only.
If we open up random writing (set/clear), with auto-growing, etc.,
that does add complexities to the impl. EG the backing store can no
longer be final, we'd have to do some locking (or mark the array
volatile) for thread safety, etc.
As far as I can tell... Lucene today doesn't yet need random write to
the packed ints. The terms dict index and CSF are the two needs I
think we have now. Someday (when CSF supports writing) we will... but
not yet today?
I don't think we need separate PRIORITY and BLOCK_PREFERENCE? Can't we have a single enum (STORAGE?) with: packed, aligned32, aligned64? "Direct" is really just packed with nbits rounded up to 8,16,32,64.
I agree that it does complicate matters somewhat to have them separated. When calling getReader the BLOCK_PREFERENCE should also be removed, as the block preference will always be the same as that architecture. Removing the "direct" option would require the caller to do some of the logic in some cases: If low processing requirements is a priority, direct is preferably and when the bitsPerValue is calculated, the caller would have to do the if (bitsPerValue > 32) bitsPerValue = 64 and so on.
(There's a bug in the patch in PackedInts.getReader, where it switches
the block size based on whether JRE is 64 bit: it's always choosing 64
bit now).
The "direct" option only applies during writing (ie, you round up to
the nearest native type bit width). At read time it's just a packed
8/16/32/64.
Hmm... maybe we could just add an optional 2nd arg to bitsRequired, a
boolean eg "roundUpToNative" or something, which if true does that
rounding for you? (And then go back to caller computes bit width and
passes it in?).

Michael McCandless
added a comment - 10/Feb/10 11:23 Toke, are you still working on this...? If not, I can take a crack? I'd really like to get something online here before we land flex, so the terms dict index isn't so wasteful of RAM.

Changing the code to use bitsPerValue instead of maxValue for constructors and persistent format took a bit longer than anticipated. To get things flowing, I've attached the code as it is now. I've moved the classes to o.a.l.util.packed and performed some clenup too. It still needs aligned32 and aligned64 implementations and more cleanup, which I'll work on for the next hour today and hopefully some hours tomorrow.

One current use-case for mutable packed ints would be for StringOrdValComparator (using an auto-grow wrapper), although the gain might be small as the overhead of the Strings is so large. I understand the problem of making all packed ints mutable, but a compromise might be to have a Mutable interface and a new factory-method that returns the same implementations as Mutable instead of Reader? That way it is possible to use the implementations for things such as sorting instead of having to re-implement them. I've left the interface for Reader clean as you suggested, but kept the implementations of set in the classes for now, as the code has already been made.

Toke Eskildsen
added a comment - 10/Feb/10 14:49 Changing the code to use bitsPerValue instead of maxValue for constructors and persistent format took a bit longer than anticipated. To get things flowing, I've attached the code as it is now. I've moved the classes to o.a.l.util.packed and performed some clenup too. It still needs aligned32 and aligned64 implementations and more cleanup, which I'll work on for the next hour today and hopefully some hours tomorrow.
One current use-case for mutable packed ints would be for StringOrdValComparator (using an auto-grow wrapper), although the gain might be small as the overhead of the Strings is so large. I understand the problem of making all packed ints mutable, but a compromise might be to have a Mutable interface and a new factory-method that returns the same implementations as Mutable instead of Reader? That way it is possible to use the implementations for things such as sorting instead of having to re-implement them. I've left the interface for Reader clean as you suggested, but kept the implementations of set in the classes for now, as the code has already been made.

I've read through the comments on LUCENE-1990 and implemented most of what has been suggested. The attached patch contains implementations for all the variants we've talked about, including aligned. There's a known bug in persistence for aligned64 (and probably also for aligned32) that I haven't stomped yet. There's also a clear need for a more elaborate unit-test with regard to persistence.

Other outstanding issues, as I see them, are whether or not mutable packed arrays should be requestable (as general purpose data structures) and how the factory for creating a writer should work. I have added a getMutable-method to the factory and not touched the return type Reader for the getReader-method. That way read-only users will not be tempted to try and update the received structure. As for the arguments to the factory, Michael McCandless suggested that the preferences should be expressed with (packed | aligned32 | aligned64 | auto). As fas as I can see, this should work. However, I've only just reached this conclusion and haven't had the time to implement it.

A speed-test has been added and the results from my machine can be seen below. In order for it to be really usable, it should be tried on other machines too.

I won't touch the code before sometime next week, but I'll keep an eye on LUCENE-1990 comments until then.

But, now that we have getMutable, can we make the concrete impls
package private? Javadocs for Mutable.set should note that the size
is fixed once you allocate it. We have no way to save a
Mutable... should we add that? If so, we may want to rename Writer ->
WriteOnceWriter. This way consumers can also get a Mutable, do random
writes, then save, if the "write once" model isn't a good fit.

Maybe rename PackedDirectInt to PackedDirect32 (and Short to 16,
Byte to 8). Because... while it is using a direct int[] under the hood,
it's really using all 32 bits for the full positive int range. So
PackedDirect32 can be used even for pos ints that would overflow a
normal java "int". (Though, for long we obviously can't use that 64th
bit for positive ints...).

The @see in the new IndexInput.readShort is wrong (referencing
writeInt).

Can you add @lucene.internal to the javadocs?

Seems like once we stomp the bugs, beef up the tests, and merge
PRIORITY and BLOCK_PREFERENCE (into maybe STORAGE?) for
the public API, we are nearly done? Thanks Toke!

Michael McCandless
added a comment - 12/Feb/10 12:10 Great progress Toke!
I guess we should do Mutable since you're so far along already
But, now that we have getMutable, can we make the concrete impls
package private? Javadocs for Mutable.set should note that the size
is fixed once you allocate it. We have no way to save a
Mutable... should we add that? If so, we may want to rename Writer ->
WriteOnceWriter. This way consumers can also get a Mutable, do random
writes, then save, if the "write once" model isn't a good fit.
Maybe we should just merge Mutable & Reader, then? (LongStore?
LongArray? PackedLongs?)
We should state clearly that these are all unsigned ints storage.
Maybe rename PackedDirectInt to PackedDirect32 (and Short to 16,
Byte to 8). Because... while it is using a direct int[] under the hood,
it's really using all 32 bits for the full positive int range. So
PackedDirect32 can be used even for pos ints that would overflow a
normal java "int". (Though, for long we obviously can't use that 64th
bit for positive ints...).
The @see in the new IndexInput.readShort is wrong (referencing
writeInt).
Can you add @lucene.internal to the javadocs?
Seems like once we stomp the bugs, beef up the tests, and merge
PRIORITY and BLOCK_PREFERENCE (into maybe STORAGE?) for
the public API, we are nearly done? Thanks Toke!

I am sorry, but personal issues sapped my time and energy this week, so Lucene got bumped down my priority-list. I am going to code4lib next week and I'll try and get some hacking done in the plane from Denmark to USA, but that depends on whether or not there is a power socket near my seat. If I don't upload a patch late monday, it will be early march before I'll get it done

But, now that we have getMutable, can we make the concrete impls
package private? Javadocs for Mutable.set should note that the size
is fixed once you allocate it.

Agreed on both.

We have no way to save a Mutable... should we add that?

I dont know enough about persistence in Lucene to make that call. Since the writer is tied to Lucene, it would not work for general purposes, so making a writer for Mutables only seems to make sense if the user uses it to build index-structures?

Toke Eskildsen
added a comment - 20/Feb/10 21:52 I am sorry, but personal issues sapped my time and energy this week, so Lucene got bumped down my priority-list. I am going to code4lib next week and I'll try and get some hacking done in the plane from Denmark to USA, but that depends on whether or not there is a power socket near my seat. If I don't upload a patch late monday, it will be early march before I'll get it done
But, now that we have getMutable, can we make the concrete impls
package private? Javadocs for Mutable.set should note that the size
is fixed once you allocate it.
Agreed on both.
We have no way to save a Mutable... should we add that?
I dont know enough about persistence in Lucene to make that call. Since the writer is tied to Lucene, it would not work for general purposes, so making a writer for Mutables only seems to make sense if the user uses it to build index-structures?
Maybe we should just merge Mutable & Reader, then? (LongStore?
LongArray? PackedLongs?)
I don't understand that one? You made a compelling argument for returning immutables to readers earlier (problems with concurrency and having all back ends support writes).
As for the name... I don't know. None of the sound right, but I have no other suggestion.
We should state clearly that these are all unsigned ints storage.
Maybe rename PackedDirectInt to PackedDirect32 (and Short to 16,
Byte to 8). Because... while it is using a direct int[] under the hood,
it's really using all 32 bits for the full positive int range.
Good point. The rest of your suggestions are also very valid.

I've renamed most of the classes to short form, as the "Packed"-prefix did was not that descriptive and fixed some bugs. Still pending is the mutable writer and a bug in persistence for aligned64. Good news (for Lucene at least) is that an airplane blocking snowdrift means that I have time this week for continued hacking.

But, now that we have getMutable, can we make the concrete impls
package private? Javadocs for Mutable.set should note that the size
is fixed once you allocate it.

The implementations are now package private, but I only put the note about fixed size on the getMutable-method. There's nothing wrong with creating a custom auto growing Mutable.

We should state clearly that these are all unsigned ints storage.

Done.

Maybe rename PackedDirectInt to PackedDirect32 (and Short to 16,
Byte to 8).

Done (Direct8, Direct16, Direct32 and Direct64).

The @see in the new IndexInput.readShort is wrong (referencing
writeInt).

Fixed.

Can you add @lucene.internal to the javadocs?

Should this also be applied to package private classes? Marking those as internal seems redundant.

Seems like once we stomp the bugs, beef up the tests, and merge
PRIORITY and BLOCK_PREFERENCE (into maybe STORAGE?) for
the public API, we are nearly done?

I've removed BLOCK_PREFERENCE from the API. It's still used internally, mainly to do controlled testing. Tests are beefed up (and currently fails for aligned, so clearly beefing worked).

Toke Eskildsen
added a comment - 23/Feb/10 15:33 I've renamed most of the classes to short form, as the "Packed"-prefix did was not that descriptive and fixed some bugs. Still pending is the mutable writer and a bug in persistence for aligned64. Good news (for Lucene at least) is that an airplane blocking snowdrift means that I have time this week for continued hacking.
But, now that we have getMutable, can we make the concrete impls
package private? Javadocs for Mutable.set should note that the size
is fixed once you allocate it.
The implementations are now package private, but I only put the note about fixed size on the getMutable-method. There's nothing wrong with creating a custom auto growing Mutable.
We should state clearly that these are all unsigned ints storage.
Done.
Maybe rename PackedDirectInt to PackedDirect32 (and Short to 16,
Byte to 8).
Done (Direct8, Direct16, Direct32 and Direct64).
The @see in the new IndexInput.readShort is wrong (referencing
writeInt).
Fixed.
Can you add @lucene.internal to the javadocs?
Should this also be applied to package private classes? Marking those as internal seems redundant.
Seems like once we stomp the bugs, beef up the tests, and merge
PRIORITY and BLOCK_PREFERENCE (into maybe STORAGE?) for
the public API, we are nearly done?
I've removed BLOCK_PREFERENCE from the API. It's still used internally, mainly to do controlled testing. Tests are beefed up (and currently fails for aligned, so clearly beefing worked).

I have done some tests (see the TODO-file in the attached patch) and on 64 bit machines, the difference in access-speed for aligned vs. packed is not that great and not always in favor of aligned. Probably because some space is wasted and the RAM-cache is not so well utilized. If this is also the case for 32 bit machines, I vote for removing aligned and only used packed with the special-case optimizations direct8, direct16, direct32 and direct64. This would also mean that there is only one persistent format.

Toke Eskildsen
added a comment - 26/Feb/10 14:03 Now we're getting somewhere. I finally squashed the persistence bug and the tests has been turned up another notch. Everything seems to run as it should. Pending issues, as I see them:
Review of the code
Should we make a MutableWriter?
Should we drop support for aligned?
The last one is interesting. The code for getting a value from aligned uses devision and a single RAM-request:
public long get( final int index) {
final int blockPos = index / valuesPerBlock;
final int bitPos = (index - (blockPos * valuesPerBlock)) * bitsPerValue;
return (blocks[blockPos] >>> shifts[bitPos]) & readMask;
where the code for packed uses shift and two RAM-requests:
final long majorBitPos = index * bitsPerValue;
final int elementPos = ( int )(majorBitPos >>> BLOCK_BITS); // / BLOCK_SIZE
final int bitPos = ( int )(majorBitPos & MOD_MASK); // % BLOCK_SIZE);
final int base = bitPos * FAC_BITPOS;
return ((blocks[elementPos] << shifts[base]) >>> shifts[base+1]) |
((blocks[elementPos+1] >>> shifts[base+2]) & readMasks[bitPos]);
I have done some tests (see the TODO-file in the attached patch) and on 64 bit machines, the difference in access-speed for aligned vs. packed is not that great and not always in favor of aligned. Probably because some space is wasted and the RAM-cache is not so well utilized. If this is also the case for 32 bit machines, I vote for removing aligned and only used packed with the special-case optimizations direct8, direct16, direct32 and direct64. This would also mean that there is only one persistent format.
java -cp lucene-core-3.1-dev.jar org.apache.lucene.util.packed.PackedIntsPerformance
Runs throught the performance tests and delivers a simple report, so it should be very easy to test on different platforms. It only measures access speed.
I consider this patch ready for review and concentrate on other matters until I hear more.

I couldn't help making a tiny tweak to the performance test so that it outputs execution time means for the different implementations. I have attached measurements from 5 different 64 bit machines. Looking at the means, I observe the following:

i7 Q820 and Xeon L5420: Practically no difference between aligned and packed with a small edge to aligned

Core 2 and Xeon 5148: Aligned is consistently about 10% slower than packed

Toke Eskildsen
added a comment - 26/Feb/10 15:49 I couldn't help making a tiny tweak to the performance test so that it outputs execution time means for the different implementations. I have attached measurements from 5 different 64 bit machines. Looking at the means, I observe the following:
i7 Q820 and Xeon L5420: Practically no difference between aligned and packed with a small edge to aligned
Core 2 and Xeon 5148: Aligned is consistently about 10% slower than packed
Xeon MP (old with just 1 MB CPU cache): Aligned ranges from 0-10% slower than packed, depending on bits/value
The direct implementations outperforms packed and aligned for all sane cases (using direct8 to hold only 1 bit/value is clearly a bad idea). No surprise there.
Caveat: The tests were run without any other significantly resource heavy processes disturbing it. This means that there were no fighting for the CPU cache.
Major caveat: Tests are needed on other processors than 64 bit Intel.
I would be great if someone could figure out how to make an aligned getter without using division as that is surely the thing that hampers aligned performance.

Michael McCandless
added a comment - 26/Feb/10 20:04 Great progress! I think this is very close.
Airplane blocking snow drifts!? Where on earth are you anyway?
Can you add @lucene.internal to the javadocs?
Should this also be applied to package private classes? Marking those as internal seems redundant.
Yeah I agree package private APIs don't need the @lucene.internal...
It's very interesting that align is never a win – I think in that
case removing it makes sense? It'll be a nice simplification.
I think we don't need to make a MutableWriter, at least before
committing? Nobody needs it now... (I think?).
Other small things:
Can you use @lucene.internal instead of the NOTE that I had put on
the classes?
We lost "final" in the RamUsageEstimator constants
Did we ever test performance of the specialized (generated)
decoders using switch statements?

In Denmark. The guy responsible for clearing the runway did indeed clear the runway. He just forgot that the plane needs to taxi into the runway in the first place. That made us miss our connecting flight.

It's very interesting that align is never a win – I think in that case removing it makes sense? It'll be a nice simplification.

Well, practically never wins for the machines I tested on and never wins with my implementation.

Can you use @lucene.internal instead of the NOTE that I had put on the classes?

Done... I think. I'm not very good at this part, so if someone else wants to do some cleanup i JavaDoc and such, they are very welcome by me.

We lost "final" in the RamUsageEstimator constants

Strange. Oh well, fixed.

Did we ever test performance of the specialized (generated) decoders using switch statements?

I just did a quick hack in order to measure performance and I was very surprised that the generated switch-based implementations performs so well. It's nearly on par with packed most of the time and exceeds it in some cases. I only tested on 3 machines though. The hack is in the LUCENE-1990-te20100226c.patch and is called when the performance test is executed.

Attachment generated_performance-te20100226.txt contains measurements where the py-generated code is tested together with the other implementations.

Note to self: Switch is not equivalent to a series of if-else, when we're talking performance and when we switch without omissions in the cases.

Toke Eskildsen
added a comment - 26/Feb/10 22:55
Airplane blocking snow drifts!? Where on earth are you anyway?
In Denmark. The guy responsible for clearing the runway did indeed clear the runway. He just forgot that the plane needs to taxi into the runway in the first place. That made us miss our connecting flight.
It's very interesting that align is never a win – I think in that case removing it makes sense? It'll be a nice simplification.
Well, practically never wins for the machines I tested on and never wins with my implementation.
Can you use @lucene.internal instead of the NOTE that I had put on the classes?
Done... I think. I'm not very good at this part, so if someone else wants to do some cleanup i JavaDoc and such, they are very welcome by me.
We lost "final" in the RamUsageEstimator constants
Strange. Oh well, fixed.
Did we ever test performance of the specialized (generated) decoders using switch statements?
I just did a quick hack in order to measure performance and I was very surprised that the generated switch-based implementations performs so well. It's nearly on par with packed most of the time and exceeds it in some cases. I only tested on 3 machines though. The hack is in the LUCENE-1990 -te20100226c.patch and is called when the performance test is executed.
Attachment generated_performance-te20100226.txt contains measurements where the py-generated code is tested together with the other implementations.
Note to self: Switch is not equivalent to a series of if-else, when we're talking performance and when we switch without omissions in the cases.

In Denmark. The guy responsible for clearing the runway did indeed clear the runway. He just forgot that the plane needs to taxi into the runway in the first place. That made us miss our connecting flight.

Good grief!

It's very interesting that align is never a win - I think in that case removing it makes sense? It'll be a nice simplification.

Well, practically never wins for the machines I tested on and never wins with my implementation.

I think we should remove it...

Did we ever test performance of the specialized (generated) decoders using switch statements?

I just did a quick hack in order to measure performance and I was very surprised that the generated switch-based implementations performs so well. It's nearly on par with packed most of the time and exceeds it in some cases. I only tested on 3 machines though. The hack is in the LUCENE-1990-te20100226c.patch and is called when the performance test is executed.

Note to self: Switch is not equivalent to a series of if-else, when we're talking performance and when we switch without omissions in the cases.

Right, if the switch cases are compact, it should compile into a fast jump
table (though it may still do an unecessary bounds check).

I think, once we removed aligned, this is ready to commit? I think we
should land this on flex branch? (It's using CodecUtil, BytesRef –
I'll merge them when I commit). Then I can cutover the terms index to
use packed ints.

Michael McCandless
added a comment - 27/Feb/10 11:13
Airplane blocking snow drifts!? Where on earth are you anyway?
In Denmark. The guy responsible for clearing the runway did indeed clear the runway. He just forgot that the plane needs to taxi into the runway in the first place. That made us miss our connecting flight.
Good grief!
It's very interesting that align is never a win - I think in that case removing it makes sense? It'll be a nice simplification.
Well, practically never wins for the machines I tested on and never wins with my implementation.
I think we should remove it...
Did we ever test performance of the specialized (generated) decoders using switch statements?
I just did a quick hack in order to measure performance and I was very surprised that the generated switch-based implementations performs so well. It's nearly on par with packed most of the time and exceeds it in some cases. I only tested on 3 machines though. The hack is in the LUCENE-1990 -te20100226c.patch and is called when the performance test is executed.
Thanks for testing this! It is interesting.
I ran the perf test on a CentOS 5.4 machine, java
1.6.0_17-b04 64 bit server, Intel core 2 duo E8400 (3 ghz) – attached
perf-mkm-20100227.txt. I also show the switch impl close, though
always a bit behind.
Seems like we should just stick with the non-gen'd packed impl?
Note to self: Switch is not equivalent to a series of if-else, when we're talking performance and when we switch without omissions in the cases.
Right, if the switch cases are compact, it should compile into a fast jump
table (though it may still do an unecessary bounds check).
I think, once we removed aligned, this is ready to commit? I think we
should land this on flex branch? (It's using CodecUtil, BytesRef –
I'll merge them when I commit). Then I can cutover the terms index to
use packed ints.

I've tested on two 32 bit Windows machines: An Intel T2400 (32 bit only) running XP and an Athlon X2 4850e (64 bit capable) running 32 bit XP. The result can be seen in attachment performance-20100301.txt. Something curious happens with high (32+) bits/value for the T2400 as aligned overtakes packed. However, the overall picture is still that aligned only wins for a few special cases, so now I'll be happy to remove it from the patch. As a note, generated is also slower than packed on the AMD processor, although not as much as for Intel.

I have removed all traces of aligned from PackedInts, but kept the classes in the patch, in the case that someone finds a faster way to handle aligned. PackedIntsPerformance still includes both the generated switch-implementation and Aligned32 and Aligned64. It should be possible to apply the patch without Aligned32, Aligned64, AlignedWriter and PackedIntsPerformance.

Toke Eskildsen
added a comment - 01/Mar/10 14:11 I've tested on two 32 bit Windows machines: An Intel T2400 (32 bit only) running XP and an Athlon X2 4850e (64 bit capable) running 32 bit XP. The result can be seen in attachment performance-20100301.txt. Something curious happens with high (32+) bits/value for the T2400 as aligned overtakes packed. However, the overall picture is still that aligned only wins for a few special cases, so now I'll be happy to remove it from the patch. As a note, generated is also slower than packed on the AMD processor, although not as much as for Intel.
I have removed all traces of aligned from PackedInts, but kept the classes in the patch, in the case that someone finds a faster way to handle aligned. PackedIntsPerformance still includes both the generated switch-implementation and Aligned32 and Aligned64. It should be possible to apply the patch without Aligned32, Aligned64, AlignedWriter and PackedIntsPerformance.

Some thoughts on avoiding the generic division by experimenting with reciprocal multiplication: For aligned, the sane number of values/block are [3, 5, 6, 7, 8, 9, 10, 16, 21, 32, 64]. I tried testing index from 0 to Integer.MAX_VALUE with these divisors and reciprocal multiplication. It worked perfectly for all divisors except [5, 7, 9, 10, 21]. Unfortunately it already falls for divisor 21 at index 252645140, which makes it useless as a full replacement. If one were so inclined, it would be possible to select aligned implementation based on valueCount, with fallback to the "slow" version. The gain of using fast division seems quite substantial as it makes aligned 14-40% faster than packed (note: Just tested on a single machine). However, re-introducing aligned with four different implementations (Aligned32, Aligned32Fast, Aligned64, Aligned64Fast) is rather daunting and it would make the selection code really messy.

I can see that there are well-known tricks to get around the rounding errors. Some are described at http://www.cs.uiowa.edu/~jones/bcd/divide.html#fixed . I don't know if these extra tricks would negate the 14-40% speed gain though. Since I would like to get the patch out of the door, I vote for keeping aligned disabled and just note that more bit fiddling might make it attractive at some point.

Toke Eskildsen
added a comment - 01/Mar/10 20:56 Some thoughts on avoiding the generic division by experimenting with reciprocal multiplication: For aligned, the sane number of values/block are [3, 5, 6, 7, 8, 9, 10, 16, 21, 32, 64] . I tried testing index from 0 to Integer.MAX_VALUE with these divisors and reciprocal multiplication. It worked perfectly for all divisors except [5, 7, 9, 10, 21] . Unfortunately it already falls for divisor 21 at index 252645140, which makes it useless as a full replacement. If one were so inclined, it would be possible to select aligned implementation based on valueCount, with fallback to the "slow" version. The gain of using fast division seems quite substantial as it makes aligned 14-40% faster than packed (note: Just tested on a single machine). However, re-introducing aligned with four different implementations (Aligned32, Aligned32Fast, Aligned64, Aligned64Fast) is rather daunting and it would make the selection code really messy.
I can see that there are well-known tricks to get around the rounding errors. Some are described at http://www.cs.uiowa.edu/~jones/bcd/divide.html#fixed . I don't know if these extra tricks would negate the 14-40% speed gain though. Since I would like to get the patch out of the door, I vote for keeping aligned disabled and just note that more bit fiddling might make it attractive at some point.

I think we shouldn't add Aligned*.java to svn? It'll just add
unused bits to the JAR, and, we can always fallback to this issue
to pull them in at a future time?

Can you resolve the remaining nocommits? EG (since we are
unsigned) we can't get the 64 bit case working. I don't think we
should rename to UnsignedXXX, nor, support minValue at this
point, and remove the ComparableBytesRef, and I'll merge BytesRef
into flex's when I commit.

I can take these too – I think it's ready to commit on flex after
this. Thanks!

Michael McCandless
added a comment - 07/Mar/10 11:51 Patch looks great Toke – a few small things:
I think we shouldn't add Aligned*.java to svn? It'll just add
unused bits to the JAR, and, we can always fallback to this issue
to pull them in at a future time?
Can you resolve the remaining nocommits? EG (since we are
unsigned) we can't get the 64 bit case working. I don't think we
should rename to UnsignedXXX, nor, support minValue at this
point, and remove the ComparableBytesRef, and I'll merge BytesRef
into flex's when I commit.
I can take these too – I think it's ready to commit on flex after
this. Thanks!

I think we shouldn't add Aligned*.java to svn? It'll just add
unused bits to the JAR, and, we can always fallback to this issue
to pull them in at a future time?

I agree. At the current state, Aligned is just dead weight.

This also means that the performance tester won't be part of the commit though. I can quickly make a performance tester that does not use aligned, if it is preferable to keep performance testing.

Can you resolve the remaining nocommits? EG (since we are
unsigned) we can't get the 64 bit case working. I don't think we
should rename to UnsignedXXX, nor, support minValue at this
point, and remove the ComparableBytesRef, and I'll merge BytesRef
into flex's when I commit.

I can take these too - I think it's ready to commit on flex after this

Toke Eskildsen
added a comment - 09/Mar/10 12:33 Michael McCandless:
I think we shouldn't add Aligned*.java to svn? It'll just add
unused bits to the JAR, and, we can always fallback to this issue
to pull them in at a future time?
I agree. At the current state, Aligned is just dead weight.
This also means that the performance tester won't be part of the commit though. I can quickly make a performance tester that does not use aligned, if it is preferable to keep performance testing.
Can you resolve the remaining nocommits? EG (since we are
unsigned) we can't get the 64 bit case working. I don't think we
should rename to UnsignedXXX, nor, support minValue at this
point, and remove the ComparableBytesRef, and I'll merge BytesRef
into flex's when I commit.
I can take these too - I think it's ready to commit on flex after this
It will help a lot if you take care of these issues, thanks.

Toke Eskildsen
added a comment - 09/Mar/10 12:34 Uwe Schindler:
We should also add the @lucene.internal javadoc comments everywhere instead of the big NOTE. Why has one class a full-uppercase class name?
Are you looking at patch LUCENE-1990 -te20100301.patch? I don't see any NOTE and no full-uppercase class name?

Michael McCandless
added a comment - 09/Mar/10 18:23 We should also add the @lucene.internal javadoc comments everywhere instead of the big NOTE.
I found one more NOTE in CodecUtil – I'll fix before committing.

Michael McCandless
added a comment - 09/Mar/10 18:26 Why has one class a full-uppercase class name?
Uwe you mean eg STORAGE? (And also BLOCK, IMPLEMENTATION... but they are package private). These are enums – seems OK to make them all caps?
Though I do think we can simplify some of this now that we're removing the aligned case... eg PERSISTENCE is an enum with only one value. I'll take a stab & post patch.

Michael McCandless
added a comment - 09/Mar/10 22:14 OK new patch attached:
Ported to flex, and cutover to CodecUtil. BytesRef required no
changes...
Simplified the API/impl to not use STORAGE, PERSISTENCE,
IMPLEMENTATION, etc. You just specify required bitsPerValue.
Removed Aligned*, and ConsumesRAM interface
Fixed the nocommits.
I think it's ready! I'll wait a day or two...

By the way Toke, we have lately been benchmarking automaton queries, which are pretty intensive on the terms dictionary.
I think we were expecting some acceptable slowdown once we switched to packed ints, but according to my benchmarks this is not the case.

I think its pretty impressive to see no measurable performance impact on this stuff at all, great work.

Robert Muir
added a comment - 27/Mar/10 17:04 By the way Toke, we have lately been benchmarking automaton queries, which are pretty intensive on the terms dictionary.
I think we were expecting some acceptable slowdown once we switched to packed ints, but according to my benchmarks this is not the case.
I think its pretty impressive to see no measurable performance impact on this stuff at all, great work.

I am very happy to hear that, Robert. The benchmarks I made had the glaring flaw that they were ... well, benchmarks. With the CPU-cache being hammered in a real world scenario, your findings indicate that the slow round-trip to main memory dwarfs the extra logic for extracting the values from the packed structure. For a few scenarios, it might even be faster than plain arrays.

Getting back to reality, my own findings indicates that using PackedInts for ord-based sorted search is not at all faster than plain arrays. The access pattern here is very sequential, so the chance that the needed value is already fetched from main memory is high for both plain and packed structures.

Toke Eskildsen
added a comment - 31/Mar/10 11:28 I am very happy to hear that, Robert. The benchmarks I made had the glaring flaw that they were ... well, benchmarks. With the CPU-cache being hammered in a real world scenario, your findings indicate that the slow round-trip to main memory dwarfs the extra logic for extracting the values from the packed structure. For a few scenarios, it might even be faster than plain arrays.
Getting back to reality, my own findings indicates that using PackedInts for ord-based sorted search is not at all faster than plain arrays. The access pattern here is very sequential, so the chance that the needed value is already fetched from main memory is high for both plain and packed structures.

Toke, are sort ords that much slower than straight arrays for sorting?

After flex lands I'd really like to make a variant of FieldCache.StringIndex that uses BytesRef for the values and packed ints for the ords.... should save alot of memory in many cases (English text saves since utf8 is 1 byte per char; enumerated fields (eg country name) should save tons by using only a few bits instead of 32 we now always use) when sorting by string.

Michael McCandless
added a comment - 31/Mar/10 11:39 Toke, are sort ords that much slower than straight arrays for sorting?
After flex lands I'd really like to make a variant of FieldCache.StringIndex that uses BytesRef for the values and packed ints for the ords.... should save alot of memory in many cases (English text saves since utf8 is 1 byte per char; enumerated fields (eg country name) should save tons by using only a few bits instead of 32 we now always use) when sorting by string.

In the original proof of concept for LUCENE-2335, I measured the time for extracting top-20 for ... 10 million? documents and got something like 600ms when using PackedInts, which is fairly slow in my book and recall getting better performance with straight arrays for that. This is all wery fuzzy though and I'd love to be proven wrong. If PackedInts are faster for sorting too, it's getting very hard to see the downside of that representation.

Since LUCENE-2335 relies heavily on arays of sorted indexes into ordinal arrays, I'll make sure to performance test both PackedInts and straight arrays for sorted search.

Toke Eskildsen
added a comment - 31/Mar/10 20:51 In the original proof of concept for LUCENE-2335 , I measured the time for extracting top-20 for ... 10 million? documents and got something like 600ms when using PackedInts, which is fairly slow in my book and recall getting better performance with straight arrays for that. This is all wery fuzzy though and I'd love to be proven wrong. If PackedInts are faster for sorting too, it's getting very hard to see the downside of that representation.
Since LUCENE-2335 relies heavily on arays of sorted indexes into ordinal arrays, I'll make sure to performance test both PackedInts and straight arrays for sorted search.

It seem like my unit-testing of PackedInts.Mutable wasn't good enough. There is a bug in Packed64 (and probably in Packed32 too) when using the set-method. In certain cases the secondary block is changed when it should be left alone. A simple unit-test is

Toke Eskildsen
added a comment - 01/Apr/10 22:24 - edited It seem like my unit-testing of PackedInts.Mutable wasn't good enough. There is a bug in Packed64 (and probably in Packed32 too) when using the set-method. In certain cases the secondary block is changed when it should be left alone. A simple unit-test is
PackedInts.Mutable mutable = PackedInts.getMutable(26, 5);
mutable.set(24, 31);
mutable.set(4, 16);
assertEquals( "The value #24 should remain unchanged" , 31, mutable.get(24));
The PackedWriter uses a different algorithm for generating the bit stream and is unaffected by this bug.
I expect the write-masks for the set-method to be at fault and I am working on a fix. ETA: Within an hour or sometime during the weekend, depending on difficulty.

Without checking thoroughly, I'd expect the two pieces of code to be exactly the same, at the difference between Packed32 and Packed64 is just long vs. int and some constants. The unit-test from above can be used for Packed32 by explicitly creating a Packed32 instead of calling the factory.

I'll be back behind the screen in a few days where I can make a patch, but you are more than welcome to roll the patch if it is more convenient to get it immediately.

Michael McCandless
added a comment - 01/Apr/10 23:06 I turned it into a patch (attached).
But: without the fix, when I run the test, I don't see it failing (on current flex branch). I've fixed a few issues with packed ints on flex, but I don't think they would've fixed this. Confused....

My initial unit-test contained an error, which I corrected after a minute or two (as far as I remember). Maybe you used the first version?

It seems that the bug is indeed in trunk. It corrupts the value of the block after the current block in certain cases: Sequential assignment of values works fine, but out-of-order assignments corrupts the array.

Toke Eskildsen
added a comment - 06/Apr/10 21:08 I did a checkout with
svn co https: //svn.apache.org/repos/asf/lucene/dev/trunk/lucene lucene-flex
and added the following method to TestPackedInts
public void testSecondaryBlockChange() throws IOException {
PackedInts.Mutable mutable = new Packed64(26, 5);
mutable.set(24, 31);
assertEquals( "The value #24 should be correct" , 31, mutable.get(24));
mutable.set(4, 16);
assertEquals( "The value #24 should remain unchanged" , 31, mutable.get(24));
}
after which I ran
ant test-core
which gave me
[junit] Testsuite: org.apache.lucene.util.packed.TestPackedInts
[junit] Testcase: testSecondaryBlockChange(org.apache.lucene.util.packed.TestPackedInts): FAILED
[junit] The value #24 should remain unchanged expected:<31> but was:<28>
[junit] junit.framework.AssertionFailedError: The value #24 should remain unchanged expected:<31> but was:<28>
[junit] at org.apache.lucene.util.packed.TestPackedInts.testSecondaryBlockChange(TestPackedInts.java:106)
[junit] at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:276)
then I added
if (bitPos <= BLOCK_SIZE - elementBits) { // Second block not used
currentMasks[base+1] = ~0; // Keep all bits
currentMasks[base+2] = 0; // Or with 0
}
to the relevant parts of Packed32 and Packed, as described above and ran
ant test-core
again, which gave me
[junit] Testsuite: org.apache.lucene.util.packed.TestPackedInts
[junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 5.463 sec
My initial unit-test contained an error, which I corrected after a minute or two (as far as I remember). Maybe you used the first version?
It seems that the bug is indeed in trunk. It corrupts the value of the block after the current block in certain cases: Sequential assignment of values works fine, but out-of-order assignments corrupts the array.

Michael McCandless
added a comment - 06/Apr/10 21:46 OK indeed now I can see the failure when I add the test (I guess I did use the first version), and the changes fix it. I'll commit shortly. Thanks Toke!

I have discovered a bug in Packed32 and Packed64: When the number of bits exceed 2^31, the setters and getters fail. This is due to a missing cast in the calculation of the entry-point in the backing int/long-arrays.

Toke Eskildsen
added a comment - 31/Aug/10 09:41 I have discovered a bug in Packed32 and Packed64: When the number of bits exceed 2^31, the setters and getters fail. This is due to a missing cast in the calculation of the entry-point in the backing int/long-arrays.
In both Packed32 and Packed64 the line
final long majorBitPos = index * bitsPerValue;
is used in get as well as set. This should be
final long majorBitPos = ( long )index * bitsPerValue;
in all 4 cases.
A unit test for this is
/*
Check if the structures properly handle the case where
index * bitsPerValue > Integer .MAX_VALUE
*/
public void testIntOverflow() {
int INDEX = ( int ) Math .pow(2, 30);
int BITS = 4;
Packed32 p32 = new Packed32(INDEX, BITS);
p32.set(INDEX-1, 1);
assertEquals( "The value at position " + (INDEX-1)
+ " should be correct for Packed32" , 1, p32.get(INDEX-1));
p32 = null ; // To free 512MB
Packed64 p64 = new Packed64(INDEX, BITS);
p64.set(INDEX-1, 1);
assertEquals( "The value at position " + (INDEX-1)
+ " should be correct for Packed64" , 1, p64.get(INDEX-1));
}
One big problem with the unit-test is that it requires 2^30*4/8 bytes = 512MB of heap. I am guessing that this makes it impossible to run in the standard test-suite.
I am unsure as to how I should push this fix through. Should I create a new JIRA issue? Make a patch against trunk? Or maybe a committer could just try the test above and insert the fix in trunk?

One big problem with the unit-test is that it requires 2^30*4/8 bytes = 512MB of heap. I am guessing that this makes it impossible to run in the standard test-suite.

Seems to be a bit high for a unittest but you can't help it, right?

I am unsure as to how I should push this fix through. Should I create a new JIRA issue? Make a patch against trunk? Or maybe a committer could just try the test above and insert the fix in trunk?

I would suggest to create a new issue and attach a patch with the fix including your unittest. Since the unittest might break hudson etc I would recommend to add an @Ignore on top of it (JUnit 4) until we decided how to include tests like that. Maybe we might introduce a special flag that enables tests like that with a JUnit Assume call but that needs to be further discussed.

Simon Willnauer
added a comment - 31/Aug/10 10:01 One big problem with the unit-test is that it requires 2^30*4/8 bytes = 512MB of heap. I am guessing that this makes it impossible to run in the standard test-suite.
Seems to be a bit high for a unittest but you can't help it, right?
I am unsure as to how I should push this fix through. Should I create a new JIRA issue? Make a patch against trunk? Or maybe a committer could just try the test above and insert the fix in trunk?
I would suggest to create a new issue and attach a patch with the fix including your unittest. Since the unittest might break hudson etc I would recommend to add an @Ignore on top of it (JUnit 4) until we decided how to include tests like that. Maybe we might introduce a special flag that enables tests like that with a JUnit Assume call but that needs to be further discussed.

This still triggers the bug but requires "only" 256 MB. Is this acceptable in the Hudson environment?

The default maxMemory size is wired at 512MB. i think this might be too large already: on my machine with 4 cpus
this means max 2GB for lucene and 4GB for solr tests.

one way to do this would be to make this maxMemory configurable with a -D, set it accordingly in hudson, and
also setup a way to write 'hudson-only' tests. then stuff like this could run in the nightly only.

Robert Muir
added a comment - 01/Sep/10 02:31 This still triggers the bug but requires "only" 256 MB. Is this acceptable in the Hudson environment?
The default maxMemory size is wired at 512MB. i think this might be too large already: on my machine with 4 cpus
this means max 2GB for lucene and 4GB for solr tests.
one way to do this would be to make this maxMemory configurable with a -D, set it accordingly in hudson, and
also setup a way to write 'hudson-only' tests. then stuff like this could run in the nightly only.