On 2013-05-19, at 12:50 PM, Alan Bateman <Alan.Bateman at oracle.com> wrote:
> On 16/05/2013 15:50, David Chase wrote:
>> webrev: http://cr.openjdk.java.net/~drchase/7088419/webrev.01/>>>> problem: Some applications spend a surprising amount of time computing CRC32s
>> (Not sure I'm supposed to be precise on an open mailing list). Recent Intel
>> architectures provide instructions that might be useful in addressing this.
>>>> See https://jbs.oracle.com/bugs/browse/JDK-7088419>>>> I turned this into a general attack on performance in Adler32 and CRC32, partly because the bug report was not clear on the size of the problematic inputs. The general approach turned out to be useful, because it's tricky to get the small-case overheads down for the accelerated-instruction version of the code.
>>> I took a second pass over the webrev. If we've going to have assembly in CRC32.c then it makes me wonder if this should be the place to probe if CLMUL is supported. Clearly there may be others cases where the VM needs to probe too but I'm just wondering if the capability really needs to be communicated via a property (at least for now). The other thing that comes to mind is whether we could re-write it in java in a way that allows C2 to generate code that uses CLMUL. I realize that is probably a much bigger effort but I can imagine other cases (like crypto where it could be useful too).
The reason for passing it through as a property is two-fold. First, it's a mess of code to probe that property, and I'd rather not duplicate it. Second, for better of worse, right now we have command-line flags to disable the use of these hardware features, for example -XX:-UseCLMUL, -XX:-UseAES. Routing this through the jvm and a property means that it all goes together.
I think it would be a Big Deal to get this working through the compiler in any quick way, but it's being discussed for the future. I tried using the Magic Instruction in simpler idioms, and it was slower, not faster. We also need to be certain of alignment, and that's not possible in the current Unsafe interface (the code that you see there is making a best-effort guess).
Also, the reason the assembly language is there is not because it is technically necessary, but rather because we get our build compilers (at least for now, I think it will change soon) from the software museum. I compiled the C version of the algorithm with clang, and with gcc 4.6, and (almost) with Solaris Studio 12.3. Visual Studio 12, the one we're not using yet, also claims to process those intrinsics, though I have not tested it yet.
> On using Unsafe to get the alignment then you can use Unsafe.getUnsafe() rather than reflection.
Okay, I'll get to that.
> On my comment about whether a parallelUpdate method has been considered then my personal view this would make it obvious when looking at the code. That is, have the long standing update methods be serial by default and introduce new parallelUpdate methods. For existing code then you could have the system property to opt-in to be parallel when the input length is over the threshold (and that should okay to backport to jdk7u if needed).
I don't like this approach for several reasons.
First, we're not done finding places that fork-join parallelism can make things go faster. If, each time we find a new one, we must split it into the parallel and serial versions, we're going to get a tiresome proliferation of interfaces. We'll need to document, test, etc, and programmers will need to spend time choosing "the right one" for each application. This will be another opportunity to split application source bases into "old" and "new" -- these chances are always there, but why add more?
Second, this doesn't actually address the bug. This was done for a bug, they want CRC32 to go faster in various places, some of them open source. They were not looking for a different faster CRC, they were looking for the same CRC to go faster. They don't want to edit their source code or split their source base, and as we all know, Java doesn't have #ifdef.
Third, I've done a fair amount of benchmarking, one with "unlimited" fork join running down to relatively small task sizes, the other with fork-join capped at 4 threads (or in one case, 2 threads) of parallelism. Across a range of inputs and block sizes I checked the efficiency, meaning the departure from ideal speedup (2x or 4x). For 4M or larger inputs, across a range of machines, with parallelism capped at 2 (laptop, and single-split fork-joins) or 4, efficiency never dropped below 75%. The machines ranged from a core-i5 laptop, to an old T1000, to various Intel boxes, to a good-sized T4.
Out of 216 runs (9 machines, inputs 16M/8M/4M, task sizes 32K to 8M),
10 runs had efficiency 75% <= eff < 80%
52 runs, 80% <= eff < 90%
139 runs, 90% <= eff < 110%
15 runs had superlinear speedup of 110% or better "efficiency" (I checked for noisy data, it was not noisy).
We can pick a minimum-parallel size that will pretty much assure no inefficient surprises (I think it is 4 megabytes, but once committed to FJ, it looks like a good minimum task size is 128k), and there's a knob for controlling fork-join parallelism if people are in an environment where they noticed these momentary surprises and care (a T-1000/Niagara does about 9 serial 16M CRC32s per second, so it's not a long-lived blip). If necessary/tasteful, we can add a knob for people who want more parallelism than that.
If it's appropriate to put the benchmarks (PDF) in a public place, I can do that.
Fourth, I think there's actually a bit of needing to lead by example. If we treat fork/join parallelism as something that is so risky and potentially destabilizing that parallelized algorithms deserve their own interface, then what will other people think? I've got plenty of concerns about efficient use of processors, but I also checked what happens if the forkjoin pool is throttled, and it works pretty well.
David