The benchmark is: fill a grid of 32*32*32 (32768 samples)C: 8.2ms (avg of 64 runs, no jni overhead as I benchmark it 100% in C)Java: 2.3ms (after warmup, avg of 64 runs)

Java server VM is 3.5x faster?

What rookie mistake did I make? I expected HotSpot to check all array-indices as 'random' array access like these: perm[perm[gx + 1] + 1] should be rather hard for HotSpot to optimize away, or so I thought.

I hardly ever use GCC. If you're using linux, you can grab the intel compiler for non-com. usage. At least for comparison purposes. Maybe you need a newer version of GCC?? Also, why limit the instruction set to P3? Isn't P4 a reasonable bottom end?

Would your noise function even get inlined? If I remember correctly, the larger a function is the less chance it has of becoming inlined. You should be able to check how much inlining is happening by removing the inline directives and check the resulting exe file size. If there is a significant difference then a lot of inlining is happening. I thought that if you are trying to call one inline function from another inline function it sinificantly reduces that chance of anything being inlined. Try removing inline from noise() and see if it has any effect.

HotSpot doesn't generate any parallel computations, but it will use SSE instructions when available. You can grep in the source for supports_sse & supports_sse2. There are a number of instructs which can have a big impact, even when computing a single result at a time.

All of the _mm_* intrinsic functions are wrappers for SIMD opcodes (MMX on). And _mm_mul_ss is a SSE-1 instruction. As for the asm dump, you mean yours? If so, then well, you told it to use SSE-1!

Well, we were trying to find out why the C code was slower than the JVM code (the math still is slower in C, I removed the branching bottleneck), and your and tom's suggestion/insinuation was that the JVM uses SSE, but as the C version uses SSE too, it means it can't be related to SSE. That was what confused me, so I double checked.

Anyway, thanks for your tips and suggestions, as it helped me getting rid of floor() and looking at the grad-function for alternative optimizations.

HotSpot doesn't generate any parallel computations, but it will use SSE instructions when available. You can grep in the source for supports_sse & supports_sse2. There are a number of instructs which can have a big impact, even when computing a single result at a time.

Hi, appreciate more people! Σ ♥ = ¾Learn how to award medals... and work your way up the social rankings!

I missed this reply. My "guess" is that you're on a machine newer than a P3 and the JIT is using newer instructions. I would be very curious to see what the JIT is producing. I've never run across a case where the VM was producing faster code.

My "guess" is that you're on a machine newer than a P3 and the JIT is using newer instructions. I would be very curious to see what the JIT is producing. I've never run across a case where the VM was producing faster code.

I changed the "-march" to "pentium4" (I don't know the names of modern archs) and it didn't help. It could be that GCC is simply doing inefficient SSE: not reusing registers or converting intermediate SSE results to x87 when it inlines a method.

If I replace this code:

1 2 3 4

inlinefloatlerp(floatt, floata, floatb) {returna + t * (b - a); }

with (to my knowlegde) theoretically the same:

1

#define lerp(t, a, b) (a + t * (b - a))

the result is 10% slower, so GCC is not generating the same assembly code in this trivial example.

Hi, appreciate more people! Σ ♥ = ¾Learn how to award medals... and work your way up the social rankings!

java-gaming.org is not responsible for the content posted by its members, including references to external websites,
and other references that may or may not have a relation with our primarily
gaming and game production oriented community.
inquiries and complaints can be sent via email to the info‑account of the
company managing the website of java‑gaming.org