On the previous thread about the article with those microbenchmarks comparing java with C++ somebody on that 'Comments' board posted some real code (a mandelbrot generator) which he converted from C to Java. On 1.4.2_04 client, it was about 8% slower.I did a little change without altering the algo's in any way (I just made everything non static) and got to almost exactly the same performance as the C version.Did another test on 1.5.0 beta 2 client and behold: the java version is even faster than the C version.

But: when I test on the server (both 1.4.2 and 1.5.0), the results are very disappointing. The server performs the test ~30% slower than the client!As a matter of fact, I've seen this kind of bad performance of the server VM very often (for example in a program I wrote for a customer where huge text files are converted to even more huge XML documents).This particular test runs 100 times (which means 100 times 15 seconds total), but performance doesn't get any better over time.

Anybody has an idea?

I know it's yet another benchmark, but the fact that I keep seeing the server VM perform so badly kind of worries me :-/

So, compared to the original version, on my laptop the benchmark runs about as fast as the FPU ASM version and almost 3.5 times as fast as the pure C version. Quite amazing, really.The asm versions that use SSE instructions still beat the crap out of the java version, but still the results are far better than I expected.

Heavy float usage often stuffs many 'cheap' C compilers (VC6, GCC), the Intel & Visual Studio.net do a much better job, but even so they don't usually do this:

Quote

To see it more clearly, I did break into the JIT code with a debugger and saw that the JIT:

1) Arranged a 4x loop unroll. 2) Is using SSE2 code. 3) The machine code is pretty nice for a JIT. Nice work indeed.

Now, that''s why the C code is slower: the C compiler is actually using default FPU (x87) code, so that.. the comparison is not fair

Of course this scores a point in favour of the JIT. The JIT can produce a code optimized depending on the actual CPU running the program, while a static compiler cannot know in advance the target platform (otherwise the static code will run ONLY on the target platform).

Nice! I wish the people who make the JITs would be more open about what the JIT can do - this sort of info would help convince developers (hey, get your free SSE2 optimisations over here!).

Hmm, now it seems the very bad server VM performance is not Athlon specific, but affects any CPU not supporting SSE2, so the problem is far more serious than I thought. It seems now that currently the server VM performs far worse than the client on most systems! Well, in this particular case that is.

I'm wondering what happens with the performance difference between the client and server (on non-sse2 CPUs) if the test is converted to float instead double precision.

Of course but I'm not trying to alter the test in order to make it quicker.When we use float instead of double we're not comparing to the double precision version of the original program anymore, but we should compare to the SSE version of the original (but only when we're running the test on an SSE supporting CPU).But my reason for converting to float is that I want to see what happens to the server performance compared to the client, so we can maybe narrow down the cause of the problem.If when using floats the server has acceptable performance compared to the client (on an AthlonXP) than I can conclude that there is probably a bug regarding SSE2 optimizations (in case those are not possible).If not, the problem lies elsewhere and we can begin to doubt the usability of the server VM in its current state on possibly even the majority of x86 platforms... :-/Which I am currently anyway, given my own personal (generally bad) experiences with the server VM on my Athlon.I'll do the test when I get home.

So 1.5.0 beta 2 seems to be as fast as, or a fraction faster than 1.4.2_03. The server's float performance seems good.It's interesting to see that on the client, performance with double precision is not lower than floats.

On an Athlon, yeah. But on a P4, Java3D should perform excellent since doubles on the server are even faster than floats.

EDIT: correction, not faster than float, but the diff between server and client is larger when dealing with doubles.

IIRC, there was a (regression) bug associated with proper alignment of doubles. Maybe this never got fixed on server (assuming, of course, that this is not a prerequisite for SSE2 which does appear to work).

However if we actually wanted x*x - y*y, then the obvious computation may actually be faster because the two multiplications can start one clock cycle apart, and when they are finished (some) 20 clock cycles later all that remains is the subtraction. By contrast in the alternative expression (x+y)*(x-y) the multiplication can't start until after both addition and subtraction have completed. The total time taken will then be very similar. A more important consideration today may be the accuracy of the result.

It's not my program. It's a java port of a fun little program called 'FFFF' (you can find it on sourceforge), ported from the original C source by the original author just to compare java's performance to the C version. I just changed it a little bit to make it not fully static. I didn't even look at the algorithm (apart from comparing it with the original sources).I figured if I would try to optimize the algorithm, the benchmark would become invalid.

...but mandelbrot is |(a+bj)| <=2, no? In which case, that's (a2-b2) + (2ab)j <= 4? I just remember that a diff of squares was in there somewhere ...

Quote

However if we actually wanted x*x - y*y, then the obvious computation may actually be faster because the two multiplications can start one clock cycle apart, and when they are finished (some) 20 clock cycles later all that remains is the subtraction.

Thanks. As I said, IIRC it used to have a significant effect (presumably not-very-good JITing); I see what you mean with pipelining. Might this change have a significant effect on the abscence/presence of sse/3dnow optimizations?

...but mandelbrot is |(a+bj)| <=2, no? In which case, that's (a2-b2) + (2ab)j <= 4? I just remember that a diff of squares was in there somewhere ...

You want a benchmark that actually computes the correct value!

Pipelines and superscalar execution certainly make estimating performance difficult. They ought to make use of a JIT very attractive if one wants optimal performance out of Pentium III, 4, Athlon XP, Athlon 64, Via Eden, Transmeta (Crusoe, Efficieon), etc.

java-gaming.org is not responsible for the content posted by its members, including references to external websites,
and other references that may or may not have a relation with our primarily
gaming and game production oriented community.
inquiries and complaints can be sent via email to the info‑account of the
company managing the website of java‑gaming.org