I've created a little microbenchmark test the relative costs of a few different ways of getting access to a temporary object inside a short method, and I thought the results might be of interest to other people.

My test computes the cross product of two 3D vectors and returns the result in the first vector. This is a small but real-world-useful operation, and it requires some temporary space for the cross product. I coded up multiple versions that used different techniques to get the needed temporary space as follows:

[1] Local var. This method just used local double variables for its temporary space. This is the only version that does not use an object (a Vector3d) for its temporary space.[2] New. Allocate a new Vector3d each time the method is called. [3] ThreadLocal. Get a temporary object using a ThreadLocal object[4] Field. Get temporary object stored in a private field.[5] Field sync. Synchronized method which gets its temporary from a private field as in [4].[6] TempStack. Get temporary object from a TempStack which is essentially a object pool where objects must be returned in the reverse-order that they were gotten. TempStack is obtained using a ThreadLocal[7] TempStack param. Use a TempStack passed in an explicit extra parameter. Ugly in that it requires an extra parameter but can be relatively fast.

Method 4 is not thread-safe and methods 3, 4, and 5 cannot be used in recursive methods. Method 2 is the cleanest of the object based methods, but how does its performance compare to the other? Here are some timings from my 1.7GHz Pentium 4 machine:

As others have noted, under 1.4.2 -server is much faster for floating point code than -client

The difference between (1) and (2) gives the approximate cost of allocating and garbage collecting the temporary Vector3d object. Allocation increases the cost of the cross product method by a factor between 2 (client) and 8 (server), so it is still a very significant cost in this case.

The synchronized method is the most expensive in all cases, so it is still best to avoid synchronization when possible.

We use a technique similar to (7) in performance-critical sections of our own code, and I would happily change the code to something cleaner like (2), if the cost was small. However the cleaner object-based techniques are still significantly slower.

Although I expected (1) to be the fastest, under -client it turns out to be actually slower than (4) and (7) for reasons I don't understand.

The field method (4) seems to be relatively slow under -server, for reasons I also don't understand.

Caveat: This is a microbenchmark and performance may be different in real applications. I think garbage collection is a wonderful thing, and I do not advocate abandoning it for object pools except when really necessary for performance reasons (preferably after profiling your code first). Comments and critiques are welcome.

All three client JVMs seem to preserve the oddity that using a field (4) is cheaper than using local variables (1). I still don't understand why this is the case.

The relative cost of a synchronized method (5) seems to be lower on MacOSX and much lower under Redhat as compared to my Windows results.

The relative cost of new and garbage collection seems slightly lower under MacOSX but significantly higher under Redhat. It still large enough in all cases to be a potential bottleneck in truly performance-critical code.

Under MacOSX, -client and -server are not significantly different which is not surprising since my understanding was that -server is ignored under the current MacOSX JVM.

Yes, there is only a single shared library for the hotspot VM on Mac OS X. I didn't realise at first, but the server shared library and client shared library files are both there, but as alias's to a single 'hotspot' library.

I guess the slight differences are caused by different parameter values to the same VM (e.g. compile thresholds, size of young generation in heap etc.)

I too have no clue why a field is faster than a stack variable - weird. If only there was a way to disassemble the native ode produced by HotSpot.

I ran some more tests to see why the cost of synchronization seemed to vary so much and it seems to depend on whether you are running on a single processor or dual processor machine. My previous tests were run on a dual processor which I didn't mention because it didn't seem relevant since the test is entirely single threaded (ie the other processor just sits idle). However I've gone back and redone my timings (with fewer other applications open) on single and dual processor machines which are otherwise similar and here are the results:

JVM 1.4.2

client

client

|

server

server

1.7GHz P4

single

dual

|

single

dual

(1) Local var ---------

0.077

0.076

|

0.012

0.012

(2) New --------------

0.070

0.141

|

0.056

0.124

(3) ThreadLocal -----

0.102

0.100

|

0.043

0.039

(4) Field -------------

0.043

0.042

|

0.045

0.043

(5) Field sync -------

0.057

0.231

|

0.055

0.178

(6) TempStack ------

0.121

0.128

|

0.045

0.047

(7) TempStack param

0.053

0.072

|

0.016

0.016

Most results are similar except for the cost of new (2) and synchronization (5) are much higher on a dual processor machine.

Adding synchronized to a method is virtually free on a single processor (assuming no contention), but fairly expensive on a dual processor

Using new (2) on single processor machine under the client JVM seems to be reasonably fast. Only the field (4) and field sync (5) methods are faster and not by that much. However on a dual processor or when using -server, then there are other techniques that are much faster than using new.

TempStack param (7) also seems to slows down somewhat on a dual processor for reasons I don't understand, but only under -client, not -server.

Would we see the same slowdowns on a single processor machine with HyperThreading enabled? (It was not enabled in any of my tests.) I'll try to test this if I can find a suitable machine.

I wonder if the JVM actually generates different code on a single vs. dual processor machine or if there is something else going here (cache effects? context switching?).

The PPC architecture uses register windowing and suchlike doesn't it from its RISC beginnings? It's pretty poorly adapted for stack-based architectures like the JVM, whereas the x86, with its paucity of general purpose registers, is much better at stack ops. So let's guess that the fields get mirrored into registers on PPC.

Yes dual processor makes a big difference for synchronisation. For single CPU raising the IRQ level is enough to prevent a context switch and therefore gain exclusive access for a moment. With dual CPUs fancier mechanisms must be used. On windows the kernel will use spinlocks in dual processor mode, operations that are noops with the single processor kernel.

I don't know if Mac OS X has the same distinction for synchronisation operations. I get the feeling that in the world of Macs dual processor machines are much more popular than in the world of windows.

Cas, I'm not sure why the PPC would be any worse at stack ops, the compiler can choose any register to be the stack pointer and it will work pretty much the same as an Intel stack. I believe the available addressing modes will mimic the push,pop without the need for any extra instructions or longer execution times. The proper set of general purpose registers is, in most cases, a win. The main problem, until recently with IBMs latest PPC chips, has been the lagging clock speeds for the PPC CPUs. But this discussion is for another thread if it is worth pursuing at all

> * As others have noted, under 1.4.2 -server is much faster for >floating point code than -client

For something quite fundamental as floating point performance,just would like to know the reason behind this discrepancy between client and server options.

As a rough benchmark, I ran some cases with my Java3D particletracking algorithm that involves fully double precision calcs.,newing of dynamic primitive arrays and objects of that type only,no synchronization anywhere, extensive polymorphic method calls(since the cells are different kinds of polyhedron), no accountingfor gc times, and no newing of objects within loops.

The time taken for creating 1000 traces for repeated invocations without any pauses are (in secs):

I suspect the reason is that the server is allowed to spend more time compiling bytecode to machine code, and the algorithm required to generate more optimized floating point instructions (e.g. the SSE instructs that are used on Intel by the server VM) requires too much processing time in the compiler. So for a client VM the lag caused by a runtime compilation pause would be considered unacceptable. That's my guess anyway.

BTW.. I did a test 10 months ago with MS Visual C++ 6.0 to do a simple conversion from RGB colour space to YUV. The Java version ran faster than the C++ .exe. The reason I suspect was the very poor performance of the Microsoft compiler for floating point to integer conversions. Apparently it sets the floating point rounding mode twice every time a conversion to int is required (once to set the mode to C style rounding, once to set it back to natural round to the closest number). Intel's C++ compiler at the time was considered vastly superior. The GNU C++ compiler on Intel at least prior to version 3 also produced VERY poor code.. so bad that I know of one project that abandoned the idea of a Linux port because they didn't want to release something with such poor performance and possibly ruin the reputation of the company.

--- Field (4) faster than local variables (1) under -clientI've profiled the code using VTune which has the side benefit of allowing one to view the assembly code produced by the hotspot compiler. I've posted the resulting assembly code for the local variable and field routines here (sorry for the strange formatting):http://www.graphics.cornell.edu/~bjw/CPTLocalVarClient.txthttp://www.graphics.cornell.edu/~bjw/CPTFieldClient.txtI'm not an x86 assembly expert but perhaps there is an expert out there who can analyze the differences. One thing I noticed is that local variable code computes the results using fp registers and then copies the results using int registers while the field code uses fp registers throughout.

--- Field (4) much slower than local variables (1) under -serverThis turns out to be an inlining effect. Hotspot inlines method (1) by default but not method (4). If I disable all inlining (using the -XX:MaxInlineSize=1 -XX:FreqInlineSize=1 flags) then the local var method slows down to 0.038 us (or just a hair faster than field). However I could not find any parameter setting that would convince hotspot to inline the field method (4) the same way it is inlining method (1) by default.

Incidently I don't think its actually any more difficult to generate the SSE/2 instructions instead of x87 for the floating point code (in fact the SSE/2 code is simpler and probably easier to generate). Hotspot -server does not use the SIMD parts of SSE/2, just the scalar instructions as shown in the assembly code linked below. I think SSE/2 fp code is a feature that is likely to migrate down into the client JVM in the next version, especially if enough people request it.http://www.graphics.cornell.edu/~bjw/CPTLocalVarServer.txt

java-gaming.org is not responsible for the content posted by its members, including references to external websites,
and other references that may or may not have a relation with our primarily
gaming and game production oriented community.
inquiries and complaints can be sent via email to the info‑account of the
company managing the website of java‑gaming.org