I've been looking around for a while on this, and haven't found much info online, so I figured I'd ask here - game programmers tend to be more psychotic than most about speed!

In compiled languages, one of the most important optimizations that can be done is to make sure you program your stuff in a cache friendly manner, especially with processor speed pulling so far ahead of memory access. So you always try to do things like accessing things sequentially instead of randomly to optimize for that.

In theory, the JVM should be optimizing this type of stuff in Java without us needing to worry about it, which is one of the reasons Java could theoretically be faster than C; but in practice, it's not clear how much this is happening.

So here's the question: does anyone know if there is anything to be gained from cache-friendly methods in Java, and how best to exploit this, or does the work that the JVM does render these techniques irrelevant? In C++ some people see up to 10x performance increases by appropriately handling memory access, so it could be very worthwhile to figure this out. In some cases I've been able to exhibit speed gains by interleaving arrays of built-in types in very careful ways, so I know the JVM isn't doing everything that's possible, but does anyone know any best practices that are generally useful? I'd hate to be optimizing for a particular architecture and JVM...

I am very interested in this topic as well. I'm mainly a C++ programmer these days and what you say is very true. I've never understood how these cache coherency issues translate to Java. Is there any kind of documentation out there on this? Would the jvm spec indicate anything in this regard? Basically the only strategy that I'm sure works is to keep everything as small as possible and as contiguous as possible (e.g. a single FloatBuffer of floats that is used as an n-dimensional array rather than an actual n-dimensional array.

My impression is that the JVM specs guarantee literally nothing about any sort of contiguous allocation - by my reading, an array of floats could very well be scattered all across the heap. Apparently even the stack is usually not contiguous in Sun's VMs.

I would think, though, that the JVM would take the fact that you're allocating a bunch of floats in an array as a good hint that it's worth putting them near each other in memory, though. And I have, indeed, been able to show small improvements in performance by carefully interleaving array data from several arrays into one. What I'm not sure about is a) whether this is consistent and exploitable in general, and b) what happens with arrays of objects, and whether there is even any point in trying to do something similar - for instance, will the array of references ever be allocated so that it is interleaved with the referents for better locality?

I'm hoping that someone knows how HotSpot actually handles this stuff in practice, since in reality, we're normally programming to that implementation, and some rules of thumb would be great. And maybe someone knows whether this stuff changes every version or is stable enough to try to work out.

BTW, re: FloatBuffers, are those faster than a single dimensional float[]? I couldn't exhibit a simple case where that seemed to be the case. Any chance you've got a snippet that could demonstrate a good use?

this is logical but in case of, for array you must access them sequentially and "incrementally" (from offset 0 to offset max) and they seems to give better result when working with 64k blocks, also use only one dimensional array as multidimensional work differently than in C (multiple block of memory allocated rather than a single one).

avoid casting when it is not simple.double => int == very slowint => double acceptableavoid casting in a lower resolution and from floating to fixed.

for complexe computation do not work directly on array value, keep them local and put them back :

1 2 3 4 5

intx=array[i];x+=y; => Performsomecomplexops......array[i]=x;

when you treat array value you must have only one read/one write per value in a loop, array are slow as there is a bound check for each read/write op.

when it is possible use block of memory in one time, it seems that working on an array than on another array is faster than working in the same time on both array (especially when there is a lot of array) => this is also true for object, i guess it is related to cache.

note that the above are only from my own experience.

EDIT:

Quote

BTW, re: FloatBuffers, are those faster than a single dimensional float[]? I couldn't exhibit a simple case where that seemed to be the case. Any chance you've got a snippet that could demonstrate a good use?

nothing can be faster than a simple float[] array in pure java, FloatBuffers may be faster for exchanging with external/native function or FloatBuffers may have some special function implemented.

I think they simple try to prevent a new keyword in the language. I remember as they introduced 'enum' and let several projects break including Apache projects and even NetBeans. The annotation based solution is probably the only possible way to get it into a 15 year old language like java.

Nah, that's just a lame excuse. After all binary compatibility wouldn't be broken, only source, and how hard is it to do a trivial renaming refactoring on the very rare bits of code with the keyword "struct" in?

Nah, that's just a lame excuse. After all binary compatibility wouldn't be broken, only source, and how hard is it to do a trivial renaming refactoring on the very rare bits of code with the keyword "struct" in?

its actually no work at all to refactor it... in case you have the source and you are allowed to recompile it. They don't see it as main feature thats why they prevent to cause any trouble with it.(but to be honest I personally really don't care if it is called 'struct' or '@Struct')

Some initial experimentation shows some interesting things. I ran a simple 1D particle simulation (increment position based on velocity, apply a tiny bit of damping to velocity) in both Java and C++ comparing the behavior depending on whether I had two separate arrays for velocity and position or one interleaved array. I can post the code if anyone cares (and yup, I am very aware of the pitfalls when writing a Java microbenchmark!), but for now I won't clutter things up. Here are the results, in FPS, higher is better:

(20000 particles over 10000 frames of simulation, 40 tests run and averaged)

So by a MASSIVE amount, C++ with separate arrays performs the worst, and we see a 5:1 improvement there by going to interleaved arrays. But in Java, there's essentially no difference, at least at this particle count - trust me, the sub percentage point difference you see there is essentially just noise.

The real surprising thing (to me, at least), is that at least Apple's JVM actually does appear to be making good on the claim that it can achieve better cache locality than AOT compiled languages, as the non cache-friendly Java code performs four times as well as the corresponding C++.

I wondered whether the performance difference between Java and C++ could be accounted for by the bounds check, so I inserted a basic bounds check into the C++ (just breaks out if bounds check fails, which is probably not quite as complex or expensive as Java's):

No difference at all for the regular float version (two arrays), but brings the interleaved version within spitting distance of the Java one.

Somewhat relevant to Cas' suggestion, I also tried using an array of Particle objects in Java (just holders for x and vx, in this case):

Java using Particle class: 17139.75

Using an array of objects is still better than the worst C++, but still a ways behind optimized Java; and we haven't even touched the issue of creating objects on the fly or garbage collecting them, either, which are the real slowdowns that I tend to see in my code.

So there's no real conclusion here, except that short of getting rid of those bounds checks, pure float arrays in Java are handled about as well as can be expected, and the JVM actually is doing quite a bit of work to keep the cache miss rate low.

This is now totally off topic, but I ran into one thing that I should warn you about - the initial version of the particle update code I wrote read something like this:

...and I was surprised to find that my timing information was heavily dependent upon the number of frames that I was simulating. But FPS should not depend on the number of frames you check, as long as it's large, and I was surprised to be hitting a cutoff around 9000 where to push any higher meant an 80% drop in frame rate! (interestingly enough, the interleaved method weathered this drop far better than the others in Java)

The cause: it turns out that numerical underflow will freaking MURDER your program's performance in either C++ or Java - that vx *= 0.99f line was bringing all the velocities down to the level of underflow after about 9000 frames, and from that point on it's a major drag on performance. In this case I wasn't interested in working around that issue, so I just changed the decay factor to 0.999f (putting off the inevitable a little further), but be aware of this in your own code, if you have a large number of variables that are used a lot and could exponentially decay to zero, it might be worth checking them against a threshold.

Things I've learned:

- The overall conclusion: the JVM does exactly what it promises to, at least with floating point arrays, optimizing them in almost every case to the point that they are similar in performance to well-used C++ arrays.- I could not find a SINGLE way to affect the timing via a local (non-algorithmic) optimization - storing the loop limit away, changing to a while loop, replacing with a try/catch and a while(true), etc. - every one of these things performed EXACTLY the same- final keyword makes no difference at all to performance anymore, anywhere, at any time. Only use it when it "should" be used, not because you think it will make things faster.- Scope does not affect speed within a class (passing an array as a parameter vs. storing it as a member doesn't make a difference)- Visibility of array does not affect speed- Using getters and setters to marshal access to private object members does not degrade performance- for (Particle p:particles) loop performed exactly the same- Any optimizations that are happening for fast array access do not seem to happen at garbage collection time - performance is identical before a single GC happens- Memory layout management is probably NOT responsible for the fast access - disrupting the access patterns does not seem to make any difference, so the JVM must be doing some intelligent prefetching

@jezek2: on the Mac, the -server mode in 1.6 doesn't do very much, I think it just changes the way the heap sizing operates, which is irrelevant for these tests. I tried in 1.5, though, and got the following:

So at least on OS X, 1.5 should DEFINITELY be used with the -server flag. I included that "Local FPS" entry in the last one because that shows that at least Apple's 1.5 -server has a speed problem passing arrays as local variables, I'm not sure why - I omitted that from all the other tests because it never differed.

I know that most people aren't using Apple's VM, so I'm going to boot into Windows right now and see how things work there...I'll post some more results in half an hour or so.

And btw, I've never had problems with stuttering or anything like that due to the server flag, I tend to find that compilations pretty much cease after the first several seconds of gameplay, which you can keep an eye on with the -XX:-PrintCompilation flag - have you had trouble with this in the past?

That enhanced-for-loop creates an Iterator, the overhead can be significant in nested loops.

Further, I agree with most of your findings.I recently began to change my coding where performance really mattered:With out-of-order CPUs, you can actually do quite a bit while the CPU is waiting for the cache, or worse: for system RAM.

So with Sun's VM, -server is DEFINITELY worth it. Interestingly enough, the performance characteristics are almost exactly inverted between the client and server VMs. Here there does appear to be a difference between using a single interleaved array and two separate ones.

Also, it appears that Apple's VM is a bit faster at this than Sun's as of 1.6, delivering more even performance overall.

Since I just noticed that this forum scrolls code posted rather than inlining it, I might as well post mine in case anyone has any beef with it:

@DzzD: There's one issue with that benchmark, which brings up a very interesting point (maybe that was the thing you were demonstrating?) - you're running through the multi-array loop in the "wrong" order; if you switch the inner and outer loops in the testMultiArray function, I get the following results:

Of course, this difference itself is a cache effect, and shows that the JVM is not handling multidimensional array optimizations very well, nor is it reordering accesses when possible. Try the following altered code and play with xWidth and yWidth to alter the dimensions of the array and see the difference between accessing it the right way and the wrong way, sometimes it is extremely dramatic:

@Riven: that's extremely interesting, I'll have to play around with that type of effect. Do you have any advice as to when you can generally get away with stuff like that?

You can only do calculations that can operate fully on the registers, 'while' waiting for memory I/O.

IIRC, a memory lookup can take up to 40 cycles (depending on memory-speed and CPU architechture) when it's not in cache. Even cache is slow compared to registers. This is the sole reason why programming in ASM can give such great performance: manually fiddling with registers in a way that compilers simply can't even get anywhere near. In Java we can only put as much as possible (primitives, not arrays, which involve memory I/O) in local variables and hope for the best.

Hi, appreciate more people! Σ ♥ = ¾Learn how to award medals... and work your way up the social rankings!

I understand the | ^ & part related to ASM but I have trouble to understand where the hang happend ? I mean for the value to be passed to the function it must be first read than the function is called , no ?

anyway it seems very interrestng, can you provide a different sample please for a better understanding ?

I read your sourcecode of your benchmark, and it's rather poor, unfortunately.

Ouch. But I took another look, and that's actually a fair assessment, though only for one of the reasons you posted - the ordering criticism is valid and I definitely should have known better than to ignore it (which does effectively render the whole test moot, so you're ultimately correct). But for the sake of nitpicking, I have to disagree with the other stuff.

Quote

Have you got any idea how long a division takes? It might be like 25-50% of all time spend in that loop.

(re: for (int i=0; i<floatArray.length/2; ++i) )

This division, expensive as it might be, only happens once, not every time through the loop, since the JVM can figure out that it's constant. It might even be converted to a shift. This is actually the first thing I tested (I trimmed all that code out for simplicity's sake), since these types of optimizations are often the lowest hanging fruit in C. Fortunately (or not, if you're looking for places to optimize!) inefficiencies in loop handling code seem to be completely optimized away by the JVM, as I mentioned before. You can even do all sorts of nasty stuff like calling simple functions like getters to get the upper range for the loop, and for the most part it adds no overhead as long as the limit is actually constant (there's probably some trickery you could pull to make the JVM worry that it won't be, in which case you would be right, but this is not one of those cases).

I tried anyways, though - replacing my original code with yours resulted in no significant difference (I removed all but the interleaved test and ran it separate times to hopefully avoid ordering issues, and got roughly the same results on each run):

Again, not on today's VMs, especially if the 2 is written as a compile time constant. This is one of the easiest optimizations for a compiler to make, so I would be shocked if the JVM wasn't doing it. If I'm correct it's even done in most C/C++ compilers these days, though I haven't checked that.

(BTW, I'm only making a claim about desktop Java on a recent JIT - on embedded, all this stuff is vital to getting passable performance, and it's probably increasingly important as you move back from 1.6 even on the desktop)

I threw together a quick benchmark to verify all this. I think I worked around the ordering issue by picking from the two variants (which should be identical apart from your suggested changes, and perform an array copy with a transformation and even/odd flop so as to try and confound any unexpected cleverness from the JVM) randomly, but let me know if you think there's still likely a bias here:

The difference between the results differs from run to run and depending on the size of the array, sometimes putting the optimized version ahead, sometimes the unoptimized one; in any case, I'm not convinced by these results that there's anything going on here.

Feel free to prove me wrong, though, I'm just going off of what I've noticed myself, and you definitely know a lot more about the internals than I do.

Quote

Further, changing the order of the tests in the benchmark will probably have a significant effect on the outcome.

There you're 100% right, and I'm a damn fool since I know that's always an issue. I thought I had checked for this but I just reordered things manually and the results come out different on Windows (haven't checked OS X). My bad.

The earlier conclusion (from the OS X tests I ran, not the Windows ones) is actually amplified by this mistake, so I'll restate it: there seems to be no predictable and exploitable difference between performance of a single interleaved array and two separate arrays, as the performance depends on a lot of other factors that you will likely not have much control over in a live app, and no matter what, as long as you use the server VM, performance beats cache-unfriendly C++ code, so the JVM is doing a lot of lifting even for simple loops.

Which brings me to the next obvious question: is there any logic to how test ordering affects timing results? Since this seems to affect things up to 50% or more, it would be fantastic to have a better idea how to avoid those hits against speed whenever possible, but it initially seems to be totally random - surely there's some method to the madness?

Again, not on today's VMs, especially if the 2 is written as a compile time constant. This is one of the easiest optimizations for a compiler to make, so I would be shocked if the JVM wasn't doing it. If I'm correct it's even done in most C/C++ compilers these days, though I haven't checked that.

No no no no! (no. ) Indeed C compilers have done this forever, but the JVM never has.If you don't see a difference, that means your bottleneck (like memory I/O) is somewhere else.

You might have read my thread about MappedObjects I created yesterday. I added code to handle the case where the stride is a power-of-two and replace a single *16 by a <<4 (or *32 by <<5 so that matter), and saw the performance double. That was because the bottleneck was actually there. As you saw, in another situation, I could squeeze in a 'massive' swap64 method, without a speed-penalty. So if you're not seeing your optimisation paying off, that doesn't mean your code has become faster, just that the CPU is waiting. Once you remove the other bottleneck, you'll notice that your earlier optimisation did speedup your code.

The VM is really not that smart, although it's trivial to replace a constant multiplkication by a bitshift.

I'll do some actual tests with your benchmark (yesterday I didn't have the time). There are so much more tricks to make that code faster (that's why I said it was rather poor). Your main slowdown is in the loop-overhead. AFAIK the JVM does some loop unrolling, but doing it yourself (copy the body 2-4 times inside the loop) is much faster.

Again, it's hard to do micro-benchmarking, not only because the JVM rewrites most of your code, but also because your might think the bottleneck is somewhere else, or are even unaware of certain types of bottlenecks.

Last, your have that giant method that runs all the benchmarks. The JIT stops moptimising methods that are longer than a certain amount of instructions. That 'magic amount' is rather low, and your method will certainly not be turned into the most efficient native code. You can take a look at the output of the Debug VMs (release candidates with debug commandline options, IIRC) from Sun, which will tell you when the JIT gives up.

I'll posts some tests later today.

Hi, appreciate more people! Σ ♥ = ¾Learn how to award medals... and work your way up the social rankings!

That all makes sense (er, well, let me rephrase: let's just say that your explanation makes sense - the behavior itself is %^$&@ maddening, that simple local optimizations like that are not somewhere in the compilation process when they would seem so straightforward to implement at literally any point along the path from .java text to native code, and I'm further surprised to see firsthand that manual unrolling of trivial loops actually helps things in any significant manner, but you're right, it absolutely does! End rant.). I suppose I'm still too often confusing the "should be" with the "is" when it comes to the JVM (a habit I must re-disabuse myself of), and unfortunately, as you mentioned, it's tricky to accurately measure these things, so a lot of times there's confirmation bias at work.

I wrote a longer response before with some more questions, but just had a computer crash so I lost it, and I'm about to go to sleep, so I'll hit the main point. Please, don't waste too much effort optimizing that code I posted before, it was just a quick stab, not anything I actually need, especially since I pretty much already have the conclusion I wanted from it (major speed boosts based on simple cache considerations when picking data representation are not quite as automatic in Java as in C++, so more detailed examination is necessary).

To me it would actually be most useful to see a simple example exhibiting, for instance, a well enough optimized loop that I could see for myself an effect like the speed difference between a multiply and a shift - while I definitely trust you on that, I tried briefly and wasn't able to make the difference visible, despite a very no nonsense loop unrolled a bit with no other obvious bottlenecks. But it's becoming clear to me that a lot of my conceptions about what is causing stalls have to shift considerably, especially in Java, and a lot of things that I would have thought would hurt performance may actually help it, so it would be great if you could point me to a pared down example that I could take as a baseline and work from.

BTW, do you know of any good references about this stuff that are somewhat recent? Googling seems to pull up mostly either ancient optimization guides from the dot com era or feel good assurances from "experts" that insist that we don't need to optimize Java code because the next version of the VM will do everything including our taxes for us anyways...

@DzzD: yeah, a lot of that could be played with quite a bit, however what I was intending to test there was whether the sole difference between a multiply and a shift was visible with all else remaining the same, so that was mostly just contrived to use a bunch of shifts for no purpose but to use them; turns out that I sort of misinterpreted what Riven was saying anyways, which renders the example moot.

I always feel vaguely bad every time I pull out subexpressions like that in Java, though, as if it's a) wasted effort, and b) perhaps doing more damage than good - am I again just feeling unreasonable pangs of guilt based on what Some Guy At Sun once said about how I should let the JVM do that type of work for me, or do those types of optimizations actually happen inside HotSpot? I mean, this is compilers 101 stuff, but I suppose without verification I should not assume much.

i++ not optimized? Hm, can anyone confirm that? ++i and i++ should definitely be translated differently from each other, but it would surprise me to learn that they are not each being converted to their best representations in bytecode.

(also, careful - the new code doesn't do the same thing anymore, you'd need to pull the shift from the limit declaration)

@DzzD: yeah, a lot of that could be played with quite a bit, however what I was intending to test there was whether the sole difference between a multiply and a shift was visible with all else remaining the same, so that was mostly just contrived to use a bunch of shifts for no purpose but to use them; turns out that I sort of misinterpreted what Riven was saying anyways, which renders the example moot.

ok, so the problem in your bench is that *2 or <<1 spent so low cpu than the final bench result wont be affected as most of the time will be spent in array(i) read/write ops.

<<1 is faster than *2 this is a fact in any language, this is related to ASM/Machine op code. ex: *2 requiere loading two register, perform mul and sometime store result back.

but let say <<1 use about 2 cpu cycle and *2 maybe use about 5 cpu cycle and all array(i) + for + other stuff... maybe 200 cycles.

In one case your full loop will spent 202 and in another 205 so you wont see any differents in your results.

Quote

I always feel vaguely bad every time I pull out subexpressions like that in Java, though, as if it's a) wasted effort, and b) perhaps doing more damage than good - am I again just feeling unreasonable pangs of guilt based on what Some Guy At Sun once said about how I should let the JVM do that type of work for me, or do those types of optimizations actually happen inside HotSpot? I mean, this is compilers 101 stuff, but I suppose without verification I should not assume much.

hum.. factorizing make both code optimised and clean (more readable), this is a good practice!! just imagine that for any reason you have to change i+1 by i+2 ....

Quote

Some Guy At Sun once said about how I should let the JVM do that type of work

dont think they was thinking of things like the following wich is absolutly awfull code :y=x*2;z=x*2;w=x*2;t=t+x*2;

(We're now thoroughly off topic, which is fine by me, this is all interesting stuff - just wanted to put a disclaimer that nothing in this post relates to cache friendly programming anymore, this is 100% in the nitpicky micro-optimization category!)

Quote

<<1 is faster than *2 this is a fact in any language, this is related to ASM/Machine op code. ex: *2 requiere loading two register, perform mul and sometime store result back.

Right, and I would never make the claim that the actual multiplication operation was as fast as a shift, I'm well aware of the differences in the operations once we're down to bare metal. But in most programming languages, there actually is no difference in the assembly generated when you type in n*2 as opposed to (n<<1), they are both turned into left shifts; if that's happening, then it's generally better practice to just write what you mean (esp. given how easy it is to screw up the precedence rules with shifts, for instance by accidentally writing (x<<1 + 1) to replace (2*x + 1)). I was simply surprised by the claim that the JVM would let this type of thing slide by, as it's a textbook example of an easy to implement peephole optimization.

After another failed stab at making a performance difference visible, I took a dig through the HotSpot 1.6.3 source to try and see exactly what's going on and why this might have been left out (it's all quite interesting, I'm absolutely going to dig through more of it to see what's happening under the hood), and I see that it definitely does have code that can recognize an integer multiply by a constant power of two and transform it to a left shift (see share->vm->opto->multnode.cpp at line 145 in MulINode::Ideal(PhaseGVN *phase, bool can_reshape)). But I don't know enough about the internals to immediately see whether it's live or not, or whether it's perhaps a low priority optimization that is usually never reached, and don't have the time right now to get this beast compiling and throw in some traces to see whether that code is getting hit. Maybe if I have some extra time this week.

Riven, is it possible that the performance improvements you're seeing as a result of those replacements are happening in functions that for some reason or another are getting skipped over during optimization, perhaps for method length reasons like you mentioned before? I totally expect you to prove me wrong on this again, by the way, but I can't help but ask now that I'm actually looking at the source that's supposed to do this transform!

Quote

dont think they was thinking of things like the following wich is absolutly awfull code :y=x*2;z=x*2;w=x*2;t=t+x*2;

Yes, that code is bad bad bad from a maintenance and readibility perspective. But I'm not sure off the top of my head whether it will be any slower, ultimately, than if you eliminated the extra multiplies, as some compilers will recognize that they can simplify and generate some very optimal assembly from that code, especially in such a simple situation. I don't know if HotSpot would do very well at it, though I'd be curious to know, because most of the time when I omit common subexpression eliminations it's because the algorithm is such that the change would actually reduce the readability or logic of the code, so there is a bit of a real tradeoff. This happens a lot with math expressions, where you are carrying out a tricky geometric procedure or something like that and for clarity it's best to preserve the geometrical objects locality in code rather than slicing and dicing to reduce the operation count.

there actually is no difference in the assembly generated when you type in n*2 as opposed to (n<<1), they are both turned into left shifts;

arf, as it is not 100% true that's better to get the habit to use <<1 or <<2 or etc.. you will finally use it automatically and after severals years thinking in binary you wont see any more difference between *2 and <<1 when reading a code.

@DzzD: I think we're fundamentally in agreement here about the speed issue, we just differ a little bit as far as style goes.

Oh, I was curious, so I looked into the increment issue you mentioned before, and I don't think you need to worry about it, if I understand your concern correctly - both ++i and i += 1 are translated to the same bytecode (IINC 1 1, ILOAD 1), and i++ turns into (ILOAD 1, IINC 1 1), as expected. Though while we're on the topic, it's worth mentioning that 32767 is the cutoff for the IINC operator: i += 32768 translates to (ILOAD 1, LDC 32768, IADD, ISTORE 1, ILOAD 1), which could potentially be slower. Not that anyone's usually "incrementing" by numbers that big, but...

java-gaming.org is not responsible for the content posted by its members, including references to external websites,
and other references that may or may not have a relation with our primarily
gaming and game production oriented community.
inquiries and complaints can be sent via email to the info‑account of the
company managing the website of java‑gaming.org