Revenge of the Titans continues apace, with all sorts of extreme niceness being packed in there. Unfortunately this niceness has come at a bit of a cost; we're now rendering around 1,000 sprites per frame (in maybe 100 draw calls) and we're getting performance issues (that is, rendering at less than 60fps). Investigation reveals that the biggest bit of compiled code that takes the most time is the writeSpriteToBuffer call in the DefaultSpriteRenderer.

I think that buffer bounds checks are probably accounting for a not inconsiderable time waste, and probably also some general Java inefficiency at dealing with floating point mults and adds but without looking at the equivalent assembly produced by C++ and the JVM I can't really speculate. And anyway, there's pretty much bugger all I can do about the contents of that method: I have to do everything in there.

Also, the number of draw calls cannot really be optimised any more than it already is optimising: the sprites are already sorted into the best possible (and only possible) rendering order by virtue of their Y coordinates, layers, sublayers, texture IDs, and rendering state.

So, the only other thing I can think of that will optimise things is using some modernfangled OpenGL cleverness like VBOs or somesuch. I used to use NVidia's fence stuff and AGP RAM but no longer as it only worked on half the machines out there.

Will VBOs make things any faster for me? The actual amount of vertex data is pretty piddly - we're only talking 1,000 sprites a frame here, that's 4,000 vertices, or 125kb (yes, kilobytes!) of data being sent to OpenGL, and that's being split up into roughly 100 1kb chunks anyway by virtue of the required rendering order.

I'm otherwise rather dismayed at the atrocious performance I seem to be getting these days

You can try pseudo-instancing or ARB_draw_instanced on modern GPUs. You could also use texture atlases or a unified fragment shader that does multiple kinds of rendering, to reduce the number of draw calls.

It all depends on the specifics of your rendering code of course. There are lots of tricks you can use if you're willing to exploit shaders, but I'm fairly sure you're going to want a solution that works on older hardware. Unfortunately, I don't think simply switching to VBOs would make a significant difference, for sprite rendering that is. It could provide a slight performance boost, but I find that VBOs are too platform/vendor sensitive. I wouldn't depend on them on pre-shader hardware anyway.

From the brief look at your code, I see your still using Java's default Math library. (cos, sin, tan, atan, atan2 are all very slow on it), you should really be using Riven's FastMath library, its upto 8x faster then java's default Math library. From what I've seen it really does have a big impact in speeding up code, especially for action packed 2d games. Another nice thing about it is it uses all float Math so no need to be casting from double or even using double. (riven's site seems to be down atm but might be up again soon).

@Spasi - looks like it might be fraught with pitfalls and complexity, I have a feeling also that it won't help too much.

@pjt33 - profiling shows I'm spending, er what was it, 7.5% of the time writing sprites and 35% of the time calling gl commands (25% glDrawArrays or so). That's on my uber-rig. I wondered if by using some sort of magic VBO memory the glDrawArrays commands would either be quicker or return asynchronously or something clever like that, letting me get on with doing something else with the CPU other than waiting for the GPU to finish.

@kapta - 99% of the sprites aren't rotated, so the vast majority of sprites drawn never go near sin or cos, so there's little point in trying to optimise those calls away. About half the sprites in the frame are scaled, which makes for 4 floating point mults in addition to the floating point adds. I suspect that's probably far less time than I worry about. I would like to know if there was some way to disable buffer bounds checking and see if that's slowing things down much but I kinda doubt it - after all the actual drawing is taking 4x as long as the data collection phase.

I barely use random numbers, and I barely use Lists. The GC rarely ever fires off ever.

My MvR engine was mostly DisplayLists and the like. With my new engine, I switched over to VBOs (which I admit, took some restructuring). The speed increase was VERY noticeable. I even switched my font rendering class over (which is basically like your sprite rendering) and noticed a big leap there as well. I have to wonder if the drivers these days are just geared that way.

The other big leap I saw was when I moved from individual textures for sprites to a single, larger texture with all the sprite graphics in it. It didn't help much when I was using DisplayLists, but with VBOs it was a different story.

NOW, my engine isn't just for sprites, so your mileage may vary.

Here, this will help you on your way (it's the least I can do for the help you've given me):

Actually that's rather helpful just having it there to stare at. I've been skirting around VBOs for years because they used to be a bit rubbish and I never really needed the performance before. Let's hope that it gets me that much-needed speed increase! (I've managed to get another 10% boost by using the server VM too)

DOH! And I just remembered... I super-boosted my particle engine performance by moving all of the QUAD sprites for it into a SINGLE VBO!!!! Yes! A SINGLE VBO!!! It drastically reduced my number of GL calls.

My ENTIRE list of viewable particles (which are all QUAD sprites) is called by this single function:

That's the power of VBOs. When I update the particles, I literally just clear the FloatBuffers, recalculate each particle (I have a particle array for ongoing data), throw them back into the FloatBuffers via .put(), and then call:

My sprite engines a bit more generic than that, it's already pretty much doing that wherever it can. One of the real killers is Y-sort. I wonder if there's any easy way I can optimise that part. I only need to Y-sort if the sprites actually overlap each other. What I need is an algorithm to band them together, even if roughly.

How large are these sprites actually? I did a quick test with a pretty stupid sprite blitter that even uses Collections.sort() each frame to do the y-sort (http://www.jpct.net/download/hacks/SpriteTest.zip - SPACE to add 1000 sprites, s to toggle size) and tried how many alpha blended and scaled sprites i could blit onto a 1024*768 screen before falling below 60 fps. The results (large sprites/small sprites) ranged from ~12000/~20000 on a Core2 Quad @ 3.2Ghz/Radeon 5870 down to ~100/~2100 on a P4 @ 2.2Ghz/Geforce 2 Go with an older midrange system AthlonX2@2200Mhz/ATI 3650 AGP being somewhere in the middle with ~1000/~12000. On what kind of system are you actually running your tests on?

Riven: I just tried your method of using an index for the buffer puts... it cut my performance more than in half. I get much better performance clearing the buffer and then refilling it with the updated information. I'm therefore guessing you were specifically talking about instances where you're only updating a small portion of the data in the buffer?

Riven mentioned the Buffers put() operation: For me single put()s are pretty slow, I try to put() everything in the Buffer with a single put(). To do that I keep arrays for all the data (vertex, color, etc. ) and collect the game's data in there every frame and then put it in the buffer with one call. I even don't create the arrays every render call to keep the garbage low. Instead a simple int shows how big the array is. Works like a charm for me.

Riven: I just tried your method of using an index for the buffer puts... it cut my performance more than in half. I get much better performance clearing the buffer and then refilling it with the updated information. I'm therefore guessing you were specifically talking about instances where you're only updating a small portion of the data in the buffer?

You saw:

Quote from: Riven

Due to the non-deterministic behaviour of HotSpot regarding Buffer performance, it might even be better to dump your data into float[]s and byte[]s and put() them into your FloatBuffer/ByteBuffer

?

FloatBuffer performance is so retarded that it is simply best avoided. Often absolute puts are about 2-4 times as fast as relative puts. Sometimes they are even slower, especially if your have created a heap-FloatBuffer somewhere else.It sometimes even has completely different performance characteristics in two identical VM launches. float[] is fast, always, independent of the day of the week.

Hi, appreciate more people! Σ ♥ = ¾Learn how to award medals... and work your way up the social rankings!

One of the real killers is Y-sort. I wonder if there's any easy way I can optimise that part. I only need to Y-sort if the sprites actually overlap each other. What I need is an algorithm to band them together, even if roughly.

Sounds like a case for shell sort.

My experience with VBOs (setting up a buffer per triangle group, populated at load-time) was that they produced no measurable performance difference on my box, even though model rendering was a bottleneck. GeForce 8400GS.

My experience with VBOs (setting up a buffer per triangle group, populated at load-time) was that they produced no measurable performance difference on my box, even though model rendering was a bottleneck. GeForce 8400GS.

This is very peculiar, I have a 8500 that sees 3-4x performance boost when using vbos compared to normal vertex arrays.

I think the main problem is the number of state changes I have to do as a result of too-fine Y sorting. I absolutely have to do Y sort because my sprites have to appear in front of each other when they overlap - but there's the key: they don't have to if they don't overlap. So I am going to adjust the Y sort to a "band sort", where I take into account the Y hotspot and height of the image to be displayed. Unfortunately that's probably going to break my neat radix sort. But it may cut down on state changes by an order of magnitude. One issue I have noticed is texture state change thrashing - drawing a sprite using one texture then switching to another texture to draw another sprite, over and over, one sprite at a time, for the irritating case where the images are placed in different texture atlases. I need some way of optimising this, either by getting statistics from a run and applying them to the sprite packer, or by some dynamic reorganisation of the atlases at runtime.

I've now interleaved the writes and draws per render state, so that I write a few sprites out at a time, set up the render state, call glDrawArrays, and then start immediately on the next set of sprites.

I'm going to implement VBOs and see if they improve the asynchronicity of glDrawArrays. BTW is glDrawArrays the best way to do this or would glDrawRangeElements be better (given the totally sequential nature of my vertices I suspect not but it's worth asking...?)

Off the top of my head, I know I've been through my own battle over glDrawArrays, glDrawElements and glDrawRangeElements. glDrawRangeElements is definitely the way to go over glDrawElements, although performance differences aren't significant it never hurts anything. For me, on a 8500MT card I found that glDrawArrays was slower than using glDrawRangeElements with indices that were sequential. This was done a long time ago and was fairly informal, so my best bet is to try it out.

With Y sort on my "typical scene" of about 1300 sprites takes 150 state changes to draw.

With Y sort completely turned off, it takes 50 state changes to draw.

With band sorting, where I use an interval tree to determine if sprites are overlapping, I get a pretty similar 60 or so state changes to draw the same scene. This is a massive reduction in state changes and calls to glDrawArrays - however, the interval tree code I found (in OpenJDK) is so slow, my frame rate plummets to 10fps

I'm now working on getting a much much faster interval tree of my own implementation. If that achieves any speedup I'll let you know...

If you have a lot of sprites that potentially aren't moving at once, you can try drawing them once into an FBO and then draw the FBO instead. Only update the rects in the FBO that have objects which have moved. I'm not sure if this will be useful for you, but it has been really big for my own optimizations because I have a lot of objects that don't move all the time.

java-gaming.org is not responsible for the content posted by its members, including references to external websites,
and other references that may or may not have a relation with our primarily
gaming and game production oriented community.
inquiries and complaints can be sent via email to the info‑account of the
company managing the website of java‑gaming.org