When Riven posted his cool new buffer mapping approach a year ago I realized how important data uploading to the GPU was when it comes to performance, and I've been using that approach since then. Recently Nvidia published some slides on reducing OpenGL hardware using the latest OpenGL extensions, and one thing they talk about is how to improve buffer mapping performance. Trying to implement this I encountered funny interactions with driver multithreading and instancing, but I believe I have finally emerged at the top! Sadly, the required functionality has only been implemented by Nvidia so far.

What's wrong with unsynchronized mapped buffers?

When I was first looking through the slides I was genuinely surprised to see that they claimed that mapping buffers was slow, so I decided to do some heavy profiling in the graphics engine I'm working on. I created a scene with a massive amount of shadow-casting point lights. The scene then has to be culled and rendered for each light 6 times to generate shadow maps for those lights, which is an extremely CPU intensive task. The most expensive OpenGL calls were glDrawElementsInstanced() at 27.7% (!) and glMapBufferRange() at 11.1% of the CPU time. Sure, buffer mapping is a real performance hog, but it's not much compared to frustum culling at around 25-30%. The test was rendered at an extremely low resolution, so the GPU load was at around 70%, indicating a CPU bottleneck.

What got me thinking was the claim Nvidia made that mapping a buffer forces driver server and client threads to synchronize. What server thread? In the Nvidia control panel, there is a setting called "Threaded optimization" which controls driver multithreading. I have been keeping this at "Off" since when left at the default setting "Auto" it can completely ruin performance in some of my programs. Forcing it to "On" caused performance to drop by 40% just as I remembered, but the profiling results were completely different. glMapBufferRange() and glUnmapBuffer() now account for 51.1% and 10.5% respectively. Holy shit! Another surprising entry is glGetError() which I call just twice per frame which take 9.4% of the CPU time. These 3 OpenGL functions take up 71.0% of my CPU time! glDrawElementsInstanced() on the other hand is now down to 0.2% of the CPU time. What the hell is going on here?!

From what I can see Threaded optimization adds an extra driver thread which all OpenGL commands are offloaded to. This extra threads complicates the OpenGL pipeline even further:

Something you learn very quickly when working with OpenGL is to avoid certain functions like glReadPixels() (without using a PBO of course) that force the CPU to wait for the GPU to finish working since it forces the CPU and GPU to synchronize. Similarly, mapping a buffer forces the game thread to wait for the server thread to finish any pending operations, stalling the game thread until the buffer can be mapped. What we're seeing is not glMapBufferRange() becoming more expensive; we're seeing a driver stall! Using unsynchronized VBOs eliminates the synchronization with the GPU (to ensure the data is no longer in use), but the internal driver thread synchronization cannot be avoided this way.

The solution

The solution is actually ridiculously simple. Don't map buffers, or rather, don't map buffers every frame. The OpenGL extension ARB_buffer_storage allows you to specify two new buffer mapping bits, GL_MAP_PERSISTENT_BIT and GL_MAP_COHERENT_BIT. Before ARB_buffer_storage it was impossible to use a buffer on the GPU while it was mapped. GL_MAP_PERSISTENT_BIT gets rid of this restriction, allowing us to simply keep the buffer bound indefinitely. GL_MAP_COHERENT_BIT ensures that the data we write to the mapped pointer is immediately visible to the GPU without having to manually flush anything. Together, they completely eliminate the need of any buffer mapping except for when the game is first started when used in an unsynchronized rotating manner, just like with Riven's approach. Since the new method is so similar to Riven's approach, it's easy to create a generic interface which can be implemented both with unsynchronized VBOs and persistent VBOs. In the following code, map() is supposed to expand the VBO in case it is not large enough to satisfy the requested capacity. Buffer rotation is handled elsewhere.

With persistent VBOs, map() no longer needs to actually map the buffer! We map the buffer once when its created and then we simply return the same ByteBuffer instance when map() is called! If the buffer is too small, we will need to reallocate the buffer and remap it though.

For a long time I've been trying to get Nvidia to fix their broken instancing performance. The whole point of instancing is to reduce the CPU overhead of drawing multiple identical objects in different locations, but currently instancing is much slower than simply batching together a few "instances" into a single VBO and rendering 64 at a time using a normal glDrawElements() call. Instancing is so slow that I actually implemented an Nvidia specific renderer for parts of my engine to avoid this problem. I'm happy to report that for one reason or another, instancing performance is through the roof when using persistent mapped buffers, matching or surpassing that of my Nvidia specific renderer!

Although persistent buffers are a bit faster when it comes to raw speed, their other advantages are much more interesting. By allowing efficient use of threaded optimization (which gets enabled automatically when left at the default "Auto") the game's thread is free to do other things. In fact, around 25% of the frame time with persistent VBOs is due to Display.update() stalling due to the extra driver thread still being busy, meaning that I could add quite a bit of game logic in the main thread without affecting my game's frame rate. The massive improvement to instancing is also great, and responsible for most of the performance improvement. That means that I no longer have to maintain two separate renderers.

It is indeed a massive embuggerance that ARB_buffer_storage is such a new extension because it's basically unavailable on the majority of machines out there at this time... which means I still have to code for the lowest common denominator as there's no point getting it to be "fast enough" on the latest hardware if it just runs poo on everyone else's machine. Even my main development machine doesn't have the extension (nvidia gtx280)

To put more emphasis on the gains of eliminating the driver synchronization overhead by getting rid of buffer mapping every frame, I decided to time exactly how long my OpenGL calls took to execute, not just the resulting FPS.

Test

Total render() time

Time spent by OpenGL thread

Scene FPS

Unsynchronized single-threaded

29.5 - 31.0 ms

26.5 - 27.0 ms

31 FPS

Unsynchronized multi-threaded

48 - 55 ms

45 - 50 ms

19 FPS

Persistent single-threaded

23.8 - 24.3 ms

20.0 - 20.6 ms

39 FPS

Persistent multi-threaded

8.9 - 9.3 ms

3.5 - 3.7 ms

41 FPS

In essence, the persistent buffer handling cut the total CPU time of my rendering routine to less than 1/3rd, or 3.34x better performance. Even more impressive, the time spent by the thread owning the OpenGL context (my engine is multithreaded) had a massive CPU time reduction to less than 1/7th, or 7.3x better performance.

The colored bars show how the engine worker threads spent their time. The top thread which does nothing at the beginning is the thread owning the OpenGL context (AKA rendering thread). The other 8 threads (running on hyper-threaded quad-core) are normal worker threads. (Fun fact: I also do stutter free texture streaming from yet another thread using a separate context.)

The test scene consists of 500 lights of which 414 lights are visible (some are outside the bottom of the screen). The engine renders 2484 shadow maps each frame. These low-resolution shadow maps are packed into 7 shadow map passes to minimize the number of FBO switches required.

java-gaming.org is not responsible for the content posted by its members, including references to external websites,
and other references that may or may not have a relation with our primarily
gaming and game production oriented community.
inquiries and complaints can be sent via email to the info‑account of the
company managing the website of java‑gaming.org