What I personally found interesting is the objective comparison of the new approach, as well as (H)LBVH to 'state-of-the-art' CPU-side construction with spatial splits (SBVH). Table 3 in the paper shows that HLBVH is able to achieve about 60% of the ray tracing performance of SBVH. The new algorithm clearly does better, approaching SBVH closely. In practice, a Titan can rebuild the BVH for a 282k scene in 9ms.

Some thoughts on this:- Although HLBVH builds faster, the impact on ray tracing performance is significant- 9 ms means a peak of ~100fps, assuming ray tracing takes far less time than updating the BVH- 282k isn't all that much...

I have been working with the assumption that CPU-side BVH maintenance is always better:- The CPU can build while the GPU renders, freeing the GPU to spend those 9ms on rays, rather than building. - Moreover, when N GPUs are used, the scene is built only once, rather than on each of the N devices. - The CPU still does a better job in terms of quality. Not in 9ms obviously, but:- Having a top-level BVH and sub-BVHs per scene graph node allow you to adjust build quality based on mesh properties (most relevant: static or not).- A fast threaded binned builder matches (or even improves on) the 9ms figure for dynamic geometry.

So: I believe it's better to use the CPU for building, and the GPU for tracing. This prevents that the CPU sits there idling, and it prevents doing the same work several times.

Thoughts?

EDIT: I forgot to mention my context. I consider my statement true for a real-time game environment, where 30+fps is desired, and a significant part of the scenery is static. For other contexts, I can imagine GPU building could make sense (although I don't really think so).

jbikker wrote:So: I believe it's better to use the CPU for building, and the GPU for tracing. This prevents that the CPU sits there idling, and it prevents doing the same work several times.

is there really no other computationally intensive work for the CPU? Like scene management, collision detection, physics simulation, etc.? Some of these can actually also be done on the GPU. But I'm curious, if the GPU can produce a frame in 33ms (for a 30fps rate), isn't there enough work for the CPU already to do in these 33ms?

1. On a quad-core CPU (pretty common if you also have a Titan), it's going to be very hard to keep the CPU occupied at full load. Physics are maybe the heaviest load in the average game, but things like AI, sound and game logic are all very cheap. And if you're using anything like PhysX, even the physics load is done on the GPU, further unbalancing the load.

2. Even if the CPU has a reasonable load (say, more than 70%), is it really worth it to a) spend half of your GPU compute time on acceleration structure maintenance, halving theoretical peak ray tracing performance, and b) do that ray tracing with either an inferior tree, or one that takes 10ms for less than 300k dynamic triangles?

I am not sure if anyone ever gathered info on CPU use in modern games; I'll see if I can fire up Borderlands 2 and Crysis 3 with the task manager open on a second screen.

EDIT: Quick test with Borderlands 2 on a quadcore CPU. I deliberately did a very intense driving session with tons of fighting, destruction and so on. CPU Load is very unstable, but definitely less than 30% on average. I enabled the 'PhysX' option in the main menu, which results in tons of small debris everywhere. I am not sure how this will be with Crysis 3, and Borderlands is obviously not the latest game, so it may be different in other games. I think RTS'es will be relatively heavy, I remember that Supreme Commander actually required a beafy CPU.

jbikker wrote:- 9 ms means a peak of ~100fps, assuming ray tracing takes far less time than updating the BVH

Does it really matter if building a scene currently takes 10 ms or 100 ms? In many years when cpu or gpu hardware has advanced so much that real-time gi will be possible, hardware at that time will be able to build bvh's faster as well, so those 10 ms or 100 ms building times will be 1 ms or even less and building times are not relevant any more. (This is only true when scene complexity would not increase and probably it will).

I don't think it's crap, it's completely valid research to see how far we can get with GPU-side BVH construction, so that we at least have the option to not do it on the CPU. My point is more about how to apply this technology, and more specifically, is GPU-side construction the best option for real-time scenarios such as games, on a typical consumer platform, which is heterogeneous in the sense that both a CPU and GPU are available and ideally should be used optimally. I do believe previous reports on GPU bvh construction have been a bit vague about the quality of the trees produced; the new paper is stating a 50% efficiency compared to SBVH, which is pretty bad if you need to trace lots of rays. I don't think the original papers ever mentioned such a poor efficiency. I just looked up the HLBVH paper, and it states an SAH cost for HLBVH that is between 101% and 114% of a SAH sweep build. In hindsight, two things obfuscate this result: the first is that only scanned models were used (Armadillo, Turbine blade etc.); the second is that the now more or less standard SBVH was not used as a reference.The new paper does however use more realistic scenes, and compares against SBVH (in table 4). The timings for their SBVH implementation are odd though; table 4 mentiones 7 seconds for the conference room, which is a very poor result.Maybe someone should write a paper that compares state-of-the-art CPU building (Wald-style threaded & binned building, with and without spatial splits) against state-of-the-art GPU building, both in terms of traversal performance and construction time. I believe realistic numbers for CPU-side building are more in the range of 20ms for 250k triangles with binned building (yielding ~90% of SBVH traversal performance) and 50ms for 250k triangles with SBVH.There's also this recent paper:"Efficient BVH Construction via Approximate Agglomerative Clustering", Gu et al.,which states that a higher quality tree can be constructed in 10ms on 32 cores for 250k triangles, using an agglomerative build, assisted by Morton codes. Their algorithm scales very well with the number of cores, and delivers very high quality trees (typically better than SBVH).

I was pretty "sceptical" with all previously released GPU building papers, but with these two new schemes (or better: improved-over-previous-works-schemes), i guess it's pretty safe to say that parallel tree building now is becoming reality. Although a (fast) implementation isn't exactly trivial for both on a GPU, and then there is also still the severe memory overhead introduced by both schemes (although also way better than previously).

Better you leave here with your head still full of kitty cats and puppy dogs.

there's both 32- and 64-bit LBVH builders, as well as binned-SAH and longest-axis sweep SAH builders in there (the latter having roughly the samequality and speed of the binned SAH builder while using a lot less memory):