Thinking Parallel, Part III: Tree Construction on the GPU

In part II of this series, we looked at hierarchical tree traversal as a means of quickly identifying pairs of potentially colliding 3D objects and we demonstrated how optimizing for low divergence can result in substantial performance gains on massively parallel processors. Having a fast traversal algorithm is not very useful, though, unless we also have a tree to go with it. In this part, we will close the circle by looking at tree building; specifically, parallel bounding volume hierarchy (BVH) construction. We will also see an example of an algorithmic optimization that would be completely pointless on a single-core processor, but leads to substantial gains in a parallel setting.

There are many use cases for BVHs, and also many ways of constructing them. In our case, construction speed is of the essence. In a physics simulation, objects keep moving from one time step to the next, so we will need a different BVH for each step. Furthermore, we know that we are going to spend only about 0.25 milliseconds in traversing the BVH, so it makes little sense to spend much more on constructing it. One well-known approach for handling dynamic scenes is to essentially recycle the same BVH over and over. The basic idea is to only recalculate the bounding boxes of the nodes according to the new object locations while keeping the hierarchical structure of nodes the same. It is also possible to make small incremental modifications to improve the node structure around objects that have moved the most. However, the main problem plaguing these algorithms is that the tree can deteriorate in unpredictable ways over time, which can result in arbitrarily bad traversal performance in the worst case. To ensure predictable worst-case behavior, we instead choose to build a new tree from scratch every time step. Let’s look at how.

Exploiting the Z-Order Curve

The most promising current parallel BVH construction approach is to use a so-called linear BVH (LBVH). The idea is to simplify the problem by first choosing the order in which the leaf nodes (each corresponding to one object) appear in the tree, and then generating the internal nodes in a way that respects this order. We generally want objects that located close to each other in 3D space to also reside nearby in the hierarchy, so a reasonable choice is to sort them along a space-filling curve. We will use the Z-order curve for simplicity. The Z-order curve is defined in terms of Morton codes. To calculate a Morton code for the given 3D point, we start by looking at the binary fixed-point representation of its coordinates, as shown in the top left part of the figure. First, we take the fractional part of each coordinate and expand it by inserting two “gaps” after each bit. Second, we interleave the bits of all three coordinates together to form a single binary number. If we step through the Morton codes obtained this way in increasing order, we are effectively stepping along the Z-order curve in 3D (a 2D representation is shown on the right-hand side of the figure). In practice, we can determine the order of the leaf nodes by assigning a Morton code for each object and then sorting the objects accordingly. As mentioned in the context of sort and sweep in part I, parallel radix sort is just the right tool for this job. A good way to assign the Morton code for a given object is to use the centroid point of its bounding box, and express it relative to the bounding box of the scene. The expansion and interleaving of bits can then be performed efficiently by exploiting the arcane bit-swizzling properties of integer multiplication, as shown in the following code.

In our example dataset with 12,486 objects, assigning the Morton codes this way takes 0.02 milliseconds on GeForce GTX 690, whereas sorting the objects takes 0.18 ms. So far so good, but we still have a tree to build.

Top-Down Hierarchy Generation

One of the great things about LBVH is that once we have fixed the order of the leaf nodes, we can think of each internal node as just a linear range over them. To illustrate this, suppose that we have N leaf nodes in total. The root node contains all of them, i.e. it covers the range [0,N-1]. The left child of the root must then cover the range [0,γ], for some appropriate choice of γ, and the right child covers the range [γ+1,N-1]. We can continue this all the way down to obtain the following recursive algorithm.

We start with a range that covers all objects (first=0, last=N-1), and determine an appropriate position to split the range in two (split=γ). We then repeat the same thing for the resulting sub-ranges, and generate a hierarchy where each such split corresponds to one internal node. The recursion terminates when we encounter a range that contains only one item, in which case we create a leaf node. The only remaining question is how to choose γ. LBVH determines γ according to the highest bit that differs between the Morton codes within the given range. In other words, we aim to partition the objects so that the highest differing bit will be zero for all objects in childA, and one for all objects in childB. The intuitive reason that this is a good idea is that partitioning objects by the highest differing bit in their Morton codes corresponds to classifying them on either side of an axis-aligned plane in 3D. In practice, the most efficient way to find out where the highest bit changes is to use binary search. The idea is to maintain a current best guess for the position, and try to advance it in exponentially decreasing steps. On each step, we check whether the proposed new position would violate the requirements for childA, and either accept or reject it accordingly. This is illustrated by the following code, which uses the __clz() intrinsic function available in NVIDIA Fermi and Kepler GPUs to count the number of leading zero bits in a 32-bit integer.

How should we go about parallelizing this kind of recursive algorithm? One way is to use the approach presented by Garanzha et al., which processes the levels of nodes sequentially, starting from the root. The idea is to maintain a growing array of nodes in a breadth-first order, so that every level in the hierarchy corresponds to a linear range of nodes. On a given level, we launch one thread for each node that falls into this range. The thread starts by reading first and last from the node array and calling findSplit(). It then appends the resulting child nodes to the same node array using an atomic counter and writes out their corresponding sub-ranges. This process iterates so that each level outputs the nodes contained on the next level, which then get processed in the next round.

Occupancy

The algorithm just described (Garanzha et al.) is surprisingly fast when there are millions of objects. The algorithm spends most of the execution time at the bottom levels of the tree, which contain more than enough work to fully employ the GPU. There is some amount of data divergence on the higher levels, as the threads are accessing distinct parts of the Morton code array. But those levels are also less significant considering the overall execution time, since they do not contain as many nodes to begin with. In our case, however, there are only 12K objects (recall the example seen from the last post). Note that this is actually less than the number of threads we would need to fill our GTX 690, even if we were able to parallelize everything perfectly. GTX 690 is a dual-GPU card, where each of the two GPUs can run up to 16K threads in parallel. Even if we restrict ourselves to only one of the GPUs—the other one can e.g. handle rendering while we do physics—we are still in danger of running low on parallelism.

The top-down algorithm takes 1.04 ms to process our workload, which is more than twice the total time taken by all the other processing steps. To explain this, we need to consider another metric in addition to divergence: occupancy. Occupancy is a measure of how many threads are executing on average at any given time, relative to the maximum number of threads that the processor can theoretically support. When occupancy is low, it translates directly to performance: dropping occupancy in half will reduce performance by 2x. This dependence gets gradually weaker as the number of active threads increases. The reason is that when occupancy is high enough, the overall performance starts to become limited by other factors, such as instruction throughput and memory bandwidth.

To illustrate, consider the case with 12K objects and 16K threads. If we launch one thread per object, our occupancy is at most 75%. A bit low, but not by any means catastrophic. How does the top-down hierarchy generation algorithm compare to this? There is only one node on the first level, so we launch only one thread. This means that the first level will run at 0.006% occupancy! The second level has two nodes, so it runs at 0.013% occupancy. Assuming a balanced hierarchy, the third level runs at 0.025% and the fourth at 0.05%. Only when we get to the 13th level, can we even hope to reach a reasonable occupancy of 25%. But right after that we will already run out of work. These numbers are somewhat discouraging—due to the low occupancy, the first level will cost roughly as much as the 13th level, even though there are 4096 times fewer nodes to process.

Fully Parallel Hierarchy Generation

There is no way to avoid this problem without somehow changing the algorithm in a fundamental way. Even if our GPU supports dynamic parallelism (as NVIDIA Tesla K20 does), we cannot avoid the fact that every node is dependent on the results of its parent. We have to finish processing the root before we know which ranges its children cover, and we cannot even hope to start processing them until we do. In other words, regardless of how we implement top-down hierarchy generation, the first level is doomed to run at 0.006% occupancy. Is there a way to break the dependency between nodes?

In fact there is, and I recently presented the solution at High Performance Graphics 2012 (paper, slides). The idea is to number the internal nodes in a very specific way that allows us to find out which range of objects any given node corresponds to, without having to know anything about the rest of the tree. Utilizing the fact that any binary tree with N leaf nodes always has exactly N-1 internal nodes, we can then generate the entire hierarchy as illustrated by the following pseudocode.

The algorithm simply allocates an array of N-1 internal nodes, and then processes all of them in parallel. Each thread starts by determining which range of objects its node corresponds to, with a bit of magic, and then proceeds to split the range as usual. Finally, it selects children for the node according to their respective sub-ranges. If a sub-range has only one object, the child must be a leaf so we reference the corresponding leaf node directly. Otherwise, we reference another internal node from the array. The way the internal nodes are numbered is already evident from the pseudocode. The root has index 0, and the children of every node are located on either side of its split position. Due to some nice properties of the sorted Morton codes, this numbering scheme never results in any duplicates or gaps. Furthermore, it turns out that we can implement determineRange() in much the same way as findSplit(), using two similar binary searches over the nearby Morton codes. For further details on how and why this works, please see the paper.

How does this algorithm compare to the recursive top-down approach? It clearly performs more work—we now need three binary searches per node, whereas the top-down approach only needs one. But it does all of the work completely in parallel, reaching the full 75% occupancy in our example, and this makes a huge difference. The execution time of parallel hierarchy generation is merely 0.02 ms—a 50x improvement over the top-down algorithm!

You might think that the top-down algorithm should start to win when the number of objects is sufficiently high, since the lack of occupancy is no longer a problem. However, this is not the case in practice—the parallel algorithm consistently performs better on all problem sizes. The explanation for this is, as always, divergence. In the parallel algorithm, nearby threads are always accessing nearby Morton codes, whereas the top-down algorithm spreads out the accesses over a wider area.

Bounding Box Calculation

Now that we have a hierarchy of nodes in place, the only thing left to do is to assign a conservative bounding box for each of them. The approach I adopt in my paper is to do a parallel bottom-up reduction, where each thread starts from a single leaf node and walks toward the root. To find the bounding box of a given node, the thread simply looks up the bounding boxes of its children and calculates their union. To avoid duplicate work, the idea is to use an atomic flag per node to terminate the first thread that enters it, while letting the second one through. This ensures that every node gets processed only once, and not before both of its children are processed. The bounding box calculation has high execution divergence—only half of the threads remain active after processing one node, one quarter after processing two nodes, one eigth after processing three nodes, and so on. However, this is not really a problem in practice because of two reasons. First, bounding box calculation takes only 0.06 ms, which is still reasonably low compared to e.g. sorting the objects. Second, the processing mainly consists of reading and writing bounding boxes, and the amount of computation is minimal. This means that the execution time is almost entirely dictated by the available memory bandwidth, and reducing execution divergence would not really help that much.

SUMMARY

Our algorithm for finding potential collisions among a set of 3D objects consists of the following 5 steps (times are for the 12K object scene used in the previous post).

0.25 ms, one thread per object: Find potential collisions by traversing the BVH.

The complete algorithm takes 0.53 ms, out of which 53% goes to tree construction and 47% to tree traversal.

Discussion

We have presented a number of algorithms of varying complexity in the context of broad-phase collision detection, and identified some of the most important considerations when designing and implementing them on a massively parallel processor. The comparison between independent traversal and simultaneous traversal illustrates the importance of divergence in algorithm design—the best single-core algorithm may easily turn out to be the worst one in a parallel setting. Relying on time complexity as the main indicator of a good algorithm can sometimes be misleading or even harmful—it may actually be beneficial to do more work if it helps in reducing divergence.

The parallel algorithm for BVH hierarchy generation brings up another interesting point. In the traditional sense, the algorithm is completely pointless—on a single-core processor, the dependencies between nodes were not a problem to begin with, and doing more work per node only makes the algorithm run slower. This shows that parallel programming is indeed fundamentally different from traditional single-core programming: it is not so much about porting existing algorithms to run on a parallel processor; it is about re-thinking some of the things that we usually take for granted, and coming up with new algorithms specifically designed with massive parallelism in mind.

About Tero Karras
Tero Karras joined NVIDIA Research in 2009, after having worked for 3 years with NVIDIA's Tegra business. His research interests include real time ray tracing, representations for detailed 3D content, parallel algorithms, and GPU computing.

Hi ertf23, I’m not sure if you have figured it out already… I have
thought for a (relatively) long time about it and think that I came to a
working solution (I am trying to code it in a few minutes.) I’m not
sure but if you or anyone else is interested I am willen to share my
code. Greetings, Neko (and sorry for my bad english, I still am learning that language)

ertf23

I did figured out both – how to allocate octree nodes and connecting the octree nodes in parent-child relationship. It also took me a long time to find the solution. I think my solution seems sub-optimal especially the algorithm for connecting nodes in parent-child relationship. It will be great if you could share the code on github or ideone. I will share mine too. We can then figure out the fastest algorithm

Neko

is pastebin okay too? (for now, we can work on a project at github if I get all my code snippets together ^^” )
And do you prefer CPU code with “virtual” parallelism?
I mean like:
#pragma omp parallel for schedule(dynamic, 32)
for(int idx = 0; idx < max; ++idx)
dosth(idx);
or would it be better if I write the CUDA code already?

I have just fixed out a bug where multiple equal morton codes generate a broken tree.

On my Intel i-7 4800M the sequentiell algorithm takes 4.07 seconds and the parallel (with 8 threads) takes 1.625 seconds to build the tree, both on 50.000.000 Objects (binary search NOT implemented in the parallel one, still looking index by index there ^^" )

I will clean my code and as soon as I get the binary search in there you can see it (but for today I still have a more important work to do, so it may take 1 or 2 more days but I guess it doesn't add much to 10 months)

About the image: I build an example main function that tests the CLZ(0) (because I was unsure if it will give me 32 or 31 ^^" haven't removed it yet, my fault)

it creates a list of 50 Million points from [0,0,0] to [1,1,1] and generates the morton code for each object

then sort it by the morton code

Then I count through the objects and count morton code that are twice, trice, 4 times or 5 times in the list (for output reason, the old version generated false trees as soon as there were more than one group of 4 equal codes…)

generate a tree with the old sequential methode and another tree with the parallel methode

then I walk through every node of both trees, count the times and the objects in the tree (for validation)

destruct the trees and tadaah

//TODO: Clean the code…

ertf23

Pastebin or any other code sharing website will also be fine. Even if your code is broken, I am alright with it. Just need to know the high level algorithm for connecting octree nodes in parent child relationship. You can share c,c++ or cuda c or even pseudo code.

Neko

Okay, I’ll read it in a sec (after commiting to github… still kinda figuring out the linux way… new to linux ^^”)

I always wanted to write a path-tracing engine (already did kinda like that… but it was half a year ago and very slow, without bvh’s and other nice stuff… only simple diffuse surface, spheres, endless planes and so on…)

I hope you like how I write code and I do invite you to work on this engine together (I aim it to be realtime on one side but nice-looking on the other side. Anyway nice and readable code/easy to use code is the main goal) since I think you are a nice person and know how to code(?).

Also sorry for packing it into one file (I tried to hold it simple for this example)

Oh, and before I forget it:
I found a solution for “multiple equal morton codes”, also I already tested with 30mio, 50mio and (ram kill) 100mio objects, also with some equal codes inserted by hand at the front, randomly in the middle and at the end and it generated fully valid trees (was kinda fun to write a self-diagnostic code and let it run over night and while in school, ran 18 hours without a memory leak and without an error, I guess that should do it) ^^

ertf23

Hallo Neko

Thanks for sharing the code. It looks good. Actually what I am looking for is not how to make Binary radix tree but ho to create Octree out of the binary radix tree. If you go to page 4 of the paper, you can see on the right column a section titled “Octrees”.

I am copy pasting the below the steps needed to create octree as per this paper:

(1) calculate Morton codes for the points,

(2) sort the Morton codes,

(3) identify duplicates, i.e. points falling within the same leaf
node, by comparing each pair of adjacent Morton codes,

(4) remove the duplicates using parallel compaction,

(5) construct a binary radix tree

(6) perform parallel prefix sum to allocate the octree nodes, and

(7) find the parent of each node.

Your code implements only steps 1-5 but not 6 and 7. I also implemented steps 1-5 similar to the way you implemented but I also implemented steps 6,7 (sub-optimal). The pseudo-code I shared with you performs steps 6 and 7(sub-optimal). The author gives some description about steps 6 and 7 but the algorithm for step 7 is not clear. So what I am looking for is optimal solution to step 7.

terekib

Hi!

I’m also trying to build an octree, but I can’t figure it out. How does Karras means, that “δchild/3┘−└δparent/3┘ are divisible by 3”? How do you calculate those counts for the number of octree nodes? How do you decide, which radix tree node becomes an octree node?

Thank you!

ertf23

For each radix tree node R
-> EL = edge linking R with its left child-node
-> ER= edge linking R with its right child-node
-> Lp = length of prefix of node R
-> Lcl = length of prefix of left child-node of R
-> Lcr = length of prefix of right child-node of R

Range of sub-prefixes for EL: [Lp+1, Lcl]. Any value in this range which is multiple of 3(for 3d) or 2(for 2d) will contribute 1 octree node of same length. Let Cl be the count of such octree nodes

Range of sub-prefixes for ER: [Lp+1, Lcr]. Any value in this range which is multiple of 3(for 3d) or 2(for 2d) will contribute 1 octree node of same length. Let Cr be the count of such octree nodes.

Total octree nodes corresponding to R = Cl+Cr

terekib

Thank you very much for the answer!
So what you said is basically this:
cntr = 1;
for (int i = 0; i parent.delta / 3)
{
cntr += node.delta / 3 – parent.delta / 3;
}
}
cntr += numberOfLeafNodes;

smtl

hi!
hello Neko!
how to deal with the case of duplicate Morton codes?

smtl

when the BVH tree is build, how to travel the tree with raytracing?

Neko

After building the tree, I will add bounding boxes (as described in the BoundingBox project in my Git :3 )

When tracing I will start at the root node, check it’s bounding boxes and when it hit I will test both children, and whenever I hit a bounding box, I will test it’s children and so on.

If you got more questions you can ask me at nekoyuke@gmail.com, but if you follow my project along I will try to build a fast, easy to understand and to use ray/path-tracer and a few sideprojects (like the TimeLapseCapture program I just wrote ^^” OpenCV c++ code will follow)

In general I try to update it every day :3

smtl

Ok I will try and follow you project.
And I guess if there is an efficient way to travel the bvh tree, becase we build the tree using Mordon tree?

smtl

hi!

hello Neko
how to deal with the case of duplicate Morton codes?

ertf23

Just scan the whole array of objects and remove any objects with duplicate morton codes. Another way is to somehow prevent duplicate morton codes.

The Chaper 4 give a solution , but I did not understand it. Can you help me.

Neko

I did it like this:
mcodeleft = lst[index-1];
mcoderigt = lst[index+1];
when (mcodeleft == mcoderight)
then set index as start point and go through the elements, till there is a diffrent element, also change the “find split” function, so it doesn’t return the middle, when the start and end elemet are equal but return the first element, so that when there is a group of equal codes, the subtrees will start at the second element and will go on like “left node = object, right node = subnode index+1” and so on, till all objects are included

so for example if you have multiple triangles with the same midpoint, they won’t be deletet

@nurabha:disqus Could we get in touch in a more “direct” chat? I want to help you with the octree but somehow I don’t get the “parallel concept” (I do get the concept of ocrtees but I don’t get the parallel concept…)

and I ask again, are you willen to participate in coding a nice to use (I hope realtime) Path/RayTracer API in c++&CUDA?

I fixed all bugs and yes, both examples deal with duplicate morton codes ^^

I achieved it by testing index-1==index+1 and if both are equal I will do a forward search till a diffrent morton code comes, also I changed the “find split” function so that it splits in equal-morton-code-cases always at the beginning so that it will go down like stairs and the nodes A is always the object and B is always the node to the other objects/the last object ^^

Do you understand my idea? If not I can try to visualize it a little :3

Also the BoundingBox project already deals with triangles, bounding boxes and duplicate morton codes (if there are any by randomness or made ba hand)

smtl

hi，Neko
Ok, I will read it after a minutes.
Can I get an email address from you? If I have some questions, I can send you email.

Hurrrrrrrrrrrrrr

Unless I’m misunderstanding–that sounds like it would produce a very deep, very narrow, linearly shaped subtree within each maximal contiguous range of identical Morton codes. This seems non-ideal.

I have just implemented the algorithm given on this page, and I dealt with code collisions by simply appending the bits of the thread ID (i.e., the leaf ID) to the least-significant-bit end of each Morton code. That way two codes will never be identical, and a long series of collisions will produce a rather nicely balanced subtree, and not a weird linear one.

The drawback of this is that calculations involving Morton codes are more complex and thus slower, since each code is effectively 64 bits long rather than 32 bits long. I tried to mitigate this as well as I could; I don’t actually concatenate the two numbers’ bits, for instance, since I expect that a 64-bit number will be slow on my hardware (assuming it’s even supported; I didn’t even check). Maybe I didn’t do it perfectly, and it’s even possible that if it IS done perfectly, then your solution will somehow be faster. But anyway, this is one more way to do it.

I also considered adding an extra step to the algorithm, between sorting of Morton codes and construction of the tree, that would use some kind of scanning operation on the differences between adjacent Morton codes to produce an array of (preferably small) offset values that could then be added, one offset to each Morton code, so as to shift all the codes “to the right” only juuuust enough to turn each series of collisions (4, 8, 8, 8, 8, 10, 35, 40) into a series of counting numbers (4, 8, 9, 10, 11, 12, 35, 40) without disrupting the global sorting at all. …But after some thought I realized that, although doing this efficiently is likely to be feasible, it would (I think?) tend to arbitrarily shift some Morton codes across some crucial boundaries in the octree, thus altering the nice connection between the arrangement of the original bounding boxes in space and the layout of the binary tree. I had little idea of how to tell how harmful that might be, and I didn’t think my concatenation scheme (described earlier) would be expensive or difficult to implement, so I gave up on this line of thinking.

Ultimately of course it’s preferable to minimize code collision in the first place, by choosing an appropriate resolution for the Morton codes, and possibly by arranging the bounding boxes so that they don’t come too close to being congruent, or to having overlapping centers, or whatever. But in a fully dynamic physics engine, degeneracies may occasionally arise (what if somebody added 20000 objects suddenly, all in the same frame and all at the same location, and then resumed simulation, expecting to see all the objects smoothly burst apart??), and so one must probably be prepared to somewhat gracefully handle a large number of Morton code collisions. (Maybe this is a reason why turning a range of collisions into a linear subtree would be bad? Because there might be tens or hundreds of thousands of them?)

Belgin

I know I’m late, but I couldn’t help but comment on this brilliant algorithm. Half of a millisecond for construction AND traversal? Sorry, but I find that hard to believe.

doOo

Hi,
I don’t understand the Morton code part, if I test the Morton3D function with the example vaules (0.1010, 0.0111, 0.1100) it’s returning 1479990 instead of 1010110010.
What am I missing here?

Neko

you are displaying the binary morton code as integer

For examle, if the morton code would be 0010 you see a 3… because your display function (i guess cout) converts the binary number to a human-known 10-base number system ^^”

so your morton code is correct, you are just displaying it the wrong way ^^

There are several ways of displaying a variable in binary

I do it like:
std::cout << std::bitset(val) << std::endl;
bits is the number of bits to display and val is the value ^^

I hope I could help x3

doOo

extremely late reply, but yes it helped,
I got my lbvh working
Thanks!!!

Hey, thank you very much for this amazing article. (how the hell could anyone discovered this perfect method :D)
I am still wondering about one thing but I think I am just missing something. What about a large object that is almost as long as the whole scene. From what I understood it still would be placed (based of its center that is more on the left) into the left subtree (left child of the root). How do we know that it can collide with the right subtree leaves as well?
In octree for example I’d place a reference to it also to the other nodes where it might collide so it would be actually present in both subtrees.