Thread Parallelism (Part 3): A Very Brief Look at Performance

This is the final of my 3 part series looking at thread parallelism, specifically at existing solutions providing a generic parallel-for abstraction. In case you missed it, have a look at Part 1 and Part 2.

This post is going to briefly look at performance of my last post’s parallel_for() wrapper inside OSPRay. I do recognize that this is going to be simplistic as tasking systems have to deal with a wide variety of application contexts. However, I feel that saying nothing about performance would leave the series incomplete.

What I’m benchmarking

OSPRay is a ray tracing library, where ray intersections with 3D geometric and volumetric data are used to render images. Readers who want a quick crash-course on ray tracing should have a look at Wikipedia’s article on the topic. I also highly recommend Pete Shirley’s 3 mini-book series on ray tracing, starting with “Ray Tracing in a Weekend“.

One of the beautiful aspects of rendering images with ray tracing is that the problem is embarrassingly parallel in image space. In other words, the rays we trace for each pixel in the image are completely independent from each other, making it trivial to parallelize that work. This makes it a nice candidate problem to use with parallel_for().

For this benchmark, I take a particular setup (hardware, scene, camera position, frame size, etc) and only change which tasking system parallel_for() targets. This is done at compile time with inlined functions through the wrapper, which yields the same result as if I wrote code directly against each tasking system’s API.

The scene I am using is a triangle mesh generated from a ground water simulation (~830k triangles, data from FIU). The tests are done with OSPRay’s default ambient occlusion (AO) renderer and a more fully featured “scientific visualization” (SciVis) renderer which adds additional direct lighting to create shadows. Each renderer is built-in to OSPRay and was used unmodified. Below are the images we get with each, the first from the AO renderer and the second from the SciVis renderer.

First-level parallelization

When ray tracing, it is more efficient to use a z-order curve to traverse the image as this yields rays which are more coherent to trace. Thus if we divide the image into 64×64 pixel tiles, assume that we do a z-order traversal of pixels within the tile for added efficiency. However, the work to schedule those tiles is still embarrassingly parallel. Thus if we divide up the work as 64×64 tiles, we get more coherent work than if we scheduled a task per-pixel or a task per scan line. Below is a visualization of what the tiling of the above image looks like, where each square is an individual task.

As noted at the end of my last post, here is the code which schedules that work using parallel_for():

The above code schedules a task for each tile, where each tile is 1) created on the stack, 2) rendered by iterating over z-ordered “chunks” in the tile, and 3) stored in the frame buffer. Using the workstation I have under my desk at the office (2x Xeon E5 2699 v3 CPUs, 36 cores, 72 threads | gcc-6.3.1 | Arch Linux) generated relatively similar results between each tasking system:

The charts show frames-per-second of progressively refining the above image, with max/min/average frame rate over 1000 frames. Just to be clear: the above images are the final, fully refined version of the image.

The results are mostly as expected: with only one level of parallel task scheduling, each tasking system does the job about the same. It is interesting to see that Cilk+ seems to have periodic latency spikes as it had a very low reported minimum frame rate. However, its max and average frame rates seem to be about the same as the others.

In general, we see roughly ~57-62 FPS with the AO renderer and ~40-45 FPS with the SciVis renderer. Not bad, but we can do better.

Adding more parallelism

If look closer at the problem, we see that a single task renders an entire 64×64 tile which can create some load imbalance problems. In order for the machine to execute more of the work in parallel, we have to say more of the work can be executed in parallel. A tile size of 64×64 is fairly large for a single task, as rays which miss the entire model are very cheap to renderer vs. rays which hit deep inside interior crevices of the model. This means that tiles which have expensive pixels to render end up taking way longer than tiles which have few (or none) expensive pixels.

There are multiple ways we can deal with this issue, one of which is to simply make our tile size smaller…but that would be boring! Instead, we can use the concept of nested parallelism to generate more granular tasks for the tasking system to work with. In a nutshell, nested parallelism just means that our parallel_for() tasks may call parallel_for() again, generating more tasks.

This is really easy to change in our above code: the inner for-loop simply becomes another call to parallel_for().

This creates a set of tasks, each of which schedules more tasks. Because parallel_for() has an implicit synchronization after it is called, each top-level task will only complete once all of its child tasks are completed. However tasks which are scheduled can be executed by any worker thread, so worker threads which were working on cheap tiles and finished early are able to then pick up incomplete work generated from more expensive tiles. This results in threads staying busy with useful work which better utilizes the machine.

The following are updated frame rates when using the nested parallelism version:

With the more granular tasks performance jumped up by about 30%, or ~68-83 FPS for AO and ~48-57 FPS for SciVis. Definitely a nice increase!

It is not perfect, however, as the parallel_for() wrapper does not compose with itself well with OpenMP. OpenMP can do nested parallelism, but it requires you to arrange the #pragma directives in a particular way for the run-time to schedule work correctly. The current way the parallel_for() wrapper is formulated, this does not allow nested calls to parallel_for() to know that they were called from a parent parallel_for(). This is partly a historical problem with OpenMP and not a reason to ignore it as a good tasking solution, but it is definitely something to look out for if you’re using a parallel_for() wrapper with OpenMP. If you are interested in OpenMP’s ability to do nested parallelism, you can easily find resources discussing the topic on the web.

Final thoughts

This post probably left many of you wanting more, which is partly the point. One of my goals here was to show how a parallelism abstraction makes it easy to test out tasking libraries. Obviously what works well for one problem is not guaranteed to work well for another problem…so go try this in your code and see what happens!

For me, it was good to go back and re-test our “home grown” tasking system again. To my surprise it looks like cleanups done in OSPRay resulted in our tasking system working better than anticipated, as it performed about the same as TBB. We have plenty of reasons to keep TBB as our “go to” tasking system (i.e. it performs better in other areas), but the results I generated for this post motivate me to package that up as a lightweight, header-only tasking system to have around.

This series is more about broadening your scope of engineering considerations when writing code, hopefully opening up possibilities to simplify your code when you don’t need a custom solution. Don’t get me wrong, having a wrapped parallel_for() on some existing tasking libraries won’t be the best solution for everyone, but I am certain it can get many of us to a place with good performance and little-to-no maintenance overhead.

3 thoughts on “Thread Parallelism (Part 3): A Very Brief Look at Performance”

Nice post, thanks for sharing your experience. Is lowering the tile size only “boring”, or is nested parallelism actually faster? I ask because in my experience simpler code often means better performance.

I also made some experiments with scheduling a while ago. I have two tiled scheduler implementations, one that uses TBB, and one task-queue based scheduler that I implemented with std::thread. The latter one uses atomic variables to synchronize the queue accesses, and holds the threads persistent (probably TBB does so too under the hood, I never checked this). I always found that the performance of both implementations was roughly the same – until testing on knights landing. There I found that my custom implementation yielded ~5% higher frame rates with benchmarks similar to the ones that you use (e.g. ambient occlusion). I’m not sure what TBB does internally, but I’d guess that the load balancing is more advanced than my simple task-queue load balancing, and I’d also guess that too complicated load balancing is overkill for a task as simple as tiled rendering.

We have other reasons to keep the tile size fairly large, as that is our unit of work with our distributed MPI backend in OSPRay. Thus the current nested version allows us to use a universal tile size between local and distributed rendering, but that’s beyond the scope of the post. 😉

I just ran a couple of quick tests on my machine here at home (2x Ivy Bridge Xeon workstation, TBB, gcc-6.3.1) to verify this before making any assertions: reducing the tile size with a single level of parallelism is also fast, but I was unable to make it faster than the nested formulation. I believe this to be true as TBB simply needs enough granularity of tasks to load balance effectively. Thus if you use a single level parallel_for() with a small tile (I tried 8×8) or use nested parallel_for() on a larger tile, they end up being effectively the same. However, these were “quick and dirty” tests, meaning I don’t know if my error rate in timings is <5% (re: your KNL results) and I simply re-ran the single view I used for this post. A "real" assessment would entail much more benchmarking, but the gain is likely not interesting considering the trade-offs mentioned above.

It is important to note, however, that OpenMP's inability to deal with nesting effectively makes a reduced tile size a good optimization when using it over other tasking systems.

Your results with TBB vs. your tasking system on KNL are certainly interesting. Did you use a fairly recent version of TBB when you tried it? Reason is that TBB may have KNL optimizations in its newer releases, which may be missing in a TBB found in CentOS6/7 or older Ubuntu package repositories (I'm not certain of this, just a idea). Regardless, it seems that you have a nice task scheduler. 🙂

Also when I took the data for this post, I didn't realize that our hand written tasking system was leaking task memory all over the place…once that got fixed we lost ~25% performance across the board…whoops! Not sure if I'll do much more to it beyond my own amusement, but it could probably be improved.

Ah ok, now I understand, when network computing is involved, you prefer larger tiles to reduce latency, and with single node sort-first you maybe prefer smaller tiles, and then a nested approach makes complete sense, anyway.

Your suggestion is presumably right, I used a default TBB version that comes with CentOS6, and a quick inspection of the disassembly lets me think that it was not even compiled with KNL specific optimizations (only occurrences of xmm registers but no zmm), let alone with algorithmic optimizations specific to the KNL platform. So there goes my happiness of being able to beat TBB 😉

I mean, for typical workloads like path tracing with complicated lighting, calculating radiance quickly becomes the dominant factor so that scheduling is often just not the bottleneck. I think that people don’t put too much thought into task scheduling because of this. But if you’re targeting interactive systems like VR on a display wall, you usually resort to simpler algorithms (e.g. direct lighting + AO, or only Whitted rendering), and then the few extra frames you can achieve with a good task scheduler actually matter.