raPT - Rapid Path TracingPath tracing, ray tracing, 3D rendering, computer graphics and high performance computing.
http://rapt.technology/
Mon, 28 Aug 2017 11:57:14 +0200Mon, 28 Aug 2017 11:57:14 +0200Jekyll v3.0.1Accelerated Single Ray Tracing for Wide Vector Units<p>The annual High Performance Graphics conference (HPG’17) is kicking off today in LA with a very promising line-up of papers.
A side effect of the continued VR hype is an increased interest in ray tracing methods motivated by adaptive rendering algorithms.
In addition, production rendering has seen wide-spread adoption of ray tracing.
So these are good times for ray tracing research and HPG offers two papers focused on accelerated ray traversal methods, one targeted at the GPU by Nvidia (<a href="http://research.nvidia.com/publication/2017-07_Efficient-Incoherent-Ray">view</a>), and one targeted at the CPU by myself.
You can read my contribution <a href="/data/wive-author.pdf">here</a>, and after that keep on reading this blog post which describes some further optimizations of the algorithm (named WiVe) introduced in the paper.</p>
<!--more-->
<p>After some further experimenting with WiVe I’ve discovered a few tweaks that make the algorithm even more efficient.
An issue with the original implementation is the dependency chain created by storing nodes to the stack and reading the top node back.
On the <a href="https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing">KNL</a> architecture there is no store forwarding between the vector and integer units, so the data has to go through the L1 cache.
Ideally the next node to be traversed could be transferred directly from vector to integer registers in order to move stack operations out of the critical path of execution.
As it happens there is a way to achieve this, even without increasing the instruction count.</p>
<!--<a href="/images/wiveenhanced.png"><img src="/images/wiveenhanced.png" alt="Schematics of the enhanced Wive single ray traversal algorithm for multi-branch BVHs." style="margin-left:auto;margin-right:auto;display:block;"/></a>-->
<figure>
<img src="/images/wiveenhanced2.png" alt="Schematics of the enhanced Wive single ray traversal algorithm for multi-branch BVHs." style="margin-left:auto;margin-right:auto;display:block;" />
<figcaption>Figure 1 <i>Sketch of the updated WiVe algorithm. For the original sketch see paper.</i></figcaption>
</figure>
<p>Take a look at Figure 1.
The idea is to always have the next node in the low element of the vector register after compression (D) because the low element can be easily transferred to an integer register using <i>movd</i> (E) (Instructions are described <a href="https://software.intel.com/sites/default/files/managed/a4/60/325383-sdm-vol-2abcd.pdf">here</a>).
This means that the order of nodes in the vector register has to be inverted.
In addition, if none of the child nodes is intersected during a traversal step the low vector element must be occupied by the current stack top (T) for the algorithm to work.</p>
<p>The <i>movd</i> instruction loads the stack top into the low element of a vector register and sets all other elements to zero (C).
Using the masked variant of <i>vpcompressq</i> with the stack top as the destination register overwrites the stack top node only if one or more child nodes are intersected.
Thus, in all cases we find the next valid node in correct order in the low element of the stack vector and the traversal can continue without further delay.
The remaining stack push is removed from the immediate dependency chain.
Since the nodes are now in inverted order we use an inverted stack as well that grows towards smaller memory addresses.
To avoid overwriting of previous stack data the store (F) requires a new mask that reflects the new register positions of the active nodes after compression.
This mask is easily obtained with a <i>vptestmq</i> instruction which sets a mask bit for all non-zero elements.</p>
<p>Now the algorithm is almost working, the only remaining issue is its termination since we no longer check the stack size during traversal.
Instead, a sentinel element is added to the bottom of the stack (S) which is zero everywhere except for a set leaf node flag.
If the traversal reaches the bottom of the stack it finds the fake leaf node and jumps to leaf intersection where a simple test detects the sentinel and terminates the algorithm.</p>
<table style="text-align:left;margin-left:auto;margin-right:auto;border-spacing: 10px;">
<caption style="text-align:left;">Table 1 <i>Performance in million rays per second (MRays/s). For experimental setup see paper.</i></caption>
<tr>
<th></th>
<th>Mazda</th>
<th>San Miguel </th>
<th>Art Deco</th>
<th>Powerplant</th>
<th>Villa</th>
</tr>
<tr>
<td><b>Updated</b></td>
<td>131.9</td>
<td>77.3</td>
<td>171.4</td>
<td>88.9</td>
<td>92.6</td>
</tr>
<tr>
<td><b>Original</b></td>
<td>126.7</td>
<td>73.1</td>
<td>165.0</td>
<td>85.4</td>
<td>87.4</td>
</tr>
<tr>
<td><b>Improvement [+%]</b></td>
<td>4</td>
<td>6</td>
<td>4</td>
<td>4</td>
<td>6</td>
</tr>
</table>
<p>So how does the updated implementation perform? Was it worth the effort?
Table 1 lists the results, the experimental setup is the same as described in the paper (AVX-512, Intel Xeon Phi 7250).
Performance improved between 4-6% over the original implementation, making the currently fastest CPU-based BVH traversal for single rays even faster!</p>
Fri, 28 Jul 2017 00:00:00 +0200http://rapt.technology/posts/accelerated-single-ray-tracing-for-wide-vector-units/
http://rapt.technology/posts/accelerated-single-ray-tracing-for-wide-vector-units/Part II: Parallel BVH construction<p>The best known BVH construction algorithm in terms of ray tracing performance is the <a href="http://www.nvidia.ca/docs/IO/77714/sbvh.pdf">BVH with spatial splits</a> (SBVH).
In contrast to a standard BVH where sibling nodes overlap in space if the corresponding primitives do, the SBVH allows to split overlapping primitives resulting in spatially disjunct sibling nodes.
Primitive splitting is a costly operation and considerably complicates multi-threaded BVH construction due to recursively growing memory buffers.</p>
<p>The raPT renderer contains a highly efficient multi-threaded and vectorized SBVH construction framework, which is described in the paper <a href="/data/pssbvh.pdf">Parallel Spatial Splits in Bounding Volume Hierarchies</a> published in this year’s <a href="http://www.vis.uni-stuttgart.de/egpgv/egpgv2016/">Eurographics Symposium on Parallel Graphics and Visualization</a> (EGPGV16).
<!--more--></p>
<p>Abstract</p>
<blockquote>
Bounding volume hierarchies (BVH) are essential for efficient ray tracing. In time-constrained situations such as real-time or large model visualization, fast construction of BVHs usually compromises hierarchy quality, resulting in reduced rendering speed. We propose a parallel framework for the state-of-the-art BVH construction algorithm with spatial splits (SBVH) that provides highest quality hierarchies within a time frame competitive with low quality builders optimized for construction speed. We leverage both data and task parallelism to employ threading and single instruction, multiple data (SIMD) capabilities of modern CPUs. Our key contribution is a lightweight memory management and load balancing scheme that maximizes parallel efficiency.
</blockquote>
<p>The paper comes with <a href="/data/supplemental.zip">supplementary code</a> fragments that demonstrate how data parallel AVX instructions can be used to accelerate various kernel operations of the SBVH algorithm, in particular triangle splitting.</p>
<p>The results section of the paper compares the raPT parallel SBVH framework to the state-of-the-art implementation in the <a href="http://embree.github.io/">Intel Embree</a> ray tracing library on a dual socket <a href="http://ark.intel.com/products/81908/Intel-Xeon-Processor-E5-2680-v3-30M-Cache-2_50-GHz">Intel Xeon E5-2680v3</a>.</p>
<p>Here, I provide extended results obtained from a more common <a href="http://ark.intel.com/products/88196/Intel-Core-i7-6700-Processor-8M-Cache-up-to-4_00-GHz">Intel Core i7-6700</a> processor. Both Embree (v2.9.0) and raPT have received further updates, so the data for the Xeon differs slightly from the values presented in the paper.
In addition, to show the impact of spatial splits on BVH construction times, the table below includes timings with spatial splits disabled for both raPT and Embree.</p>
<p><a href="/images/extresultsbvh.png"><img src="/images/extresultsbvh.png" alt="Extended results for Parallel Spatial Splits in Bounding Volume Hierarchies." style="margin-left:auto;margin-right:auto;display:block;" /></a></p>
<p>As you can see, reducing the number of threads from 48 (Xeon) to 8 (Core i7) reduces the gap between Embree and raPT somewhat which demonstrates the performance gained from the highly efficient parallel framework in raPT if many threads are active.</p>
<p>Without spatial splits raPT and Embree are closer, with raPT still clearly in the lead. This puts into perspective the speed-up obtained from vectorized split operations and the corresponding implementation of recursively growing memory buffers.</p>
Mon, 06 Jun 2016 00:00:00 +0200http://rapt.technology/posts/part-ii-parallel-bvh-construction/
http://rapt.technology/posts/part-ii-parallel-bvh-construction/Part I: The raPT acceleration structure<p>This is the first part of a series about the inner workings of the raPT renderer, uncovering techniques and implementation details.
The topic for today is the acceleration structure implemented in raPT to speed up the calculation of ray-geometry intersections.</p>
<!--more-->
<p>An acceleration structure reduces the algorithmic complexity of finding the closest triangle intersection for a given ray from linear to logarithmic in the number of triangles, decoupling the render performance from the scene size to some extent.
Among the common acceleration structures for ray tracing are <a href="http://dcgi.felk.cvut.cz/home/havran/DISSVH/dissvh.pdf">kd-tree</a>, <a href="http://www.isislab.it/papers/cosenzaegita08grid.pdf">grid</a>, and <a href="https://graphics.stanford.edu/~boulos/papers/togbvh.pdf">bounding volume hierarchy</a> (BVH).
The design of the acceleration structure is influenced by several factors, such as emphasis on dynamic or static geometry, flexibility, memory consumption, and by the capabilities of the target platform and the underlying microarchitecture.</p>
<p>The choice for raPT is a BVH with a <a href="http://cg.ivd.kit.edu/publications/pubhanika/2008_qbvh.pdf">branching factor of 4</a> (BVH4), meaning that a parent node references (up to) 4 child nodes.
Increasing the branching factor increases the efficiency of data parallel processing and memory coherence, and reduces the depth of the hierarchy, which in turn reduces ray divergence and memory consumption.
It also reduces culling efficiency leading to an increased number of ray-bounding box intersection tests and memory bandwidth, so there is a sweet spot which depends mainly on the traversal algorithm and the targeted micro architecture. Since the raPT traversal algorithm <a href="http://jcgt.org/published/0004/04/05/">ORST</a> processes 2 rays simultaneously, a branching factor of 4 is sufficient to saturate an <a href="https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions">AVX</a> register.</p>
<p>A differentiating feature of my BVH4 structure is a fast masking mechanism, which allows compression of nodes and visibility editing of scene geometry.
Each node contains two 4-bit masks, a static mask for compression and a dynamic mask for editing.
Each mask bit maps to one of the child nodes, where a set bit enables the child node for traversal and a unset bit disables it.
The dynamic mask is a subset of the static mask, so that the dynamic mask can only have a set bit where the static mask has a set bit as well.
Initially both masks are equal, and during traversal only the dynamic mask is relevant.</p>
<p>Before continuing with the mask mechanism, let’s first take a look at the node data structure(s):</p>
<figure class="highlight"><pre><code class="language-c--" data-lang="c++"><span class="k">struct</span> <span class="n">bvhNode4</span>
<span class="p">{</span>
<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">child</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">unused</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">mask</span><span class="p">,</span> <span class="n">perm</span><span class="p">,</span> <span class="n">flags</span><span class="p">;</span>
<span class="p">};</span>
<span class="k">struct</span> <span class="n">bvhNode4_box</span>
<span class="p">{</span>
<span class="n">vec4f</span> <span class="n">x</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="n">y</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="n">z</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>
<span class="p">};</span>
<span class="k">struct</span> <span class="n">bvhNode4_cluster</span>
<span class="p">{</span>
<span class="n">bvhNode4_box</span> <span class="n">boxes</span><span class="p">;</span>
<span class="n">bvhNode4</span> <span class="n">nodes</span><span class="p">[</span><span class="mi">4</span><span class="p">];</span>
<span class="p">};</span></code></pre></figure>
<p>Compared to the original layout described in my ray traversal <a href="http://jcgt.org/published/0004/04/05/">paper</a> there have been some changes / improvements.
The number of <code>bvhNode4</code> elements after a <code>bvhNode4_box</code> is no longer variable, but always 4.
This leads to a <code>bvhNode4_cluster</code> which is always 128 bytes large and naturally aligned in memory.
The <code>child</code> field contains the index for the child cluster or for the primitive cluster, depending on whether the node is an inner node or a leaf respectively.
This information is encoded in the <code>flags</code> field.
<code>perm</code> is the index into the traversal order look-up table and <code>mask</code> contains the static mask in the upper bits and the dynamic mask in the lower bits.</p>
<p>Also the field for the primitive count has vanished from <code>bvhNode4</code> compared to the original layout.
In the case of a leaf node <code>child</code> now references exactly one primitive cluster containing 4 triangles, which means that BVH construction needs to partition the primitives until only 4 or less remain in a single node, regardless of SAH cost.
This may seem like a drawback but in practice the number of SAH efficient leaves with primitive counts larger 4 is negligible to begin with, and the simpler intersection logic results in slightly improved traversal performance.</p>
<h3 id="compression">Compression</h3>
<p>During construction of a BVH4 often nodes are created with 2 or 3 child nodes instead of 4.
In the case of 2 child nodes, both are leaves, and in the case of 3 child nodes, 1 is a leaf and 2 are inner nodes.
The distribution of the child count in a common scene is around 40%, 20%, and 40% for 2,3, and 4 children respectively.</p>
<p>Thus child clusters are not always fully occupied, resulting in wasted memory.
Static masks allow to compress 2 unrelated pairs of child nodes into a single cluster, reducing the size of the hierarchy by around 20%.
For the case of 3 child nodes, compression is not directly possible because there are no nodes with a single leaf.
The strategy of ‘pulling up’ nodes can be applied here to fill the single empty slot with one of the two grand children.
However, this can degrade traversal speed slightly, so it is a trade of between performance and (rather insignificant) memory savings.</p>
<p>The same mechanism is used to tightly pack primitive clusters for leaves with only 1, 2 or 3 triangles.
Here leaves with 1 triangle can be paired with nodes containing 3 triangles.
However, in practice 3 triangle leaves always outnumber single triangle leaves significantly, which prohibits perfectly dense packing.</p>
<h3 id="editing">Editing</h3>
<figure style="text-align:center;font-style:italic;">
<a href="/images/boeing_edit1.png"><img src="/images/boeing_edit1_small.png" alt="Altering visibility based on material, object, mesh, or primitive granularity interactively." style="margin-left:auto;margin-right:auto;display:block;" /></a>
<caption><span style="font-weight: bold; font-style: normal;">Figure 1</span> Altering visibility based on material, object, mesh, or primitive granularity interactively.</caption>
</figure>
<p>When inspecting a model, you usually want to change the visibility of certain parts or delete unneeded geometry entirely.
For this use case, dynamic masks are the ideal technique, see Figure 1.
Selecting a material, object, mesh, or primitive, the corresponding triangles are disabled / enabled and a post-order traversal of the BVH propagates the visibility changes up the hierarchy.
This way disabled subtrees are implicitly skipped for invisible geometry, accelerating ray traversal.
Note that this feature does not add any additional instructions to the traversal routines.
The original visibility of triangles / nodes can be reconstructed from the static masks.</p>
<p>Depending on the granularity of the visibility changes (e.g. material) it may be necessary to perform a post order traversal of the entire hierarchy to find all corresponding primitives.
Thus the update is implemented with task-based parallelism utilizing the same parallelization framework powering the BVH builder.
For common scenes a full update takes a few milliseconds at most, and for a huge data set like the Boeing the timing is around 700ms on a dual socket workstation.</p>
<figure style="text-align:center;font-style:italic;">
<a href="/images/boeing_edit2.png"><img src="/images/boeing_edit2_small.png" alt="Using dynamic masks for delete. Removed geometry (red) can be restored at any time." style="margin-left:auto;margin-right:auto;display:block;" /></a>
<caption><span style="font-weight: bold; font-style: normal;">Figure 2</span> Using dynamic masks for delete. Removed geometry (red) can be restored at any time.</caption>
</figure>
<p>The dynamic masks are also useful to implement a ‘delete with undo’ mechanism.
By defining a special ‘delete’ material, assigning this material to a geometry selection will remove all the corresponding primitives from the scene.
The deletions can be undone by toggling the visibility of the ‘delete’ material and changing the material for a selection of primitives.
This is illustrated in Figure 2 where all the removed triangles appear in red.</p>
<h3 id="construction">Construction</h3>
<p>For the efficient construction of high-quality split BVHs on many-core CPUs I have devised a fast parallelization framework, which is the topic of an upcoming paper already accepted for publication.
I will highlight the work in a future blog post with some additional goodies, time permitting.</p>
Mon, 18 Apr 2016 00:00:00 +0200http://rapt.technology/posts/part-i-the-rapt-acceleration-structure/
http://rapt.technology/posts/part-i-the-rapt-acceleration-structure/Introducing the raPT renderer<table class="table-img">
<tr>
<td class="td-img"><a href="/images/audi_r8_front_left.png"><img src="/images/audi_r8_front_left_small.png" alt="Front view of a path-traced Audi R8" /></a></td>
<td class="td-img"><a href="/images/audi_r8_front_right.png"><img src="/images/audi_r8_front_right_small.png" alt="Alternate front view of a path-traced Audi R8" /></a></td>
<td class="td-img"><a href="/images/audi_r8_back.png"><img src="/images/audi_r8_back_small.png" alt="Rear view of a path-traced Audi R8" /></a></td>
</tr>
<tr>
<td class="td-img"><a href="/images/boeing2_bottom.png"><img src="/images/boeing2_bottom_small.png" alt="Bottom view of a path-traced Boeing 777" /></a></td>
<td class="td-img"><a href="/images/boeing2_front.png"><img src="/images/boeing2_front_small.png" alt="Front view of a path-traced Boeing 777" /></a></td>
<td class="td-img"><a href="/images/boeing2_inside.png"><img src="/images/boeing2_inside_small.png" alt="Inside view of a path-traced Boeing 777" /></a></td>
</tr>
</table>
<p>Over the past few weeks the testbed surrounding my fast ray tracing kernels has evolved into a fully capable rendering system,
featuring <abbr title="High Dynamic Range">HDR</abbr> environment lighting, <abbr title="a.k.a. area lights">mesh lights</abbr>, <a href="http://www.cs.dartmouth.edu/~cs77/slides/18_PathTracing.pdf">next event estimation</a>, <a href="https://en.wikipedia.org/wiki/Low-discrepancy_sequence">low discrepancy</a> samples, <a href="http://www.cs.dartmouth.edu/~cs77/slides/18_PathTracing.pdf">Russian roulette</a>, and a simple material model supporting glossy and specular reflections and refractions.</p>
<!--more-->
<p>The images above of an Audi R8 (1.6M triangles) and a Boeing 777 (300M triangles) demonstrate partly the quality of the raPT renderer, and partly my artistic inability to choose proper material settings.
Please note that the missing parts in the Boeing images are not the result of a rendering error but of an incomplete data set.</p>
<p>If you compare the images to other renderings of the same <abbr title="computer-aided design">CAD</abbr> model <a href="https://www.google.de/search?tbm=isch&amp;q=boeing+777+ray+tracing">floating around</a>, you will also notice that I applied several modifications to the geometry and material groups.
Modifying the Boeing has been a totally pain-free experience as the raPT renderer supports interactive editing (with undo!), even for very large scenes.</p>
<p>For the coming weeks I plan to dive into several technical aspects in separate blog posts.
Currently I am working on seamless batch integration throughout the entire rendering pipeline, and once this is finished I plan to provide a detailed performance analysis as well.</p>
Thu, 31 Mar 2016 00:00:00 +0200http://rapt.technology/posts/introducing-the-rapt-renderer/
http://rapt.technology/posts/introducing-the-rapt-renderer/Processing HDR maps for importance sampling<p>Currently I’m implementing a <abbr title="High Dynamic Range">HDR</abbr> pipeline for <a href="https://en.wikipedia.org/wiki/Reflection_mapping">environment map</a> lighting in raPT, and I have found HDR image processing to be quite demanding computationally.
In particular the generation of full resolution <a href="https://en.wikipedia.org/wiki/Cumulative_distribution_function" title="Cumulative Distribution Function">CDF</a> tables for importance sampling can take a long time with a straightforward single-threaded implementation.
For example, only building the CDF table for a 8k×4k HDR image (floats) already requires about 600ms on my <a href="http://ark.intel.com/products/81908/Intel-Xeon-Processor-E5-2680-v3-30M-Cache-2_50-GHz">Intel Xeon E5 v3</a>. Since I want to use a slider to adjust exposure and rotation of the environment maps in my scenes, this is real annoyance instead of real time.</p>
<!--more-->
<p>Obviously, implementing multi-threading support will already give a considerable boost on my dual socket 24 core machine, especially since the problem is embarrassingly parallel along one of its two dimensions.
However as usual curiosity took over and I wanted to see how far I can push performance, so I optimized the memory access patterns and added <a href="https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions">AVX</a> vectorization in addition to dynamically load-balanced multi-threading.
The results are decent and, while not providing any spectacular new insights, highlight some important performance characteristics of modern CPU architectures.</p>
<p>First, let me state the problem more clearly. A u-v parametrized environment map with width w and height h in a 2:1 aspect ratio is mapped to a sphere by aligning u with the azimuthal angle φ and v with the polar angle θ.
For importance sampling each image pixel is assigned a <a href="https://en.wikipedia.org/wiki/Probability_density_function" title="Probability Density Function">PDF</a> value derived from its scalar intensity scaled by sin(θ), and the resulting two-dimensional discrete PDF is separated into one-dimensional <a href="https://en.wikipedia.org/wiki/Marginal_distribution">marginal</a> and <a href="https://en.wikipedia.org/wiki/Conditional_probability_distribution">conditional</a> densities.
The density values are accumulated along the image columns into monotone step functions corresponding to the conditional CDFs, and from the integrals of the conditional CDFs a single marginal CDF is calculated.
Sampling is accomplished by drawing a pair of random numbers, which are used to index the marginal CDF and the resulting conditional CDF respectively by finding the function’s step interval that includes the value of the random number.</p>
<p>The performance critical part is the calculation of the conditional CDF for each of the pixel columns. Accumulation is performed along the v axis while the image is stored in u direction in memory. This guarantees 1-2 cache misses for every accessed pixel composed of 3 floats.
The text book approach is to use some kind of blocking technique, which is implicit in the AVX implementation and load balancing described below.</p>
<p>The <span style="color:red">r</span><span style="color:green">g</span><span style="color:blue">b</span> <abbr title="Array of Structures">AOS</abbr> format of the pixel data is a less-than-ideal match for 8-wide <abbr title="Single Instruction Multiple Data">SIMD</abbr> instructions, unfortunately. The best approach here is to load 8 pixels at once (96 bytes) using 3 registers, and to perform a shuffle dance to get a single vector for each component.
This is accomplished by 3 <code>vperm2f128</code>, 6 <code>vblendps</code> and 3 <code>vpermilps</code> instructions, which was the shortest sequence I could come up with, but I always welcome clever suggestions. Once everything is in place the PDF values for the 8 pixels are calculated in parallel and … another inconvenience is encountered.</p>
<figure style="text-align:center;font-style:italic;">
<img src="/images/shuffle.svg" alt="Unpacking 3 registers of AOS pixel data into SOA format." width="450px" style="margin-left:auto;margin-right:auto;" />
<caption>Unpacking 3 registers of AOS pixel data into SOA format. Urgh!</caption>
</figure>
<p>Each of the calculated PDF values belong to a different conditional CDF so accumulation requires gathering the previous results from separate memory locations and scattering the new results back to separate memory locations again.
Since AVX gather/scatter support is either notoriously slow or non-existing, a small accumulation buffer is used to store the results of 8 iterations along the v axis in <abbr title="Structure of Arrays">SOA</abbr> format.
Once the buffer is full, the data is read back into the AVX registers and a full 8x8 transpose swaps rows and columns so that 8 results can be stored for each of the 8 conditional CDFs being processed.
The last entry of the accumulation buffer is used to seed the calculations of the next 8 iterations. This way all the gather/scatter operations are avoided at the cost of several shuffle instructions.</p>
<p>The final optimization is to process 16 instead of 8 conditional CDFs in parallel, essentially duplicating the code described above.
The motivation is twofold, first to fully utilize the cache lines being fetched (exactly 3 for 16 pixels), and second to provide two independent instruction streams to increase <abbr title="Instructions per Cycle">IPC</abbr>.</p>
<p>Load balancing for multi-threading is then implemented with a simple atomic counter that represents the pixel columns of the HDR image, which is incremented by 16 every time a thread requires more work until the counter value is equal to the image width.</p>
<table style="text-align:left;margin-left:auto;margin-right:auto;border-spacing: 15px;">
<tr>
<th></th>
<th>original</th>
<th> AVX </th>
<th>multi-threaded</th>
</tr>
<tr>
<td><b>8k&times;4k</b></td>
<td>619ms</td>
<td>166ms</td>
<td>14ms (44x)</td>
</tr>
<tr>
<td><b>4k&times;2k</b></td>
<td>116ms</td>
<td>41ms</td>
<td>3ms (39x)</td>
</tr>
<tr>
<td><b>1k&times;512</b></td>
<td>2.42ms</td>
<td>0.74ms</td>
<td>0.07ms (35x)</td>
</tr>
</table>
<p>Results are listed in the table above for 3 different image sizes and 3 implementations (original, AVX, multi-threaded).
Let’s compare the original and AVX implementations first. The speed-up is around 3-4x which is reasonable considering the ratio of shuffle to actual compute instructions.
Next, let’s focus on multi-threading scalability. Activating all 24 cores/48 threads leads to a speed-up of 11-13x which is rather low.
The reason is memory bandwidth saturation since the ratio of data movement to compute is quite high. <a href="http://www.intel.com/content/www/us/en/benchmarks/server/xeon-e5-2600-v3/xeon-e5-2600-v3-stream.html">Intel reports</a> 118Gb/s for a synthetic benchmark on a dual socket system almost identical to mine, and since my memory buffers live all on the same <a href="https://en.wikipedia.org/wiki/Non-uniform_memory_access" title="Non-Uniform Memory Architecture">NUMA</a> node I can potentially reach half of this.
The medium-sized benchmark achieves an effective bandwidth of 4k∗2k∗(12b+4b)/3ms≈45Gb/s which is about 70% of the maximum. Most of the remaining bandwidth is consumed by hardware prefetching cache lines along the image rows that are evicted before they are actually used and thus need to be reloaded.</p>
<p>This brings me to a point that probably stood out to you from the start: Swapping the axes for marginal and conditional CDFs would linearize memory access and align with the hardware prefetcher, and in addition improve load balancing because rows could be processed individually instead of in sets of 8 or 16.
The simple justification for not doing this is that the format is decided by the sampling code, and for performance reasons I use the long axis for the marginal CDF.
Just out of interest I tested swapping the axes for the CDF computation, and achieved 25-30% shorter run times and an effective bandwidth of 95% due to the more efficient memory access.</p>
<p>Concluding this rather technical (tedious?) post, my initial performance issue has been resolved and I can happily adjust exposure and rotation of my full resolution HDR environment maps in real time now. Considering the severe bandwidth bottleneck and the large number of threads in my test machine, the AVX implementation was not strictly necessary but a nice exercise nevertheless.</p>
<p>The next post will introduce the first version of the raPT path tracer featuring my <a href="/posts/fast-ray-tracing-kernels/">fast ray tracing kernels</a> and this HDR map processing implementation of course, so expect enjoyable pretty pictures!</p>
Sun, 24 Jan 2016 00:00:00 +0100http://rapt.technology/posts/processing-hdr-maps-for-importance-sampling/
http://rapt.technology/posts/processing-hdr-maps-for-importance-sampling/Fast ray tracing kernels<table class="table-img">
<td class="td-img"><a href="/images/boeing_front.png"><img src="/images/boeing_front_small.png" alt="Front view of a path-traced Boeing 777" /></a></td>
<td class="td-img"><a href="/images/boeing_bottom.png"><img src="/images/boeing_bottom_small.png" alt="Bottom view of a path-traced Boeing 777" /></a></td>
<td class="td-img"><a href="/images/boeing_side.png"><img src="/images/boeing_side_small.png" alt="Side view of a path-traced Boeing 777" /></a></td>
</table>
<p>This is my first <abbr title="Rapid Path Tracing">raPT</abbr> blog post so welcome and thanks for reading. A couple of days ago <a href="http://jcgt.org" title="Journal of Computer Graphics Techniques">JCGT</a> published my paper <a href="http://jcgt.org/published/0004/04/05/">Efficient Ray Tracing Kernels for Modern CPU Architectures</a> which introduces refined algorithms for <abbr title="4-ary Bounding Volume Hierarchy">BVH4</abbr> traversal of coherent and incoherent rays, named <abbr title="Coherent Large Packet Traversal">CLPT</abbr> and <abbr title="Ordered Ray Stream Traversal">ORST</abbr> respectively.</p>
<!--more-->
<p>The results show that CLPT can outperform <a href="http://embree.github.io/">Embree</a> up to 4x for primary ray traversal! For incoherent rays ORST can achieve up to 60% higher traversal speed compared to the previously fastest method <a href="http://fileadmin.cs.lth.se/graphics/research/papers/2014/drst/" title="Dynamic Ray Stream Traversal">DRST</a>, which makes ORST about twice as fast as Embree for secondary rays.</p>
<p>Unsurprisingly, the motivation of the paper has been to push traversal speed even further compared to previous approaches, but also to have a unified pair of algorithms with best-in-class performance for a single acceleration structure. Previously, a <abbr title="binary Bounding Volume Hierarchy">BVH2</abbr> was best for primary rays and a BVH4 best for secondary rays <em title="Some would argue for a BVH8, but...">in my experience</em>. With CLPT, a BVH4 is always the right choice. While primary rays are not such a big deal in a path tracer, CLPT still provides a noticeable speed-up to overall frame time. More importantly, my ray tracing kernels are also used in scientific rendering applications where local shading is sufficient and performance is dictated by primary ray traversal.</p>
<p>The images above demonstrate both CLPT and ORST in action with diffuse path tracing of a Boeing 777 model with <em title="I believe the original model has 360 Million triangles, but some went missing apparently. If you have them please contact me!">300 Million triangles</em>. The high-quality BVH4 of the Boeing has been constructed within 15 seconds with an optimized, parallel <a href="https://graphics.cg.uni-saarland.de/2009/stichhpg2009/" title="Spatial Splits in Bounding Volume Hierarchies">SBVH</a> implementation, which will be the topic of a future blog post. For now, you can find the paper Abstract below and the link to the full paper (and other publications) in the <a href="/about/">About</a> section.</p>
<blockquote>
The recent push for interactive global illumination (GI) has established the 4-ary bounding volume hierarchy (BVH4) as a highly efficient acceleration structure for incoherent ray queries with single rays. Ray stream techniques augment the fast single-ray traversal with increased utilization of CPU vector units and leverage memory bandwidth for batches of rays. Despite their success, the proposed implementations suffer from high bookkeeping cost and batch fragmentation, especially for small batch sizes. Furthermore, due to the focus on incoherent rays, optimization for highly coherent BVH4 ray queries, such as primary visibility, has received little attention. Our contribution is twofold: For coherent ray sets, we introduce a large packet traversal tailored to the BVH4 that is faster than the original BVH2 variant, and for incoherent ray batches we propose a novel implementation of ray streams which reduces the bookkeeping cost while strictly maintaining the preferred traversal order of individual rays. Both algorithms are designed around a fast traversal order look-up mechanism. We evaluate our work for primary visibility and diffuse GI and demonstrate significant performance gains over current state-of-the-art implementations.
</blockquote>
<!--more-->
Wed, 13 Jan 2016 00:00:00 +0100http://rapt.technology/posts/fast-ray-tracing-kernels/
http://rapt.technology/posts/fast-ray-tracing-kernels/