It was the week between Xmas and the New Year, and I was enjoying my free time, hacking on my renderer (a life-time obsession). Up till then, my renderer only included rasterizers, with soft-shadows and pre-calculated ambient occlusion:

I wanted more, however. I had more or less completed everything I ever wanted to do with my rasterizers, since (a) all the features I ever wanted to implement were done (b) the code used a common C++ template to create a "family" of rasterizers (since all rasterizers share the same trait: they interpolate "stuff" linearly across the scanlines, per-pixel).

Phase 1: Raycasting

Raytracing has nothing to do with rasterizing - it is a completely different algorithm. It sends a ray from the viewpoint, that "pierces" the screen at each pixel, and checks whether the ray intersects the scene's triangles:

for each screen pixel
throw a ray from the viewpoint to the pixel
find the closest triangle pierced by the ray
at the intersection point, perform Phong lighting

Or, if we also want shadows:

for each screen pixel
throw a ray from the viewpoint to the pixel
find the closest triangle pierced by the ray
at the intersection point, throw another ray towards the light
if this shadow ray finds any triangle, the pixel is in shadow
Otherwise, perform Phong lighting

Raytracers can in fact work with more than just triangles - spheres and cylinders and other "perfect" objects can be traced. I have, however, always coded my renderer to work with triangle meshes - for a number of reasons (huge availability of ready-made 3D models being a good one). I therefore Googled for ray/triangle intersections. I knew I would need acceleration structures - if I was to make this run at real-time speeds, I had to avoid checking each ray against all the scene's triangles.

A thesis drew my attention: two guys had implemented irregular Z-buffers in CUDA, using "bins" of triangle lists. It was simple enough: the screen was split in a grid of "bins", and after calculating the 2D bounding box of the projected (screen-space) triangle, we know which "bins" the triangle "spans" over...

The triangles are placed in "bin" lists.

...and we place the triangle in a simple std::list<Triangle*> (stored per each bin). When sending our rays, we only check for intersections with the triangles that belong in the ray's bin, not with all of them.

Well, CUDA didn't have STL lists of course, but I coped the old fashioned way, via heap (cudaMalloc-ed). And this worked - after about a day of coding, I had my first ever raycaster written:

A dragon... in C++ and in CUDA.

A 50K triangles "dragon" was raycasted at about 20 frames per second. Not bad for a days work!

Unfortunately, as the video shows, the speed nose-dived from 20 to about 1.5 frames per second, as soon as I added shadows. I was aghast. Why? I was using the same trick, that is, "bins" that store triangle lists, only this time I was calculating them in "light-screen" space. The same algorithm, but speed was devastated... because CUDA1.2 (i.e. my 70$ GT240) has no cache, and also suffers a lot from thread divergence.

So I posted this to Reddit, hoping for advice. In 2 days, it became the most popular CUDA post ever in Reddit/ programming :‑) To be honest, some people there wanted to skin me alive. "How dare you implement shadows without OpenGL, you bloody idiot!" (or something to that effect).

But I persevered :‑)

Googling and reading papers, I found out about the balancing of global memory latency with more threads, so I wrote a quick Python script to find the optimal balance between the number of bins I was using, and the number of threads per CUDA block. Rendering speed improved by a factor of 4!

Using textures was next: CUDA1.2 (i.e. my GT240) has no cache, but if the data were placed in textures, they would be cached. A bit. I tried it, and only got 3% speedup...

So I went back to the code, and tried to make my global memory accesses more coalesced: I changed the screen size of the bins to directly mirror the number of CUDA blocks - this meant that each block of threads was accessing exactly the same triangle list, so global memory accesses became a lot more "streamlined". Speed soared by 2.5x.

And at that point, I realized it - I was doing raycasting, with Phong lighting and shadows, rendering a colorful train object at 21 frames per second!

Oh, the joy! :‑)

Phase 2: Raytracing

Two weeks passed. And as you can probably guess, I wanted more. Again :)

It turned out, that the "bins" trick is not good enough. After all, a raytracer doesn't stop at the first intersection. It casts reflected, refracted and my favourite, ambient occlusion rays. Bins can't accomodate that. I needed a better data structure, a space partitioning scheme.

So I Googled again... and read about the Bounding Volume Hierarchy, and how it can be used to accelerate any ray-intersection checking in world space. As I always do with a new graphics algorithm, I first coded it in my SW-only version of the renderer. Furious coding ensued... and the results were beautiful, beyondanything my rasterizers ever did. Were these images really created by my renderer? I couldn't believe it.

What I could believe, unfortunately, was the speed. Abysmal. The pure-C++ version gave me one frame per second on the train object (sigh).

This begged for the CUDA treatment.

Moving the BVH to CUDA land

A bounding volume hieararchy is a simple data structure, that "groups" objects together into bounding volumes. If a ray doesn't hit the volume, then you don't need to check whether it hits any objects inside it, and you are gaining tremendous speed.

The structure itself, however, is hierarchical - which is another way of saying "recursive". And guess what - CUDA doesn't have recursion. I had to switch to a stack-based recursive implementation for my ray traversals into the BVH. This turned out to be easier than I thought. I also changed the C++ version of BVH nodes (that had pointers to inner/leaf nodes inside them) into a far less C++-y version, storing indices in the CUDA arrays instead of pointers.

I compiled, typed the command to spawn the raytracer, looked away from the screen, and hit ENTER...

Would you believe me? It worked the first time. No joke. I spent two hours to change the C++ code into its CUDA equivalent, and it worked the first time I tried. Has only happened once or twice in my programming life...! Probably it was the side effect of the additional experience I had amassed in the meantime - regardless, I was positively giddy - this first attempt was running 3.5x times faster than the C++ code, showing reflections and shadows. I had 3.5 frames per second on the train now...

More! I needed more! :‑)

Looking at the code, I remembered about textures again. I had tried them in the "bins" version, two weeks ago, but they hadn't helped much there. I decided to try it again, and did a rather ugly hack of storing successive floating point numbers from my structures into successive float4 elements of a CUDA texture. I did it for everything - vertices data texture, triangles data texture, pre-computed triangle intersection data texture, BVH data texture...

And at each step... Oh... my... God.

The speed was increasing by more than 50% each time I stored a new category of data in textures! I have no explanation as to why it had failed so miserably when I tried it with the bins - maybe I goofed somehow (most probably). The point is, that I finally had it - I had a raytracer running in real-time, showing me reflections and shadows, at about 15 frames per second!

CUDA recursion via C++ templates

So far, I was using a stack-based recursion for traversing the BVH, but I was NOT using anything similar for the raytracer itself. I had something else in mind: C++ template meta-programming.

Now, I don't know if you've seen template meta-programming before, but I consider this one of the best hacks I've ever written. I force the C++ compiler to create different versions of the Raytrace function, and each one calls another, not itself - so no problem compiling this with CUDA!

The recursion ends when Raytrace<MAX_RAY_DEPTH> is called, which returns the ambient black I am using.

Fix your BVH, stupid

The raytracer was already running in real-time. But I was noticing some strange things while flying in my scenes - sometimes, even when I was staring in a part of space that was decidedly empty, it was slow - when in fact it should have been at 30fps or more.

Debugging this was a LOT more difficult. It required logging (see the #ifdef DEBUG_LOG_BVH sections in the code) and extensive testing. I eventually found I had a bug in how I was building the BVH, and when I fixed it, speed almost doubled. Again :‑)

The train was now running at the same speed as the "bins" version, with Phong lighting, Phong normal interpolation, and both shadows and reflections enabled. And the code was actually... much better looking.

Except for one thing - the right BVH... took more time to build. The chessboard needed 34 seconds to load...

BVH-building with SSE

I had coded SSE before (in my Mandelbrot zoomer), so it was clear to me that building the BVH and continually taking the min and max of floating point coordinates while building bounding boxes, was a prime candidate for _mm_min_ps and _mm_max_ps. If you are not aware of them, SSE instructions work on 4 floats at a time, and could therefore be used to accelerate my BVH building a lot.

I had worked with SSE before, but I had not used intrinsics - this time, I chose to be more portable and avoid inline assembly. A day's worth of work later (because turning something to SSE is not as easy as writing it in CUDA) I finished it - and the chessboard was now loading in 10 seconds.

As fate would have it, as soon as I finished with that, I realized I could simply store the calculated BVH data the first time I load an object, and re-load them (from the saved "cache") next time. Even the slow version of the BVH calculation would be fine with that.

Sometimes the simple ideas come too late :‑)

(Future plans: dynamic scenes via runtime creation of BVH with CUDA)

Templated features

Since C++ templates are compile-time code generators, I realized I could move beyond the "rendering modes" of my SW-only renderer. I could choose different rendering parameters at runtime, by changing the core CUDA kernel into a template:

What this means, is that each time the compiler sees a PAINT macro invocation, a specific template is instantiated,
with just the features we want - at compile-time! And the result is that we can switch raytracer features on and off,
at run-time, without the penalty of executing these if statements at run-time, in the body of the CUDA kernel.

Have I mentioned how I love C++ templates?

OK, here's a gripe about them, so you won't call me a fanboy of Stroustrup: When the template params are booleans and/or enumerants, I shouldn't have to resort to such hacks - I should be able to say...

More speed!

A result to compare with: the latest version of my raytracer, renders the well-known "Conference" scene with primary rays, diffuse lighting and shadows at 22 frames per second. Not bad, I think. The train object is also running at 18 frames per second with reflections, shadows, Phong normal interpolation and specular lighting.

Features of the raytracer

For those of you that skipped the developer gobbledygook above, here's a short list of my raytracer's features:

A Bounding Volume Hierarchy using axis-aligned bounding boxes is created and used for ray/triangle intersections. The BVH is created via the surface-area heuristic, and is stored for fast re-use. If SSE extensions are detected during compilation (from configure), then a SIMD implementation is used at load-time that builds the BVH faster.

CUDA 1.2 cards like my GT240 have no support for recursion, so I used C++ template magic to implement compile-time recursion - see cudarenderer.cu in the source tarball for details.

...and the second is also quite easy - add the contrib and non-free
collections to /etc/apt/sources.list - that is, at the end of your distro's
repository line. In my case, it looks like this after editing:

The comments on this website require the use of JavaScript. Perhaps your browser isn't
JavaScript capable or the script is not being run for another reason. If you're
interested in reading the comments or leaving a comment behind please try again with a
different browser or from a different connection.