GPU Hacks: Fractals, a Raytracer and Raytraced Quaternion Julia Sets

Nils L. Corneliusen10 June 2016

Introduction

The NVidia Tegra X1
has a Maxwell-based GPU with a theoretical FP32 peak of 512 GFLOPS per second. It can easily be programmed
using OpenGL GLSL shaders. However, making fast GPU code is different from making fast CPU code.
3 non-typical GPU jobs are implemented in fragment shaders and optimized for better performance.
Videos and source code is included.

GPU Fractals

Note: The theory of this section is still good, but has been revised numerous times since the
publication of this article. Refer to the
Pipeline
in 2017 and 2018 for more info.

The code for both is essentially the same; two conditionals is enough to separate them.
Iterations as input is not a good idea if the main loop can be unrolled. Currently it can't,
but we'll get to that later. Let's start by defining an interface:

And, finally, it's time to roll things out. Just duplicating the calculations + if() won't work.
Let's try something different. Instead of optimizing for early cutoff, let's optimize for the worst
case, ie. max iterations is expected. As long as it's possible to quickly identify the first
value above 4.0 later, it doesn't matter if the code overcalculates several values. The trick is to figure out
the optimal depth and make the compiler stuff the comparisons between the calculations to avoid
stalls.

As it turns out, 32 seems to be the optimal depth. And packing the calculations into
a vec4 and comparisons into a bvec4 array seems to work ok. What doesn't seem to work
is using greaterThan(), so skip that:

Then it's a matter of figuring out which value is the first to be greater than 4.0. Let's bundle up the
comparison results as bits in a 32-bit uint and use findLSB() to find the lowest bit in the current block
of 32 results:

Next is the main reflection loop and the crucial first trace_ray() call, which is inlined here.
trace_ray() will stall badly since every calculation depend on the previous. And the compiler doesn't
unroll or interleave since there's an if() at the end, so it has to be done manually.

The plane is not reflected and only has non-reflective shadows. So let's exit if we didn't hit anything and it's
not the first reflection:

if( rt_hit == -1 && refl > 0 ) break;

That means checking for plane intersection only happens on the first iteration when all spheres are missed.
If the plane is missed, generate a pulsing skyline instead. obj_pos[1].z should of course vary a bit, or
it gets boring quick.

The plane is hit, so determine if there's a shadow. Trace a ray back to the light pos
and see if anything is hit. Since only hit/no hit is interesting, it can be simplified compared to the
original trace_ray().

GPU Raytraced Quaternion Julia Sets

Note: This section is still technically correct, but some spare cycles were found and detail was increased in 2017-2018. Also
proper rotation added. Please refer to the quaternion code in the
Revised 2017 Xmas demo.

22 June 2017: Plane projection:

6 June 2017: Slightly revised color scheme in new video:

Original video:

The natural extension of a fractal routine and a raytracer. Naah, I just came across it randomly
on the net and it looked cool.

Keenan Crane
has written a paper about the theory behind it, and he's also made a GPU implementation. This
was way back in 2004. Is it possible to get it to run quickly on the Tegra X1? Let's give it a shot.

For simplicity, I recycled some of the raytracer stuff, so the interface is similar:

Iterations is moved to a define to make unrolling more viable later. startpos, ax, ay should
be calculated as in the raytracer. An interesting position would be mu_pos=(-0.125, -0.256, 0.847, 0.0895), epsilon=0.02
where the left half resembles Doh from the Arkanoid breakout game. More positions can
be found on Paul Bourke's page.

Let's go through the routines from top to bottom. quatMult() and quatSq() are ok, except there's
performance issues with using the GLSL functions dot()/cross() there. I have no idea why,
so let's skip using them in these two crucial functions.
Paul Bourke has a good explanation of the mathematics involved.

inout vars seems to cost a bit, so the ones not needed are eliminated. Some of the loops have constant
inputs for first iteration, so they can be trivially reduced. (Not by the compiler, of course.)
And again manual interleave-unrolling must be done by hand. This is becoming routine:

Obviously, this code is heavier. But there's just one object, and the lowest result seems to be about 62 fps when morphing
through some of the points from Paul Bourke's article.

FPS measurement, rendering to PBuffer (no vsync), same path as in the video:

Comments are always appreciated. I prefer being contacted on LinkedIn.
Email is also available, you can figure it out from the front page.
I switched to a disposable email address system, so it will change regularly.