Pipeline, 2016 Archive

22 December 2016

Video with 96 spheres. Doing 16x16 block probing on the CPU before rendering it on the GPU:

21 December 2016

In the never ending hunt for more spheres, it's time to turn to the CPU for answers. It's mostly idling anwyay.
Let the CPU check the corners of each 16x16 block of the display and determine if the area is covered by only spheres,
only shadow plane/sky or both. This makes the GPU code significantly quicker, since half the code can be eliminated for most of the screen:

Corner checking takes around 6ms CPU time, so let a separate thread calculate that for the next frame while the current one
is rendering. Synchronize it all with a barrier. Simple. This calls for an increase to 96 spheres at 60 fps (barely):

Code needs some work before publication.

14 December 2016

There was a slight mishap in the raytracer code which makes some shadows disappear. It's noticeable in the video if you
pay close attention to the shadows. The measurments are a bit off too, but it's still above 60 fps after fixing it, so
I'm not gonna bother updating them. The
GLSL source code
in the
GPU Hacks Article
has been updated.

14 December 2016

GPURay 1.7, 80 spheres at 60 fps on NVidia Tegra X1:

Music: 1992 Amiga Oktalyzer module by Jan Wegger.

There's an interesting pragma that can be used in NVidia GLSL code: #pragma optionNV (unroll all).
Trying it on the raytracer gives interesting results: A huge assembler file that has no REP loops at all,
and slightly quicker execution time.

So the trick is to find the correct unroll level and do it manually. Earlier attempts were clearly too conservative.
Automatic unrolling doesn't work if there's if() statements in the loop. After some trial and
error it seems that 16 is a good unroll value for a target of 80 spheres. Since the unroll is greater, using vec4
in the second loops starts paying off too. Let's increase to 80 spheres and do a test run. Blue is old code, orange is unroll
all and green is unroll 16 + vec4:

12 December 2016

An attempt to maximize the Tegra X1 GPU power usage. Around 11 watts is the current highscore while still looking cool:

4 November 2016

One issue with the solution below is the excessive use of if()-arguments. By wrapping things up in vectors
we can get most of them out of the way. greaterThan() etc. in OpenGL compiles to the corresponding sgt etc.

The if() set can be reduced with some mix() and min() and stuff. The generated code is neat,
but it's not very quick yet. Have to investigate that further. Luckily, this is not like in
the old days.

The cool part is that it can now do 72 spheres:

2 November 2016

Video with 64 spheres:

1 November 2016

While perusing more generated code and trying out other unroll structures, it's clear that the
one from 28 October is not totally optimal for larger numbers of spheres. A different approach yielding
better results seems to be turning down the lower unroll to 4 (not shown), and increasing the upper to 4
while moving the sqrts out of the way. So there's some double tests, but the compiler seems to figure
it out. The whole point of that is to insert a continue if all 4 doesn't hit at all. The miss ratio
is pretty high and we're going through everything anyway:

The good news is that the number of spheres can be increased to 64 with plenty of cycles to spare.
The bad news is that there's not enough cycles for next multiple of 4. Have to find a way to get
around that.

28 October 2016

In the never-ending hunt for more spheres in the
GPU Raytracer,
I noticed that the glGetProgramBinary() call outputs assembler code too, in
NV_gpu_program5
format. That makes it reasonably easy to see where improvements can be made.

The conditional construct in the first loop seems to confuse the compiler, and blocking negative
square roots is pointless when they're basically free. Wrap it all up in a simple if()
that reduces nicely in the generated assembler code. Still need to unroll 2 by hand, though,
since the compiler still cannot unroll automatically when there's if() expressions in the loop:

Unlike gcc, the NVidia compiler seems to be able to look ahead far enough to convert this into
a coherent set of dp3/mad/or/trunc instructions. It's now considered wise to keep number
of spheres a multiple of 10. Looks like it can finally do 60 spheres at 60 fps:

New video:

20 September 2016

I missed a trivial optimization in the shadow calculation in the
GPU Raytracer.
There's no point in calculating the distance, it's enough to know whether any sphere is hit or not.
That simplifies the loop enough so it can be unrolled further. A limitation is that the sphere count
has to be a multiple of 4, not 2 as earlier:

5 September 2016

Obviously, the same is true for the input data in the fractal and quaternion code too. The results
are less dramatic, but the quaternion routine is now stable above 60 fps.
Updated the
article
and made new measurements.

31 August 2016

The NVidia article
How About Constant Buffers
gives some interesting information about how data is stored and accessed on the GPU.
The scene data in the
GPU Raytracer
is constant and the usage pattern fairly regular, so storing it in a constant buffer sounds like a good idea.
Constant buffers translate to uniform buffers in OpenGL, so it's a matter of just replacing
"shader storage" with "uniform" in the right places.

The results are spectacular. Old version (SSBO) versus new version (UBO), 32 spheres:

Video below. Can now have 52 spheres and keep fps above 60 all the time:

2 August 2016

27 July 2016

It's been a slow summer, so I finally had time to update the SRTP AES Optimization
article from 2010. Should now support all CPU types and packet sizes larger than 4096.
It's available here: SRTP AES Optimization Revisited

13 May 2016

April 2016

April is GPU appreciation month! Some attempts at making old cr... err, old stuff run on an
NVidia Tegra X1.

GPU Raytracing

36 objects, 3 reflections, non-reflective shadows on center sphere. 126 lines shader code, could be
less if nvcc would stop screwing up unrolling.

GPU Julia

Based on a GPU implementation by John Tsiombikas.
I spent some time optimizing it and used a HSV'ish palette instead. 256 iterations. 45 fps if all pixels are
at max iterations, aka. black screen. So 60 fps for all normal cases. Need bigger GPU.

GPU Plasma Effect

HSV to RGB Conversion

Using a HSV-based
palette when making effects is useful, since it wraps without any edges.
But the color has to be converted back to RGB before displaying, and this can be complicated. Several methods of
assorted quality can be found on the net. This one seems
popular, and this one is also interesting.
Donald Reynold's implementation uses cosines for everything,
but then the hexagonal shape is gone. So let's do a twist: Use a cosine for the basic H slope, clamp() for the flat
top/bottom and mix() or similar for S and V: