I just did a “hey kids, let me tell you about demoscene” event at work, where I talked about
and and showed some demos I think were influential over the years, roughly sorted chronologically.

Here’s that list, in case you also want to see some demoscene
things. There’s a whole bunch of excellent demo productions I did not show (due to time constraints); and I
mostly focused on Windows/PC demos. A decent way of finding others is searching through “all time top”
list at pouët.net.

I’m giving links to youtube, because let’s be realistic, no one’s gonna actually download and run
the executables. Or if you would, then you most likely have already seen them anyway :)

Future Crew “Second Reality”, 1993, demo

Tim Clarke “Mars”, 1993, 6 kilobytes

Psychic Link “Paper”, 1996, 64 kilobytes

Exceed “Heaven Seven”, 2000, 64 kilobytes

farbrausch “fr-08: .the .product”, 2000, 64 kilobytes

Alex Evans “Tom Thumb”, 2002, wild demo

TBC & Mainloop “Micropolis”, 2004, 4 kilobytes

mfx “Aether”, 2005, demo

Kewlers & mfx “1995”, 2006, demo

mfx “Deities”, 2006, demo

farbrausch “fr-041: debris”, 2007, 144 kilobytes

Fairlight & CNCD “Agenda Circling Forth”, 2010, demo

Fairlight & CNCD “Ziphead”, 2015, demo

Eos “Oscar’s Chair”, 2018, 4 kilobytes

Conspiracy “When Silence Dims The Stars Above”, 2018, 64 kilobytes

Sailing out to sea | A tale of woe is me
I forgot the name | Of where we’re heading
– Versus Them “Don’t Eat the Captain“

So! This whole series on pathtracing adventures started
out without a clear goal or purpose. “I’ll just play around and see what happens” was pretty much it.
Looks like I ran out of steam and will pause doing further work on it. Maybe sometime later
I’ll pick it up again, who knows!

One nice thing about 2018 is that there’s a lot of interest in ray/path tracing again, and
other people have been writing about various aspects of it. So here’s a collection of
links I saved on the topic over past few months:

Planning

Our kids are 5 years older (15 and 9yo), which makes it easier! We are five years older too though :/

Six people in total, since now we also took my parents. This meant renting two cars.

Similar to last time, I used internets and google maps to scout for locations
and do rough planning. It was basically “go around the whole country” (on the main Route 1),
cutting in one place via the highland Route F35, and then a detour
into Snæfellsnes peninsula.

Total driving distance ended up ~2600km (200-300km per day). That does not sound a lot, but we did not end up
having “lazy days”; there is a lot to see in Iceland, and every stop along the way is basically an hour or two.
For example you might want to hike up the waterfall, or get down to some cliffs in the water, etc.
The map on the right shows all the places we did end up stopping at. I had a dozen more marked up, but we skipped
some.

I booked everything well in advance (4 months), either via Booking.com or Airbnb.
Since we were a party of six, in some more remote places there was not that many choices actually. Having a camper
or tents might be much cheaper and allow more freedom, at expense of comfort.

Late June is basically “early summer” in Iceland. Most/all of the highland roads are already open. There can be quite a lot of rain; I was looking at the forecasts
and it did not look very good. Luckily enough, we only got serious rain for like 3 days; most other days there was relatively little
rain. Temperature was mostly in +8..+15°C range, often with a really cold wind. There were moments when I wished I’d taken gloves :)

Photo Impressions

Most of the photos are taken by my wife.
Equipment: Canon EOS 70D with Canon 24-70mm f/2.8 L II and Sigma 8-16mm f/4.5-5.6. Some taken with iPhone SE.

Day 1, South (Selfoss to Kirkjubæjarklaustur)

The southern part is quite crowded with tourists; going up to Dyrhólaey/Vik is plenty of sights and a good trip for
a day. We also started the first day with “ok there’s a million things to see today!”.

Fun fact! Unity codebase has a text = "Eyjafjallajökull-Pranckevičius"; line
in one of the tests, that checks whether some thing deals with non-English characters. I think
@lucasmeijer added that.

End of June is blooming time of Nootka Lupin;
there are vast fields full of them. People go to take wedding photos and whatnot in there.

Next up, we can go to the tongue of Sólheimajökull glacier
(this is a bit redundant; “jökull” already means “glacier”). I’ve never seen a glacier before, and
the photos of course don’t do it justice. This is a tiny piece at the end of the glacier. Very impressive.

Day 2, South/East (Kirkjubæjarklaustur to Höfn)

Driving up to another glacier, Svínafellsjökull. Again, the scale is hard to comprehend; many glaciers in Iceland
are 500 meters high, some going up to a kilometer. A kilometer of ice!

A short (but very bumpy) road to the side, and we are close to it:

Next up, Jökulsárlón glacial lake. Was
a setting for a bunch of movies! The lake is just over a hundred years old, and is growing very fast,
largely due to melting glaciers.

Right next to it there is so called “Diamond Beach”, where icebergs, after being flushed out into the sea
and eroded by salt, come ashore as tiny pieces of ice. The sand is black of course, since it was originally
pumice and volcanic ash.

Day 3, East (Höfn to Egilsstaðir)

Eastern side of Iceland is where there’s no tourist crowds, and no big-name attractions either. Even the main
highway road becomes gravel for a dozen kilometers in one place :) Most of the Route 1 goes along the coastline
that is full of fjords, which makes for a fairly long drive. There is a shortcut
(route 939 aka Öxi) that lets you cut some 80km, but it’s
gravel and very steep (here’s random youtube video showing it).
I thought “let’s do the coastline instead, we’ll watch plenty of sea and cliffs”. Not so fast!
Turns out, coastline can mean that there’s a literal cloud right on the road, and you basically don’t see
anything. Oh well :)

There were some lighthouses (barely visible due to mist/fog/clouds), a nice waterfall (Sveinsstekksfoss),
and also here’s a photo of our typical lunch:

We stayed in a lovely horse ranch, and also found an old car dump nearby.

Day 5, North/East (Egilsstaðir to Mývatn)

Most of the day was driving on Route 1 through Norður-Múlasýsla region. First you see towns and villages disappear,
then farms disappear, and then even sheep disappear (whereas normally sheep are everywhere in Iceland).
What’s left is a volcanic desert with basically a single road cutting through it.

There was a waterfall (Rjúkandi) near start of that trip, and lava fields towards the end, close to Dettifoss.

Here’s Dettifoss, which is 100m wide, 44m deep and other
measurements as well (ref).

Nearby, the Krafla area with the Víti crater, Krafla power station
and Hverir geothermal area with fumaroles and mudpots.

Lake Mývatn nearby has a flying mountain (not really, just low fog) and a lot of birds.

Day 5, North (Mývatn to Akureyri)

Mývatn to Akureyri is a very short drive, so we did a detour through Husavik towards
Ásbyrgi canyon.
Last time we were in Iceland, Husavik was lovely and Ásbyrgi
was quite impressive. However this time, pretty much the whole day was heavy rain. Not much visibility, and
not too pleasant to hike around and enjoy the sights. Oh well! Here’s Ásbyrgi and
Goðafoss:

Akureyri has an excellent botanical garden; more photos from it at
my wife’s blog.

Day 6, Highlands (Akureyri to Kerlingarfjöll)

This was where we took off the main highway and into the F35/Kjalvegur
gravel road. I heard from a bunch of people the suggestion along the lines of “OMG you have to go along one of the
highland roads”, and so that’s why we did it. F35 is the easiest of those; legally it requires a 4x4/AWD car
but I think technically any car should be able to do it. Most other highland roads actually have river crossings; whereas
F35 only has one or two small streams to cross. Most of the road is actually in very good condition (at least at start
of July), with only a couple dozen kilometers that have enough stones and pits to make you go at 20-30km/h.

And decided to hike towards a nearby rhyolite mountain area (Hveradalir). Apparently I must have misread something somewhere,
since what I thought was 3km turned out to be 5km one way (mixup of miles vs kilometers in my head?),
the path was steep, with blobs of snow along the way, really strong wind and a descending cloud. At some point
we decided to declare ourselves losers and just turn back. Oh well :/

Turns out, you can just drive up to the same area via some mountain road. It’s steep and bumpy, and there was still
tons of snow on the side, but the views up there were amazing. The wind almost blew us away though; maybe it’s
good that we did not hike all the way.

Day 7, Part of Golden Circle (Kerlingarfjöll to Reykjavik)

“Golden Circle” is a marketing term for probably
the most touristy route in Iceland. But parts of it did happen to be on our way, so we went straight
from the highlands where there’s no one around, into “all the tourists in one spot” types of places
like Gullfoss.

Day 8, Part of Golden Circle (around Reykjavik)

Þingvellir national park, most famous for being
a place where you can actually see the rift between Eurasian and North American tectonic plates, and also
for being a place of Alþingi, one of the oldest parliaments in
the world.

Next up, Kerið crater. Similar to Krafla’s Víti, except with more
tourists and you can get down to the lake itself.

Then we went to the Raufarhólshellir lava cave.
Things I learned: “skylight” is not just a computer graphics term (also means places where underground caves have openings
towards the ground); lava flow produces really intricate “bathtub ring” patterns; and complete darkness feels
eery.

Day 9, West (Reykjavik to Snæfellsnes)

Driving up to Snæfellsnes takes a good chunk of time,
with generally nothing to see along the way (in relative terms of course; in many other countries these
valleys and horizons would be amazing… but Iceland has too many more impressive sights). There are
Gerðuberg basalt columns midway:

…but apart from that, not much. I was starting to think “ohh maybe this will be a low point of the trip”, and
then! Rauðfeldsgjá gorge was very fun; you try to find
your way across a water stream in a very narrow gorge, with huge chunks of snow right above you.

Just a couple minutes from there, Arnarstapi village has really
nice cliffs at the water.

Five minutes from that, Hellnar village has even more impressive cliffs.
I mean look at them! That layout and flow of the rocks should not exist! :)

Day 11, Reykjanes Peninsula (around Keflavík)

And the famous Bláa Lónið (Blue Lagoon),
but we decided not to go inside (too many people, and didn’t feel the need either). There’s a power station
right next to it, and some tractors doing cleaning. Much romance, wow :)

Next time?

I have no doubt that we’ll go to Iceland again (seriously, it’s amazing). One obvious thing would be going in
the winter. So maybe that!

I wanted to check out how’s the performance on a mobile device. So, let’s take what we ended up with in the
previous post, and make it run on iOS.

Initial port

Code for the Mac app is a super
simple Cocoa application that either updates a Metal texture from the CPU and draws it to screen, or produces
the texture with a Metal compute shader. I know almost nothing about Mac or Cocoa programming, so
I just created a new project in Xcode, picked a “Metal game” template, removed things I don’t need and added
the things I do need.

“Porting” that to iOS basically involved these steps (again, I don’t know how it’s supposed to be done;
I’m just doing a random walk):

Created two projects in Xcode, using the “Metal game” template; one for Mac (which matches my current code setup),
and another one for “Cross Platform” case.

Looked at the differences in file layout & project settings between them,

Some tweaks to existing app code
to make it compile on iOS – mostly temporarily disabling all the SSE SIMD code paths
(iOS uses ARM CPUs, SSE does not exist there). Other changes were mostly differences in Metal functionality between
macOS and iOS (MTLResourceStorageModeManaged buffer mode and didModifyRange buffer method only exist on macOS).

Xcode tools for iOS GPU performance

I wanted to look at what sort of tooling Xcode has for investigating iOS GPU performance these days. Last time I did it was
a couple years ago, and was also not related to compute shader workloads. So here’s a quick look into what I found!

Update: this post was about Xcode 9 on an A9 hardware. At WWDC 2018 Apple has announced
big improvements to Metal profiling tools in Xcode 10, especially when running on A11 or later hardware.
I haven’t tried them myself, but you might want to check out the WWDC session
and “Optimizing Performance” doc.

TL;DR: it’s not bad. Too bad it’s not as good as PS4 tooling, but then again, who is?

Most of Xcode GPU analysis is under the “Debug Navigator” thingy, where with an app running you can select the “FPS” section
and it displays basic gauges of CPU & GPU performance. When using Metal, there is a “Capture GPU Frame” button near the bottom
which leads to actual frame debugging & performance tools.

The default view is more useful for debugging rendering issues; you want to switch to “View Frame By Performance” instead:

The left sidebar then lists various things grouped by pipeline (compute or graphics), and by shader. It does not list them
by objects rendered, which is different from how GPU profiling on desktop usually works. In my case obviously the single
compute shader dispatch takes up almost all the time.

The information presented seems to be a bunch of GPU counters (number of shader invocations, instructions executed, and so on).
Some of those are more useful than others, and what kind of information is being shown probably also depends on the device & GPU model.
Here are screenshots of what I saw displayed about my compute shader on an iPhone SE:

Whole frame overview has various counters per encoder. From here: occupancy is not too bad, and hey look my shader is not using any
half-precision instructions:

“Performance” section has more stats in number form:

“Pipeline Statistics” section has some useful performance hints and overview graphs of, uhm, something. This is probably telling me I’m ALU
bound, but what are the units of each bar, and whether they are all even the same scale? I don’t know :)

If the shader was compiled with debugging information on, then it can also show which places of the shader actually took time. As far
as I can tell, it just lies – for my shader, it basically says “yeah, all these lines took zero time, and there’s one line that took 6%”.
Where are the other 94%?!

Xcode tools for Mac GPU performance

In the previous post I ranted how Mac has no GPU performance
tools at all, and while that is somewhat true (i.e. there’s no tool that would have told me “hey Aras, use by-value local variables insteaad
of by-reference! twice as fast!”)… some of that “Capture GPU Frame” functionality exists for Mac Metal applications as well.

The “compute kernel” part has way fewer counters, and I don’t quite believe that ALU active time was exactly zero.

“Pipeline Statistics” section on the other hand… it has no performance hints, but it does have more overview graphs!
“Register pressure”, “SIMD group occupancy” and “threadgroup memory” parts sound useful!

Let’s do SIMD NEON code paths for CPU

Recall when in part 8 I played around
with SSE intrinsics for CPU HitSpheres function? Well now that code is disabled since iOS uses ARM CPUs, so Intel specific
instructions don’t even compile there.

However, ARM CPUs do have their own SIMD instruction set: NEON.
I know! Let’s use NEON intrinsic functions to implement our
own float3 and float4 helpers, and then the SIMD HitSpheres should more or less work.

Caveat: as usual, I basically have no idea what I’m talking about. I have read some NEON code in the past,
and perhaps have written a small NEON function or two at some point, but I’m nowhere near being “proficient”
at it.

NEON float3

First off, let’s do the float3 helper class implementation with NEON. On x64 CPUs that did improve performance
a bit (not much though). NEON intrinsics overall seem to be way more orthogonal and “intuitive” than SSE ones,
however SSE has way, way more information, tutorials & reference about it out there. Anyway, the NEON
float3 part is this commit,
and my summary of NEON is:

#include <arm_neon.h> to get intrinsics & data types,

float32x4_t data type is for 4-wide floats,

NEON intrinsic functions start with v (for “vector”?), have q in there for things that operate on four things,
and a suffix indicating the data type. For example, a 4-wide float add is vaddq_f32. Simple and sweet!

Getting to individual SIMD lanes is much easier than on SSE (just vgetq_lane_f32), however doing arbitrary
swizzles/shuffles is harder – you have to dance around with extracting low/high parts, or “zipping” various
operands, etc.

Doing the above work did not noticeably change performance though. Oh well, actually quite expected. I did learn/remember
some NEON stuff though, so a net positive :)

NEON HitSpheres & float4

Last time an actual performance gain with SIMD was doing SSE HitSpheres,
with data laid out in struct-of-arrays fashion. To get the same working on NEON, I basically have to implement a float4
helper class, and touch several places in HitSpheres function itself that use SSE directly. It’s all in
this commit.

That got CPU performance from 5.8 Mray/s up to 8.5 Mray/s. Nice!

Note that my NEON approach is very likely suboptimal; I was basically doing a direct port from SSE. Which means:

“mask” calculation for comparisons. On SSE that is just _mm_movemask_ps, but becomes this in NEON:

Current status

So the above is basic port to iOS, with some simple NEON code path, and no mobile specific GPU tweaks/optimizations
at all. Code is over at 14-ios tag on github.

Performance:

iPhone SE (A9 chip): 8.5 Mray/s CPU, 19.8 Mray/s GPU.

iPhone X (A11 chip): 12.9 Mray/s CPU, 46.6 Mray/s GPU.

I haven’t looked into how many CPU threads the enkiTS task scheduler
ends up using on iPhone X. I suspect it still might be just two “high performance” cores, which would be within
my expectations of “roughly 50% more per-core CPU perf in two Apple CPU generations”. Which is fairly impressive!

Oh, last post was exactly a month ago… I guess I’ll remove “daily” from the titles then :)

So the previous approach
“let’s do one bounce iteration per pass” (a.k.a. “buffer oriented”) turned out to add a whole lot
of complexity, and was not really faster. So you know what, let’s park that one for now; maybe we’ll
return to something like that once (if ever) we either actually need it, or perhaps when we’ll work
on smaller ray packets that don’t need hundreds-of-megabytes of ray buffers.

Scott Bean (@gfxbean) sent a little hint that in my “regular, super simple”
GPU implementation I might get much better
performance by moving scene/material data into groupshared memory. As we’ve seen in the
previous post, using group shared
memory can speed things up quite a lot, and in this case all threads will be going through exactly
the same spheres to check rays against.

All that work is completely isolated inside the compute shader (nice!), and conceptually goes like this:

groupshared Foo s_GroupFoo[kMaxFoos];
// at start of shader:
CopyFoosFromStructuredBuffersInto(s_GroupFoo);
ThreadGroupMemoryBarrier(); // sync threads in the group
// proceed as usual, just use s_GroupFoo instead
// of StructuredBuffer<Foo> variable

D3D11

The actual commit for D3D11 is here,
and is pretty self-explanatory. At start of shader I make each thread do a little bit of “copy” work like this:

I also reduced thread group size from 16x16 to 8x8 since that was a bit faster on my GPU (may or might not be faster
on any other GPU…). What’s the result? NVIDIA GeForce 1080 Ti: 778 -> 1854 Mray/s.

So that’s 2.4x faster for a fairly simple (and admittedly not trivially scalable to large scenes) change! However…
a quick test on Radeon Pro WX 9100: says: 1200 -> 1100 Mray/s, so a bit slower. I haven’t investigated why, but I guess the takeaways
are:

Pre-caching compute shader data into thread group shared memory can make it a lot faster!

Or it might make it slower on a different GPU.

Good luck!

Metal

I did the same change in the Metal implementation; here’s the commit -
pretty much the same as what is there on D3D11.
The result? MacBook Pro (2013) with Intel Iris Pro 60.8 -> 42.9 Mray/s. (oꆤ︵ꆤo)

Why? No idea; Mac has no tooling to answer this question, as far as I can tell.

And then I did a change that I thought of totally at random, just because I modified these lines of code and started to think
“I wonder what would happen if I…”. In the shader, several places had code like const Sphere& s = spheres[index] – initially
came from the code being a direct copy from C++. I changed these places to copy into local variables by value, instead
of having a const reference, i.e. Sphere s = spheres[index].

Here’s the commit, and that tiny
change got the performance up to 98.7 Mray/s on Intel Iris Pro.

Why? Who knows! I would have expected any “sufficiently smart compiler“
to have compiled both versions of code into exact same result. Turns out, nope, one of them is 2x faster, good luck!

Metal shaders are a bit of a black box, with not even intermediate representation being publicly documented. Good thing is…
turns out the IR is just LLVM bitcode (via @icculus).
So I grabbed a random llvm-dis I had on my machine (from Emscripten SDK, of all places), checked which output file Xcode
produces for the *.metal inputs, and ran it on both versions.

The resulting LLVM IR disassembly is not very easy on the eyes, looking generally like this:

I’m not fluent in reading it, but by diffing the two versions, it’s not immediately obvious why one would be slower
than the other. The slow one has some more load instructions with addrspace(3) on them, whereas the fast one
has more calls into alloca (?) and llvm.memcpy.p0i8.p3i8.i64. Ok I guess? The alloca calls are probably not “real”
calls; they just end up marking up how much of thread local space will get needed after all inlining. Memcpy probably
ends up being a bunch of moves in exactly once place, so if GPU has any sort of load coalescing, then that gets used
there. Or that’s my theory for “why faster”.

So Metal takeaways might be:

By-value instead of by-const-reference things might be much more efficient.

Metal bytecode is “just” LLVM IR, so peeking into that with llvm-dis can be useful. Note that this is still
a machine-independent, very high level IR; you have no visibility into what the GPU driver will make of it
in the end.

Current status and what’s next

So this simple change to pre-cache sphere/material/emissive data into thread group shared memory got GPU performance
up to: