Category: Uncategorized

Introduction

I previously wrote about ShaderSet, which was my attempt at making a clean, efficient, and simple shader live-reloading interface for OpenGL 4.

Since ShaderSet was so fun to use, I wanted to have the same thing in my D3D12 coding. As a result, I came up with PipelineSet. This class makes it easy to live-reload shaders, while encapsulating the complexity of compiling pipeline state in a multi-threaded fashion, and also allowing advanced usage to fit your rendering engine’s needs.

Show Me The Code

In summary, the interface looks something like what follows. I tried to show how it fits into the design of a component-based renderer.

The big idea is to add pipeline descs to the PipelineSet, and those descs don’t need to specify bytecode for their shader stages. Instead, the names of the compiled shader objects for each shader stage are passed through the “GraphicsPipelineFiles” or “ComputePipelineFiles” struct.

Each added shader returns a double-pointer to the root signature and pipeline state. This indirection allows the root signature and pipeline state to be reloaded, and also allows code to deal with the PipelineSet in an abstract manner. (It’s “just a double pointer”, not a PipelineSet-specific class.)

From there, BuildAllAsync() will build all the pipelines in the PipelineSet in a multi-threaded fashion, using the Windows Threadpool. When the returned handle is signaled, that means the compilation has finished.

Finally, you must call UpdatePipelines() at each frame. This does two things: First, it’ll update any pipelines and root signature that have been reloaded since the last update. Second, it garbage-collects any root signature and pipelines that are no longer used (ie. because they have been replaced by their new reloaded versions.) This garbage collection is done by deleting the resources only after kMaximumFrameLatency updates have passed. This works because it’s guaranteed that no more frames are in flight on the GPU with this pipeline state, since it exceeds the depth of your CPU to GPU pipeline.

The Workflow

IPipelineSet is designed to work along with Visual Studio’s built-in HLSL compiler. The big idea is to rebuild your shaders from Visual Studio while your program is running. This works quite conveniently, since Visual Studio’s default behavior for .hlsl files is to compile them to .cso (“compiled shader object”) files that can be loaded directly as bytecode by D3D12.

Normally, Visual Studio will force you to stop debugging if you want to rebuild your solution. However, if you “Start Without Debugging” (or hit Ctrl+F5 instead of just F5), then you can still build while your program is running. From there, you can make changes to your HLSL shaders while your program is running, and hit Ctrl+Shift+B to rebuild them live. The IPipelineSet will then detect a change in your cso files, and live-reload any affected root signatures and pipeline state objects.

To maintain bindings between shaders and C++, I used a so-called “preamble file” in ShaderSet. This preamble is not necessary with HLSL, since we can use its native #include functionality. Using this feature, I create a hlsli file (the HLSL equivalent of a C header) for the shaders I use. For example, if I have two shaders “scene.vs.hlsl” and “scene.ps.hlsl”, I create a third file “scene.rs.hlsli”, which contains two things:

The Root signature, as #define SCENE_RS “RootFlags(0), etc”

The root parameter locations, like #define SCENE_CAMERA_CBV_PARAM 0

I include this rs.hlsli file from my vertex/pixel shaders, then put [RootSignature(SCENE_RS)] before their main. From there, I pick registers for buffers/textures/etc using the conventions specified in the root signature.

I also include this rs.hlsli file from my C++ code, which lets me directly refer to the root parameter slots in my code that sets root signature parameters.

As an example, let’s suppose I want to render a 3D model in a typical 3D scene. The vertex shader transforms each vertex by the MVP matrix, and the pixel shader reads from a texture to color the model. I might have a scene.rs.hlsli as follows:

This code defines the root signature for use in HLSL. (See: Specifying Root Signatures in HLSL) The defines at the bottom correspond to root parameter slots, and they match the order of root parameters specified in the root signature string.

Again notice that the registers for the texture and sampler match those specified in the root signature, and notice the RootSignature attribute above the main.

Finally, I call this shader from my C++ code. I include the header from the source file of the corresponding renderer component, I set the root signature parameters, and make the call. It might be something similar to this:

There’s things going on here that aren’t strictly the topic of this article, but I’ll explain them anyways because I think it’s very useful for writing D3D12 code.

I use a big upload buffer each frame to write all my CBV allocations to, that’s the purpose of pPerFrameAlloc. Its allocate() function returns both a CPU (mapped) pointer and the corresponding GPU virtual address for the allocation, which allows me to write to the allocation from CPU, then pass the GPU VA while writing commands.

In this case, the per-frame allocation is an upload buffer, so I don’t need to explicitly copy from CPU to GPU (the shader will just read from host memory.) An alternate implementation could use an additional allocator for a default heap, and explicitly make a copy from the upload heap to the default heap.

The per-frame allocator is a simple lock-free linear allocator, so I can use it to make allocations from multiple threads, if I’m recording commands from multiple threads.

I could do something similar to the per-frame allocator for descriptors for the Tex0SRV_GPU, or I could create the descriptor once up-front in the Init(). It’s up to your choice, really.

When the time comes to finally specify the root parameters, I do it using the defines from the included scene.rs.hlsli, such as SCENE_RS_MVP_CBV_PARAM. This makes sure my C++ code stays synchronized to the HLSL code.

In Summary

IPipelineSet implements D3D12 shader live-reloading. It encapsulates the concurrent code used to reload shaders, and encapsulates the parallel code that accelerates PSO compilation through multi-threading. It integrates with code without that code needing to be aware of PipelineSet (it’s “just a double-pointer”), and garbage collection is handled efficiently and automatically. Finally, PipelineSet is designed for a workflow using Visual Studio that makes it easy to rebuild shaders while your program is running, and allows you to easily share resource bindings between HLSL and C++.

There are a bunch more advanced features. For example, it’s possible to supply an externally created root signature or shader bytecode, and it’s possible to “steal” signature/pipeline objects from the live-reloader by manipulating the reference count. See the comments in pipelineset.h for details.

Side note: Disabling driver updates seems to be a really convoluted process. Windows is relentless in trying to stop me from using Beta drivers. It’s annoying, but hopefully this won’t be a problem in the future when mainstream drivers have the required features for PIX.

Next, uninstall your current graphics drivers. Open up the Device Manager, go under “Display adapters”, right click your drivers and choose “Uninstall”.

Download the zip version of the drivers (not the exe), and unzip them.

Back in the Device Manager, click “Action>Add legacy hardware”.

Choose “Install the hardware that I manually select from a list (Advanced)”.

Choose “Display adapters”.

Click “Have Disk”.

In the “Install From Disk” window, click “Browse”.

Pick the inf file (eg: “igdlh64.inf”) from the “Graphics” folder of the drivers.

Click “OK”, then pick the GPU model that corresponds to your computer.

Keep clicking Next until it’s done installing.

Step 3: Disassembly in PIX

Run your program with PIX by setting the executable path and working directory. Unless you have a UWP app, you probably want to “Launch Win32”.

From there, click “Launch”.

Next, press print screen, or click the camera button in PIX, to capture a frame of rendering. Double-click on the small picture of your capture that appears in PIX.

If you see an error popup here, it seems probably because your drivers are not updated enough (or more likely, that Windows automatically reverted your Beta update behind your back.)

Once your capture is open, click on the “Pipeline” tab. Then, click the “Click here to start analysis” text that appears in the window in the bottom half of PIX

Next, click on the “Dispatch” or “Draw” event in the Events window at top left (seen in previous screenshot) for which you are interested in seeing the disassembled shaders. Click on the shader stage you want in the bottom left of the window, then click on “Disassembly” (as below). And voila! Gen assembly for your shader!

Introduction

Direct3D 12 opens up a lot of potential by making it possible to write GPU programs that make use of multiple GPUs. For example, it’s possible to write programs that distribute work among multiple GPUs from linked GPUs (eg: NVIDIA SLI or AMD Crossfire), or even between GPUs from different hardware vendors.

There are many ways to make use of these multi-adapter features, but it’s not obvious yet (at least to me) how to best make use of it. In theory, we should try to make full use of all available hardware on a given computer, but there are difficult problems to solve along the way. For example:

How can we schedule GPU tasks to minimize communication overhead between different GPUs?

How can we distribute tasks among hardware that vary in performance?

How can we use special hardware features? eg: “free” CPU-GPU memory sharing on integrated GPUs.

D3D12 Multi-Adapter Features Overview

To better support multiple GPUs, Direct3D 12 brings two main features:

Cross-adapter memory, which allows one GPU to access memory of other another GPU.

Cross-adapter fences, which allows one GPU to synchronize its execution with another GPU.

Working with multiple GPUs in D3D12 is done explicitly, meaning that sharing memory and synchronizing GPUs must be taken into consideration by the rendering engine, as opposed to being “automagically” done inside GPU drivers. This should lead to more efficient use of multiple GPUs. Furthermore, integrating shared memory and fences into the API allows you to avoid making round-trips to the CPU to interface between GPUs.

Linked Display Adapters (LDA) refers to linked GPUs (eg: NVIDIA SLI/AMD Crossfire). They are exposed as a single ID3D12Device with multiple “nodes”. D3D12 APIs allow you to specify a bitset of nodes when the time comes to specify which node to use, or which nodes should share a resource.

Multiple Display Adapters (MDA) refers to multiple different GPUs installed on the same system. For example, you might have both an integrated GPU and a discrete GPU in the same computer, or you might have two discrete GPUs from different vendors. In this scenario, you have a different ID3D12Device for each adapter.

Another neat detail of D3D12’s multi-adapter features is Standard Swizzle, which allows GPU and CPU to share swizzled textures using a convention on the swizzled format.

Central to multi-adapter code is the fact that each GPU node has its own set of command queues. From the perspective of D3D12, each GPU has a rendering engine, a compute engine, and a copy engine, and these engines are fed through command queues. Using multiple command queues can help the GPU schedule independent work, especially in the case of copy or compute queues. It’s also possible to tweak the priority of each command queue, which makes it possible to implement background tasks.

Use-Cases for Multi-Adapter

One has to wonder who can afford the luxury of owning multiple GPUs in one computer. Considering that multi-adapter wasn’t properly supported before D3D12, it was probably barely worth thinking about, other than scenarios explicitly supported by SLI/Crossfire. In this section, I’ll try to enumerate some scenarios where the user might have multiple GPUs.

“Enthusiast” users with multiple GPUs:

Linked SLI/Crossfire adapters.

Heterogeneous discrete GPUs.

Integrated + discrete GPU.

“Professional” users:

Tools for 3D artists with fancy computers.

High-powered real-time computer vision equipment.

“Datacenter” users:

GPU-accelerated machine-learning.

Engineering/physics simulations (fluids, particles, erosion…)

Another potentially interesting idea is to integrate CPU compute work in DirectX by using the WARP (software renderer) adapter. It seems a bit unfortunate to tie everyday CPU work into a graphics API. I guess it might lead to better CPU-GPU interop, or it might open opportunities to experiment with moving work between CPU and GPU and see performance differences. This is similar to using OpenCL to implement compute languages on CPU.

Multi-adapter Designs

There are different ways to integrate multi-adapter into a DirectX program. Let’s consider some options.

Multi-GPU Pipelining

Pipelining with multiple GPUs comes in different flavors. For example, Alternate Frame Rendering (AFR) consists of alternating between GPUs with each frame of rendering, which allows multiple frames to be processed on-the-fly simultaneously. This kind of approach generally requires the scene you’re rendering to be duplicated on all GPUs, and requires outputs of one frame’s GPU to be copied to the inputs to the next frame’s GPU.

AFR can unfortunately limit your design. For example, dependencies between frames can be difficult to implement efficiently. To solve this problem, instead of pipelining at the granularity of frames with AFR, one might pipeline within a frame. For example, half of the frame can be processed on one GPU, then finished on another GPU. In theory, these pipelining approaches should increase throughput, while possibly increasing latency due to the extra overhead of copying data between GPUs (between stages of the pipeline.) For this reason, we have to be careful about the overhead of copies

Task-Parallelism

With a good data-parallel division of our work, we can theoretically easily split our work into tasks, then distribute them among GPUs. However, there’s fundamentally a big difference in the ideal level of granularity of parallelism between low-latency (real-time) users and high-throughput (offline) users. For example, work that can be done in parallel within one frame is not always worth running on multiple GPUs, since the overhead of communication might nullify the gains. In general:

Real-time programs don’t have much choice outside of parallelism within one frame (or a few frames), since they want to minimize latency, and they can’t predict future user controller inputs anyways.

Offline programs might know the entire domain of inputs ahead of time, so they can arbitrarily parallelize without needing to use parallelism within one frame.

If our goal is to render 100 frames of video for a 3D movie, we could split those 100 frames among the available GPUs and process them in parallel. Similarly, if we want to run a machine learning classification algorithm on 1000 images, we can also probably split that arbitrarily between GPUs. We can even deal with varying performance of available GPUs relatively easily: Put the 1000 tasks in a queue, and let GPUs pop them and process them as fast as they allow, perhaps using a work-stealing scheduler if you want to get fancy with load-balancing.

In the case of a real-time application, we’re motivated to use parallelism within each frame to bring content to the user’s face as fast as possible. To avoid the overhead of communication, we might be motivated to split work into coarse chunks. Allow me to elaborate.

Coarse Tasks

To minimize the overhead of communication between GPUs, we should try to run large independent portions of the task graph on the same GPU. Parts of the task graph that run serially are an obvious candidate for running on only one GPU, although you may be able to pipeline those parts.

One way to separate an engine into coarse tasks is to split them based on their purpose. For example, you might separate your project into a GUI rendering component, a fluid simulation component, a skinning component, a shadow mapping component, and a scene rendering component. From there, you can roughly allocate each component to a GPU. Splitting code among high-level components seems like an obvious solution, but I’m worried that we’ll get similar problems as the “system-on-a-thread” design for multi-threading.

With such a coarse separation of components, we have to be careful to allocate work among GPUs in a balanced way. If we split work uniformly among GPUs with varying capabilities, then we can easily be bottlenecked by the weakest GPU. Therefore, we might want to again put our tasks in a queue and distribute them among GPUs as they become available. In theory, we can further mitigate this problem with a fork/join approach. For example, if a GPU splits one of its tasks in half, then a more powerful GPU can pick up the second half of the problem while the first half is still being processed by the first GPU. This approach might work best on linked adapters, since they can theoretically share memory more efficiently.

An interesting approach to load-balancing can be found in GPU Pro 7 chapter 5.4: “Semi-static Load Balancing for Low-Latency Ray Tracing on Heterogeneous Multiple GPUs”. It works by roughly splitting the framebuffer among GPUs to ray trace a scene, and alters the distribution of the split dynamically based on results of previous frames.

One complication of distributing tasks among GPUs is that we might want to run a task on the same GPU at each frame, to avoid having to copy the input state of the task to run it on a different GPU. I’m not sure if there’s an obvious solution to this problem, maybe it’s just something to integrate into a heuristic cost model for the scheduler.

A Note On Power

One quite difficult problem with multi-adapter has to do with power. If a GPU is not used for a relatively short period of time, it’ll start clocking itself down to save power. In other words, if you have a GPU that runs a task each frame then waits for another GPU to finish, it’s possible for that first GPU to start shutting itself down. This becomes a problem on the next frame, since the GPU will have to spin up once again, which takes a non-trivial amount of time. As a final result, the code ends up running slower on multi-adapter than it does in single-adapter, despite even the most obvious opportunities for parallelism.

One might suggest to force the GPU to keep running at full power to solve this problem. It’s not so obvious, since drawing power from idle cores takes away power from the cores that need it. This is especially an issue on integrated GPUs, since the GPU would steal juice from the CPU, despite the CPU probably needing that power to run non-GPU code during the rest of the frame. Of course, power-hungry applications are also generally not welcome on battery-operated devices like laptops or phones.

Does this problem have a solution? Hard to say! As a guideline, it might be important to use GPUs only if you plan to utilize them well, and be careful about CPU-GPU tradeoffs on integrated GPUs. We might need help from hardware and OS people to figure this out properly.

NUMA-aware Task Scheduling

An important challenge of multi-adapter code is that memory allocations have an affinity to a given processor, which means that the cost of memory access increases dramatically when the memory does not belong to the processor accessing it. This scenario is known as “Non-uniform memory access”, aka. “NUMA”. It’s a common problem in heterogeneous and distributed systems, and is also a well-known problem in server computers that have more CPU cores than a single motherboard socket can support, which result in multi-socket CPU configurations where each socket/CPU has a set of RAM chips closer to it than others.

There exist some strategies to deal with scheduling tasks in a NUMA-aware manner. I’ll list some from the literature.

Deferred allocation is a way to guarantee that output memory is local to the NUMA node. It simply consists of allocating the output memory only at the time of the task being scheduled, which allows the processor that was scheduled to perform the allocation right-then-and-there in its local memory, thus guaranteeing locality.

Work-pushing is a method to select a worker to which a task should be sent. In other words, it’s the opposite of work-stealing. The target worker is picked based on a choice of heuristic. For example, the heuristic might try to push tasks to the node that owns the task’s inputs, or it might try to push work to the node that own’s the task’s outputs, or the heuristic might combine ownership of inputs and outputs in its decision.

Work-stealing can also be tweaked for NUMA purposes, by tweaking the work-stealing algorithm to first steal work from nearby NUMA nodes first. This might apply itself naturally to the case of sharing work between linked adapters.

Conclusion

Direct3D 12 enables much more fine-grained control over use of multiple GPUs, whether though linked adapters or through heterogeneous hardware components. Enthusiast gamers, professional users, and GPU compute datacenters stand to benefit from good use of this tech, which motivated a search for designs that use multi-adapter effectively. On this front, we discussed Alternate-Frame-Rendering (AFR), and discussed the design of more general task-parallel systems. The design of a task-parallel engine depends a lot on your use case, and there are many unsolved and non-obvious areas of this design space. For now, we can draw inspiration from existing research on NUMA systems and think about how it applies to the design of our GPU programs.

In the last post, I showed a proof of concept to implement a “cont” object that allows creating dependencies between TBB tasks in a dynamic and deferred way. What I mean by “dynamic” is that successors can be added at runtime (instead of requiring the task graph to be specified statically). What I mean by “deferred” is that the successor can be added even after the predecessor was created and spawned, in contrast to interfaces where successors need to be created first and hooked into their predecessor secondly.

The Goal

The goal of this post was to create an interface for cont that abstracts TBB details from everyday task code. TBB’s task interface is low level and verbose, so I wanted to have something productive and concise on top of it.

Extending tbb::task_group

tbb::task_group is a pretty easy way to spawn a bunch of tasks and let them run. An example use is as follows:

I wanted to reuse this interface, but also be able to spawn tasks that depend on conts. To do this, I made a derived class from task_group called cont_task_group. It supports the following additional syntax:

I’m thinking about how to implement DAGs of tasks that can be used in a natural way. The problem is that a DAG is a more general construct than what task systems usually allow, due to valid performance concerns. Therefore, if I want to implement a DAG more generally, I need to come up with custom hacks.

Trees vs DAGs

The major difference between a tree and a DAG of tasks is that a DAG allows one task to have an arbitrary number of successor tasks. Also, with a DAG, a task from one subtree of tasks can send data to a task from a different subtree of tasks, which allows you to start tasks when all their inputs are ready rather than when all the previous tasks have run and finished. (Hmm… That makes me think of out-of-order versus in-order processors.) This expands on the functionality of a tree of tasks, since trees only allow outputs to be passed to their immediate parent, whereas DAGs can pass data to grandparents or great-grandparents, or to tasks in subsequent trees.

By default, tasks in task systems like TBB and Cilk are designed to have only one successor: a parent task, or a continuation task. Having a parent task makes it possible to have nested tasks, which is useful for naturally spawning tasks from tasks, similarly to how functions can call other functions. Continuation tasks make it potentially more efficient to spawn a follow-up task to handle the results of a task, and they can do so without affecting the reference count of the parent task.

DAG Implementation

To implement a general DAG, you need to (in one way or other) keep track of the connections between the outputs of some tasks and the inputs of other tasks in a more general way. There are two aspects to this connection:

How is memory allocated for the data passed from predecessor task to successor task?

How is the successor task spawned when its inputs are all satisfied?

According to Intel’s documentation (See: General Acyclic Graphs of Tasks), it’s suggested that the memory for the passed data is stored within the successor task itself. Each task object contains the memory for its inputs, as well as a counter that keeps track of how many inputs need to be received before the task can be spawned. In the TBB example, each task also keeps a list of successors, which allows predecessors to write their outputs to their successor’s inputs, and allows the predecessor to decrement their successor’s count of missing arguments (and can finally spawn the successor task if the predecessor finds that it just gave the successor its final missing input.)

The Problem with DAGs

In the TBB DAG example, all tasks are spawned up-front, which is easy to do in their example since the task graph is structured in a formal way. In my case, I want to use a DAG to implement something that looks like a sequence of function calls, except using tasks instead of functions, to allow different parts of the code to be executed in parallel. I want to use a DAG to make it possible to establish dependencies between these sequential tasks, to allow the programmer to create tasks that start when their inputs are available. In short, instead of creating the tasks up front, I want to create the tasks in roughly sequential order, for the purpose of readability.

The problem with what I want to do is that I can’t directly store the outputs of predecessors inside the successor. Since the predecessors need to be passed a pointer to where their outputs should be stored, the successor (which stores the inputs) needs to be allocated before the predecessors. This means you roughly need to allocate your tasks backwards (successor before predecessor), but spawn the tasks forward (predecessors before successors). I don’t like this pattern, since I’d rather have everything in order (from predecessor to successor). It might not be absolutely as efficient, but I’m hoping that the productivity and readability improvement is worth it.

The Interface

Instead of spawning successor tasks up front, I’d like to allocate the storage for the data up front, similarly to Cilk. Cilk has a “cont” qualifier for variables, which can be used to communicate data from one task to another. For example, the fibonacci example from the Cilk paper contains the following code:

This code computes fib(n-1) and fib(n-2), then passes the results of those two computations to a sum continuation task, which implements the addition within fib(n) = fib(n-1) + fib(n-2). The data is passed through the x and y variables, which are marked cont. I don’t know why the Cilk authors put the call to spawn sum before the calls to fib in this example, but perhaps this alternate arrangement of the code is possible:

With this alternative arrangement of the code, the order of the tasks being spawned mirrors the equivalent code in plain sequential C:

int x, y;
fib(&x, n-1);
fib(&y, n-2);
sum(&k, x, y);

Implementing cont

If the successor task is allocated before the predecessors, the predecessor that supplies that final missing input to the successor can also spawn the successor. However, if the successor is allocated after the predecessors, it’s possible that all predecessors finish their work before the successor is allocated. If we allocate the successor and find that all its inputs are already available, we can simply immediately spawn it. However, if some inputs of the successor are still not available, then we need to find a way to spawn the successor when those inputs become available.

To spawn successors when a cont becomes available, each cont can keep a list of successors that are waiting for it. When the cont is finally set, it can pass that input to its successors, and potentially also spawn successors if this cont was their final missing input. The difficulty with implementing this system lies in resolving the race condition between the predecessor, the cont, and successors.

Here’s an example of a race condition between predecessor/cont/successor. Suppose a successor task is spawned with some cont inputs. The successor might see that a cont input has not yet been set, so the successor adds itself to the list of successors in the cont. However, it might be possible that the cont suddenly becomes set in a different thread while the successor is adding itself to the cont’s successor list. The thread that is setting the cont might run before it sees the new successor added, so it might not notify the successor that the input is now complete (by decrementing its counter and possibly spawning it.)

cont successor linked list

It might be possible to solve the problem described above if it’s possible for the successor to atomically check that the cont is not satisfied and if so add itself to the cont’s list of successors. This might be possible if the cont uses a linked list of successors, since the successor could do a compare-and-swap that sets a new head for the list only if the input is unsatisfied.

If that compare-and-swap fails because the cont became set just before the CAS, the successor can just treat the input as satisfied and move on to checking the next input. If the CAS fails because another successor registered themselves concurrently, then the registration needs to be tried again in a loop. On the other side of that race condition, if the cont’s compare-and-swap to set its completion fails because a successor added themselves to the list concurrently, then the cont just tries again in a loop, which allows the cont to notify the successor that interrupted it of completion.

If the cont becomes set while the successor is hooking up other inputs, the cont will pass the input and decrement the counter, which itself is not a problem. However, the last cont might finish and spawn the successor before the function hooking the successor finishes, which shouldn’t be a problem as long as nothing else needs to happen after the successor finishes hooking itself up to its inputs. If something does need to happen immediately after the hookups, the successor can initialize its counter with an extra 1, which it only decrements at the end of the hookup (and potentially then spawns the task.)

A detail of this implementation is the allocation of the nodes of the linked list. The successor task needs to have allocated with it at least one linked list node per cont. Since this is a fixed number, the allocation can be done more efficiently.

There’s a difficulty with the compare-and-swap, which is that two different compare-and-swaps need to be done. First, the successor should only be added to the list if the cont is not yet set. Second, appending to a linked list in a CAS can fail if two successors try to add themselves to the list simultaneously. To solve this problem, I propose that the cont’s “is set” boolean is stored as the least significant bit of the head of the linked list. This allows operations on the cont to both switch the head of the list and compare and set completeness simultaneously. If pointers are aligned then the least significant bit is always 0, so no information is lost by reusing that bit. We just need to make sure to mask out that bit before dereferencing any pointers in the linked list.

cont: Derived from cont_base, stores data in “std::optional-style” to pass data from predecessor to successors.

spawn_when_ready: Spawns a task when a list of conts are all set. Interface still a bit low-level.

I’ve only done a few tests, so I don’t know if it’s 100% correct. I only really have one test case, which I’ve debugged in a basic way by adding random calls to sleep() to test different orders of execution. I wouldn’t consider using this in production without a thorough analysis. It’s honestly not that much code, but I’m not an expert on the intricacies of TBB.

Also, I was lazy and used sequential consistency for my atomic operations, which is probably overkill. Any lock-free experts in the house? 🙂 (Update: I’ve replaced the seq_cst with acquires and releases. Fingers crossed.)

I’ve also not sugar-coated the syntax much, so there’s still lots of low-level task management. I’d like to come up with syntax to spawn tasks in a much simpler way than manually setting up tbb::task derivatives, setting reference counts, allocating children, etc. This is a topic for another post.

With this implementation, I was able to build the following task dependency graph, with each edge annotated with its type of of dependency.

I’m thinking more about how one can use TBB to write task code that looks similar to existing C code. Of course, people have tried to do this before, and made languages that integrate task parallelism naturally. This article takes a look at these existing solutions, looking for inspiration.

a built-in “?” operator allows declaring the dependency of a successor on a predecessor.

This is some pretty cute syntax. My main worry is that there might be overhead in the automation of passing around continuation variables. In contrast, TBB also allows creating continuation tasks, but it requires you to pass continuation arguments by reference manually. For example, TBB users can create a continuation task with inputs as member variables, and the address of these member variables are used as destination addresses for the childrens’ computation. See Continuation Passing. Still, the TBB continuation syntax is pretty tedious to type (and probably error-prone), and I wonder if we can do some C++ magic to simplify it.

The “spawn” and “spawn_next” syntax makes spawning tasks look a lot like calling functions, which is consistent with the goals I described in the previous post. The “cont” variables might be possible to implement by wrapping them in a C++ type, which could implement operator= (or a similar function) for the purpose of implementing an equivalent to “send_argument”. Cilk allows cont variables to be passed as arguments that are declared non-cont (such as sum’s x/y above), and automatically unwraps them when the time comes to actually call the function. In a C++ implementation, this automatic unwrapping might be possible to implement with a variadic function template that unwraps its arguments before passing them to the desired function. If that’s too difficult, we can fall back to defining the continuation function with explicit “std::future”-like arguments, requiring using a special function to unwrap them at the usage site.

I think one of the best things about Cilk is implicitly generating dependencies between tasks by passing arguments. This is much less work and is more maintainable than explicitly declaring dependencies. It does not deal with running two tasks in an order based on side-effects, like if you want printf() calls in two tasks to always happen in the same order. This might be possible to mitigate your in design by factoring out side-effects. Alternatively, we could create a useless empty struct and use that to indicate dependencies while reusing the syntax used to pass meaningful data. This is very similar to the tbb::flow::continue_msg object used in TBB flow graph.

By the way, Cilk’s dependencies are implemented by keeping a counter of missing arguments for tasks. When the counter reaches 0, the task can be executed. This is very similar to how TBB tasks implement child-parent dependencies. The awkwardness is that TBB normally only supports a single parent task, so a task with multiple parents need to be handled specially. See General Acyclic Graphs of Tasks.

Cilk Plus

Cilk Plus is an extension of C/C++ available in some compilers. It enables features similar to Cilk in a way that interops with C/C++. However, instead of any continuation passing, it defines a keyword “cilk_sync”, which waits for all child tasks to finish executing before proceeding. This is probably perfect for fork-join parallelism (a tree of tasks), but I’m not sure if it’s possible to implement a general directed acyclic graph with these features.

ISPC

The ISPC language is mainly useful for writing high-performance SIMD code, but it also defines some built-in syntax for task parallelism. Namely, it supports built-in “task”, “launch”, and “sync” keywords. Again, this seems limited only to fork-join parallelism.

Others?

I’ve seen a few other languages with task-parallelism, but they usually seem to stop at fork/join parallelism, without talking about how continuations or DAGs might be implemented. If you know about a programming language interface that improves on the syntax of Cilk for creating continuations, please tell me about it.

Conclusions

I like Cilk’s ideas for passing data from child task to parent task. Implementing an interface similar to it in C++ using TBB might allow a pretty natural way of implementing task parallelism both for fork/join task trees or more general task DAGs. My main concern is making an interface that makes it easy to do common tasks.

I think that continuation passing might be an elegant way to implement sequential-looking code that actually executes in a DAG-like fashion, which would make it easy to reuse the average programmer’s intuition of single-threaded programming. I want the DAG dependencies to be built naturally and implicitly, similar to how Cilk implements “cont” variables. I want to make it easy to create placeholder “cont” variables that are used only to build dependencies between tasks with side-effects that need to operate in a specific order, similarly to tbb::flow::continue_msg. I also want a way to have a node with multiple parents (to implement general DAGs), and I’d like to minimize the overhead of doing that.

One of my main concerns is how to encapsulate the reference count of tasks. TBB sample programs (in the TBB documentation) all work with reference counts in a pretty low-level way, which may be suitable for when you want to carefully accelerate a specific algorithm, but seems error-prone for code that evolves over time. I hope that this logic can be encapsulated in objects similar to Cilk’s “closure” objects. I think these closure objects could be implemented by creating a derived class from tbb::task, and some C++ syntactical sugar (and maybe macros) could be used to simplify common operations of the task system. From there, I’m worried about the potential overhead of these closures. How can they be allocated efficiently? Will they have a lot of atomic counter overhead? Will their syntax be weird? I’ll have to do some experimentation.

Introduction

In the last post, we talked about the motivation to write data-parallel code in order to scale the performance of our programs with hardware resources. We also saw some basic designs for parallel algorithms, mostly in the abstract.

In this post, we’ll go into more detail about what aspects of hardware design exist to increase the performance of data-parallel code. There is a lot of overlap between both CPU and GPU design in this area, so this is quite generally applicable knowledge when writing parallel code. The goal is to identify aspects of hardware design that we can rely on without knowing too much about the underlying architecture, since this allows us to write code that stands the test of time. Naturally, there are still differences between CPU and GPU hardware, so these differences will be highlighted too.

Instruction Parallelism and SIMD

For the purpose of discussion, let’s consider the following parallel loop:

Since this is a trivially parallel loop, we can straightforwardly apply parallelism techniques to accelerate it. However, before we actually modify the code ourselves, let’s consider what the processor could do to run this code in parallel for us automatically.

Automatic Instruction Parallelism

In theory, the underlying processor could automatically determine that this code can run in parallel by observing the reads and writes being made. At each iteration, it would see a read from data[i] into a temporary register, some math on that temporary register, then a write back to data[i]. In theory, the processor could internally build dependency graphs that represent the dependencies between all reads and writes, defer the evaluation of this graph until a lot of work has been accumulated, then evaluate the work in the dependency graph by executing different code paths in parallel.

The above can sound a little bit like science fiction, but it does happen to a limited extent in processors today. Processors like CPUs and GPUs can automatically execute instructions in parallel if there does not exist a data hazard between their inputs and outputs. For example, if one instruction reads from memory and a subsequent instruction writes that memory, the processor will wait for the read to finish before executing the write, perhaps using a technique like scoreboarding. If there does not exist such a data hazard, the processor might execute the two instructions in parallel. Additionally, some processors may be able to automatically remove superfluous data dependencies using register renaming, or by making guesses on the future state of the data using speculative execution to avoid having to wait for the results to arrive.

Of course, relying on processor smarts comes at a cost. The hardware becomes more expensive, these features come with their own overhead, and it’s hard to trust that these optimizations are really happening unless you understand the processor at a very deep level (and perhaps have professional tools for verifying it, like Intel VTune). Two instructions that can execute in parallel might also be separated by enough code in between them that the processor is not able to see that they can be executed in parallel, and that the compiler isn’t allowed to perform the optimization safely either.

For example, the following addition of a print statement to “times_two” might make it too complicated for the compiler and processor to safely execute iterations of the loop in parallel, since it can’t know if the implementation of “printf” might somehow affect the contents of the “data” array.

Manual Instruction Parallelism

In theory, we might be able to help the compiler and processor identify instructions that can run in parallel by explicitly writing them out. For example, the following code attempts (possibly in vain) to help the compiler and processor to recognize that the operations on the data array can be done in parallel, by iterating over it in steps of 4.

Assuming the processor can actually understand the intention of this code, the situation is still not great. This setup outputs much more bytecode, which may hurt the efficiency of the instruction cache, and still relies on the processor’s ability to dynamically identify instructions that can execute in parallel.

In the face of the difficulties in automatically executing instructions in parallel, hardware designers have created instructions that allow explicitly declaring operations that run in parallel. These instructions are known as SIMD instructions, meaning “Single Instruction Multiple Data”. As hinted by “multiple data”, these instructions are very well suited to exploit data-parallelism, allowing either the compiler or the programmer to assist the processor in recognizing work that can be done in parallel.

SIMD instruction sets include parallel versions of typical arithmetic instructions. For example, the ADDPS instruction on Intel processors computes 4 additions in one instruction, which allows explicit indication to the processor that these 4 additions can be executed in parallel. Since this ADDPS instruction needs quadruple the inputs and outputs, it is defined on 128-bit registers as opposed to the typical 32-bit registers. One 128-bit register is big enough to store four different 32-bit floats, so that’s how the quadrupled inputs and outputs are stored. You can experiment with SIMD instructions sets using so-called compiler intrinsics, which allow you to use these SIMD instructions from within your C/C++ code as if they were ordinary C functions. You can use these by including a compiler-supplied header like xmmintrin.h.

As an example of applying SIMD, consider the reworked “times_two” example:

This code implements the times_two example at the start of this section, but computes 4 additions in every iteration. There are a few complications:

The syntax is ugly and verbose.

It requires the input array to have a multiple of 4 in size (unless we add a special case.)

It requires the input array to be aligned to 16 bytes (for best performance.)

The code is less portable.

Writing code that assumes a SIMD width of 4 is also non-ideal when you consider that newer Intel processors can do 8-wide or 16-wide vector operations. Different GPUs also execute SIMD code in a variety of widths. Clearly, there are many good reasons to want to abstract the specifics of the instruction set in order to write more portable code more easily, which can be done automatically through the design of the programming language. This is something we’ll talk about later.

I hope we can agree that SIMD adds another level to our data-parallel decomposition. Earlier, we talked about splitting work into data-parallel pieces and distributing them among cores, and now we know that we can get more parallelism by running SIMD code on each core. SIMD compounds with multi-core in a multiplicative way, meaning that a 4-core processor with 8-wide SIMD has a maximum theoretical performance improvement of 4 * 8 = 32.

Memory Access Latency Hiding

With the compute power of multi-core and SIMD, our execution time quickly becomes bound by the speed of memory access rather than the speed of computation. This means that the performance of our algorithm is mostly dictated by how fast we can access the memory of the dataset, rather than the time it takes to perform the actual computation.

In principle, we can use our newly found computation power of multi-core and SIMD to lower the overhead of memory access by investing our spare computation in compression and decompression. For example, it may be highly advantageous to store your data set using 16-bit floating point numbers rather than 32-bit floating point numbers. You might convert these floats back to 32-bit for the purpose of running your math, but it might still be a win if the bottleneck of the operation is memory access. This is an example of lossy compression.

Another way to deal with the overhead of memory access is to create hardware which hides the latency of memory access by doing compute work while waiting for it. On CPUs, this is done through Hyper-Threading, which allows you to efficiently create more threads than there physically exists cores by working on the second thread of work where the core would normally wait for one thread’s memory access to complete.

A significant factor of improving the latency of memory access lies in the design of your algorithm and data structures. For example, if your algorithm is designed to access memory in predictable patterns, the processor is more likely to guess what you’re doing and start fetching memory ahead of your requests, which makes it more likely to be ready when the time comes to access it. Furthermore, if successive memory operations access addresses that are close to each other, it’s more likely that the memory you want is already in cache. Needless to say, cache coherency is very important.

Beyond cache effects, you may also be able to hide the latency of memory access by starting many memory operations as early as possible. If you have a loop that loops 10 times and does a load from memory at the beginning of every loop, you might improve the performance of your code by doing the loads “up front”. For example:

This optimization is more likely to be beneficial on an in-order core rather than an out-of-order core. This is because in-order cores execute instructions in the order in which they are read by the program, while out-of-order cores execute instructions based on the order of data-dependencies between instructions. Since out-of-order cores execute instructions in order of data-dependency, the loads that were pulled out of the loop might not execute early despite our best efforts. Instead, they might only execute at the time where the result of the load is first used.

Optimizations based on playing around with the order of instructions is generally less useful on out-of-order cores. In my experience, these optimizations on out-of-order cores often having no noticeable effect, and are often even being a detriment to the performance. CPU cores these days are typically out-of-order, but GPU cores are typically in-order, which makes these kinds of optimizations more interesting. An optimization like the one above might be already done by the GPU shader compiler without your help, but it might be worth the experiment to do it yourself, by either manually changing the code or by using compiler hints like HLSL’s “loop unroll” hint.

That said, I’ve seen cases where manually tweaking memory access has been extremely beneficial even on modern Intel processors. For example, recent Intel processors support a SIMD instruction called “gather” which can, for example, take as input a pointer to an array and four 32-bit indices into the array. The gather instruction performs these multiple indexed loads from the array in a single instruction (by the way, the equivalent for parallel indexed stores to an array is called “scatter”). As expected from its memory access, the gather instruction has a relatively high latency. This can be a problem in the implementation of, for example, perlin noise. Perlin noise implementations use a triple indirection into an array to produce seemingly random numbers, in the style of “data[data[data[i] + j] + k]”. Since each gather depends on the result of a previous gather, the three gathers need to happen completely sequentially, which means the processor basically idles while waiting for 3x the latency of memory access of gather. Manually factoring out redundant gathers or tweaking the perlin noise algorithm can get you a long way.

On the topic of dependent loads, a common target of criticism of memory access patterns is the linked list. If you’re traversing a linked list made out of nodes that are randomly scattered in memory, you can certainly expect that the access of each successive linked list node will cause a cache miss, and this is generally true. However, there’s another problem relating to linked lists, related to memory access latency. This problem comes from the fact that when iterating over a linked list, there’s no way for the processor to go to the next node until the address of the next node is loaded. This means you can bear the full brunt of memory access latency at every step of the linked list. As a solution, you might want to consider keeping more than one pointer per node, for example by using an octree rather than a binary tree.

CPUs vs GPUs?

I’ve been mostly talking CPUs so far, since they’re generally more familiar and easier to work with in example programs. However, pretty much everything I’ve said generally applies to GPUs as well. GPUs are also multi-core processors, they just have smaller cores. The cores on GPUs run SIMD code similarly to CPUs, perhaps with 8-wide or 16-wide SIMD. GPUs have much smaller caches, which makes sense, since synchronizing caches between multiple processors is inherently a very ugly problem that works against data-parallelism. By having smaller caches, GPUs pay a much higher cost for memory access, which they amortize by heavy use of executing spare computation while waiting for their memory access to complete, very similar in principle to hyper-threading.

Conclusion/To Be Continued

In the next post, we’ll talk about how the data-parallel uses of multi-core and SIMD are abstracted by current compute languages, so we can finally understand the idiosyncracies of compute languages.