Principles of Compute (Part 2)

Introduction

In the last post, we talked about the motivation to write data-parallel code in order to scale the performance of our programs with hardware resources. We also saw some basic designs for parallel algorithms, mostly in the abstract.

In this post, we’ll go into more detail about what aspects of hardware design exist to increase the performance of data-parallel code. There is a lot of overlap between both CPU and GPU design in this area, so this is quite generally applicable knowledge when writing parallel code. The goal is to identify aspects of hardware design that we can rely on without knowing too much about the underlying architecture, since this allows us to write code that stands the test of time. Naturally, there are still differences between CPU and GPU hardware, so these differences will be highlighted too.

Instruction Parallelism and SIMD

For the purpose of discussion, let’s consider the following parallel loop:

Since this is a trivially parallel loop, we can straightforwardly apply parallelism techniques to accelerate it. However, before we actually modify the code ourselves, let’s consider what the processor could do to run this code in parallel for us automatically.

Automatic Instruction Parallelism

In theory, the underlying processor could automatically determine that this code can run in parallel by observing the reads and writes being made. At each iteration, it would see a read from data[i] into a temporary register, some math on that temporary register, then a write back to data[i]. In theory, the processor could internally build dependency graphs that represent the dependencies between all reads and writes, defer the evaluation of this graph until a lot of work has been accumulated, then evaluate the work in the dependency graph by executing different code paths in parallel.

The above can sound a little bit like science fiction, but it does happen to a limited extent in processors today. Processors like CPUs and GPUs can automatically execute instructions in parallel if there does not exist a data hazard between their inputs and outputs. For example, if one instruction reads from memory and a subsequent instruction writes that memory, the processor will wait for the read to finish before executing the write, perhaps using a technique like scoreboarding. If there does not exist such a data hazard, the processor might execute the two instructions in parallel. Additionally, some processors may be able to automatically remove superfluous data dependencies using register renaming, or by making guesses on the future state of the data using speculative execution to avoid having to wait for the results to arrive.

Of course, relying on processor smarts comes at a cost. The hardware becomes more expensive, these features come with their own overhead, and it’s hard to trust that these optimizations are really happening unless you understand the processor at a very deep level (and perhaps have professional tools for verifying it, like Intel VTune). Two instructions that can execute in parallel might also be separated by enough code in between them that the processor is not able to see that they can be executed in parallel, and that the compiler isn’t allowed to perform the optimization safely either.

For example, the following addition of a print statement to “times_two” might make it too complicated for the compiler and processor to safely execute iterations of the loop in parallel, since it can’t know if the implementation of “printf” might somehow affect the contents of the “data” array.

Manual Instruction Parallelism

In theory, we might be able to help the compiler and processor identify instructions that can run in parallel by explicitly writing them out. For example, the following code attempts (possibly in vain) to help the compiler and processor to recognize that the operations on the data array can be done in parallel, by iterating over it in steps of 4.

Assuming the processor can actually understand the intention of this code, the situation is still not great. This setup outputs much more bytecode, which may hurt the efficiency of the instruction cache, and still relies on the processor’s ability to dynamically identify instructions that can execute in parallel.

In the face of the difficulties in automatically executing instructions in parallel, hardware designers have created instructions that allow explicitly declaring operations that run in parallel. These instructions are known as SIMD instructions, meaning “Single Instruction Multiple Data”. As hinted by “multiple data”, these instructions are very well suited to exploit data-parallelism, allowing either the compiler or the programmer to assist the processor in recognizing work that can be done in parallel.

SIMD instruction sets include parallel versions of typical arithmetic instructions. For example, the ADDPS instruction on Intel processors computes 4 additions in one instruction, which allows explicit indication to the processor that these 4 additions can be executed in parallel. Since this ADDPS instruction needs quadruple the inputs and outputs, it is defined on 128-bit registers as opposed to the typical 32-bit registers. One 128-bit register is big enough to store four different 32-bit floats, so that’s how the quadrupled inputs and outputs are stored. You can experiment with SIMD instructions sets using so-called compiler intrinsics, which allow you to use these SIMD instructions from within your C/C++ code as if they were ordinary C functions. You can use these by including a compiler-supplied header like xmmintrin.h.

As an example of applying SIMD, consider the reworked “times_two” example:

This code implements the times_two example at the start of this section, but computes 4 additions in every iteration. There are a few complications:

The syntax is ugly and verbose.

It requires the input array to have a multiple of 4 in size (unless we add a special case.)

It requires the input array to be aligned to 16 bytes (for best performance.)

The code is less portable.

Writing code that assumes a SIMD width of 4 is also non-ideal when you consider that newer Intel processors can do 8-wide or 16-wide vector operations. Different GPUs also execute SIMD code in a variety of widths. Clearly, there are many good reasons to want to abstract the specifics of the instruction set in order to write more portable code more easily, which can be done automatically through the design of the programming language. This is something we’ll talk about later.

I hope we can agree that SIMD adds another level to our data-parallel decomposition. Earlier, we talked about splitting work into data-parallel pieces and distributing them among cores, and now we know that we can get more parallelism by running SIMD code on each core. SIMD compounds with multi-core in a multiplicative way, meaning that a 4-core processor with 8-wide SIMD has a maximum theoretical performance improvement of 4 * 8 = 32.

Memory Access Latency Hiding

With the compute power of multi-core and SIMD, our execution time quickly becomes bound by the speed of memory access rather than the speed of computation. This means that the performance of our algorithm is mostly dictated by how fast we can access the memory of the dataset, rather than the time it takes to perform the actual computation.

In principle, we can use our newly found computation power of multi-core and SIMD to lower the overhead of memory access by investing our spare computation in compression and decompression. For example, it may be highly advantageous to store your data set using 16-bit floating point numbers rather than 32-bit floating point numbers. You might convert these floats back to 32-bit for the purpose of running your math, but it might still be a win if the bottleneck of the operation is memory access. This is an example of lossy compression.

Another way to deal with the overhead of memory access is to create hardware which hides the latency of memory access by doing compute work while waiting for it. On CPUs, this is done through Hyper-Threading, which allows you to efficiently create more threads than there physically exists cores by working on the second thread of work where the core would normally wait for one thread’s memory access to complete.

A significant factor of improving the latency of memory access lies in the design of your algorithm and data structures. For example, if your algorithm is designed to access memory in predictable patterns, the processor is more likely to guess what you’re doing and start fetching memory ahead of your requests, which makes it more likely to be ready when the time comes to access it. Furthermore, if successive memory operations access addresses that are close to each other, it’s more likely that the memory you want is already in cache. Needless to say, cache coherency is very important.

Beyond cache effects, you may also be able to hide the latency of memory access by starting many memory operations as early as possible. If you have a loop that loops 10 times and does a load from memory at the beginning of every loop, you might improve the performance of your code by doing the loads “up front”. For example:

This optimization is more likely to be beneficial on an in-order core rather than an out-of-order core. This is because in-order cores execute instructions in the order in which they are read by the program, while out-of-order cores execute instructions based on the order of data-dependencies between instructions. Since out-of-order cores execute instructions in order of data-dependency, the loads that were pulled out of the loop might not execute early despite our best efforts. Instead, they might only execute at the time where the result of the load is first used.

Optimizations based on playing around with the order of instructions is generally less useful on out-of-order cores. In my experience, these optimizations on out-of-order cores often having no noticeable effect, and are often even being a detriment to the performance. CPU cores these days are typically out-of-order, but GPU cores are typically in-order, which makes these kinds of optimizations more interesting. An optimization like the one above might be already done by the GPU shader compiler without your help, but it might be worth the experiment to do it yourself, by either manually changing the code or by using compiler hints like HLSL’s “loop unroll” hint.

That said, I’ve seen cases where manually tweaking memory access has been extremely beneficial even on modern Intel processors. For example, recent Intel processors support a SIMD instruction called “gather” which can, for example, take as input a pointer to an array and four 32-bit indices into the array. The gather instruction performs these multiple indexed loads from the array in a single instruction (by the way, the equivalent for parallel indexed stores to an array is called “scatter”). As expected from its memory access, the gather instruction has a relatively high latency. This can be a problem in the implementation of, for example, perlin noise. Perlin noise implementations use a triple indirection into an array to produce seemingly random numbers, in the style of “data[data[data[i] + j] + k]”. Since each gather depends on the result of a previous gather, the three gathers need to happen completely sequentially, which means the processor basically idles while waiting for 3x the latency of memory access of gather. Manually factoring out redundant gathers or tweaking the perlin noise algorithm can get you a long way.

On the topic of dependent loads, a common target of criticism of memory access patterns is the linked list. If you’re traversing a linked list made out of nodes that are randomly scattered in memory, you can certainly expect that the access of each successive linked list node will cause a cache miss, and this is generally true. However, there’s another problem relating to linked lists, related to memory access latency. This problem comes from the fact that when iterating over a linked list, there’s no way for the processor to go to the next node until the address of the next node is loaded. This means you can bear the full brunt of memory access latency at every step of the linked list. As a solution, you might want to consider keeping more than one pointer per node, for example by using an octree rather than a binary tree.

CPUs vs GPUs?

I’ve been mostly talking CPUs so far, since they’re generally more familiar and easier to work with in example programs. However, pretty much everything I’ve said generally applies to GPUs as well. GPUs are also multi-core processors, they just have smaller cores. The cores on GPUs run SIMD code similarly to CPUs, perhaps with 8-wide or 16-wide SIMD. GPUs have much smaller caches, which makes sense, since synchronizing caches between multiple processors is inherently a very ugly problem that works against data-parallelism. By having smaller caches, GPUs pay a much higher cost for memory access, which they amortize by heavy use of executing spare computation while waiting for their memory access to complete, very similar in principle to hyper-threading.

Conclusion/To Be Continued

In the next post, we’ll talk about how the data-parallel uses of multi-core and SIMD are abstracted by current compute languages, so we can finally understand the idiosyncracies of compute languages.