Software developer
Twitter @savage309

Published

2. GPGPU 101 – GPU thread model

This is the next part in the series of GPGPU 101 posts I started some time ago. If
you haven’t checked part 0 or part1, please do so. In part 1 we’ve talked how the x86 processor approaches the problems of efficiency (locality) and performance (parallelism). Now, we will discuss how the modern GPUs are approaching those. In this part we will talk more about parallelism and in the next we will focus on locality.

As the name of the GPU (Graphics Process Units) suggests it was designed for processing graphics. And graphics tends to be an embarrassingly parallel problem – you have to calculate the color for tens of millions of pixels, and (most of the time) the color of every pixel is independent from the other pixels. But graphics are not the sole embarrassingly parallel problem and at one point people started using GPUs for stuff different from graphics. First it was used in a hacky fashion, since there was no API for general purpose programming and after that CUDA and OpenCL were born. As we mentioned, there were multiple GPUs designs and approaches, but today more or less all desktop GPU vendors (AMD, Intel and nVidia) have converged to similar architectures. They do have differences which should be taken into consideration when optimizing a specific program, but the general approach to writing programs for those GPUs is the same. Here are the overviews of the GPU approaches from Intel, nVidia and AMD.

CPU architecture – optimized for single thread performance

The CPU is designed to perform a single task as fast as possible (or “optimized for low latency“). In order to do so, it uses complex control logic (superscalar pipeline, branch predictor, register renaming) and complex (and big) data and instruction caches. On the other hand the GPUs have many simple cores, that are lacking all of the complex stuff in the CPUs (or “optimized for high throughput“). So, what does this means for us? …

In a nutshell, the GPU is a huge SIMD machine, so they are well suited for data parallelism tasks. The width of the SIMD unit depends on the vendor (4 in Intel GPUs, 32 in nVidia GPUs, 64 in AMD ones – wider lanes do not mean better performance of course, so keep that in mind). Some of the vendors are calling each lane of the SIMD unit a ‘core’. Usually, the vendors are combining some SIMD units into groups (4 SIMD units in nVidia are forming a “Streaming Multiprocessor”, 2 SIMD units in Intel are forming are “Execution Unit”, and in AMD 4 SIMD units are forming a “GCN Compute Unite (CU)”). We will call that “SIMD Group“. The idea of those groups is that the SIMD units inside them are sharing some kind of hardware (most common a cache unit, but also instruction fetch/decode system and others). This also provides some locality, which is good.

GPU architecture – optimized for throughput

In the basic diagram above, every little green square is a lane from a SIMD unit, which can perform a floating point (or other) operation over some data. All green boxes (or SIMD lanes, or “cores”) in one SIMD unit (surrounded by the whitest box) are executing the same instruction. We’ve grouped 4 SIMD units in a SIMD group and we have 2 such groups here. So instead of multiple complex hardware, we have many simple cores. By the way, if you ever wondered how the vendors are calculating the FLOP/s on their processors – take the number of the SIMD lanes (cores), multiply that by 2 (since every core can executed up to 2 floating point operations per cycle with something called multiply-add or ‘mad‘) and multiply that by the frequency of the core. It is not hard to see that this theoretical FLOP/s peak is really, really very theoretical – there is no way to keep all units busy all the time and make them do mads all the time in any meaningful app.

But, as we discussed in the previous chapter, this complex hardware (branch predictor, register renaming etc.) is very important for the performance, since it keeps the pipelines busy. So if we are lacking that in the GPUs, isn’t the performance going to suffer a lot?

Well, obviously not. The trick that the GPU does is, that instead of trying to minimize the latency, it is trying to hide it with parallelism. It works as follows – every SIMD unit has a number of registers that it can use. The tasks we give to the GPU are in amount of hundreds of thousands. The GPU prepares some of those tasks in the hardware, so at any given moment there are multiple tasks ready to be executed (they share those registers). So, let say that we execute an instruction on a SIMD unit, and this instruction needs some memory from the DRAM also called global memory. In fact it is a bit different from the CPU DRAM and is named GDDRAM, but for our needs it is DRAM enough. On the CPU, if this instruction is not in the cache, we will have to wait and CPU will stall. If you have other thread ready to run, possibly the OS will switch to it (but this switch is expensive, since all the resources that were used by the stalled thread have to be swapped with the new one). The GPU on the other hand can simply choose another task to execute, since the data needed for it is already in the registers and waiting. So in some sense, we can say that the thread switch on the GPU is hardware implemented and basically free. The key here is that you have to give the GPU enough tasks in order to achieve high level of parallelism, that will be used to hide the latency.

But everything comes at a price, right. The price here are the registers. If your threads need a lot of data, at one point the registers would not be able to fit it all. At this point we have a so-called “register spill” to the global memory. Compared to the register, reads and writes to the global memory are ridiculously slow, so it is best to avoid that situation and keep the memory that every thread needs low.

Now, what about synchronization. If you have hundreds of thousands of threads, probably synchronizing them will be hell (after all synchronizing even 2 threads on the CPU is pretty hard). The answer is that the GPU is suited for tasks that don’t need synchronization at all – like graphics (real time or photorealistic), deep learning, simulations, etc. If you need synchronization on a global level – it is impossible. The only possible synchronization can be done is on a SIMD group level. But there are atomics on global level available. Tasks that require global synchronization usually are split into multiple GPU programs, thus offloading the synchronization job to the CPU.

And lastly, some programming. The thread model of the general purpose programming languages for the GPUs follows more or less the thread model of the hardware. For example in CUDA, threads are grouped in “thread blocks”. Threads in a block can talk to each other, even more they run on the same SIMD unit. How many threads are in a block is programmable, so if you have more than 32 (the size of the SIMD unit in nVidia), the thread block will be executed on multiple runs. And you can run multiple of those thread blocks (grouped in so-called “grid”). Both CUDA and OpenCL are offering multi-dimensional entities like groups of thread blocks and grids for convenience, but I’ve always preferred to use 1D spaces. Enough talking, lets do some coding, since it will make it much easier to understand.

Here is a program that sums up two arrays in C++.

1

2

3

4

5

6

7

8

template<typenameT>

voidsum(constT*A,constT*B,T*C,size_t size){

for(size_ti=0;i<size;++i){

C[i]=A[i]+B[i];

}

}

sum(A,B,C,size);

And here it is in CUDA. Instead of having a for loop, every thread (or core) will sum one of the array fields.

1

2

3

4

5

6

7

8

9

10

11

12

13

template<typenameT>

__global__//this tells the compiler to compile the function for the GPU

voidsum(constT*A,constT*B,T*C,intnumElements){

intglobalId=blockDim.x*blockIdx.x+threadIdx.x;//this gives us a unique thread id. threadIdx, blockDim and blockIdx are automatically available in the whole source of the GPU program. threadIdx is the unique index of the thread within the thread block, blockIdx is the index of the block groups, and blockDim tells the dimensions with which we've called the kernel

//every thread sums one of the array values. Make sure that we are not going out of the array bounds. Some threads would do nothing, but this is perfectly okay - usually we have tens of thousands of threads, and if some of them are going to just return this would not affect the performance

if(globalId<numElements){

C[globalId]=A[globalId]+B[globalId];

}

}

intthreadsPerBlock=32;//we want threadsPerBlock * numBlocks to be >= size, because we need one thread for each array element

intblocksPerGrid=(size+threadsPerBlock-1)/threadsPerBlock;

sum<<<blocksPerGrid,blocksPerGrid>>>(A,B,C,size);//use this special syntax to call the GPU function

Finally, some common keywords with the GPGPU apps. The CPU is usually called the “host“. The GPU is called the “device“. The host and the device are having physically different memories, and we have to do explicit copies from one to the other if we want to move stuff (more on that in the next part). The functions written for the GPU, that can be called from the host are called “kernels” (I’ve no idea why). Kernels can be only void functions (this makes sense, since they are executed from many threads, and if every one of them returns a result it could get messy). Functions, that can only be called from the GPU are called “device functions“.

Both OpenCL and CUDA are C based languages, and both of them offer extensions to get a unique per-thread id. CUDA has a C++ support too (though you don’t want to use virtual functions and other advanced stuff, but more on that in the next parts). Malloc and new can’t be used on the device. You have to allocate memory on the device before you start the kernel. This happens with a API call from the host (in fact, there is a way to use them with CUDA, but it is something that we will avoid and will talk about in the next parts).

How to synchronize the work between the host and the device ? What about having multiple devices ? How does a SIMD unit works, when there is a branch in the source and some threads go into the ‘true’ path, while others into the ‘false’ path? We look at all that (and more), again, in the next part.

CPU thermal profile during runtime

P.S. As you may have noticed the CPU has these very big cores located on one part of the chip, caches and memory on the other. In the GPUs on the other hand, you have kind of a better mixture of cores and memory (and the core are using much less power, since they don’t have any advanced features and a running at around 1GHZ). And we’ve talked how cooling/power is a major problem with current (and next) processors. But it is not only ‘heat’, it is heat per square mm. So in the CPUs, you have parts of the chip that are getting way hotter than the CPU vendors would like (caches require less power compared to ALU units). This makes the life (and the future) for the GPUs even brighter.