In my earlier blog post I quickly went through the perspective of the CPUs and GPUs to scale out their performance. I also mentioned how the APU is trying to harness the goodness of both worlds. Let me quickly this time go through a simple example and show and the APUs would present an excellent platform to solve this problem.

Consider the problem of parallel summation across a very large array. How would you solve this problem on a CPU? Here is the pseudo code:

Take an input array.

Block it based on the number of threads (usually one per core – 4 or 8 cores).

Iterate to produce a sum in each block.

Reduce across threads.

Vectorize your execution step through the SIMD ISA.

Have a look at the code below

//Summation Across all threads

float4 sum(0,0,0,0);

for (i=(n/threads_count)*thread_num to (n+b)/threads_num)

Sum += input[i];

float scalarSum = sum.x +sum.y + sum.z + sum.w;

//Reduction stage to aggregate threads results

float reductionValue(0);

for (t <= threads_num)

reductionValue += t_sum;

Think now of an efficient implementation on the GPU:

Take the input array.

Block it based on the number of threads (16 per core it could be up to 64 per core).

Iterate to produce a sum in each block.

Reduce/Sum across threads.

Vectorize through a different kernel call due to the limitations of the current execution models.

//Summation Across all threads

float64 sum(0,…,0);

for (i=(n/threads_count)*thread_num to (n+b)/threads_num)

Sum += input[i];

//Reduction stage to aggregate threads results

float reductionValue(0);

for (t <= threads_num)

reductionValue += t_sum;

They don’t look so different from each other, right? Basically you do the same steps but the main differences are the number of cores and the number of threads. On GPUs you have more way more threads to do the summation, which may complicate your model. In addition, these many threads bring with them a lot of state management overheads, context switching, and problematic stack management. On the CPU cores you may have data parallelism through the limited number of cores and threads. Narrow SIMD units simplify the problem. High clock rates and caches make serial execution efficient for each single thread. Also the simple mapping of tasks to threads allows us to create complex tasks graphs. However, this comes at the cost of many iterations for loops. So in other words, GPUs support very fine-grained data parallel execution and CPUs provide coarse-grained data parallel execution model.

APUs combine these by providing a nested data parallel code. Basically, CPUs take coarse-grained tasks and break them down to the on-chip GPUs to do faster execution of finer grained tasks. Close coupling of the CPUs and GPUs elemenates the cost of moving data between them to execute this nested data parallel model. Also, CPUs can handle conditional data parallel execution much better than GPUs; offloading computations becomes more efficient since there is virtually zero data copying for this offloading process.

Applications can now combine high and low degree of threading at almost zero cost. Also, interesting execution models are possible. You can have multiple kernels execution on the simultaneously communicate through shared buffer and relatively low synchronization overhead. So back to our example, we can now divide our array to the four CPU cores and each core then can offload the summation to the GPU threads, do the reduction at its level, and then all the CPUs can synchronize and do the reduction with very low overhead.

So, this is in terms the possibilities on the APU architecture.

The question now is: how can we easily use all these capabilities without scarifying performance? Moving from the explicit data movement between CPUs and GPUs to the shared memory spaces is tricky. CPUs use explicit vectors ISA and memory access patterns, but GPUs depend on implicit vectors through multiple threads scheduled to access adjacent memory locations simultaneously. How can these two models be targeted in an easy clear programming model with an acceptable efficiency and true shared memory that we can freely pass pointers to between the CPU and GPU cores? This will be my next blog post. Stay tuned!

As I’m heading home after three exciting days at the AMD’s Fusion Developer Summit 2011, I’d like to share with you my findings, thoughts and ideas I got out of this event. It had five fascinating tracks each one had around 10 sessions over the four days. The Programming Models track was the most interesting and exciting, at least to me. It is tightly coupled with the new AMD Fusion System Architecture (FSA). It brought with it a lot of new concepts. I can see also a lot of interesting challenges.

Let me take you in a series of posts sharing with you the excitement of these new innovations from AMD. I’ll start with a quick background of why the APUs are a good answer to many computation problems and then I’ll talk about its programming model.

So, the Fusion architecture is a reality now. It starts the era of heterogeneous computing for the common end-user. It combines the x86 heavy lifting cores with super-fast simpler GPU cores on the same chip. You probably came across articles or research papers advertising the significant performance improvement that GPUs offer compared to the CPUs. This is often heard as a result of poor CPU code and the inherently massive parallelism of the algorithms.

The APUs architecture offers the balance between these worlds. GPU cores are optimized for arithmetic workloads and latency hiding. However, CPU cores deal with the branchy code for which branch prediction and out-of-order execution are so valuable. They both built for different design goals in mind:

GPUs design aims to maximize throughput at the cost of lower performance for each thread. They use the area in having more cores of simpler designs by not implementing branch prediction, out-of-order, or large caches.

Hence, these architectures hide memory latency in different ways.

So, in the CPUs world memory stalls are of high cost and they are harder to cover. Because of the several caching hierarchies, it takes many cycles to cover a cache miss. That’s why a larger cache reduces is necessary to reduce memory stalls. Also the out-of-order execution makes the pipeline busy doing useful computations while cache misses are served for some other instructions.

GPUs, however, use different techniques to hide memory latency. They issue an instruction over multiple cycles. For example, a large vector execute on a smaller vector unit. This reduces instruction decode overhead and improves throughput. Executing many threads concurrently by interleaving their instructions fills the gaps in the instructions stream. So, they depend on the aggregated performance of all executing threads and not reducing the latency of a single thread. GPU’s cache, however, is designed to improve spatial locality of instructions execution and not focusing on temporal locality. That’s why they are very efficient in retrieving large vectors through many banks they offer for the SIMD fashioned data fetching.

So choosing either of these two worlds comes with a cost. For example, CPUs large caches to maximize number of cache hits and the support the out-of-order execution consumes a much budget of the available transistors on the chip. The GPUs however cannot handle branchy code efficiently; they are effective most on massively parallel algorithms that can be solved in vectors and many independent threads. So, each one is for a specific type of algorithms or a problem domain. For a concrete case study have a look at the table below comparing representatives of the CPU and GPU sides.

AMD Phenom II – x86

AMD Radeon HD6070

6 cores 4-way SIMD (ALUs)

A single set of registers per core

Deep pipeline supporting out-of-order execution

24 simple cores 16-way SIMD

64-wide SIMD state (threads count per CU)

Multiple register sets shared

8 or 16 SIMD engines per core

And this is when the Eureka! moment came to the AMD engineers & researchers to reconsider of microprocessors and design the Accelerated Processing Units (APUs). Combining both architectures on a single chip may solve many problems efficiently, specially for multimedia and gaming related. The E350 APU for example combines two “Bobcat” cores and two “Cedar”-like cores, which includes 2 and 8-wide SIMD engines on the same chip!

So let me take through an example in my next post to show you quickly the current and future models on these APUs. Also, I’ll be writing about: the run-time models, the software ecosystem of APUs, and the Roadmap of the AMD Fusion System Architecture (FSA)

Blogroll

Great Lakes Consortium for Petascale Computation
The Great Lakes Consortium for Petascale Computation is a collaboration among colleges, universities, national research laboratories, and other educational institutions. The consortium facilitates the widespread and effective use of petascale computing, t
0

The Hybrid Multicore Consortium
Oak Ridge National Laboratory (ORNL), Lawrence Berkeley National Laboratory (LBNL), Los Alamos National Laboratory (LANL) are leading providers of large scale computational resources for the scientific community and have made substantial investments to de
0