Transkript

2 Preface The widespread use of so-called General-Purpose Graphics Processing Units (GPGPUs) in modern computers offers a massively parallel computing power. These processors become more and more flexible, but still their programming requires special programming techniques. Efficient utilization is a non-trivial task and requires (1) explicit parallel programming and (2) new compilation techniques to ease programming of these architectures. Moreover, GPGPUs provide parallelism on different levels, i. e. instruction-level and thread-level parallelism at once, each requiring different implementation techniques. This seminar discusses various approaches to create efficient code for GPGPUs but also considers specific applications of these parallel computers in scientific areas.

4 Seminararbeit GPU Architektur und Programmiermöglichkeiten für GPGPU-Anwendungen Marius Gräfe University of Kaiserslautern, Embedded Systems Group m 25. Oktober 2012 Alone we can do so little; together we can do so much. Helen Keller 1

49 2 1 Introduction When NVIDIA introduced CUDA in 2006 and thereby finally made general purpose computing on GPUs simpler and more understandable, a lot of highly data parallel algorithm implementations, which now could be run at a reasonable performance even on standard consumer hardware, began to appear. Very soon other vendors, like AMD, began to open up their parallel computing platforms too and a vast field of different hardware and programming interfaces was the result. The next logical step was to unify all those interfaces to a single framework. This was done when Apple proposed OpenCL to the Khronos Group in 2008, which now maintains it as an open standard that a lot of vendors have implemented by now. But still, CUDA is a few years older then OpenCL and there has been put a lot of work into the plenty of CUDA programs already written. Therefore it is desirable to port over these already written programs to other platforms by using OpenCL, instead of reimplementing them. CPUs are also getting more and more cores and additional instructions to do SIMD like calculations, making them also suitable for parallel computing. It is not to expect that CPUs will reach the same performance as those highly data parallel algorithms on GPUs today, but it is still interesting if the characteristics of CUDA code can be used to improve the execution efficiency on CPUs. The following text will start with a quick overview of CUDA and its code and architecture. Then there will be a description how CUDA code can be ported to CPUs using the MCUDA framework. Afterwards it will continue with an overview of OpenCL and how code can be ported from CUDA to OpenCL. Then there will be a performance analysis of the ported code and it will conclude with an outlook to the future of portation of CUDA code. 2 CUDA In 2006 NVIDIA introduced CUDA, which allowed general purpose computing to be done on NVIDIA GPUs [9]. Its architecture closely resembles the actual hardware architecture of NVIDIA graphics cards. CUDA also came with a language extension for C called CUDA C and an API to hide most of the device management to make the GPU easily programmable. 2.1 Architecture A CUDA program consists of host code, which is run on the CPU, and device code, which is run on the GPU and consists of so called kernel functions. The host code is compiled and run like normal a C program, the kernels however are executed multiple times in parallel as CUDA threads, each having its own thread local memory. Threads are again organized in blocks, which are assigned to one of the GPU s multiprocessors. All blocks are organized in the grid. Each thread and block has its own index which is available from within the kernel code as threadidx and blockidx. All threads in a block can access the block local shared memory. Variables declared as shared are copied in each blocks shared memory and allow efficient cooperation between the thread in a block. Threads in a block are further grouped into warps, which are executed by the multiprocessor like a SIMD instruction. [9] Another type of memory on CUDA devices is the constant memory. Variables in constant memory cannot change during the program execution, but are broadcast to certain number of threads and cached more aggressively, therefore saving a lot of memory bandwidth. Finally there is the texture memory. It is also read-only, but has its caching optimized for two-dimensional access. [11] 46

50 3 Figure 1: The CUDA device model. [7] 2.2 Language extensions CUDA C is basically standard C with a few keywords added. The keywords device, global and host help the CUDA compiler to distinguish between code written to be run on the host and the device. Host code and device code are then fed into the appropriate compilers, which is, in the case of the host code, the default C compiler. The CUDA compiler then generates an intermediate, assembly like language called PTX, from the device code, which is fed to the graphics card driver and then translated to target hardware instruction set on runtime [10]. Another type of keywords enable the programmer to declare in which part of the device memory a variable should reside, to efficiently use the different types of memory described above [11]. NVIDIA also gives you a handful of prebuilt datatypes and functions that make GPU programming a lot easier, but are not necessary to use the GPU [11] [9]. Important to mention here is the syncthreads() function that allows a barrier synchronization across all threads of the same block, meaning all threads have to enter and leave the function at the same time. Also, a special syntax for kernel invocation (<<< >>>) was added. This very small extensions to the C language makes CUDA not only quite easy to learn, but also simplifys the source-to-source translation into other languages, as we will see in the later sections. 3 CUDA on Multicore Processors The CUDA toolset already comes with an emulator for GPUs on CPUs, but it is meant for debugging purposes rather than efficiently running CUDA code on GPUs. It runs one OS thread for every CUDA 47

51 4 thread and uses native mutexes for synchronization. Therefore a massive amount of overhead is generated for the number of threads a typical CUDA program creates [1]. So an other approach to efficiently map the CUDA architecture to the CPU architecture has to be found. 3.1 MCUDA The MCUDA framework [12] provides a source-to-source translator, based on the Cetus framework, to map CUDA kernels to standard C and a runtime framework to run these kernels on a common CPU. Both these tools are designed to port over as much as possible of the performance coming from the special characteristics of CUDA code and the features of CUDA devices. CUDA programs gain a lot of performance from kernel functions with very similar control flow and as well lose much performance from accesses to the global memory, which has very high latency compared to the local memory spaces. This encourages programmers to write code with very regular control flow and high data locality. Further all threads in a block can be executed independently, so the basic idea is to run each block, not each thread, per CPU core. The regular control flow in the thread blocks make it likely that the SIMD instructions, which available on current CPUs, can be used. Thread-local and shared memory spaces also roughly fit into the CPUs L1 cache, therefore maintaining the data locality on the CPU Code Portation Since the host code is already running on the CPU and no device specific initializations have to be made, most of the translation process focuses on the kernel functions. The first step is to translate the control flow, therefore transforming the thread-level kernel functions into block-level functions. This is done be serializing the kernel functions in each block using a so called thread loop, and explicitly introducing the threadidx variable (cf. Figure 2). void add(float a*, float b*, float c*) { int i = threadidx.x; if(i < VECTOR_SIZE) c[i] = a[i] + b[i]; } void add(float a*, float b*, float c*, dim3 blockdim, dim3 blockidx, dim3 griddim) { dim3 threadidx; } //Thread Loop start for(threadidx.y = 0; threadidx.y < blockdim.y; threadidx.y++) { for(threadidx.x = 0; threadidx.x < blockdim.y; threadidx.y++) { //Kernel Code int i = threadidx.x; if(i < VECTOR_SIZE) c[i] = a[i] + b[i]; } } //Thread Loop end Figure 2: The add function of a simple vector addition without and with a thread loop 48

52 5 Thread-local variables are now effectively reused in each iteration of the thread loop and the shared variables stay visible for all iterations since they are declared outside the loop. If the kernel contains synchronization statement, meaning a statement which all threads have to enter and leave at the same time, further transformations have to be done. This transformation is called loop fission when applied directly to the thread loop and deep fission when applied to a scope within the thread loop. If the synchronization statement is directly within the scope of a thread loop, the thread loop is simply split around it (See Figure 3). void kernel (...) {... thread_loop { //Code before barrier syncthreads(); //Code after barrier } } void kernel (...) {... thread_loop { //Code before barrier } } thread_loop { //Code after barrier } Figure 3: Loop fission applied directly to the thread loop If not, the scope around the synchronization statement is split in two thread loops, side effects in the control structures are removed and the scope itself is declared as a new synchronization statement, as demonstrated in Figure 4. This is called deep fission and can be safely done, because the CUDA model requires the control flow affecting synchronization to be thread-independent within a block. All early-exit and irregular control statements are also marked as synchronization points. This not always needed, but it secures the program s consistency. These steps are repeated until all synchronization statements are converted. For the serialized threads, this loop fissions have the same effect as a barrier synchronization, since the second thread loop is entered only after all threads have completed the first one. After the control flow is transformed, all variables used in more than one thread loop are buffered by creating an array containing the values for different threads. Variables used only in a single loop can safely be reused. References to variables outside thread loops are represented by buffer element 0 since they stay the same across all threads, and therefore using any of the elements is sufficient. Shared variables simply have the shared keyword removed as they are visible to all logical threads anyway. (cf. Figure 5) In the host code, the kernel launch statements and some of the basic API functions (e. g. memcpys) can stay untouched, since they are reimplemented by the MCUDA framework, the remaining API and library calls have to be ported manually Execution When the kernel launch function is invoked, the host thread stores the launch parameters to global variables and enters a barrier synchronization point. The worker threads, which represent the device, also enter this barrier, when they become idle. On exit, the host thread advances to a second barrier, all worker threads begin executing the block functions, using the launch parameters, and enter the second barrier 49

54 7 void kernel (...) {... int k; int a,b,c; shared float data[16]; } thread_loop { b = data[threadidx.x]/2; a = 0; } while(a < 16) { thread_loop { for(k=0,k<64,k++) { c += k*b; } } thread_loop { a++; } } thread_loop { data[threadidx.x] = c; } void kernel (...) {... //Variables only used in //a single thread loop int k; //Variables used in //multiple thread loops int a[],b[],c[]; //Shared variables float data[16]; } thread_loop { b[tid] = data[threadidx.x]/2; a[tid] = 0; } //Variable outside of thread loop while(a[0] < 16) { thread_loop { for(k=0,k<64,k++) { c[tid] += k*b[tid]; } } thread_loop { a[tid]++; } } thread_loop { data[threadidx.x] = c[tid]; } Figure 5: Code before and after replicating variables. when there a no blocks left to execute. When leaving the second barrier, the host thread returns to the host code, and the worker threads enter the first barrier again. The blocks can be assigned either statically or dynamically to the worker threads. When scheduled statically, each worker thread gets a set of blocks, at most one block larger than the set of any other block, and executes it. On dynamic scheduling each worker thread acquires the next block to executed after it has finished executing one block until there are no blocks left Performance To test the performance of MCUDA, implementations of algorithms which have proven to be very efficient on GPUs (Matrix multiplication, Coulombic Potential, Magnetic Resonance Imaging) were compared to their highly optimized CPU counterparts. The results (cf. Figure 6) show, that the ported code is at least half as fast, therefore giving a reasonable performance for not specially tuned code. The MCUDA implementations also scale very well, nearly linear, with the number of CPU cores, at least for a small number of cores. This should be beneficial for future CPUs with more cores. Dynamic block scheduling is marginal faster than the static method and is expected to give more distinct 51

55 8 Figure 6: Performance of MCUDA with different numbers of worker threads and scheduling techniques. [12] improvements for a considerably larger number of threads Optimizations The same way CUDA kernels are typically fine-tuned to perform better on specific GPU hardware, the kernels can be optimized for the CPU architecture (e. g. varying the number of kernels per block, loop unrolling etc.). Experiments showed, that the optimal optimization points for GPU and CPU are very different. For example loop unrolling boosts CUDA implementations, since branches cost o lot of time on GPUs, on CPUs however loop unrolling prevents the compiler from using SSE or MMX instructions. Therefore tuning the CUDA code for CPUs before portation can improve the performance. Also a liveness analysis on the variables before replicating them, could help to improve the amount of memory used, by only buffering those variables which have a live value a the end of the thread loop Limitations The MCUDA framework can only translate the kernel functions automatically, the host code has to be ported manually. For kernel invocation however, MCUDA uses the CUDA syntax and the basic CUDA memory management functions are also reimplemented. If the host code only uses these, no further portation has to be done. Also, with the introduction of OpenCL, MCUDA itself became obsolete, since there exist OpenCL implementations for CPUs. Still, it should be possible to use the techniques used by MCUDA for the OpenCL CPU implementations. 52

56 9 4 OpenCL OpenCL is a framework, whose specification was released in 2008 by the Khronos Group, to write programs that can execute on many different platforms, like GPUs,CPUs or even DSPs, and is not restricted to NVIDIA hardware like CUDA. However many aspects of OpenCLs code and architecture are quite similar to CUDA. 4.1 Architecture The OpenCL model (cf. Figure 7) specifies that there is one processor to coordinate the execution, the host, and one or more processors to execute the kernels, the devices. Like CUDA, kernels are executed multiple times in parallel as work-items. These are grouped in work-groups, the whole of all work groups is called NDRange. Figure 7: The OpenCL device model. [7] The OpenCL memory model consists of global memory and read-only constant memory, accessible by all work-groups, local memory restricted to its work-group, and work-item local private memory. In contrast to CUDA there is no texture memory specified, since OpenCL is designed not only to run on GPUs. [3] The host code is compiled by the corresponding compiler for the host, but since the implementation of the OpenCL API is platform specific, the kernel code is compiled during runtime. [2] One can easily see, that, apart from a few device specific features, the basic architecture of CUDA and OpenCL is very alike. 53

57 10 CUDA OpenCL thread work-item thread-block work-group grid NDRange local memory private memory shared memory local memory global memory global memory constant memory constant memory texture memory - Table 1: Comparison between CUDA and OpenCL terminology 4.2 Language extensions OpenCL is also an extension to standard C. It add qualifiers to define in which memory region a variable does reside in ( global, local, private and constant) and to mark kernel functions ( kernel). OpenCL also provides a API for memory management and kernel invocation, but since OpenCL supports a variety of different platforms the API has to offer functions that are more low-level than those in the CUDA API too. All of the device and kernel management the CUDA does implicitly, like managing the command cues to control the devices or compiling or managing kernels, which are objects in OpenCL and can not be used the same function-like way as in CUDA, has to be done explicitly in OpenCL, making the code a lot more verbose. This is also the reason kernel code should be put to separate file, since they are compiled from strings, making the code rather unreadable when directly embedded in the host code. [5] 5 CUDA to OpenCL 5.1 Code Portation As we can see, OpenCLs architecture and code is relatively similar to CUDA. Thus, porting the code is mainly replacing keywords and API functions [7][2]. Furthermore the additional setup procedures needed by OpenCL have to be added and kernel and host code have to be split into different files. However manually porting the code is a very tedious and time consuming work, so projects to automate the process were created Swan A first step into simplifying the portation of existing CUDA code to OpenCL is the Swan tool [5]. Code ported with Swan can be built for CUDA and OpenCL targets, making it easy to support and maintain multiple platforms. It consists of the two components swan and libswan. swan is a source-code processing tool for CUDA kernel sources. It takes the sources, which have to be in their own source files, does a source-to-source translation for OpenCL targets and then passes the code to the appropriate compiler. Because of the close similarities between CUDA C and OpenCL C, the source-to-source translation is done with a set of regular expressions and does not need a complex C parser. The result is a C header file containing the compiled source and a entry-point function to invoke the kernel, taking the kernel launch parameters as additional parameters. 54

58 11 libswan provides functions similar to those of the CUDA API (e. g. for memory management), and is implemented for both CUDA and OpenCL. The correct implementation is chosen at compile time. To port existing CUDA programs (cf. Figure 8), each kernel is put into a separate source file and the corresponding header files are included in the host code. Then the kernel calls are replaced be entry-point functions from the header files. Lastly the CUDA API calls are replaced by the appropriate swan API calls. global void kernel(int *param) {... } void host() { int *param; int hparam = 42; //CUDA accepts int launch parameters int grid, block; //Allocate memory on the GPU cudamalloc((void**)&param,sizeof(int)); cudamemcpy(param,&hparam,sizeof(int), cudamemcpyhosttodevice); #include "kernel.kh" void host() { int *param; int hparam = 42; //Swan needs vectors which are filled //using swandecompose dim3 grid, block; //Allocate memory on the GPU param = (int*) swanmalloc(sizeof(int)); swanmemcpyhtod(&hparam,param,sizeof(int)); //Launch the kernel swandecompose(&grid,&block,4,16); //Launch the kernel grid=4; block=16; kernel(grid,block,0,param);... } kernel<<<grid,block>>>(param);... //Clean up cudafree(param); } //Clean up swanfree(param); Figure 8: Porting CUDA to Swan [5] The resulting code can now be used on both CUDA and OpenCL platforms, but since it s neither CUDA nor OpenCL code, Swan is more an additional abstraction layer than a source-to-source translator. Close resemblance to CUDA C and hiding of the OpenCL setup procedures make Swan quite easy to use for CUDA programmers, but even if the translation of the kernels is done automatically, porting most of the host code by hand and missing equivalents of some CUDA API functions mean a lot of work, especially for larger programs. The performance of the OpenCL version is about 50% compared to the CUDA version on the same hardware. Examination of the PTX code produced by the different compilers showed, that the OpenCL compiler produced a lot less efficient code CU2CL A project which aims to fully automate the code translation process is the CU2CL framework [8]. It is a plugin for the Clang compiler framework and already provides automatic translation for the most commonly used parts of the CUDA API. Clang was chosen since it already provides all the tools needed for code analysis and rewriting, therefore requiring few additional code and reducing the possibility of 55

59 12 errors. CU2CL recursively walks and analyzes the Abstract Syntax Tree generated from the original source by Clang and then does a string-based rewrite directly on the source file, not the AST, using Clangs rewriting mechanism. Since most of the code of a CUDA program is normal C code, and both CUDA and OpenCL are C based, this approach only touches the CUDA specific parts of the code. This leaves the original structure, especially comments, intact and simplifies maintainability and further development on the generated OpenCL code. Rewriting itself is based on common patterns. Reoccurring types of rewrites (e. g. CUDA API calls, see Figure 9) are generalized, making the framework more modular and easy to expand. // CUDA float *newdevptr;... cudamalloc((void **) &newdevptr, size); // OpenCL cl_mem newdevptr;... newdevptr = clcreatebuffer(clcontext, CL_MEM_READ_WRITE, size, NULL, NULL); Figure 9: Rewriting a common CUDA API call [8]. To completely translate the CUDA program some #include directives must also be rewritten. Because #inlcudes are not present in the AST, CU2CL registers a callback with the Clang preprocessor which then provides all necessary information for rewriting them. CUDA specific headers are removed entirely, system headers, like stdio.h are removed from the OpenCL kernel files, since they cannot be used there. Included CUDA sources are split into two new files for host and device code and the #includes are rewritten to point to the host code files, device code is not included since it is only used during runtime. CU2CL already supports the most commonly used CUDA API calls, therefore only very few to no lines have to be ported manually after the translation process. These manual changes are quite simple and can easily be added in the future through CU2CLs modular architecture. On the performance side, the automatically translated code performs just as well as its manually ported counterpart. In comparison to the original CUDA code it again performs noticeably worse, which again is explained by the NVIDIA OpenCL compiler, doing not as many optimizations as the CUDA compiler. 5.2 Performance Portability After a CUDA program is ported over to OpenCL, the next question one may ask is: How does the OpenCL implementation perform in comparison to the CUDA implementation? Thus, performance comparisons between those implementations were made on the same platform, to keep the comparison fair, as well as on platforms from different vendors, which is one of the main reasons for porting programs over in the first place. Since the code ported by methods described above perform almost the same [8][5], analyzing the manually ported versions should be enough to draw reasonable conclusions. 56

60 Performance of CUDA and OpenCL on the same platform In [7] several algorithms were ported and compared. From the NVIDIA GPU Computing SDK the bandwidthtest, which just uses API call and does no additional computation, and the matrixmul benchmarks were chosen. Also selected were the Coulombic Potential (CP) and the Magnetic Resonance Imaging Q and FHD (MRH-Q and MRI-FHD) benchmarks from the Parboil benchmark suite [4], because their memory access pattern are very well-suited for GPUs, so their kernels should be able to run without having to wait for memory accesses. Figure 10: Performance comparison between CUDA and OpenCL [7]. As Figure 10 shows, the OpenCL versions are a lot slower than the CUDA versions. Only the bandwidthtests execution time is about the same, which shows that the additional execution time does not come from the different API calls but from the execution of the kernels. Further this narrows the problem down to the different compilers used to build the kernels. Analyzing the PTX code generated by both compilers showed, that the CUDA compiler applies several optimizations to the code while the OpenCL compiler by default does not. The most important optimizations the CUDA compiler uses are loop unrolling, to reduce the number of branches and index calculations, common subexpression elimination, where reoccurring expressions are replaced by a variable holding the computed value, and loop invariant code motion, which moves calculations outside loops if they are unchanged by the loop. Additionally the OpenCL compiler tends to group similar instructions, whereas the CUDA compiler interleaves memory access and calculation instructions, allowing to overlap I/O and computation through pipelining [2]. The CUDA compiler also makes use of NVIDIA device specific instructions, such as mad and rsqrt. Manually applying these to the PTX code puts the performance of the OpenCL programs within close range of the CUDA ones (cf. Figure 11). Nevertheless, the OpenCL compiler supports optimizations for floating point calculations, invoked by the -cl-fast-relaxed-math option. The performance of the programs compiled with this 57

61 14 option can also be seen in Figure 11. With optimizations turned on, some of the OpenCL variants also come within range of CUDA, but the OpenCL compilers optimizations are still not as mature and complete, which explains the small differences in performance. The performance of the CP algorithm comes from the OpenCL compiler not using NVIDIAs native rsqrt instruction, which saves a lot of time consuming division operations in this case. OpenCL does, in fact, support these native instructions by using the native prefix before certain functions, but then the implementation and therefore the accuracy of the calculation becomes platform specific [6]. This leads to another drawback of the -cl-fast-relaxed-math option: When used, the precision of floating point calculations are not IEEE 754 compliant anymore.[6] Figure 11: Performance of optimized OpenCL code. [7] Performance of ported OpenCL code on other platforms. In [7] the ported CUDA code was also tested in different platforms. It was run on an Intel Core i7 with 4 cores, a NVIDIA Tesla C1060 and a Radeon HD 5870, which, in theory, has much higher peak performance (2720 Gflop/s on Radeon to 933 Gflop/s on Tesla). NVIDIA s OpenCL compiler was used for Tesla and AMD s OpenCL implementations was used for both Radeon and CPU. On all platforms both the automatically optimized and unoptimized versions were tested, the results can be seen in Figure 12 Because the Benchmarks make extensive use of data parallelism and regular memory access patterns, which GPUs are optimized for, the CPU performs a lot worse than the GPUs. Also the -cl-fast-relaxed-math parameter doesn t affect the performance on the Intel and AMD platform, which suggests that the optimizations do not work with the AMD compiler yet. Comparing the unoptimized benchmarks shows, that it depends on the application whether the Tesla or the Radeon GPU perform better. 58

62 15 Figure 12: Performance of OpenCL on different platforms. [7] To analyze the sustained performance, the benchmarks were run with different workgroup sizes. The result for the MRI-FHD is shown in Figure 13. The normalized execution time is the kernel execution time by the best kernel execution time for the corresponding hardware. This shows that the optimal parameters for each algorithm still depend on the platform. The work-group sizes have to be large enough to hide I/O, but not too large since hardware resources are limited. Additionally, in case of GPUs the work-group sizes should be a multiple of the warp sizes (wavefront sizes for AMD) supported by the hardware, because that is the number of thread that are executed in parallel on the hardware and therefore the only way to fully utilize the GPU Optimizations As we can see, even if OpenCL is designed with code portability in mind, OpenCL can perform just as well as CUDA. The major problem are the current implementations of the OpenCL compiler, which do few to no optimizations. But since OpenCL is quite young in comparison to CUDA, it is to be expected that the implementations will get better in the next few years. Still, as the comparison of the sustained performance shows, the code has to be tuned to reach to optimal performance an each hardware platform. One approach to automatically do this hardware dependent optimizations is called auto-tuning. The idea of auto-tuning is to have a large number of code variants for the ported program and the empirically select the one that performs best on the given hardware. The auto-tuning infrastructure typically consists of code generator, which produces the different code variants based on templates by applying different parameters and optimization techniques, and a heuristic search engine, which tries to find the best variant out of the previously generated ones. The search engine itself limits its search space, if possible, by using knowledge of the target hardware and previously evaluated results. [2] The major problem with auto- 59

63 16 Figure 13: Normalized performance for different work-group sizes. [7] tuning is, that generating the templates automatically is hard, since the parameters and parameter spaces are dependent on the algorithm that is to be tuned, and have to be selected carefully to not affect the algorithms correctness. Another possibility is to reduce the additional setup time needed by OpenCL. This can be done by compiling the kernels only once on deployment instead of every time when the program is executed, using OpenCLs clgetprograminfo to get the intermediate code and saving it to disk. This especially saves time when there are a lot of different kernels or when applied to a library. [2] 6 Future As we can see, the greatest performance losses are induced by the OpenCL compilers. Since OpenCL is relatively young and it is being actively developed at the moment, it is to be expected, that OpenCL compilers will mature further. Modern optimization techniques and tuning for the target hardware are likely to be implemented. Additionally some improvements on the translation frameworks can be done. For CU2CL optimizing the generated code for the target platform is planned [8] and first experiments with auto-tuning have already be conducted [2]. On the mere portation side, most of the work is done as CU2CL showed. There is still a bit of the CUDA API left to be supported, but the groundwork to easily add additional transformations has been done and implementing them is planned for the near future. Another interesting project, but not in the scope of this work, is Ocelot [1]. It does not translate or even touch the original source code, it rather implements an alternative CUDA driver API which can 60

64 17 execute the PTX code generated by the NVIDIA compiler. Thus it can even by used on programs where the original source code is not available. The project is already implemented for quite a few backends and is still in active development. 7 Conclusion As we can see, porting CUDA code to other platforms is no big problem anymore. Especially porting to OpenCL is rather simple, since its principles and architecture are very alike to CUDA. Porting the code manually is possible, but very time consuming. Frameworks for automatic code portation have been implemented and are working, but still need a certain amount of manual work in the most cases. However the step to a fully automatic translation is mainly supporting the remaining parts of the CUDA API. The portation however only works on the functional side as of now. The ported code performs very poor on other platforms. It can be tuned to perform as well a CUDA, but currently not automatically. Until this is the case, it is unlikely that code portation from CUDA will play a big role in real world applications. A lot of work is to be done in both improving the compilers and adding optimizations to the portation frameworks. 61

66 OpenACC and the PGI Compiler Dimitri Blatner University of Kaiserslautern, Embedded Systems Group d Abstract In the field of High Performance Computing (HPC) there is a big movement towards hybrid systems, consisting of accelerators such as GPGPUs (General Purpose Graphic Processor Units) that share computations with CPUs. In fact, future systems become more and more integrated for efficiency reasons. Applications for these systems need to be programmed in an effective and easy way, that means understandable and fast and abstracting from technical conditions. Therefore the following pages present the OpenACC API (Application Program Interface) targeting the acceleration of programs via compiler directives and providing easy mechanisms for both accelerators and many/multi-core processors. Also the PGI compiler for OpenACC are presented, which is one of the first OpenACC compilers and widely used in the HPC (High Performance Computing) world. 1 Introduction Since the beginning of General Purpose GPUs (Graphic Processor Units) the market for these hybrid systems has increased very fast. Now they play an important role in the world of HPC, not only for workstation or desktop PCs. The reasons are simple. They can process more parallel computations with 500 and more cores in modern GPUs compared to up to 8 physical cores in todays CPUs, they have a fast GPU memory and they also have a better performance per watt ratio. In fact, power consumption is one of the biggest problems today when maintaining HPC clusters and computer centers. The first GPGPU cards came from Nvidia in mid 2007, which currently is also the market leader. These cards are based on GPU technology, but are dedicated for acceleration of massively parallel computations. Today, more and more of the worlds fastest supercomputers in the Top500 list 1 contain GPGPUs to speed up computations, mainly from Nvidia 2. The upcoming Nvidia GPGPU 3 has almost 2500 small ALUs 4 with about 1.17 TFLOPS 5 peak performance in double precision. Hybrid systems usually are not limited to GPGPUs as coprocessors or accelerator cards for certain or special arithmetic operations or other complex instructions. Now, programmers have the quite complex task of producing code that can exploit the additional speed of different accelerator cards. Accelerator cards usually can not access the application memory directly, data copying to the accelerators memory via the CPU is very time consuming and the main challenge for high performance applications. Since different accelerators may have different instruction sets or computation possibilities, a deep knowledge 1 Top500 list, June 2012: (see Rank 5, 6, 10, 14, etc.) 2 More information on Tesla GPGPUs at 3 Specification of the Tesla K20 GPGPU, see 4 Arithmetic Logic Unit, see 5 Tera (10 12 ) Floating Point Operations Per Second 63

67 2 of those instructions and used hardware is often required in order to develop applications for them. In case of GPGPUs the instructions set is often limited to quite fundamental arithmetic operations on scalars and vectors. Therefore a CPU is still needed for the main application logic. Recently, Intel has announced a new coprocessor 6 based on their Xeon server CPUs with about 62 cores and 1 TFLOPS peak performance in double precision understanding the x86 instruction set. The HPC and multimedia electronic markets (mobile phones, tablets, etc.) shows that there is a trend towards more and more integrated logic circuits and acceleration units in the future for power and performance reasons. This trend can be seen in the newer AMD APUs 7 and Intel CPUs 8 and most of todays smart phones. With Nvidias GPGPUs also the Compute Unified Device Architecture 9 (CUDA) [24] were introduced providing a toolkit containing libraries, development tools, language extensions and compiler directives. Although CUDA delivers good performance and is widely used in HPC based and even consumer systems, the effort writing CUDA applications is very high and the applications run only cards from Nvidia. This is one big reason while a year later the open standard Open Compute Language (OpenCL) [12] was published by the Khronos Group 10 providing a cross-platform development framework supporting heterogeneous hardware. OpenCL is cross-platform compatible and has a rich feature set with many in-built functions, but a slower performance than CUDA due to a higher abstraction of specific hardware. It turns out that programming with OpenCL is also quite challenging, since the abstraction level needed for easier programming is not very high. This leads to the idea of OpenACC which this paper will present. The rest of the paper is structured as follows: Section 2 Related Work presents the references used in this work and additional approaches related to OpenACC. Section 3 Hybrid Architectures gives an overview about hybrid architectures and their limits as a foundation to understand current research efforts establishing new standards for future systems. Section 4 OpenACC presents the OpenACC standard in detail with examples, compiler techniques and a performance comparison. Section 5 PGI Accelerator OpenACC Compiler presents the PGI Accelerator compiler for OpenACC as well as some compiler techniques and a performance analysis. Section 6 HSA shows a recently introduced formation for developing standards for heterogeneous architectures. Section 7 Conclusions summarizes the information from this paper and gives a concluding evaluation. 2 Related Work Although this paper presents only OpenACC standard and PGI Accelerator Compiler, other works have to be considered to give a good overview about the current state of the art. This section presents some common or alternative approaches for multicore and accelerator programming, namely OpenMP, Open- MPC, hicuda and some high -level aspects of OpenCL. Afterwards the papers used in this work for 6 Intel Xeon Phi, see 7 Accelerated processing unit, see 8 Intel Quick Sync Video, see 9 Article What is CUDA at 10 Consortium for open IT standards with leading industry members, see 64

68 3 OpenACC and the PGI compiler are presented. 2.1 OpenMP OpenMP [8, 1, 7] is currently the de facto standard programming model for multicore and SMP systems in industry. It was developed by a group of hardware and software manufacturers with Oracle, Intel, Hewlett-Packard and IBM. The programming model extends the languages C/C++ and Fortran by compiler directives, #pragma omp... for C/C++ and!$omp... for Fortran. The main focus of OpenMP is to produce highly portable code, that keeps both parallel programming and parallelization of existing sequential code as simple as possible for the programmer. This makes it a cross-platform solution, since it does not depend on a specific compiler or hardware. OpenMP implements a fork-join based concept of threads, forking the main application thread into as much as needed sub-threads that work in parallel whenever the corresponding compiler directive is given. At the end of a parallel region the subthreads are joined together. OpenMP also supports the concept of tasks explicitly by compiler directives, meaning that it can define code blocks which are called frequently and can be executed independently from the rest of the code with the use of a dynamic dispatcher and a task queue. The basic OpenMP programming model neither supports heterogeneous multicore systems nor the usage of accelerators. This is where OpenACC wants to score. 2.2 OpenCL OpenCL [18, 12] defines itself as a programming framework for heterogeneous compute resources, see Figure 1. As mentioned in the introduction it was published by the Khronos Group with Nvidia as the chair and Apple as specification editor. The first implementation of OpenCL was within the MacOS 10.6 operating system.,!-&'./)0*1)%'$#"%&'2"*3*45' 1%"6#"+(4$%'*4$"%(&%&' A+%"5*45' 840%"&%$0*#4' 7!-&' 84$"%(&*45)9'5%4%"()' 1/"1#&%'2(0(:1("())%)' $#+1/0*45'./)0*: 1"#$%&&#"' 1"#5"(++*45' B%0%"#5%4%#/&',#+1/0*45' 7"(1;*$&' <!8&'(42' =;(2*45' Figure 1: The application field of OpenCL. [12] OpenCL mainly is an huge API which makes it very portable. The memory architecture is hierarchical and very similar to the one of CUDA except for some differences in terminology. It was build around existing GPU architectures today. The memory management is explicit, accelerator functions are also called kernels. OpenCL applications are compiled in two phases, first into an Intermediate Representation (IR) at compile time and second into binary code during runtime, which results in a higher 65

69 4 initialization time. Nevertheless OpenCL is more abstract (or in a sense high -level), because it allows to generate code for completely different accelerator architectures, unlike CUDA at the moment. With version 1.2 of OpenCL it is possible to treat embedded kernels, e. g. implemented on FPGAs, as normal OpenCL kernels 11 without requiring knowledge of how to invoke the embedded kernels. This demonstrates the power or the possibility of abstract programming models with OpenCL as an example. 2.3 OpenMPC OpenMPC (OpenMP Extended for CUDA) [21] 12 shows an approach to use the OpenMP programming model along with CUDA accelerators, but differs slightly in the execution model [6]. Therefore additional compiler directives and environment variables help the compiler to split the application code into host and accelerator code. It provides a compilation system and an API that highly abstracts from the CUDA programming model. The steps of the compilation flow are showed in Figure 2. Figure 2: OpenMPC compilation flow. The (A) marks additional compile passes for automatic code tuning. [21] Additionally, OpenMPC provides several optimization tools, e. g. for memory management, and provides a so called search space pruner tool that analyzes the given OpenMP application and suggests possible optimization settings for CUDA related parameters. Figure 3 shows the main compiler directives of OpenMPC. With clauses the programmer can finetune and optimize the code, they are described in detail in [21]. #pragma cuda gpurun [clause [,] clause ]...] #pragma cuda cpurun [clause [,] clause ]...] #pragma cuda nogpurun #pragma cuda ainfo procname(pname) kernelid(kid) Figure 3: OpenMPC compiler directives. [21] The evaluation shows that OpenMPC applications have over 80 percent of the performance of handwritten CUDA applications, tested with different common algorithms. This is a very good performance result with regards to the high abstraction level of this programming model Also available as an online version, see https://engineering.purdue.edu/paramnt/openmpc/ 66

70 5 2.4 hicuda hicuda [14] is another approach similar to OpenMPCs for a high-level abstraction of the CUDA programming model. It provides compiler directives together with a so called directive handler and a sourceto-source compiler for CUDA code generation. The directives allow many customization options via clauses like in OpenMPC or OpenMP for data partitioning, memory management and the definition of kernel/host parts and loops. These clauses are abbreviations for CUDA related language constructs and preferences. The compilation flow is showed in Figure 4. Figure 4: hicuda compilation toolchain. [14] First, the C/C++ code, annotated with compiler directives, is preprocessed by the directive handler, so accelerator functions and host functions are split and CUDA related preferences are set. Afterwards the source-to-source compiler translates the intermediate code to valid CUDA source code, that is later compiled by the native CUDA C compiler tool chain. During these steps no special analyzation or optimization process is run. In other terms, hicuda simply provides abbreviations to CUDA language constructs via C/C++ compiler directives, so that the programmer does not write CUDA code directly. It does not free the programmer from the CUDA programming model, but reduces many lines of code. Programmers still have to know how data should be moved between the host and the accelerator, which code has to form the kernel and so on. The evaluation of hicuda was done on an old GeForce 8800GT graphics card with CUDA version 1.1 compatibility with different standard parallelization algorithms. This, of course, is not representative for the whole potential of hicuda, but it shows that a speed up similar to hand-written CUDA code is possible. At the end, it also is a very flexible approach that can benefit from further improvements of the native CUDA compiler and additionally gives the freedom to implement own compiler optimizations. 2.5 OpenMP for Accelerators OpenMP for Accelerators [6] extends the OpenMP programming model by new compiler directives. The programming model and the directives are very similar to the ones of OpenACC providing clauses for memory management and other customizations. It targets the ease of programming accelerating applications without altering existing code. Applications may still run without an accelerator. The additional compiler directives identify regions that can be offloaded to the accelerator without being restricted to specific accelerators. This implies a very portable code as well as an easy and flexible mechanism to accelerate existing code. The main work has to be done by the compiler, since OpenMP for Accelerators does not say how exactly to implement the directives. The evaluation of the accelerator supported code was done with a none specified compiler and shows a speed up comparable to a pure OpenMP implementation, when no additional accelerator is used. It 67

71 6 states that PGI s CUDA compiler (see Section 5) achieves 5 times the speed with hand-written CUDA code, which is of course not the target of this approach. 2.6 Papers used in this work A Comparative Study of OpenACC Implementations [25] accull: An OpenACC Implementation with CUDA and OpenCL Support [27] Directive-based Programming for GPUs: A Comparative Study [26] Experiences with High-Level Programming Directives for Porting Applications to GPUs [16] Moving Heterogeneous GPU Computing into the Mainstream with Directive-Based, High-Level Programming Models (Position Paper) [22] OpenACC First Experiences with Real-World Applications [28] OpenACC Implementations Comparison [23] The OpenACC Application Programming Interface [4] Using Compiler Directives for Accelerating CFD Applications on GPUs [17] Generalized parallelization methodology for heterogeneous HPC platforms [20] Performance of FORTRAN and C GPU Extensions for a Benchmark Suite of Fourier Pseudospectral Algorithms [9] Towards high performance and usability programming model for heterogeneous HPC platforms [19] PGI Accelerator Compilers OpenACC Getting Started Guide [13] PGI Accelerator Programming Model for Fortran & C [3] Porting and scaling OpenACC applications on massively-parallel, GPU-accelerated supercomputers [15] 3 Hybrid Architectures Most computer systems, whether personal or high performance, underlay a hybrid architecture consisting of one or more CPUs and a coprocessor with own dedicated memory. It can be compared to hybrid fuel/electro motors in newer cars. The opposite is an architecture consisting of a single type of processors in a single or clustered system, e. g. homogeneous SMP clusters. Figure 5 shows an overview of a single computer system based on a hybrid architeture. In fact, almost all systems are based on this architecture, from personal computers using a CUDA/OpenCL compatible graphic card to high performance clusters like the Cray XK7 13 using AMD Opteron processors along with Nvidia Tesla GPGPUs as coprocessors. In this context people also talk about heterogeneous computing, because the computation of an application is done by different architectures in collaboration. Figure 5 also shows the great bottle neck of the architecture, the connection between CPU and accelerator. Not only data is transferred via the PCIe-2 connection, but also all control information. Therefore, one has to focus on minimizing data transfers to the accelerator, using asynchronous communication and keeping as much data as long as possible resident in the accelerator s memory. The asynchronous communication is very important, so that the CPU and the accelerator may continue their computation while in parallel data is transferred between them. The communication is typically implemented via the DMA controller used also for writing to external storage devices without having the CPU to interfere. The main reason causing this bottleneck is the existence of two separate memories for the CPU and the accelerator, because a complete shared memory is very cost expensive. 13 Specification of the Cray XK7 system, see 68

72 7 Main Memory 32 GB ECC DDR3 Acc. Memory 5GB ECC GDDR5 ~ 42 GB/s ~ 200 GB/s CPU ~ 150 GFLOPS PCIe-2 8 GB/s Accelerator ~ 1.17 TFLOPS Figure 5: Overview of a single hybrid computer architecture found in most systems today with example performance values. It can easily be observed that hybrid architectures are a short-lived solution as long as the communication between the CPU and the accelerator is so limited. Nevertheless, we can already see a great change in future architectures that is accompanied by a change in programming paradigms. Hence, there is a need for a flexible and adaptable way of programming highly parallel applications for various system architectures [19]. As mentioned in the introduction, a more integrated architecture is also the trend in electronics for tablet PCs and mobile phones making the traditional hybrid architecture obsolete. 4 OpenACC At the end of 2011, the OpenACC group was founded by CAPS Enterprise, Cray Inc., The Portland Group Inc. and Nvidia 14. They developed the standard in cooperation, where all companies but Nvidia provide compilers for OpenACC (more in Section 5.2 Alternative Compiler). Then, in november 2011, the OpenACC API specification version 1.0 [4] was released, based on the popular OpenMP programming model for multicore and SMP architectures. OpenACC is developed for acceleration of existing applications by offloading parts of the code to an accelerator device, e. g. a GPGPU, being portable and platform independent. The future idea is to merge OpenMP into OpenACC having one standard for heterogeneous architectures utilizing both CPUs and accelerators. This approach also reduces the very cost intensive development time for heavy parallel applications, see Figure 6. The paper [20] gives an overview over todays technology and programming methodology for hybrid or heterogeneous architectures, including OpenACC. Although OpenACC is very abstract, the paper states that OpenACC orientates on the current technology without having a generalized sight on hybrid or heterogeneous architectures and shows its limitations among other programming models. In their opinion, the division of the code into host code and offloaded accelerator is similar to CUDA that does not allow the host to support the accelerator by doing some of the accelerators computations, but instead 14 About OpenACC, see 69

73 8 just let the host manage the control flow. Therefore, the paper tries to find a more generalized view respectively methodology for programming different architectures exploiting both the multicore and accelerator performance. It suggests extensions to the OpenMP and OpenACC programming models for a better generalization/abstraction for future systems. Unfortunately, no concrete approaches for the implementation are proposed. The problem is that current hardware architectures already work in a specific way, so the offloading principle is the first step in the right direction. Since OpenMP and OpenACC are meant to be merged in the future, the standard develops in the idea of the paper. a) Serial (CPU) code 1 real(8)::a(m,n),b(n,m) 2 do i = 1,m 3 do j = 1,n 4 b(j,i) = a(i,j) 5 end do 6 end do b) ACC directive code 1 real(8)::a(m,n),b(n,m) 2!$acc region 3!$acc do 4 do i = 1,m 5!$acc do 6 do j = 1,n 7 b(j,i) = a(i,j) 8 end do 9 end do 10!$acc end region c) CUDA kernel 1 attributes(global) subroutine & mt_kernel(m,n,a,b) 2 real(8) :: a(m,n),b(n,m) 3 integer,parameter :: bsize = 16 4 j = (blockidx%x-1)*bsize + threadidx%x 5 i = (blockidx%y-1)*bsize + threadidx%y 6 b(j,i) = a(i,j) 7 end subroutine mt_kernel d) CUDA host code 1 real(8),device,allocatable,dimension(:,:) & :: a_dv,b_dv 2 integer,parameter :: bsize = 16 3 type(dim3) :: dgrid,dblock 4 allocate(a_dv(m,n),b_dv(n,m)) 5 a_dv = a!copy data to device 6 dblock = dim3(bsize,bsize,1) 7 dgrid = dim3(m/bsize,n/bsize,1) 8 call mt_kernel<<<dgrid,dblock>>> & (m,n,a_dv,b_dv) 9 b = b_dv!copy data back to host 10 deallocate(a_dv, b_dv) Figure 6: Comparison between OpenACC annotated and CUDA code. a) shows a matrix transposition example and b) the same with OpenACC directives. The CUDA code therefore is in c) and d). [17] Usually a programmer, writing CUDA [14, 24] or OpenCL [12], has to know and consider the underlying hardware used by his application and how it behaves, even though OpenCL offers a higher abstraction than CUDA. He also determines the parts of the code he needs to be accelerated, starts up and shuts down the accelerator, writes a CUDA/OpenCL kernel by hand considering memory allocation, memory copying, arranging input and output data the right way for fast computations and coalescing memory accesses and so on. Now, if just the problem size changes, sometimes the kernel code must also be changed to be compatible with the hardware. The programmer has to focus more on device-specific code instead of implementing algorithmic enhancements [27]. It is easy to see that this is not an optimal approach. Now, the question is how to effectively program applications for these systems. There typically exist three concepts as illustrated in Figure 7: (1) using libraries that are optimized for certain tasks, (2) using compiler directives that guide the compilers, and (3) using new language constructs that often produce the best performance, but need more time to be written. The first two concepts are more handy for the programmer. Programming libraries are often tailored to a problem, using optimized data structures and even considering architecture dependent hardware characteristics (for cross-platform compatibility). 70

74 9 Application Programming Libraries Compiler Directives Language Constructs easier to use faster Figure 7: Typical ways to improve the performance of an application. From the developers point of view, the focus can stay on the algorithm. Language constructs are widely used by CUDA and OpenCL, providing the possibilty to write more performant architecture specific code. Usually, this results in faster code, approximately 20 to 40 percent are common. Sometimes the speed up of hand-written code compared to compiler generated code is even zero or negative, because todays compilers become more and more powerful. CUDA and OpenCL both provide language constructs and programming libraries, where CUDA is technology specific and runs on Nvidia GPGPUs only. Both languages are quite hard to program, the code must be structured in special accelerator functions (usually called kernels) invoked by the main (or host) application. Although OpenCL has a high -level API, it does not add much comfort for programming. After writing some kernel functions, the code has to be tuned and optimized to gain a performance benefit. Now, if the underlying system changes due to an upgrade or if the application shall run on a different system, the code must usually be adapted, which is not always an easy task due to different CUDA compute capabilities on different hardware. Furthermore, debugging is one of the most challenging task for CUDA and OpenCL applications. Some alternative approaches to manage this difficulties are OpenMPC [21], hicuda [14] and OpenMP for Accelerators [6] as mentioned in Section 2. OpenACC provides compiler directives and library functions for the programming languages C/C++ and Fortran, very similar to the well known and widely used OpenMP API [1, 7] for Shared Memory Processor (SMP) systems. This approach is cross-platform compatible and may also result in a complete framework for heterogeneous systems in the future. The API is also portable, so called runtime routines can run in different environments in the presence as well as the absence of an accelerator [4]. OpenACC moves the acceleration and parallelization problems to library functions and the compiler, still giving programmers the ability to guide the compiler via directives. The compiler itself has to manage cache coherency, data movement and so on, as OpenACC requires them to be done implicitly. In real world scenarios, explicit guidance of the compiler via directives is necessary to achieve the best performance. The programmer uses OpenACC directives to mark compute intensive code regions, which are then offloaded to the accelerator, i. e. the the enclosed code is only run by one or more available accelerators. Only marked regions are accelerated. OpenACC distinguishes between several different regions, the most important ones are the parallel regions and kernels regions. Typically, parallel regions contain work-sharing loops, where each iteration of the loop computes a fixed piece of work and is (at most) 71

75 10 independent from the other iterations. Kernels regions execute the code region as a kernel, i. e. typically one or more nested loops that are divided into domains and are executed by N threads in any order in parallel. In CUDA or OpenCL a single, but specific, function is called a kernel. OpenACC does not specify how the compiler has to partition the loops, so at the end the compiler is the responsible factor for the overall performance. In fact, it is still important that the programmer know the capabilities of the accelerator since current compilers do not automatically produce performant code. If the the host and the accelerator do not share the same memory address space, which is the case most of the time, the programmer has to make more precise compiler directives in order to reduce data transfers between the host and the accelerator. Compilers can not easily reduce the amount of data transfers by itself. Additionally, limited accelerator memory can prevent the compiler from offloading regions to the accelerator [4]. 4.1 Directives The following definitions are taken from the OpenACC API v1.0 specification [4]. Compiler directives are specified using the preprocessor keyword #pragma in C and C++ and the comment keyword!$, followed by acc and the directive name in Fortran, so the compiler knows it is an OpenACC directive. The syntax for C and C++ is defined as: #pragma acc directive-name [clause [[,] clause]...] new-line The squared brackets [ ] intend that the argument is optional. For Fortran the syntax is:!$acc directive-name [clause [[,] clause]...] There are 7 sorts of directives sorted by their functionality, namely (1) parallel, (2) kernels, (3) data / host_data, (4) loop, (5) cache, (6) declare and (7) update / wait. Every directive has a validity domain that is called a region and is usually indicated by a structured code block, e. g. a for-loop construct. (1) The most important construct is the parallel directive. The syntax therefore is: #pragma acc parallel [clause [[,] clause]...] new-line C/C++ structure block or!$acc parallel [clause [[,] clause]...] new-line Fortran structure block!$acc end parallel Whenever a parallel region is reached, groups of workers (also called gangs) are created. Then, each worker (that is a thread or a group of threads) starts the execution of the structured block. During the execution of the parallel region, the amount of gangs and workers are fixed. No other parallel region or kernels region can be executed inside a parallel region. The optional clause can be one of 72

76 11 if ( condition ) async [( scalar-integer-expression )] num gangs( scalar-integer-expression ) num workers( scalar-integer-expression ) vector length ( scalar-integer-expression ) reduction( operator:list ) copy( list ) copyin( list ) copyout( list ) create( list ) present( list ) present or copy( list ) present or copyin( list ) present or copyout( list ) present or create ( list ) deviceptr( list ) private( list ) firstprivate ( list ) Each parallel region has an implicit barrier at its end. After the parallel region, the execution continues after all gangs completed their computation. The async clause prevents this barrier, so that the host does not have to wait for the gangs to finish their evaluations (like the nowait clause in OpenMP) and can continue the execution asynchronously to the accelerators execution. This is useful if the host prepares new data to be processed on the accelerator. The if clause causes the compiler to create two copies of the parallel region, one on the host and one the accelerator. If the condition within the clause evaluates to true, the copy on the accelerator will be executed, otherwise the parallel region is executed on the host side. The num_gangs clause and analogously the num_workers clause define the amount of gangs and workers per gang in the parallel region, which can be an advantage for the exploration of all available accelerator compute cores. The clause vector_length simply defines the vector length for SIMD 15 instructions or automatically vectorized loops for each worker in a gang. The clause reduction is again the OpenACC counterpart for the OpenMP reduction clause. It is mostly used when at the end of a parallel computiation all part evaluations have to be processed into one result variable, e. g. summarizing partial sums or finding the maximum/minimum over all partial evaluations. The variables listed in the deviceptr clause indicate that the variables point to the devices memory, so that no data has to be copied from the host to the accelerator. This is useful when several functions are applied several times to the same data that is already is on the device. As for reduction, the private clause causes each gang to have an own copy of the specified variable like in OpenMP. This is typically used for index variables or others that do not have to be shared between different loop iterations or gangs, so that no overhead synchronization happens due to cache/memory coherency. 15 Single Instruction Multiple Data 73

77 12 The firstprivate clause is similar to the private clause except that each copy of the private variable in each gang is initialized with its last value on the host before reaching the parallel region. For the right behaviour of data movement between the host and the accelerator, several clauses give hints to the compiler. These clauses are very important to prevent unnecessary memory transfers as they are very time consuming and inhibit the execution on the accelerator. E. g. in the simple case of a matrix multiplication of a NxM with a MxN matrix with M >> N, the resulting matrix has the small size NxN, so just a small part of the accelerators memory has to be copied back to the host. The following clauses can be applied to variables, complete arrays or subarrays. The variable list inside the copy clause shows the compiler which data has to be copied from host to the accelerator memory (in case they do not share the same address space) before the execution of the parallel region begins. After completing the execution all the data is then copied back to the host. The clause variant copyin tells the compiler which data to copy only from the host to the accelerator memory before executing the parallel region and copyout tells analogously which data has only to be copied back to the host after the execution. Subarrays can be defined with the syntax: arr [ lower index : length ], where the lower array index has to be constant. The subarray arr [ 5 : n ] means the elements a[5], a[6],... a[n 1]. The usage of the create clause results in memory allocation for the specified variable list on the accelerator, but with the difference that no data will be copied from the host to the accelerator or vice versa. This can be used for storing intermediate results on the accelerator which the host do not need to know. The present clause indicates the variables or arrays, that are already available in the accelerators memory avoiding data movement, e. g. if the application defines some variables which point to or are part of larger dataset that has already been copied. The four clauses present_or_copy, present_or_copyin, present_or_copyout and present_or_create are used to first test whether the variables or arrays are already available in the accelerator memory and copy(in/out) or create them if they are not available. (2) The next important construct ist the kernels directive. Kernels regions are typically used for multiple nested loops. In the following only the C/C++ syntax will be used for clarity, the Fortran syntax is analogue to our first construct. #pragma acc kernels [clause [[,] clause]...] new-line structure block Whenever a kernels region is reached, the structure block is compiled and divided into a sequence of kernels that are executed in order. One kernel is simply a function that runs on OpenACC compatible accelerators just like kernels in OpenCL and CUDA, but with the property of being highly parallelizable. Usually, one nested loop is mapped to one kernel where the body of the loop maps to the body of the kernel function. Kernel regions allow the same clauses as parallel regions except for num_gangs, num_workers, vector_length, reduction and (first)private. The semantics of the clauses also remain the same. The number of gangs and workers may differ for each kernel. (3) The data construct can define variables, arrays or subarrays to be allocated on the device s memory. For this data region one can specify how the data is transferred (if it is desired), just like the data clauses for the parallel and kernels directives. The allocation is valid for the duration of the region 74

78 13 where the data directive is specified, not for the surrounding region. The directive may appear inside of other directives, commonly inside parallel or kernels regions, but can also enclose other directives. #pragma acc data/host data [clause [[,] clause]...] new-line structure block The host_data construct makes the address of the device s data available to the host and has only one possible clause use_device(list), where the variables used in list must be present on the accelerator. The host data construct may only appear within other regions. (4) The loop construct only applies to a for-loop (do-loop in Fortran) that must immediately follow the directive. It can precisely describe the way the loop shall be parallelized by its clauses. #pragma acc loop [clause [[,] clause]...] new-line for loop Supported clauses are collapse, gang, worker, vector, seq, independent, private and reduction. The collapse clause is used when the loop contains other loops and takes a natural number as an argument defining the number of nested loops to be associated with the loop region. The seq clause forces the loop to be processed sequentially, while the independent clause indicates that each iteration is independent. The clauses gang, worker and vector indicate whether the loop shall be parallelized among gangs, workers or vector operations. Some of the clauses may only appear in the context of a parallel or kernels regions, see [4] for more details. The loop construct can be combined with the parallel or kernels construct, #pragma acc parallel loop... respectively #pragma acc kernels loop.... (5) The cache construct may appear right before or within a loop. #pragma acc cache( list ) new-line Array elements or subarrays can specified to be kept in the highest cache level possible for processing the loop. (6) The declare construct is used at the declaration of variables or arrays (but not subarrays) allocating them on the devices memory for the duration of the region in which the declare directive appears in. #pragma acc declare declclause [[,] declclause]... new-line All data clauses are valid for this construct and the new device_resident clause that indicates variables only to be allocated in the device memory and not the host. So the host may not be able to access the variable. (7) There are two execution directives, namely the update and the wait constructs. The update construct may appear within an explicit or implicit data region causing to update variables or arrays on the host with the values from the device or vice versa. #pragma acc update clause [[,] clause]... new-line 75

79 14 The available clauses are host (specifying the variables/arrays to be updated on the host), device (specifying the variables/arrays to be updated on the device), if (updating only for a true condition) and async (updating the data asynchronously). The wait construct may appear anywhere in the application causing to wait for an asynchronous task to be finished until the next operations are executed. #pragma acc wait [( scalar-integer-expression )] new-line If an argument is specified with the wait directive, the application waits for an asynchronous operation with the same number specified. If no argument is specified, the wait directive causes the application to wait for all asynchronous activities to finish. 4.2 Library Routines and Environment Variables Beside compiler directives OpenACC provides the programmer with lots of library routines that are not mandatory. Programmers shall include openacc.h for C/C++ or openacc_lib.h respectively the openacc module for Fortran. When used, the application may be less portable in case systems do not support the OpenACC API. This can be bypassed with the use of the _OPENACC preprocessor statement at compile time. In the following the library functions are listed with a short description. The data type acc_device_ t defines a type for accelerator devices. For convenience only the C/C++ functions are considered: int acc get num devices ( acc device t ) ; - returns the number of attached devices of given accelerator type void acc set device type ( acc device t ) ; - sets the accelerator type to be used for parallel or kernels regions acc device t acc get device type ( void ) ; - returns the accelerator type used for next regions void acc set device num ( int, acc device t ) ; - sets which device to use for next regions int acc get device num ( acc device t ) ; - returns the number of the used device of given device type int acc async test ( int ) ; - tests if all asynchronous operations associated with the given number have finished execution int acc async test all ( ) ; - tests for completion of all asynchronous operations void acc async wait ( int ) ; - waits for completion of all asynchronous operations associated with the given number void acc async wait all ( ) ; - waits for completion of all asynchronous operations void acc init ( acc device t ) ; - initializes runtime for given accelerator type void acc shutdown ( acc device t ) ; - shuts down the connection to the given accelerator type int acc on device ( acc device t ) ; - tells whether it is running on a particular device for given type 76

80 15 void* acc malloc ( size t ) ; - allocates memory on the accelerator void acc free ( void* ) ; - frees memory on the accelerator Right now, there exist only two environment variables for OpenACC as follows: export ACC DEVICE TYPE=NVIDIA - defines the default accelerator type used when executing parallel or kernels regions, if the application is compiled to use multiple accelerator types export ACC DEVICE NUM=1 - defines the default device number to use when executing parallel or kernels regions 4.3 Limitations The papers [22] and [28] state that all the directive based approaches have more or less limitations as they where developed for hybrid host+accelerator architectures and not for clusters or distributed systems. For OpenACC these limitations are valid: (1) Only scalar reductions there is no way to define and express complex or complete custom reductions, e. g. for finding the maximum over all computation results and storing the index together with this maximum. (2) Critical sections or atomic operations (like for example in OpenMP) that must not be parallelized due to side effects and therefore have to be handled sequentially. OpenMPC implements this for a special case, when also a reduction pattern is defined. (3) No fine-grained synchronizations the programmer can define updates on variables or waits for asynchronous operations, but there is no way to control fine-grained synchronizations, e. g. within accelerated loops. It is not the best practice to have this kind of synchronization, but for certain problems it may sometimes not be possible to avoid it. (4) No function calls within accelerator regions since current accelerators do not support function calls within highly parallel computations, OpenACC (or at least OpenACC compilers) do not allow function calls inside parallelized regions if they can not be inlined. (5) Limited pointer operation support most instructions operate on array-based variables, but an extensive pointer arithmetic is not supported, e. g. calculating addresses for future computations. (6) Scalability only for host+accelerator architectures current research and industry computer systems are large scale distributed systems. To increase the scalability some sort of a MPI 16 capability is needed to integrate data distribution, synchronization and parallelization for these systems. (7) Untransparent debugging the abstraction of OpenACC implies hiding of information, so the developer do not have to see how his code is accelerated. When it comes to unwanted side effects, wrong results during computation or incorrectly translated directives, it is desired to get the right overview on how the directives are translated (if they are translated) and how the code is then executed. The compiler has to face this problem and generate appropriate information for the programmer, even for high dimensional problems. (8) No asynchronous data transfers data transfers between host and accelerator can not be asynchronous relative to each other, they just can be asynchronous to the computation of the host or accelerator. Some- 16 For more details, see 77

81 16 times it is wanted to have more than one transfer to or from the accelerator while it already processes data. (9) No automatic exploration of multiple accelerator cards OpenACC only allows to offload regions to a specific accelerator card. If more than one accelerator are attached, the programmer can not dynamically offload code to one of them, but instead have to define the accelerator he wants the next region to be offloaded to. Not a theoretical, but a practical limit is the dependency of OpenACC to good compilers. OpenACC hides the technical details and trusts the compiler to produce optimal code. Today, one may think that compilers are so powerful that they can optimize code in almost every case. This is maybe true for sequential applications, where compilers have been developed over 20 years, but for highly parallel applications this is not the case as the performance analysis in Section 5 shows. For parallel code the compilers have to be much more complex to achieve parallelism of non-trivial nested loops. Probably the biggest problem of almost all OpenACC compilers today are non-coalesced memory accesses, memory alignments and reduction of memory transfers. 5 PGI Accelerator OpenACC Compiler Before OpenACC was published, PGI had already a directive based programming model, namely PGI Accelerator [2, 13, 3] with an own compiler supporting both C and Fortran. With version 12 of the compiler the OpenACC directives were integrated and the programming model (version 1.3) was slightly changed to obtain full compatibility. Many OpenACC directives were already part of the PGI compiler with other names, e. g. the PGI region directive matches the OpenACC kernels directive [3]. In the following, some aspects of the PGI compiler are mentioned from [3]. Currently, not all available directives of OpenACC are implemented by PGI, i. e. the host data directive and three clauses for other directives [13]. Targeting other accelerator devices than used after acc shutdown ( acc device t ) ; is also not supported right now. On the other side, the PGI compiler supports two clauses mirror and reflected that are similar to the OpenACC present data clause, but offer more information to the compiler for automatic checking for data availability. It further extends OpenACC by the support for non-linear arrays in the accelerators memory. The binaries for invoking the compiler are pgcc and pgfortran for C respectively Fortran programs. OpenACC directives are enabled by adding the flags -acc or -ta=nvidia to the PGI compiler. The compiler can give additional feedback with the -Minfo flag and can also generate multiple versions of an application with the -fast flag for CUDA devices, one version for CUDA version 1.0 capability (and higher) and one for version 2.0 capability (and higher). In many benchmarks the PGI compilers show a very good performance compared to other compilers, see Section Compiler Techniques One speciality of the PGI Accelerator programming model is the implicit and automatic reduction detection by the compiler and implicit cache management. With the introduction of explicit OpenACC directives the PGI compiler now supports implicit and explicit reductions and cache behaviors [13]. The compiler supports offloading to an accelerator only for explicitly marked regions, no automatic 78

82 17 offloading is supported. On absence of accelerator directives the application code can be parallelized with the -mp option among CPU threads utilizing multicore architectures. As it is a standard in todays compilers and required by OpenACC, the PGI compiler supports implicit and explicit loop unrolling, parallelization and vectorization. Loops are also parallelized using loop tiling respectively strip-mining, i. e. loops are segmented into smaller chunks so that the loop chunks can be directly mapped to the accelerator hardware [25]. Concurrency is increased by using data-level (according to data dependencies) and task-level parallelism, generating tasks out of frequently called independent code blocks and executing as many of them concurrently as possible. Coalesced accesses Non-coalesced accesses Figure 8: Problem of non-coalesced memory accesses of most accelerator compilers. The PGI OpenACC compiler analyses the code in different stages and processes all the gained control data by a planner module [25]. How exactly the analysis and compile techniques are applied to the code is of course the business secret of PGI, but Section 5.3 shows that the PGI compiler has a lots improvement potential as other compilers can achieve much better results in special cases. The paper [17] states that one of the biggest problems of todays compilers are non-coalesced memory accesses. Figure 8 demonstrates the meaning of the problem. It requires to manipulate the code to achieve a beneficial aligning and if so, the applications performance usually increases dramatically. The compiler also does not support function calls within parallelized regions that can not be inlined. 79

83 Alternative Compiler CAPS HMPP 17 [11, 10, 25] provides a rich toolkit consisting of an own set of compiler directives like the PGI accelerator model, a runtime environment and compiler tools for C and Fortran. In HMPP-only applications, the programmer has to define so called codelets that are functions to be run on an accelerator, either generated by a tool or hand-written. This programming model of course requires more effort in development than pure directive based approaches like OpenACC or the PGI Accelerator model. CAPS HMPP is also part of the OpenACC standard and the new version supports now OpenACC directives. The Cray Compiling Environment 18 is a set of compilers for C/C++ and Fortran, libraries and additional tools for code analysis and profiling. The compilers are mainly used in Cray supercomputer systems, but are not limited to them. The compilers support OpenACC and OpenMP directives. Not all OpenACC directives and clauses are supported and the implementation is at a non-fixed stadium, i. e. it may change in the behavior in the future. The accull [25, 27] OpenACC compiler was developed by the HPC group at the university of La Laguna in Spain and is the first open OpenACC compiler with both CUDA and also OpenCL support, unlike other OpenACC compilers. It is a two-layer approach consisting of a source-to-source compiler and a runtime library called Frangollo. The compiler is based on their own YaCF compiler framework 19. It generates a hierarchical project structure with compile instructions ready for compilation instead of generating a binary file. This enables the possibility for further optimizations by skilled programmers. The OpenACC annotations are translated into calls to the runtime library, which itself generates OpenCL and CUDA structures based on analysis of the code. Currently, not all OpenACC directives and clauses are supported, but the most common ones (see [27]). The evaluation of the compiler was done with different server, workstation and desktop environments with Tesla GPGPUs, onboard GPUs and non-accelerated multicore CPUs. A molecular dynamic simulation and a mandelbrot computation set were benchmarked among other benchmarks [27]. AccULL surprises with a performance comparable to OpenMP on systems with no GPU or onboard graphics. Although it is not a commercial compiler, other performance comparisons between accull and the PGI and CAPS compilers show that accull do not need to hide behind the commercial compilers, see Section 5.3 for details. The compiler framework supports the integration of other commercial compilers for taking advantage of pre-existing features like vectorization and memory allocation techniques. 5.3 Benefits and Performance Analysis This section presents some performance comparisons between the PGI Accelerator compiler and others. The paper [25] (in their point of view) tried to create real world implementation scenarios that an average scientist or engineer would produce. The evaluation covers a simple matrix multiplication, the HotSport thermal simulation, non-linear DNA sequence alignment optimization and LU decomposition. Evaluated compiler environments were OpenMP, PGI Accelerator, hicuda and accull. The PGI compiler shows the best performance for simple matrix multiplication and DNA sequence alignment, but 17 See product website, 18 Cray Compiling Environment Release Overview and Installation Guide at 19 For more information see 80

84 19 scales badly for the HotSpot problem. No compiler were able to produce a faster code than natively compiled hand-written CUDA code. The papers [23] respectively [26] by the accull developers show a direct performance comparison between the CAPS, the PGI and the accull compiler for the LU decomposition, HotSpot, Path Finder and Matrix Multiplication problems. After the first release of the accull compiler, the evaluation shows a great increase in performance of accull, so that PGI is slower in the average. The CAPS compiler is the slowest one of all three and is about half as fast as the accull compiler in the tested benchmarks. Figure 9 shows the implementation and Figure 10 the result of the performance measurement. 1 #pragma acc data copyin( power [0: row col ], 2 _resultado [ 0: row col ]) copy ( _temp [ 0: row col ]) 3 { 4 for ( i = 0; i < num_iterations ; i++) { 5 #pragma acc kernels loop private ( r ) independent 6 for ( r = 0; r < row ; r++) { 7 #pragma acc loop private ( c ) independent 8 for ( c = 0; c < col ; c++) { 9 double delta ; // Start computation Figure 9: Annotated loops in the HotSpot problem using OpenACC directives. [23] Figure 10: Average performance of all tested benchmarks. [23] The paper [17] compares different directives for optimizing code with the PGI compiler version 11 and native CUDA code. It shows that the best directive based approach is about 30-40% slower than 81

85 20 native CUDA code for two different computing environments. It states that the achieved performance with GPUs are far away from being satisfactorily, but is acceptable due to fact that the development effort is very small compared to CUDA. The key to performance are directive adjustments minimizing the data transfers and memory accesses, but this often requires also a change in the code. Figure 11 shows one performance analysis of the directive based approach, which OpenACC in version 12 of the PGI compiler should also fulfill, compared to CUDA accelerated code a) Pleiades-GPU b) hyperwall-gpu Gflop/s Problem Size cuda, simple cuda, cached Problem Size acc directive host, simple host, blocked Figure 11: Performance comparison of a double precision matrix multiplication on both GPUs and CPUs. [17] The paper [16] took two existing CUDA applications and annotated them with OpenMP, CAPS HMPP and PGI Accelerator directives and measured the performance. Then, they tried to adjust the directives, so the translated code more or less matches the hand-written CUDA code. Since OpenACC is related to the PGI or the HMPP programming model, the results should be valid for OpenACC as well. The evaluation confirms the conclusions of the other papers that directive based programming leaves much room for improvement and gains about 30-40% of the CUDA performance in average. Figure 12 and 13 show the results of the performance comparison after all adjustments were made to optimize the produced code by OpenMP, CAPS HMPP and PGI. The paper [28] presents another performance analysis of OpenACC with two real world applications, namely a Bevel Gear Cutting Simulation used in engineering and a computation of the Neuromagnetic Inverse Problem from the field of medicine. The evaluation was done on a 12-core AMD processor with a Tesla C2050 GPGPU and a 4-core Intel Westmere processor. They compared an OpenCL implementation with an OpenACC and a PGI Accelerator annotated version and also showed the performance gap between PGI Accelerator and OpenACC directives. The results can be seen in Figure 14 and 15. They also measured the programmability/productivity by changed lines of code for each programming model. In the engineering application, OpenACC reaches about 80% of the performance of the best effort OpenCL version, which is quite high compared to the programmability. The medicine application is more complex, so OpenACC looses much performance and achieves about 40% of the OpenCL implementation. The paper states that this distressing result may be enhanced with compiler optimizations and still are encouraging, because the OpenACC implementation 82

87 22 %! $" '#!" '!!" 1234,, !"#$%&'()*+( $! #" #! &!" %!" $!" " #!"! &'()*+,-.+/'&'0( 1023*+,-.+/'&'0(!" ()*+, -.*/-0,/ Figure 14: Simulation of the Bevel Gear Cutting in engineering. [28]!"#$%"&$'(%)*+,-$(.-/0( &!! %!! $!! #!! "!!! '()*+, -./011 '()*0++ #!!" +!" *!" )!" (!" '!" &!" %!" $!" #!"!" 89:;00 <64=;>>,-./0, $6-7 Figure 15: Neuromagnetic Inverse Problem in medicine. [28] Tesla X2090 GPGPU connected via a high-speed interconnect by Cray utilized by MPI/OpenMP. The Himeno benchmark was annotated and optimized with OpenACC directives enabling asynchronous data transfers and compiled with a pre-release Cray compiler. The results show a nice scalability curve for the benchmark, with the most optimized OpenACC version with asynchronous data transfers as the fastest version. With an increasing number of nodes, the time needed for data transfers becomes almost as high as the actual kernel execution time on the accelerators, which is less efficient. Nevertheless, the performance is in the range of 0.5 to 4.5 tera FLOPS for the whole system, which is far away from the peak performance of the Cray XK6. The problem may of course be the used benchmark application. It would be nice to see the scalability for several thousands of nodes, as exa-scale computing expects multiple magnitudes of the benchmarked performance, and if OpenACC can really maintain the same scalability as for up to 128 nodes. A very interesting point is that the whole benchmark (consisting of about 670 loc) was entirely ported to the GPU using only 26 additional directives for the blocking respectively 29 directives for the asynchronous version. The asynchronous clause automatically increased the overall performance by 5 to 84

88 23 10%. 6 HSA This chapter covers an effort of AMD to develop a highly integrated heterogeneous computing architecture and therefore both the hardware and the software. The Heterogeneous System Architecture (HSA) came out of the AMD Fusion architecture, which integrates CPU and GPU on one single chip. HSA is another name and targets the generalization of the fusion approach developing new methods under an open industry standard to create and efficiently exploit parallelism of future heterogeneous architectures. Therefore, the HSA foundation 23 was initiated in mid 2012 by AMD together with ARM, Samsung, Texas Instruments among others and forming working groups for different application domains. Right now, this initiative gains more and more interest by the industry and developers since the market leader in embedded systems ARM has become a member of HSA. Figure 16: Desired HSA position in software development. [5] One interesting aspect of HSA is the combined address space for the CPU and the GPU using clever address mapping mechanisms 24. The shared address space enables a great performance boost of GPGPU accelerated applications since the memory transfer time, the largest limiting performance factor, is drastically reduced. Developing applications for these systems will be much easier than for current hybrid architectures. For the upcoming 3rd generation of AMD s APUs and the future, much effort goes into increasing the performance for HPC application domains. For exploiting future heterogeneous architectures, HSA realized the importance of having a big developer community and supports them with applications, compilers, profiling and optimizer tools, runtime 23 See for more information 24 AMD I/O Virtualization Technology (IOMMU) Specification, see TechDocs/48882.pdf 85

89 24 libraries and lots of learning resources. As many developers have experiences in many different programming languages, HSA wants to support a wide range of programming languages (currently C/C++, C#, Java and some functional languages are supported). On the software side, HSA wants to be as easy to program as possible and therefore adapts even its hardware architectures (as seen in current AMD APUs and Opteron Server CPUs). It enables automatic work sharing between CPU and GPU and provides extensions to OpenCL, which are already used by Adobe products like Photoshop. Currently, a Visual Studio plugin C++ AMP is available, where programmers can offload code regions to the accelerator by two additional language keywords. Programmers also do not have to worry about cache coherency anymore. HSA provides context switching of threads on accelerators, so accelerators can be programmed a lot like CPUs. Also a new intermediate language (IL) called HSAIL was developed. HSA should not be mixed up with OpenCL, as it provides lots of own extensions to OpenCL and has its own development and runtime environment. All in all, it is a trend-setting approach and the support from the industry is definitely very promising. 7 Conclusions This paper presented the OpenACC programming model in detail with remarks on its application field, benefits, limits and tradeoffs and showed the potential that lies in this approach. Additionally, the PGI compiler was presented among alternative compilers and performance measurements. Without exceptions, all used papers have a positive opinion about OpenACC. Although the performance is in average 30-40% less than hand-written CUDA or OpenCL code, it is very good with respect to higher productivity and low knowledge requirements for the programmer. As the market is changing continuously it is open whether OpenACC gets the state of the art or not, but its foundations are solid and enjoy a big developer base. This also depends on the improvement of available OpenACC compilers. The approach is very promising and in fact demonstrates many improvements and possibilities over existing approaches. 86

92 Dataflow Programming on GPUs Maximilian Senftleben University of Kaiserslautern, Embedded Systems Group m 1 Introduction The ongoing paradigm shift towards parallel programming and computation offers much potential to improve computation performance but likewise requires more advanced knowledge of parallel programming. Heterogeneous systems with multi-core CPUs, GPUs, and FPGAs offer the access to different forms of parallelism (data, task parallelism), but there are few high-level parallel programming models available which do not require extensive knowledge of the devices. The programming of hybrid systems is very complex, often to difficult to achieve for mainstream programmers. The massive parallelism introduced by GPUs can be better exploited by using dataflow programming. In this document an overview of different possible dataflow programming models which abstract from the most complex programming aspects is given. Each model s key characteristics are described and its results concerning performance are presented. The document closes with a brief comparison of the presented models and their benefits. 1.1 Dataflow programming Dataflow programming models a program as a directed graph consisting of nodes, which represent computations, and edges, which represent data connections. Dataflow focuses on the connection structure of a program, the way the data takes through it. Dataflow programming reduces the need of global state information of a program as the data flowing between nodes and their internal state characterize the systems state. This paradigm is inherently parallel as each node can operate as soon as it has its input data, not requiring any state information from the rest of the system. To run of a dataflow program only a method to coordinate and buffer the messages passed around is needed to maintain the system s state, which is handled by the language s runtime. 89

93 2 2 General This chapter gives a brief overview of the matter to be discussed or the foundations it is based on. 2.1 SystemC SystemC is a high level system design language and is based on C++. It extends C++ by constructs to model properties of hardware systems: parallelism scheduling and synchronization via modules, processes and channels. SystemC also provides its own event-driven real-time simulation kernel. The language, being open-source and very similar to C++, is widely used at universities to model hardware systems. 2.2 CUDA [9] CUDA (Compute Unified Device Architecture) is an architecture developed by NVIDA, which allows to write programs, which can be executed on GPUs. Most often C FOR CUDA is used to program code for a GPU. C FOR CUDA is based on Standard C and extended/restricted with NVIDIA modifications. Figure 1: CUDA processing flow Figure 1 shows the processing flow of CUDA applications: first, all required data is copied into the GPU memory, then the CPU instructs the GPU to start the processing, then the GPU executes the program in parallel in each core and finally the results are copied back to the main memory. 2.3 Message Passing Interface (MPI) [11] The Message Passing Interface (MPI) is a standard for message exchange in parallel computations among (potentially) distributed systems first published in It aims to provide high performance, scalability and portability. MPI does not define a concrete implementation but defines a set of operations and their semantic (Interface). Implementations are available on a wide range of machines and bindings for different programming languages exists (e.g. C, Fortran, Python, OCaml). MPI supports point-to-point and collective communication. 90

94 3 A MPI program consists of multiple processes, each executing its own code, which communicate via calls to MPI communication primitives. MPI calls can be local, non-local, blocking, nonblocking, or collective. 2.4 Polyhedral Process Network [13] Polyhedral Process Networks are a subclass of Kahn Process Networks. Kahn Process Networks consist of processes, channels (unbounded FIFOs) and synchronization happens via blocking reads. PPNs on the other hand have only bounded FIFOs/memories and use blocking writes to cover this restriction. Figure 2: PPN Example (a), example process P2 (b) [6, Fig. 3] Figure 2 shows an example PPN and an example for what a process may look like Definition [13] A polyhedral process network is a directed graph with a set of processes P as vertices and communication channels C as edges. Each process P i P has the following characteristics a statement identifier s i, a dimension d i, an iteration domain D i Z d i. Each channel C i C has the following characteristics a source process S i P, a target process T i P, a source access identifier corresponding to one of the accesses in the statement s Si, a target access identifier corresponding to one of the accesses in the statement s Ti, a polyhedral relation M i D Si D Ti mapping iterations from the source domain to the target domain, a type (e.g., FIFO), a piecewise quasi-polynomial buffer size. 91

95 Static Affine Nested-Loop Programs (SANLP) [14] SANLP is a subset of the C language. An SANLP consists of a set of statements, each possibly enclosed in loops and/or guarded by conditions (nested). Its control flow is known at compile time (static) and it only consists of expressions of the form ax+b (affine). The integer set called iteration domain is the set of iterator vectors for which a statement is executed. Its linear inequalities express the lower and upper bounds of the enclosing loops PPN Extraction from SANLP [13] [14] [12] Figure 3: Derivation of a PN [12, Fig. 2] As shown in Figure 3 the extraction of a PPN from a SANLP is achieved in four steps. 1. Preprocessing: The SANLP is converted to a network representation. A single process representing all executions of one assignment statement. 2. Consumption Restructuring: The data consumption is restructured such that each array written to by different processes is replaced by separated memory arrays for each producer process. 3. Production Restructuring: The data production is restructured such that each array read by different processes is replaced by separated memory arrays for each consumer process. 4. Communication Model Selection: Depending on the producer/consumer pair different types of communication and synchronization mechanism are used to derive a valid PPN. 2.5 Concurrent Collections (CnC) [5] Concurrent Collections (CnC) is a parallel programming model developed by Intel. It is influenced by stream processing, dynamic dataflow, and tuple spaces. CnC defines three main constructs step collections, data item collections, and control tag collections. Each collection represents a set of dynamic instances. A step collection corresponds to a computation and its instances correspond to an invocation of the computation. A data collection consists of a set of data items indexed by item tags. Data items are accessed via get/put operations on the collection. They are required to be immutable and can only be put once. Control tag instances are used for control. A 92

96 5 put operation on a control collection prescribes (creates) step instances of some step collections with the control tag as input. A CnC program is defined statically as a CnC (specification) graph which defines the collections and their relationships. In a CnC graph a node represents a collection while a directed edge represents a put, get or prescribe operation. Figure 4: CnC graph example [5, Fig. 1] Figure 4 shows an example of a CnC graph. Rectangles represent data item collections, ellipses step collections, and hexagons control tag collections. Dotted edges represent prescription operations, and arrows represent get/put operations of data items (production/consumption). Environment communication is represented by squiggly edges. A whole CnC program consists of the specification, code for each step, implementing the computation for each node, and the environment, the user code which interacts with the CnC graph. Data instances can be produced and consumed by the environment, control instances can be produced by the environment and used to prescribe conditional execution. The collection tag usage is defined as follows: Putting a tag into a control collection will cause the corresponding steps to eventually execute, when their input is ready. The execution of a step takes the tag indexing the step instance as input argument which contains the information to compute the tags of all its input and output data. Data collection tags are used as indices in an associative container, in which an element indexed by one tag can only be written once. This immutability provides determinism. 93

97 6 3 SysCellC [8] 3.1 Overview In this publication a compile flow is presented, which constructs an implementation on a multi-gpu cluster system for a given SystemC program. Therefore, the program is mapped to the GPU-API while SystemC channels abstract the communication between GPUs. 3.2 Approach The aforementioned compile flow is described in seven steps which can be seen in the overfiew in Figure 5 Figure 5: SysCellC design flow [8, Fig. 1] Step 1 Starting with application code in SystemC (with sc module for modeling computation processes and SystemC primitive channels sc signal and sc fifo modeling streams) we divide the processes in two types: computation intensive ones and others which are dedicated to application monitoring, environment communication and (CPU) memory management. The computation intensive ones are mapped on the GPUs, while the other ones are mapped to the CPUs. Due to the (usually) great amount of data to be processed in comparison with the smaller GPU video memory the data has to be sized and tiled to optimize overlapping between communication and computation. Sometimes applications require data to be prefetched from CPU memory, the SystemC application has to take care of all these prefetching processes. The processes mapped on GPUs are subject to some restrictions to express the synchronization between concurrent components: They are not allowed to have wait() primitives and they should only be sensitive to their sensitivity list. The processes are only sensible to a signal that can be viewed as a clock. Therefore, a process may only block when it is finished. 94

98 7 Step 2 The next step is the manual partitioning of the SystemC code in a computation data parallel part, which is mapped to the GPUs and the other part which is mapped to the CPUs guided by profiling information. Step 3 This step consists of the transformation of the SystemC code in an XML intermediate representation by the SCXML parser provided. Each XML file represents a SystemC component and contains its most important characteristics: in/out ports (name, type and size), declared processes (name and type), sensitivity list structural information: name and type of components in a hierarchical tree, names of subcomponent connections and component port bindings SystemC s sc signal and sc fifo in intra and inter cluster node communication are overloaded and implemented with the MPI version 2 (MPI-2) standard. Step 4 The SystemC components are allocated to the different GPUs and CPUs using the SYNDEX tool which in turn uses the XML files and profiling reports. SYNDEX inputs: a hierarchical conditioned data-flow graph of computing and communication operations with their data type and size and their components execution time. a graph representing the architecture specification composed of processors and communication medias. The processor is characterized by supported tasks, execution time (obtained during profiling) and worst case transfer time for each type of data on the interconnect (obtained by data size estimations). SYNDEX uses a heuristic for mapping and scheduling of asynchronous tasks (i.e. communication through sc fifo). Step 5 Using the previous gathered information the C code for CPUs and GPUs is generated. It embeds a lightweight SystemC scheduler on the CPUs to preserve the SystemC model s operational semantics. The code is architecture independent due to GPU library overloading and the implementation of the MPI based SystemC channel interface library. The GPU kernel launcher function (on SystemC level) can call a GPU kernel or launch a CPU multithreaded version for code verification in a CPU environment. Step 6 A single binary multithreaded code for the CPUs and GPUs is compiled from the C code using the tool SYSCELLC. The MPI standard is used to implement the SystemC channel interfaces. Step 7 In this step the implemented system is used to generate profiling information for reuse and optimization in the 4th step. 95

99 8 3.3 Results The described approach was applied on three test cases (a producer/consumer case, a CDMA radiocommunication system and a visual attention model) and the resulting code was compared with the native SystemC execution on 1 CPU. The sizes of the generated C source code and the original SystemC code are similar, which means the described technique does not introduce bloated code. The execution time of the resulting code was between 10 and 35 times faster ( 10 for the CDMA test case, 35 for the other test cases) than the native implementation on 1 CPU. 96

100 9 4 Efficient Stream Buffer Mechanism for Dataflow Execution on Heterogeneous Platforms with GPUs [2] 4.1 Overview The publication describes the approach to map streaming applications using a Process Network (PN) model of computation onto heterogeneous architectures. As Figure 6a shows, the PN model provides coarse-grained task and pipeline parallelism while optimization of individual nodes may provide finegrained data parallelism. Figure 6: Data-Driven Execution: Asynchronous Processing + Stream Buffer Communication [2, Fig. 3] 4.2 Approach There exist compiler techniques that derive Polyhedral Process Networks (PPN) from a class of sequential nested loop programs, e.g. the pn compiler [14] which works on Static Affine Nested-Loop Programs (SANLP, subset of C language). A prototype framework for mapping the PPN onto a hybrid system was build using the pthread library for the multi-core CPU workload and CUDA 4.0 API for access to the GPU. The framework is based on the asynchronous execution of dataflow independent computations and coarse-grain pipeline parallelism in a PN model. Furthermore, a Stream Buffer (SB) mechanism is introduced which provides the functionality to enable pipelined CPU-GPU communication/execution. For each nested loop body statement a process is generated in a PPN. The node domain of a process is constructed by its iteration space. Channels are used for passing the input and output of processes as data tokens between processes, which are blocked until input data becomes available. The processes are mapped to threads on either GPU or CPU and execute asynchronously and pass the following 97

101 10 execution phases for each token (single frame): stream data in - blocking read, computation - function execution, stream data out - blocking write. The computation corresponds to the execution of the nested loop body. It can optionally be optimized by polyhedral techniques (e.g. locality or special GPU support) [4, 1, 3] The iteration domain of GPU nodes can be mapped to the N-dimensional range of the CUDA-kernel, s.t. the computation function is executed by a large number of lightweight CUDA threads in parallel. Different kernels are executed in different and parallel CUDA streams if the target device supports concurrent kernel execution. The SB mechanism uses dataflow to exploit pipeline parallelism on a hybrid platform. It allows data transfer between host and GPU device to happen concurrent to computations. As all communication of the PPN model is point-to-point each channel is implemented by a Stream Buffer. The implementation itself is based on a Circular Buffer and pointers to reduce data movement in FIFO usage. Semaphores emptycount and fullcount realize blocking write to and blocking read from the channel. The Stream Buffers are implemented using a distributed memory approach using double buffering and its communication by asynchronous memory transfers. Additional threads monitor the transfers and signal data availability to the blocked (waiting) processes. Figure 6b-e illustrates the implementation of the Stream Buffer mechanism. 4.3 Results Using the asynchronous execution model and stream support an I/O transfer and computation pipeline is realized. The results obtained from a synthetic producer-transformer-consumer streaming application predict that synchronization overheads are rather low and good overlap of memory transfers and computations can be achieved. 98

102 11 5 CnC-CUDA [7] 5.1 Overview In this publication Intel s Concurrent Collections (CnC) programming model is extended to a model called CnC-CUDA to address hybrid system as well. The paper includes a definition for multithreaded steps for GPU execution and the automatic data and control flow generation between CPU and GPU steps. Furthermore a CnC implementation based on Java is presented and used as foundation for the CnC-CUDA implementation. 5.2 Approach CnC was implemented in Habanero-Java (HJ), a programming language developed at the Rice university, because it includes useful constructs to implement CnC primitves as shown in Table 1. CnC construct Tag Prescription Item Collection put() on Item Collection get() on Item Collection Translation to HJ Java String object / point object async or delayed async java.util.concurrent.concurrenthashmap Nonblocking put() on CurrentHashMap Blocking or nonblocking get() on CurrentHashMap Table 1: Summary of mapping CnC primitives to HJ primitives [7, Tbl. 1] The mentioned CnC programming model was extended in order to support CUDA steps efficiently. First the graph syntax was extended to support GPU steps in addition to CPU steps and to specify constants in the graph file, which later can be used in CPU (HJ) and GPU (CUDA) code. These constants are used to declare correct sized item collections for exchange between CPU and GPU. The CnC Parser generates Access functions for each item collection. The item collections maintain the standard Put and Get access methods for each individual data item which are put into a ConcurrentHashMap. As soon as enough tags have been put, the corresponding items in the ConcurrentHashMap are collected, converted to a C friendly format (e.g. replace Java primitive datatype wrapper classes by their primitive type) and passed to CUDA. Because those single Put and Get operations yield a significant performance overhead, the primitives PutRegion and GetRegion are introduced as a much more efficient alternative. PutRegion/GetRegion allow the programmer to put/get a (potentially multidimensional) region of integers associated with a similarly dimensioned array of items. This array can be directly passed to the CUDA kernel and eliminates the get/put operations for each individual item. Tag collections are automatically generated using the type definitions in the graph file. Tag collections control execution and synchronization of computation steps. Depending on the functionality of the device a mutex indicating the usage and therefore accessibility to new computations is used. The number of CUDA computation steps that can be prescribed by another CUDA computation step is limited to 1. Therefore, if one CUDA step prescribes another one, the second one is invoked immediately after the first one on the GPU without returning to the CPU. Synchronization between host and device computation steps is achieved by calling a CUDA tag collections Wait() method, which blocks 99

103 12 until all launched kernels returned and their result was transfered to the main memory. The PutRegion operation places a region of integer tags into the tag collection and immediately launches a CUDA kernel for all tags in the range once all required data items are available. The item collection property One-For-All passes the same data to each thread on a device and can result in better memory usage and performance. Only the actual CUDA kernel must be written, the translator generates stub codes that allocate memory and copy the data structures to the device before a step is executed and free the device memory after the kernel finishes. For now CUDA kernels can only put a single item on each item collection of its outputs. 5.3 Results Four different benchmarks were used to compare the runtime on different programming models: Fourier coefficient analysis (Series, Java Grande Forum (JGF) benchmark suite), successive overrelaaxation (SOR, JGF), IDEA encryption (Crypt, JGF), Heart Wall Tracking program (Rodinia benchmark suite). Each was run on varying data sizes using CnC-CUDA, CnC-HJ, Serial C, hand-coded CUDA, and the original single-threaded Java benchmark. The measured runtime included the memory transfer overhead. The execution of the GPU CnC-CUDA code compared to the CPU CnC-HJ code led to a speedup in almost every benchmark run between from factor 2 up to 400. In most benchmarks the speedup grows with the data size, except for the JGF Crypt benchmark, where the speedup for varying data sizes stays between 2 and 3. The JGF Series benchmarks profit most from CnC-CUDA in greater data sizes: the speedup grows from to when the data size is increased by a factor of 100. The benchmark results show that the performance speedup of excessive parallelism on GPUs can be used by non-device-expert programmers as well. 100

104 13 6 PTask: operating system abstractions to manage GPUs as compute devices [10] 6.1 Overview A set of OS abstractions called PTask API is introduced. Its supported dataflow programming model uses a directed acyclic graph to assemble individual tasks. Vertices are called ptasks and represent the executable code and edges represent the data flow between the vertices. PTasks main goals are usage of a single resource manger to provide guarantees for fairness and isolation, providing of a data flow programming model which abstracts from device management and provide a programming environment that allows code to be modular and fast. 6.2 Approach The PTask API is built on different OS-level abstractions: PTask, Port, Channel, Graph, Datablock and Template. A ptask is similar to the well known OS process abstraction while mainly executed on a GPU (or similar devices), is managed by the OS and provides some input and output resources which can be bound to ports. A port is a kernel namespace object which can be bound to ptask input and output resources and represents a data source or sink. A channel connects a port to another port, or to other data sources and sinks in the system. The collection of ptasks connected via their ports by channels represent a graph. A datablock represents a data flow unit in a graph. A template provides meta-data describing datablocks to assist mapping datablocks to threads on a GPU. New system calls address these new abstractions, e.g. sys push inserts data in a channel, blocking if its capacity is reached and sys pull retrieves data from a channel, blocking if its empty. The stand-alone user-mode library implementation supports ptasks coded in HLSL (DirectX), CUDA, and OpenCL. The PTask runtime can schedule multiple independent graphs in parallel and takes care of fairness and efficiency. A ptask can be in one of four states: Waiting, Queued, Executing, or Completed. The prototype implementation supports four different scheduling modes: first-available, fifo, priority, and data-aware. In first-available mode every ptask is assigned a manager thread, these threads compete for available accelerators. The fifo mode enhances the first-available mode with queuing. Priority mode enhances ptasks with a static priority and proxy priority (OS manger thread priority). Dataaware mode works similar to priority mode, but takes data memory spaces into account, such that it preferes accelerators where most of the data is already up-to-date. 6.3 Results The PTask implementation was evaluated with a gestural interface on Windows 7, an encrypted file system on Linux, and microbenchmarks. Five different implementations of the gestural interface are compared: a host-based CPU only version, a handcoded GPU optimized version, a piped version (four different processes connected by pipes), a modular version which combines all processes of the piped version, and the version implemented using the PTask API. The evaluation uses a Core2-Quad CPU and a NVIDIA GTX 580 GPU. The 101

105 14 PTask version achieves higher maximum throughput (275.3 MB/s) than even the hand-coded version (248.2 MB/s) and supports real-time data rates with low CPU utilization. EncFS (FUSE-based encrypted file system for Linux) was modified to use a GPU for AES encryption. A sequential read and write of a 200MB file are 17% and 28% faster than the version using the SSL software library implementation. The evaluation uses a NVIDIA FTX470 GPU, Intel Core i5 3.20Ghz CPU, 12GB RAM, 2 SATA SSD 80GB in a striped RAID. Multiple GPU tasks can render the GPU scheduler useless (e.g. a 30 slowdown), therefore the GPU scheduling mechanism in the kernel (PTSched) is used to eliminate this problem. The micro-benchmarks include bitonic sort, matrix multiplication, matrix addition, and matrix copy kernels with input matrix and image sizes ranging from 64x64 to 1024x1024. The mean speedup for the different benchmarks over a single-threaded, modular GPU-based implementation is 93% and over a handcoded version 10%. 102

106 15 7 HyperFlow: A Heterogeneous Dataflow Architecture [15] 7.1 Overview HyperFlow is a dataflow architecture which provides different abstraction layers over computation resources. It supports heterogeneous computation resources and can provide optimized implementations for each task depending on which resource it is executed. This enables a high degree of portability as a taks does not have to be assigned to a resource statically but could be executed efficiently on different resources. 7.2 Approach The following abstractions layers are provided by HyperFlow: Interconnected Task-Oriented Modules (TOMs) represent pipelines, and are executed as flows in a token-based way. A TOM consists of several parameters such as the number of input and output ports. A TOM does not contain an implementation of the task it represents, but refers to a list of of task implementation objects, which perform the actual computation. The execution of a TOM at runtime requires the presence of a task implementation that matches he system resources. Pipelines are executed by sending instruction tokens to processing units and retrieving data tokens as results. In HyperFlow these tokens are modeled as flow between connected TOMs. Flows might by generated on the completion of module executions and are classified as waiting, live, or dead. HyperFlow maintains a flow cache to store incoming flows until all required input data is available, and then executes the corresponding module. The actual computing resources are encapsulated by Virtual Processing Elements (VPEs). A VPE manages the execution on a specific computing resource and provides the required context. A VPE waits for tasks to be executed as soon as its managed resource becomes available. Then it assures that the required input data resides in its current context. HyperFlow provides a datatransfer path between each VPE by assuming that each VPE has access to main CPU memory. Figure 7 gives an overview of the HyperFlow architecture, which consists of the former mentioned TOMs and VPEs and furthermore the Execution Engine and VPE Scheduler. 103

107 16 Figure 7: HyperFlow Architecture. [15, Fig. 1] The Execution Engine (EE) is the main controlling component of HyperFlow. It is responsible for VPE initialization and assignment to a corresponding resource as well as for management of the flows. The EE dispatchs newly generated waiting flows to the VPE Scheduler as well as the state of VPE resources. The VPE Scheduler manages waiting flows and schedules them for execution on available VPEs. It uses two queues, one for waiting flows and one for the ones currently executing. Flows are assigned a identification number which are used to determine the execution order and to support global scheduling strategies. The VPE Scheduler also takes care of multiple data instances for the same module not to be mixed up. HyperFlow s Memory Management requires each data object to inherit from the predefined DATA class, which implements reference-counted objects with a copy-on-write approach. This enables using references instead of copies on the one hand, and eliminates read-after-write or write-after-write hazards on the other side. Alternatively different approaches could be implemented by overriding the DATA class, e.g. if the default implementation performs badly with a given data-copy intensive application. 104

108 Results One of the main differences to similar approaches is the separation of task specification and its implementation. The pipelines described in HyperFlow are allowed to have feedback communication (cyclic graphs). The first real-case application evaluated is a image-based edge detection pipeline. Compared to the visualization API VTK, Hyperflow runs between 4 (using 1 CPU) to 6 (using 8 CPUs) faster than VTK and 2 (using 8 CPUs) to 4 (using 1 CPU) faster than a hand-tuned VTK version. The next evaluation treats the application of Streaming Multigrid Gradient-Domain Processing, which revealed a speedup towards the original implementation between 1.26 and Another real-case application was the Parallel isosurface extraction which was compared to the approach of Isenburg et al s. The performance comparison using between 1 and 64 CPUs showed HyperFlow consistently outperforms Isenburg et al s approach. 105

109 18 8 Conclusion 8.1 Remarks All of the presented models yield a significant speedup of parallel applications towards their sequential implementation, but it should be noticed that the results show potential speedups for selected applications as most of the models only apply to a specific class of problems and do not perform well with others. 8.2 Comparison of results The SysCellC approach describes a compile flow that uses the SystemC constructs channel, fifo, and module to describe a dataflow program. The SystemC channels are implemented with the MPI standard. The approach of Balevic on the other side is based on PPNs which can be obtained directly from SANLP programs. The dataflow is implemented by a StreamBuffer mechanism which enables piplined execution between CPU and GPU. CnC-CUDA uses programs in form of a CnC graph which can be compared to PPNs, but the execution of a step depends on control tags. PTask is an API which works on dataflow graphs and executes them under one resource manager which enables it to give fairness and performance guarantees. Hyperflow also works on a dataflow network of connected TOMs, which may have different implementations for different devices. Hyperflow supports programs with data feedback. From the author s point of view the PPN approach seems to be one of the more interesting approachs in terms of academic research as it is a more formal underlying model and it still can be used to represent a wide set of problems. 106

112 Choice of Content to Load&Lock Example Implementation Details Genetic Algorithm Evaluation Conclusion 18 7 Bibliography 18 Abstract Together with [13], this papers aims to be an introduction to the concept and a summary of the current research on the topic of Scratchpad Memory(SPM). The topics I focus on are the hardware aspects, energy efficiency, both the general concept of SPM and known implementations and applications, as well as the issue of worst-case execution time in hard real-time systems. 1 Introduction 1.1 Definition: Scratchpad Memory Scratchpad Memory(SPM) is the term chosen for cache-like software-managed memory. It is significantly smaller than the main memory, ranging from below 1-KB to several KB in research applications and being at 256-KB in the SPEs of the Cell (Section 3.4) multiprocessor. Being located on the same chip as - and close to - the CPU core, its access latencies are negligible compared to those of the main memory. Unlike caches, SPM is not transparent to software. It is mapped into an address range different from the external RAM, which is outlined in Figure 1. Figure 1: Cache and SPM memory model[6] Some implementations make it possible for the CPU to continue its calculations while data is transferred from RAM to SPM or vice versa by employing an asynchronous DMA controller. Even without it being asynchronous, transfers from or to RAM are often handled by a special controller that moves data in blocks rather than having the CPU using load and store instructions. There are approaches that use both a SPM and a regular cache. 109

113 3 In multicore processors, there may be a separate SPM per core, which can, depending on the implementation, be used as private buffer memory, ease communication between cores or both. (see Section 3.4 for an example). 1.2 Contents of This Paper After the definition and motivation of SPM, in Section 2 I will discuss the hardware details and impact on energy efficiency. Section 3 will name examples for known applications, with a focus on the Cell multiprocessor. A short explanation of cache locking, as well as some implemented examples are given in Section 4 together with a comparison of the concepts Finally, Section 5 gives an introduction to SPM in WCET optimisations, summarising three papers on both static allocation of data and code, as well as dynamic cache locking. 1.3 Motivation Modern computer applications require more RAM to perform tasks than can be embedded into the processor core. Apart from some low-power embedded systems, most processors utilise cache hierarchies to lessen the speed penalty caused by access to external memory. Cache is a small temporary buffer managed by hardware, employing usually hard-wired displacement strategies like least-recently-used(lru), first-in-first-out(fifo) or randomised approaches. Since these displacement strategies are written to perform good for a wide spectrum of use cases, they are less optimal than a strategy that is tailored to a specific application by a compiler that knows about the whole program structure and may even employ profiling data. Furthermore, because of its lack of most of the management logic cache requires, SPM is less demanding both in chip area and complexity. This will be further explained in Section 2. WCET calculation in hard real-time systems is easier and provides tighter estimates when employing SPM, since it is more predictable and gives developers or compilers more possibilities to optimise. 2 Energy Efficiency The main advantage of cache over SPM is that it is transparent to software. To achieve this, it needs to know which memory addresses lie within blocks that are currently stored in the cache. Tags are the parts of memory addresses that are required to map a block of cache memory to the address in the RAM it belongs to. They are stored next to the cache blocks they belong too in so-called tag lines, see Figure 2. Next to the tag lines and the necessary logic to determine whether a memory access is already cached, a controller that fetches previously uncached memory blocks and takes care of block displacement is required. Depending on the implementation, there may be different mechanisms like write-back and write-through as well as several displacement strategies available that applications or the operating system can choose from. Since the on-chip cache is a major part in the energy consumption of a modern processor, requiring from 25% to 45%, increasing its efficiency or replacing it with SPM has a significant impact on the energy consumption of the whole processor. To compare the energy and area efficiency of SPM and cache, [16] modifies an existing processor, the ARM7TDMI, to use an SPM instead of the previously built-in cache. They employ the energy-away encc compiler with the post pass option of assigning code and data blocks with the knapsack algorithm. After optimised compilation, the resulting executable is emulated using ARMulator which emits a trace 110

114 4 Figure 2: Cache memory organisation [16] of all memory access operations. Those can can be used to determine both the energy consumption of the SPM and the cache. The observed average reduction in time and area are 34% and 18% for constant cycle times and SPM is about 3 times more efficient than a 4-way set associative cache when energy is concerned. Utilising profiling and graph partitioning to optimise SPM allocation as well as a custom management unit with an instruction to load code into the SPM, [6] achieved a 50.7% energy reduction and an 53.2% improve in performance. 3 Known Applications / Implementations of Scratchpad Memory Systems This section is dedicated to widespread architectures whose memory model fits the definition of scratchpad memory. 3.1 Microcontrollers Both the Atmel megaavr[1] and the STMicroelectronics STM32 ARM Cortex based microcontroller have interfaces to connect external memory. These extend the internal ram and are mapped at own address ranges in the address space available to the processor. Because the on-chip memory for those is an order of magnitude faster and smaller than the external RAM, it can be considered a scratchpad memory. 3.2 Cuda and OpenCL Cuda is a framework for general purpose GPU programming (GPGPU) developed by NVIDIA and is restricted to their GPU architectures. OpenCL is an open standard designed to be portable to more architectures, including even multicore CPUs and Cell (Section 3.4). Both can benefit from Scratchpad Memory Optimisation as well. The GPU architectures are outlined in Figure 3 and Figure 4. Software written for both Cuda or OpenCL has to explicitly manage the allocation of contents and transfers of those between different memory levels. Before being used for calculations, data has to be 111

115 5 Figure 3: GPU Architecture in CUDA[8] Figure 4: GPU Architecture in OpenCL[8] transferred between the RAM of the host system and the memory built onto the graphics adaptor or other accelerator device. A main problem of GPGPU calculations is the latency and bandwidth of those transfers, since, especially with GPUs, they are an order of magnitude slower than on-device operations. 3.3 Emotion Engine[10] The Emotion Engine powering the PlayStation 2 featured a 16-KB scratchpad memory, used mainly to improve communication between the CPU and both the floating-point SIMD-VLIW processors. 3.4 Cell Architecture The Cell multiprocessor employed in the PlayStation 3 as well as high-performance clusters features two approaches for software-managed memory hierarchy. It was developed by Sony, IBM and Toshiba with the purpose of establishing a platform that provides high performance, energy efficiency and support for real-time applications. One Cell processor consists of one PowerPC Element and 8 Synergistic Processing Elements(SPE), the structure of which is outlined in Figure 5. A single SPE consists of 256-KB local storage, a SIMD Synergistic Processing Unit(SPU) and a DMA controller. The local storage fits the definition of Scratchpad Memory since both the content of the local storage and the DMA transfers from and to RAM are software-controlled. Those DMA transfers are asynchronous, which means there can be up to 16 queued DMA transfers without forcing the SPU to wait for their completion. Being unable to access memory outside of their local storage without DMA 112

116 6 Figure 5: Schema Cell Architecture, [7] transfers, the code running on the SPU may be required to be split into overlays, the details of which will be discussed in Section The local storage of each SPE is mapped into the global address space to allow both the other SPEs and the PPE to transfer data from and to it. This allows programmers to choose the data flow model that best fits the application, for example they can organise their SPEs into a chain, in which data is efficiently streamed from one element to the next, being processed at every station. The PowerPC Element embedded in the Cell multiprocessor is a general purpose CPU that allows a Operating System to manage the other processors. It posses 32-KB first-level instruction and data caches as well as a 512-KB second-level cache, the latter of which employs replacement management tables that allow the operating system to lock contents to cache Code Overlay Without transferring data through the DMA controller of its SPE, each SPU can only access and execute the contents of the local storage associated with it.[7] Code Overlay[3] is a mapping technique to execute programs that are larger than the available memory (local storage) by splitting that memory into regions, and the program code into corresponding segments. Multiple segments can be linked to the address of one region, meaning they can be swapped at run-time by means of transferring another segment with its functions into a region. This leads to a behaviour known from caches: When a called function is not mapped to its region, there is an overlay miss and it has to loaded before execution can continue. There is a toolchain available integrated into the GNU Compiler Collection that can automatically generate overlays for SPU applications[2]. [3] introduces a Code Overlay Generator(COG) that aims at producing a more optimal overlay mapping than the default IBM Cell SPU compiler. Their approach is based on a heuristic that works without constructing or solving an ILP, instead relying on a heuristic. A more detailed report on optimisations for instruction SPM for average-case execution time optimisation is given in [13]. 113

117 7 4 Cache Locking Cache locking is a technique available in some systems that allow the operating system or application to control cache behaviour. Locking data in the cache, i. e. keeping it from being evicted, can be used to make memory access more predictable or even optimise the average execution time. Since the approaches and algorithms targeted at optimising by cache locking are similar to those for Scratchpad Memory, I included this section. 4.1 Architectures [12] lists Coldfire MCF5249, PowerPC 440, MPC5554, ARM 940 and ARM 946E-S as architectures that support cache locking. Cell: The PowerPC element of the Cell multiprocessor allows software to lock cache contents in place, which allows it to optimise memory access, among others, for fast predictable response times in hard real-time systems. x86: Cache locking is not a supported feature of the x86 architecture, but still possible through processor-specific cache control mechanisms. While the use of those is discouraged in performancecritical scenarios, there are some applications that make use of this feature: CAR: Cache as RAM is a technique employed by the coreboot open-source bios to increase the amount of memory available to the CPU before the initialisation of the RAM controller. CARMA: Carma is a framework to establish a trusted computing base requiring only the CPU of a computer to be trusted. It is motivated by the fact that PCI allows hardware to access the RAM and memory mapped regions in other peripherals. This provided an entry point for malware nested in PCI devices (e. g. NIC firmware flash) as well as cold-boot attacks where sensitive information is gained by freezing RAM chips and thus being able to read their contents after removing them from the system. The technique employed by CARMA is based on the approach of CAR. But, instead of using L1 cache, they are mapping and locking a portion of RAM into the L2 cache, which gives it a piece of general-purpose memory instead of splitting instruction and data memory. 4.2 Cache Locking vs. Scratchpad For example, in [12], Isabelle Puaut and Christophe Pais compare the effects of the instruction cache locking optimisation from [11] for both SPM and locked cache. The differences between the two mechanisms is hidden by a function Load that takes care of the following: Cache Locking: When relying on cache locking, Load scans all the program lines of the basic block to load and checks if there are free cache lines in each set a line is mapped to. If there is available cache lines, it then locks the scanned content in cache. This approach requires little or no modification of the original memory layout. Scratchpad: For SPM, Load uses a first-fit strategy to allocate an entire basic block. This has to be done at compile-time to determine the memory address the block will be executed at. This is an example for the portability of algorithms between SPM an cache locking optimisation. The detailed results are given in Figure 6, while scratchpad memory is not generally better than cache in the given scenario, the WCET is not significantly higher. Furthermore, the authors voice the following concerns: 114

118 8 Figure 6: On-chip/Off-chip/reload ratios for locked caches & scratchpad memories, [12] Figure 7: Impact of Basic block size, [12] Cache pollution: The granularity of caching, meaning cache lines of fixed size, may lead to cache pollution, the consequence of which is that data is unintentionally locked into the cache because it s located in the same line as data that is intentionally locked. SPM fragmentation: As with any memory management, scratchpad memory may become fragmented when exposed to continuous allocations and deallocations. Cache pollution directly affects the WCET because the time needed to load and lock a cache block is longer than necessary when including data that is not needed. Cache is able to handle large basic blocks without performance loss because it is working with the independent granularity of its own block size. SPM, on the other hand, is very susceptible to basic blocks that are to large for the available memory, which may be worsened by aforementioned fragmentation. An example is given in Figure 7, where the j f dctint benchmark shows the explained behaviour. 5 Scratchpad in real-time systems When developing software for hard real-time it is crucial to prove that it can meet specified reaction times. These are proven by calculating the worst-case execution time, short WCET. Naturally - it is in the interest of developers, and those who evaluate hard real-time systems, to give as tight an upper bound as possible. Optimising for WCET differs from average-case execution time (ACET) optimisation because it has to rely on the formal calculation and guarantee of the execution time along the worst-case execution path(wcep). Because of this, most of the algorithms geared towards reducing the ACET are not suitable for WCET reduction, especially those relying on profiling. 115

119 9 5.1 Caches There are several aspects of cache behaviour that make it difficult to guarantee tight WCET boundaries. Having a processor with cache leads to unpredictable timing, since its internal state at a specific point of time is unknown to compiler and developer. This leads to WCET estimates being pessimistic, assuming cache misses wherever it s not clear if a page is cached, or even having to ignore the cache entirely[9]. A further aspect of those uncertainties, timing anomalies, may lead to cache misses being better for the whole WCET than a cache hit, which further complicates giving tight boundaries[14]. Using Scratchpad Memory alleviates those problems - as well as providing a method to specifically optimise execution time of the worst case execution path. 5.2 WCET Centric Data Allocation to Scratchpad Memory[15] In [15], the authors analyse different approaches to lower the WCET of an application by statically allocating data to SPM. First, they formulate an ILP which, when solved, produces an optimal allocation of variables to the scratchpad. After that, there is an analysis of both knapsack and branch-and-bound approaches to solving the problem faster, as well as a greedy heuristic to be able to allocate larger amounts of data within reasonable time. While the approaches given in this paper can be used to allocate variables on the stack, it is only possible for non-recursive ones that can be treated as global variables ILP The basis of the ILP to optimise the WCET is a set of decisions S v {0,1} that determines for each variable v allvars whether it is allocated to SPM(S v = 1) or conventional memory(s v = 0). To ensure that all the allocated variables actually fit in the SPM, there is the constraint S v area v scratchpad size v allvars that ensures that the sum of the area needed for each allocated variable area v is lower than the size of the SPM scratchpad size. Calculation of the WCET of a loop is done through analysis of the directed acyclic graph(dag) representing its control flow. It is assumed that the DAG of each loop has exactly one source and one sink node, the latter may be a virtual sink node when none is given. The WCET of the DAG rooted in each basic block i is called W i. It is calculated from the sink node in the DAG to the source node, the first being W sink = cost sink v allvars S v gain v n v,sink where cost i is the execution time of the basic block i and the reduction of which is calculated by multiplying the gain in access time of each variable gain v through it s allocation to scratchpad with the number of occurrences n v,i. Similarly for every edge in the DAG i j: W i W j + (cost i v allvars S v gain v n v,i ) Which gives us the WCET of the loop body as W source. The WCET of the whole program can be determined by multiplying the WCET of each loop with the a known constant lb, which is the known 116

120 10 maximum number of iterations. This gives us the maximum execution time cost of the innermost loops, which can now be used to construct the same constraints for the next level of loop nesting until the WCET for W entry, the unique entry node of the program, is known. Thus, the objective function of the ILP is the WCET of the entire program: W entry Knapsack While formulating memory allocation as a knapsack problem may be an intuitive choice, it is not appropriate for WCET-optimising SPM allocation. The problem is that the reduction achieved through the allocation of variables to scratchpad is not additive. This is because, whenever one path is optimised, it might become faster than another path, leading to that being the new worst-case execution path, which lessens the achieved reduction in worstcase execution time. A graphical example of changing WCEPs is given in Figure 15 in Section 5.5. Since the WCET reduction achieved through the allocation of each variable is heavily dependent on the allocation of other variables, knapsack can not be used to solve the allocation problem in this instance Branch and Bound Branch and bound is the improved approach chosen to find a perfect solution to the ILP given in Section To not be forced to try all the possible combinations of variable allocations, branch and bound is an approach that utilises a heuristic to discard large amounts of possible solutions without calculating the result of all of them. This is achieved by representing the decision of the allocation of each variable as a layer in a tree structure. While traversing the tree, after each WCET that is lower than the WCETs found before, all those subtrees not yet traversed are discarded for which a heuristic shows that their lowest possible WCET is above the minimum already found. See Figure 8 for a graphical example. Figure 8: Pruning the branch-and-bound search tree [15] A good heuristic for the upper bound of a subtree, referred to as UB() from now on, is essential for the branch-and-bound approach to be able to cut away large portions of the decision tree. This is essential to lessen the computational effort required for finding an optimal solution. 117

121 11 While it is not reliable to optimise the worst-case execution time, knapsack optimisation provides a definite upper bound for the possible WCET reduction through variable allocation. It is solved through dynamic programming, and requires the following parameters: 1. Variables v i to allocate. 2. Size of each variable area vi to allocate. 3. Limit given through the size of the SPM: scratchpad s ize 4. Maximum possible execution time reduction through allocation of variables given by the maximum reduction achieved on any execution path: bound v The effort necessary is still exponential, which is why even branch-and-bound is not feasible for larger programs and large SPM Greedy Heuristic To provide a feasible solution to the problem of static data allocation to SPM, there is a greedy heuristic given in [15] that has a significantly lower complexity. The results of a comparison given in Figure 10 Figure 9: Greedy heuristic given in [15] indicate that the given greedy heuristic is close to the optimal case when considering the achieved WCET reduction. [15] incorporate unfeasible path detection to further reduce their WCET. They detect unfeasible paths by searching for conflicting assignments that are of the form variable := constant and conditional branches of the form variablerelational operatorconstant. For further information, the reader is referred to [4]. 5.3 Optimal Static WCET-aware Scratchpad Allocation of Program Code [5] While the previous paper focused on data allocation to SPM, [5], as the title indicates, focuses on the allocation of program code. 118

122 12 Figure 10: WCET reduction through optimisation with original ILP, branch-and-bound and greedy heuristic for various applications[15] ILP The ILP defined is very similar to the one given in [15], which is why I will narrow the explanation down to the differences between the two. First, program code is usually worked with on the granularity of basic blocks, but defining decision variables for those makes no difference in the general ILP layout. A more important consideration done in [15] is the size and speed penalty of jump and branch instructions. Encoding limits the address distance an instruction can jump or branch over, which, together with SPM being mapped to a different address space than the RAM leads to more instructions being 119

123 13 required when there is a transfer from basic block to basic block across the two types of memory. Figure 11: Possible jump scenarios[15] x i, x j and x k are the decision variables for the basic blocks b i, b j and b k. There are three scenarios to transfer control between basic blocks on typical embedded processors, depicted in Figure 11. Implicit jumps transfer control between consecutive blocks without a jump or branch instruction, when the end of one basic block is reached and the next one begins. The execution time penalty for placing basic blocks connected by implicit jumps in different memory spaces can be modelled by: jp i impl = (x i x j ) P high represents the logical XOR, P high is the jump penalty for jumping across memory spaces. Unconditional jumps are penalised when the basic blocks lie in different memory spaces. When they share the same, there is the much smaller penalty P low. The basic blocks may even be consecutive in the memory they are allocated to, which happens when all in-between blocks are allocated differently, and causes a penalty of 0. The general penalty caused by unconditional jumps from block b i to b j is defined as: jp i uncond = (x i x j ) P high + (x i x j ) (1 (x i x k )) P low b k Figure11b A conditional jump is regarded as a combination of both implicit and unconditional jump, it s penalty being: jp i cond = (x i x j ) P high + (x i x j ) P high + (x i x j ) (1 (x i x k )) P low b k Figure11c Those three are then integrated into the ILP, being added to the WCET at every edge in the directed acyclic graph that represents the given code. Aside from those performance impacts, the memory consumption of additional jump instructions has to be considered when allocating code to SPM. [15] gives the following size penalty for the different jump conditions: Compiler (x i x j ) S impl if JS ofb i isimplicit (x i x j ) S uncond if JS ofb i isuncond. s i = (x i x k ) S impl + (x i x j ) S uncond if JS ofb i iscond. 0 else To use the given ILP, [15] used the architecture of their WCET-aware C compiler WCC. ILP-based WCET-aware SPM code allocation is done after all other optimisations. The entire architecture is outlined in Figure

124 14 Figure 12: WCET-aware compiler WCC[15] Evaluation For evaluation, [15] uses 73 different real-life benchmarks, the result for some of which is given in Figure 13: WCET reduction estimates for different benchmarks[15] Figure 13. The size of the scratchpad is 48kb, 47 of which are usable after reserving 1b for system code. Program code size of all benchmarks are between 52 bytes and 18kB, so all of them fit into the SPM. Each of the five benchmarks was run with the scratchpad size being restricted to 10%, 20% etc. up to 100% of the program size. 5.4 WCET-Centric Software-controlled Instruction Caches for Hard Real-Time Systems[11] Definition of Problem [11] provides WCET optimisation through dynamic cache locking. Puaut first gives a greedy algorithm to find near-optimal solutions that are then used as an initial population for a genetic algorithm which further optimises the result. Unlike the previously discussed static SPM allocations, dynamic SPM or cache locking changes the allocation of variables or basic blocks during program execution. The dynamic allocation of can be split into two problems: First, the choice of reload points at which new content gets locked in the cache and old content may be evicted. Second, the data to load at these reload points has to be determined. 121

125 Choice of Reload Points While reload points(rp) could be placed at every instruction in the program, the vast number of possibilities would cause an enormous complexity. In [11], the author chooses to limit the placement of reload points to natural locations, which are the headers of functions and loops. To limit the amount of reload points, the user may specify max reload points. To decide whether a RP is worth choosing, an estimate of the possible WCET reduction is calculated. For those calculations it is necessary to give an overview of the cache model used: The instruction cache is W-way set associative. It contains B blocks of S B byte each, with the total size being S C = B S B. Because the instruction size of the CPU is fixed, each cache block can hold exactly ipcl instructions. There are several constants to describe timing behaviour of the cache: 1. t hit and t miss are the latencies caused by cache hits and misses. 2. t i and t l are caused by loading and locking the cache respectively. The number of cache misses required for load and lock of pl instruction lines is t i +t l pl To evaluate whether a reload point is worth being placed before the loop L, the greedy algorithm uses the Cost function CF(L) given below. Furthermore, it relies on a function f (bb) or f (pl) that gives the number of executions of a basic block bb or instruction line pl on the WCEP of the loop L. While pl(l) returns all instruction lines of the loop, m f pl(l) returns all the most executed instructions that fit into the cache. WCET cache(l) = f (pl i ) (t miss + (ipcl 1) t hit ) pl i pl(l) WCET locked(l) = f (pl i ) ipcl t hit pl i m f pl(l) + f (pl i ) ipcl t miss pl i pl(l) m f pl(l) + ph pre head(l) f (ph) (t i +t l m f pl(l) ) CF(L) = WCET cache(l) WCET locked(l) CF(L) indicates the benefit of placing a reload point in the pre-header of loop L. If CF(L) is positive, the WCET of the loop L is expected to be lower when cache content is locked in the reload point preceding it. The greedy algorithm sorts the reload points and selects the max reload points first Choice of Content to Load&Lock For each RP that is decided to be worth using, the benefit of the load&lock of each basic block(bb) following that RP is considered. Similarly to the benefit of RP, this is done through a formula that weighs the estimated WCET using cache in regular LRU mode against that achieved while locking the BB to said cache. The actual algorithm defined in [11] for this step is given in Figure 14. It chooses the N most beneficial basic blocks, iterates the reload points and loads all program lines of each basic block at all the reload points that precede it. This is done iteratively until there are no more beneficial basic blocks or the WCET after an iteration is worse than that of the iteration before. A lower value for N means that the WCEP and WCET are reevaluated more often, which increases the effect of the optimisation. 122

127 17 previously discussed in Section 5.2.2: Allocation of contents on the WCEP to either SPM or locked cache may change the WCEP, leading to less reduction in WCET than on the original path, which is not additive Implementation Details Instead of restructuring the code or executable of the given application, one approach for inserting reload points given in [11] relies on the debug function of the processor platform being used. A breakpoint is used at every reload point, the exception caused by which is captured by the processor and handled by an external manager that takes care of the cache locking Genetic Algorithm A genetic algorithm uses Darwin s theory of evolution to find solutions to a problem. It does so by evaluating the fitness of individuals in a pool of specimens and applying evolutionary mechanisms, e. g. crossover and mutation. There is no guarantee that a evolutionary algorithm finds a good solution, and the time it takes for a random initial population to become reasonably fit is too long. Because of this, the genetic algorithm is used to increase the fitness of the results of the greedy heuristic given before. The formal definition of the parameters for said algorithm is as follows: 1. Codification: (Representation of an individual) Individuals are represented by chromosomes, which are arrays of tuples of the form (rp,contents) rp identifies a reload point and contents the cache contents to be locked at that point. 2. Fitness: The fitness of an individual is the WCET it achieves. 3. Selection: The probability of the selection of one individual is linearly dependent on its WCET. 4. Crossover and mutation: Crossover is done by randomly selecting a point in the chromosome, everything before which comes from one parent, everything after from the other. There are three mutation mechanisms implemented: M rem removes one randomly selected reload point, M add adds a random reload point and M chg randomly changes the content of one reload point. 5.5 Evaluation To evaluate the performance of both the greedy heuristic and the genetic algorithm, Puaut compares the Figure 16: Performance results given in [11], miss ratio of LRU, PPR, the heuristic and genetic algorithm hit to miss ratio the Heptane open source cache-aware WCET estimation tool yields for compiled MIPS R2000/R3000 binary code. They are compared to regular cache using the cache replacement strategies LRU and pseudo round-robin(prr), the latter of which is chosen because it is a hard to predict strategy. Figure 16 shows that neither the results of the greedy algorithm nor their genetically optimised versions yield a better miss ratio than LRU. There are, however, cases in which the hard to predict PRR strategy 124

128 18 performs worse, which indicates that hard to predict cache replacement strategies could be an application for the algorithms proposed in [11]. 6 Conclusion The basic message every paper on SPM that I read carries is: SPM is able to compete with regular cache, not seldom overcoming the efficiency cache can offer. Development on optimising compilers continues making advancements, so it should be only a matter of time before optimisations for explicitly managed memory hierarchies are implemented outside of research compilers. While general-purpose systems require more work to make existing applications work with SPM, embedded systems have the benefit of having software that is often tailored to, or at least specifically compiled for the specific architecture. Together with increased energy efficiency and, compared to cache, decreased complexity, this makes SPM a perfect mechanism to optimise embedded systems. Even tho most of the interesting optimisations for general-purpose and high-performance multicore systems are explained in the paper of my colleague Axel Ratzke[13], there are conclusions I can draw from my research on the known applications of scratchpad memory and explicitly managed memory hierarchies: The success of GPGPU programming for high-performance and especially high-efficiency calculations and the success of the Cell architecture indicate that the concept of Scratchpad memory is a worthwhile and important topic for research on optimisation of those applications. Mobile devices, specifically Android smartphones with their java-based portable applications, would require a JIT to be able to optimise software for SPM, which is something I would like to see being researched. In conclusion: Being able to automatically compile and optimise code to properly use explicitly managed memory hierarchies appears to be an important step to increase the efficiency of computing applications. 7 Bibliography References [1] ATmega640/1280/1281/2560/2561 Datasheet, revision P, updated: 10/2012. Available at atmel.com/devices/atmega1280.aspx?tab=documents. [2] Software Development Kit for Multicore Acceleration, Version 3.1, Programmer s Guide. Available at http: //www.ps3devwiki.com/wiki/cell_programming_ibm. [3] Michael A. Baker, Amrit Panda, Nikhil Ghadge, Aniruddha Kadne & Karam S. Chatha (2010): A performance model and code overlay generator for scratchpad enhanced embedded processors. In: Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, CODES/ISSS 10, ACM, New York, NY, USA, pp , doi: / Available at [4] Ting Chen, Tulika Mitra, Abhik Roychoudhury & Vivy Suhendra (2005): Exploiting branch constraints without exhaustive path enumeration. In: In 5th International Workshop on Worst-Case Execution Time Analysis (WCET. [5] H. Falk & J.C. Kleinsorge (2009): Optimal static WCET-aware scratchpad allocation of program code. In: Design Automation Conference, DAC th ACM/IEEE, pp

130 An Introduction to the Research on Scratchpad Memory with Focus on Performance Improvement - Instruction SPM, SPM on Multicoresystems and SPM on Multitaskingsystems Axel Ratzke University of Kaiserslautern, Embedded Systems Group a Abstract In this paper a short introduction of the broad field of scratchpad memories will be given. The main focus is on the improvement of performance by the proper use of scratchpad memories. Several techniques for the automatically optimal usage of available on-chip memory space will be presented. 1 Introduction In modern embedded systems the number of processing elements increases steadily. Therefore modern architectures have to deal with the problem of the bottleneck of limited memory. As the systems continuously get faster, the memory remains the weak point. Common solutions to this problem are hierarchical cache memory structures. Since caches are hardware controlled, it is quite difficult to use them efficiently in a system with only limited resources. Furthermore, it can remain unpredictable which information will be stored in the caches and which will be displaced again. To avoid these kinds of problems the use of scratchpad memories became more and more popular in embedded systems. Scratchpad memories basically are small, fast and explicitly software-managed local on-chip memories. Since they are controlled by software, they offer a bunch of advantages for special purpose systems. The abandonment of additional hardware saves space on the chip, which is mostly limited. Second software-managed memories offer the advantage of considerably faster access, since it is assumed that a request hits. Additionally, software solutions are simpler to implement and design errors can be addressed efficiently. This paper will put its main focus on the performance improvement of scratchpad memories (SPM). To consider the broad spectrum of their use appropriately in this work it is divided into the aspects of storing instructions into the SPM, the extension of the problem of optimal usage of the SPM to multicore systems with various separated SPMs and the usage of SPMs in systems dedicated to several distinct processes. In each of these aspects some approaches to solve one or several problems this aspect reveals, are presented. The rest of this work is structured as follows: Section 2 presents the problem of optimal usage of SPM considering only instruction storage. Two approaches to this problem are presented. Section 3 extends the problem to the proper usage in multicore environments. Three approaches to performance improvement are presented. Section 4 presents one possibility to adapt an SPM-based memory architecture to support several distinct processes. Finally it concludes in Section

131 2 2 Instruction SPM As stated above the main focus of this paper is on the performance improvement of systems using scratchpad memories. By the use of a scratchpad memory, the performance of a system can be increased by two options. The first one is data allocation, but this paper will focus on the second option, i. e. instruction allocation. The reason for this is that embedded systems are often designed particularly for special purposes and therefore there are often only a few applications working on them. That is why it is possible to exploit this by saving the often executed instruction blocks in the scratchpad memory in order to obtain a performance gain. Usually there are again two options for doing so. The first is the static approach - here the SPM is filled with instructions before the execution of an application and these instructions are not removed until the execution is finished. It is obvious that this approach wastes a lot of available memory space and therefore does not achieve the best results. The second approach is dynamic - here instructions are not transported into the SPM until they are needed for execution. On one hand this enables a better usage of the available memory space, But then on the other hand it seems much more difficult for the programmer to manage the SPM. That is the reason why solutions are searched to hand this problem over to the compilers, which should be able to handle this problem much more efficiently than humans. For this reason in the following two different approaches to dynamic instruction allocation are introduced. In this first approach presented here, propose [29] in "Dynamic Overlay of Scratchpad Memory for Energy Minimization" a profile-based approach which places instructions as well as data into the SPM by regarding their span of life. The technique is to find the points where code to store in respectively load from the SPM needs to be inserted. These points have to be placed optimally to generate as few overhead as possible. Also the technique has to generate the addresses of variables and code in the SPM in order to place as many variables and code segments in the SPM as possible. The actual problem is to store memory objects into the SPM when they are needed and to write them back when they are not needed anymore. The authors state this problem to be proven to be NP-complete. To solve this problem efficiently it is divided into two smaller subproblems by the authors. The first subproblem is to decide whether a memory object should be assigned to the SPM or the main memory and to find the optimal position for the placement of the corresponding code. The second subproblem is to compute optimal addresses in the SPM for these memory objects. Both subproblems have been proven to be NP-complete. An algorithm consisting of four steps is used to solve that problem. At first variables and code segments of the application are spotted as memory objects. In the second step a liveness analysis is executed to find the live range of the memory objects. The third step is to decide whether these objects are stored into the SPM or the main memory. In the final step for all objects that are stored in to the SPM, the addresses are computed. In doing so the following types of variables and code segments are regarded as potential candidates (called memory objects): global variables (scalar and non-scalar), since they need space in the data memory in both types, non-scalar local variables, since the authors assume that frequently accessed scalar variables at execution time are stored in the registers and so do not need to be stored in the SPM, and frequently executed code segments, called traces which are identified with the trace generation technique. A trace is a frequently executed straight path of basic blocks which improves the processor s performance through its spatial locality. In addition the authors state that traces lead to an unconditional jump every time so that they form atomic units of instructions that are free to move. Afterwards the liveness analysis is executed on the control flow graph of every function. Here the set of 128

132 3 basic blocks of a function builds the set of nodes and the edges are built by the flow of control while the functions execution. The authors extend the principle of DEF-USE-chains from [21] to determine the liveness of the memory objects. A reference to a memory object is can be classified as DEF, MOD or USE. If a reference assigns a new value to every element of a memory object, it is DEF. If it does so to only some of the elements of a memory object, it is MOD and a reading reference is USE. These attributes are assigned to the nodes in the control flow graph. After that a combination of static and profiling based analysis methods is applied in order to find the basic blocks which contain references to variables. Static methods are used to find the blocks which contain references and profiling is used to seperate DEF references from MOD references (and vice versa). Traces are regarded like variables, but USE is assigned to their corresponding blocks. Finally, a fixed point algorithm is used to find the span of life of every object. Now the memory assignment problem is formulated as integer linear programming problem. In this process memory objects are mapped to the SPM on the edges of the control flow graph. According to the authors this allows an efficient determination of the optimal points for the insertion of the transport code. To every object on every edge of the control flow graph an element of the set of static attributes Attrib STAT IC = {DEF, MOD, USE, CONT} is assigned. DEF is assigned to every edge emerging from a node with DEF attribute. In contrast MOD respectively USE are assigned to every edge leading to a node with MOD respectively USE attribute. If a memory object is live on an edge, CONT is assigned to the edge. Also spill attributes Attrib SPILL = {LOAD, STORE} are assigned to the edges to model the transport of memory objects. The LOAD attribute is assigned to the edges where the corresponding object can be loaded from the main memory to the SPM. Accordingly the STORE attribute is assigned to edges where the object can be stored back from the SPM to the main memory. The LOAD attribute is assigned to the edges which are marked with MOD, USE or CONT, or emerge from a diverge-node (i. e. a node with out-degree greater than one). Accordingly the STORE attribute is assigned to edges marked with DEF or leading to a merge-node (i. e. a node with in-degree greater than one). So spill attributes only can be assigned to edges which are already defined as DEF, MOD, USE or CONT. To enforce this, the binay decision variable Xjk i is defined, which models the assignment from memory object mo k to the SPM on edge e i. Xjk i = 1 if and only if mo k is present on the SPM at edge e i and an operation corresponding at j is performed and 0 otherwise. Here e i E,at j Attrib STAT IC Attrib SPILL and mo k MO. The objective function is represented by the energy savings that are possible through this technique. The savings have to be maximized. i {E pro f it (i, j,mo k ) Xjk i E load cost(i,mo k ) XLOADk i E store cost(i,mo k ) XSTOREk i } (1) k Here E pro f it (i, j,mo k ) are the savings by assigning the mo k to the SPM at e i and E load cost (i,mo k ), E store cost (i,mo k ) are the energy costs for transporting memory object mo k to respectively from SPM at edge e i. The authors take the savings from [26]. Now the constraints for the linear programming problem are built. First the constraints which enforce a correct flow of the liveness of the memory objects. X i DEFk X j CONT k X i STOREk = 0 mo k MO (2) X i USEk X j CONT k X i LOADk = 0 mo k MO (3) 129

133 4 X i MODk X j CONT k X i LOADk = 0 mo k MO (4) X i CONT k X j CONT k X i LOADk = 0 mo k MO (5) The constraints (6) to (9) are added to ensure this also on merge-nodes. X i LOADk X i jk 0 e i {e i1...e in }at j {at j1...at jn } (6) Here e i1 to e in are the edges leading into a merge-node. Xj1k i1 in =... = Xjnk s.t.at j1...at jn Attrib STAT IC (7) X i STOREk X i jk 0 e i {e i1...e in }at j {at j1...at jn } (8) Xj1k i1 in =... = Xjnk s.t.at j1...at jn Attrib STAT IC (9) Here e i1 to e in are the edges coming from a diverge-node. Finally the total space of the available SPM is added to the constraints so that the memory objects which are assigned to the SPM are not able to exceed it. Xjk i Size(mo k) ScratchpadSize e i E (10) k The problem posed above is solved with a commercial ILP solver [4]. The number of all variables in the formulation lies in O( MO E ) according to the authors. Finally the address assignment problem is solved. In the step above during the ILP formulation it was implicitly assumed by the authors that the size of all memory objects togehter does not exceed the size of the SPM, when the size on every edge is smaller than the SPM - and so the addresses can be computed. When a bad assignment strategy is used, this assumption can be false due to SPM fragmentation. As a consequence it can happen that that addresses can not be assigned to objects although there would be enough space for them. If all objects have got the same size, this problem can be trivial - in other cases this problem becomes NP-complete. Now the problem is formulated as mixed integer linear programming problem. To compute the address of a memory object, the authors first compute the offset of its start address to the base address of the SPM. The integer variable O i j models the offset of memory object mo j at edge e i and 0 O i j ScratchpadSize Size(mo j ) (11) holds. Next the constraints for the problem formulation are formulated. The offset of two memory objects which are defined on the same edge, must not overlap. This will be enforced with O i j O i k + L ui jk Size(mo k) e i E (12) O i k Oi j + L u i jk Size(mo j) L e i E (13) where u i jk = 1 if and only if Oi k - Oi j Size(mo j) is satisfied and 0 if and only if O i j - Oi k Size(mo k) is satisfied. These constraints have to be repeated for every two memory objects which are assigned to the SPM at the same edge. 130

134 5 Thus constraints are added to ensure that for a given memory object mo k the offset for every edge on which it is assigned to the SPM does not change. O i k O j k L vi j k = 0 e i,e j E (14) with v i j k = 1 if and only if Oi k O j k and 0 otherwise. This constraint is reformulated for every valid two edges, too. A valid solution can be recognized by the offsets of memory objects which remain the same on pairs of edges. So the objective function is the sum of the binary variable v i j k for every valid pair of edges for every memory object. This function has to be minimized. i j v i j k (15) k Also this problem is solved with the help of the ILP solver [4], but with its branch and bound procedure. According to the authors this can take long time for some instances of the problem. The technique was tested with a system consisting of an ARM7T processor core, an on-chip SPM and an offchip main memory. The here presented algorithm to solve the overlay problem was compared to the static allocation technique from [25]. At first the benchmarks were compiled using an energy optimizing C compiler and next the trace generation [28] was applied, before the presented allocation technique was used. The so generated maschine code was executed by ARMulator [1]. The tests were executed with benchmarks of MediaBenchII [2], UTDSP benchmark suite and a benchmark consisting of sorting routines from [25]. As a result an application needs on average 21% less CPU-cycles if the technique presented above was used. This technique seems quite conservative, as it seems that profiling data and formulating the problem of optimal SPM use as an linear programming problem is a common approach. Moreover the problem is identified as NP-complete. But then this paper describes a way to use existing SPM space for both instructions and data, which is quite an advantage on systems with only limited die-area. Figure 1: Edge Detection: Comparison with a static allocation approach In the paper presented next, a quite different approach is shown where the content of the SPM is not determined before execution but rather at runtime. In "Scratchpad Memory Management for Portable Systems with a Memory Management Unit" [15] a dynamic strategy for horizontally partitioned memory subsystems for contemporary embedded processors was developed. The memory subsystem is fitted with a memory management unit and a SPM that is physically addressed and mapped to the virtual address area. Furthermore, to spare energy costs and further increase 131

135 6 Figure 2: Performance and Code Size: Comparison with a static allocation approach the system s speed, a small minicache is implemented. The procedure is based on using the page fault exceptions of the memory management unit to track page accesses and copy often used code into the SPM, before it is executed. Because the smallest unit to copy code into the SPM is a memory page, the authors state that good code placement is of the utmost importance in this procedure. A postpass-optimizer is used to divide the application s binaries by the use of its profile data into the three categories: pageable, cacheable and uncacheable. Pageable code is aggregated to pages with the same size as an physical memory management unit page and copied into the SPM as needed, while the other two categories are stored at fixed positions at the external memory. They describe their memory system as follows. To avoid the difficulties of todays standard architectures (e. g. architectures where a the MMU first translates the virtual address to a physical address which is compared to the SPM base register und SPM accesses occur only if this address belongs to its address area or architectures where the SPM and the cache are accessed at the same time) the authors provide the following architecture. A horizontally partitioned on-chip memory subsystem for the instruction side of a harvard architecture and additionally a micro-tlb, a SPM and a direct-mapped minicache. The micro- TLB s task is to translate the virtuall addresses from the core s instruction fetch. The resulting physical address is checked against the register of the SPM area to decide if the request should be forewarded to the SPM or if it should be forewarded to the minicache. This means if the address is located beyond the SPM, the minicache will be addressed. The minicache has the purpose to reduce the costs for requests beyond the SPM. As disadvantage it is to mention that the the assignment from virtual-to-physical-translation with the SPM/Cache access increases the latency of the instruction fetch with the time of an micro-tlb access. For cores up to 500 Mhz relief can be procedured by an additional cycle. For cores beyond 500 Mhz it is possible to intersect the instruction fetch pipeline to help according to the authors The Scratchpad Memory Manager which manages the SPM as a global resource is neither dependent on the size of the SPM nor a certain amount of running applications and is fully integrated into the runtime environment. Pageable code is aggregated to pages whose size equates to the size of a virtual memory page. As soon as an application s binary is loaded, the runtime environment creates the virtuall-to-physical-assignment by building the MMU tables which assign the physical addresses to virtual addresses. At first all assignments to physical code are rendered unusable by invalidating the check bits. Now the application starts by setting the PC to the corresponding entry point and as soon as the PC reaches pageable code, the MMU throws a prefetch abort execution because the assignment in the page table is missing. Once this exception is forewarded to the SPMM by the runtime environment, the SPMM loads the necessary page into the SPM and creates an appropriate assignment to the virtual page in the page table and then the aborted instruction fetch phase is restarted. Pages already residing in the SPM are 132

136 7 not affected by this procedure. If there s more code to be stored in the SPM as free pages are available, already loaded pages have to be replaced. Since these pages always have to be read but not written, they do not need to be written back into the external memory. To override a page only its assignment in the page table has to be invalidated. The SPMM monitors which pages are free and which are occupied and decides which page should be used next by a simple round robin strategy. The Postpass-Optimizer used in this paper belongs to the Seoul National University Advanced Compiler tool kit, which is itroduced in [22]. It is simply called SNACK-pop and works with the ARM/Thumb instructionset including the DSP expansions, while ARM floatingpoint instructions are not supported. According to the authors, using a postpass-optimizer offers three advantages. Every binary can be optimized for the SPM-allocation technique without access to the source code and the need of recompiling the whole application. Also the optimizer allows optimizing the whole program complete with libraries which would not be possible on the source level. Last but not least postpass optimization is perfect suitable for low level code layout arrangement optimizations. As input for the optimizer serve the application s binaries and libraries in the ARM ELF format, which will be deassembled to the code sections and data sections. All unidentified symbols are resolved and in the next step the code blocks are further partitioned into functions consisting of basic blocks, and branches with hard coded offset are resolved and replaced by relocation information, so that the optimizer is free to move code as needed. As soon as SNACK-pop encounters a pointer pointing on constant data in its pool, it removes it und moves the data to a global data area which is not marked as pageable, before adjusting the pointer. This is necessary since thrashing can occur if big constants can override each other in the SPM due to their size. In order to gather profile data of the application, instrumentation code is added to every function and the image is simulated on an instruction set simulator to compare it with an unaltered reference image. In this way a bunch of profiles and instruction traces are created with different training data sets. In the next step the new created profile data is fed to the optimizer again to define the average number of accesses of a code block and for each block in function level it is defined if it falls to one of the categories pageable, cacheable or uncacheable as follows. Code which is read on average less then once is classified as uncacheable since it would only produce cache misses. Possibly occuring hits based on local closeness are negligible due to the small access number. For every other block the energy which would be necessary to execute the code from the cache is calculated and compared to the energy which would be needed to execute the block from the SPM added to the amount of energy needed to copy the code from main memory to the SPM. If the first value is the smaller one, the code will be assigned to the cache, in the other case it will be assigned to the SPM. Thereafter function splitting similar to the one in [23] is used. At first the code blocks are resorted according to their intended position in the memory and new branch instructions are added as needed to keep the control flow graph valid. The text step is to split the functions into partial functions for the partial blocks in order to locate them in the SPM, the cache or the external memory. The code for the according partial functions is then located there respectively. Now the code has to be placed into the pages optimally. Since this problem is NP-hard, the authors act by the following heuristic. At first all loops in the dynamic call graph are found. Here the authors do not only mean loops in the traditional way, but rather functions where the number of accesses by a father function divided by the number of accesses of the father function exceeds a certain threshold. On the source level the effect is just the same according to the authors. For every of these loop headers the loop members are identified by computing the loop s closure, which means the loop contains all functions in its body that are called at least as often as the loop header itself. Now the loop call graph is built (i. e. a 133

137 8 simply directed graph with loops as its nodes and edges between them if one loop is a subloop of the other) and went through from the innermost loop to the outermost. For every pageable function, that is not already bined in a bin, a bin is created in which it is bined. After every node in the loop call graph is processed, the size of the bins is calculated (i. e. bordered) so that they cannot grow to infinity. Up next all loops which contain inner loops (i. e. non leaf-nodes) are looked at. Every function from the outer loops is moved to the bins of the inner loops, as long as they fit, by using the bestfit algorithm from Introduction to algorithms [14]. Here the principle is to aggregate functions with strong temporal vicinity to obtain a better usage of the bins and avoid internal fragmentation. Once there is no function left to be moved, the size of the bins is recalculated. Functions which are marked as pageable but do not belong to any of the loops are processed at last. The loop call graph is processed towards its root and for every access to such a type of function, a fictive loop with a threshold of one is calculated. If a loop contains both an inner loop and such a function, it will be tried to bin the function into the bin of the inner loop. All functions that are left over are bined into an extra bin. Once SNACK-pop is done with the code arrangement, it builds the new ELF binaries and adds six new symbols which describe size and location of every of the three code regions (paged, cached and uncached). As soon as an ELF binary is loaded, the SPMM searches for these symbols and if it finds them, the memory assignments are established according to them - if not, only the minicache is used for this unoptimized image. In order to test the approach SNACK-armism [22] was used. As a reference system serves a fully cached system with 4-way associative virtually-indexed, physically-tagged instruction and data caches. As performance metric the total execution time was used. Tested was with benchmarks fom MiBench [17], MediaBench [19], the official ISO MP3 decoder [6], MPEG-4 XviD encoding/decoding [7] and a combined benchmark consisting of Quicksort, Dijkstra, SHA, ADPCM-enc, ADPCMdec, and Bitcount. The tests showed that on average 85% of the pages space was used, while 15% of a page remained unallocated. On average the use of the SPMM with a small minicache increased the performance by 12% against a convential instruction cache with similar die-area. On the negative side one page fault costs 190 instructions, 270 loads from, and 29 stores to SDRAM on average. This paper seems quite promising since it is the starting point for several researches to this subject of the authors. One advantage of the approach is obviously its flexibility. It does not need to be known what program will be running on the system. This might be usefull for systems that are created for more than one single purpose. But then one can clearly see that there are still some disadvantages. First of all is the waste of SPM space through not fully packed pages (15% of a page remain unused on average!). Next is the latency incuring if the SPMM needs to react, when new pages need to be loaded into the SPM, and to actually load them from main memory to the SPM. 3 SPM on Multicore Systems After two ways to exploit available on-chip memory optimally und improving performance for singlecore systems the problem has to be expanded to multicore systems. On multicore systems it has to be dealt with several new problems. Since one has more cores to execute processes together, communication between them has to be considered. Also often to every core its own private SPM is assigned, so that the data needed for the execution has to be divided between these SPMs optimally or has to be copied to each of them. In this section three approaches for using the available scratchpad memory space to gain additional per- 134

138 9 formance are considered. The first one uses linear programming to determine optimal storage of data on an multicore system. A technique based on Integer Linear Programming for integrated task mapping and or scheduling, SPM partitioning and data mapping for MPSoC is introduced in "Integrated Scratchpad Memory Optimization and Task Scheduling for MPSoC Architectures" [27]. Here the authors act on the assumption of working with data only, but the problem formulation can be generalized to data or code blocks in general. In the supposed architecture every processor has got its own private SPM, but also got access to the other processors private SPM with a higher latency. It is assumed to start with an application and a kind of a budget for the total SPM with the target to find a data mapping and a configuration for the processors private SPM suited for maximizing the application s performance. Since the best configuration for an processor s SPM depends on the tasks mapped to it, task mapping and scheduling and the SPM configuration depend on each other. As said above the architecture consists of several cores which can communicate with each other by using a shared off-chip memory via bus. A virtually shared scratchpad is used, which means every processor has its own private SPM but can access the other SPMs too with a higher latency (so called remote SPM). For reasons of simplification the latency of every off-chip memory access is assumed as a constant and conflicts eventually occuring on the bus are absorbed this way and furthermore it is assumed that every memory area can be mapped to at least one SPM. The author s goal is to find the optimal SPM configuration, which minimizes the initation intervall, for a given task graph, architectural model and limit of the available SPM space. A task graph is a directed acyclic graph which describes the single tasks of an application as nodes and the communication between them as edges. Every task can be mapped to every processor, so that every execution time for every processor is associated with every node. This execution time for a task executed by a processor depends on the position of the data in the SPM, so that the execution time is calculated with all data variables placed in the off-chip memory. An edge from one task to another in the graph models a data transfer between them, so the amount of data transported is associated with every edge and in addition every task is associated with the size of and access frequency to the data variables, received by profiling. It is assumed that an application is given in such a graph. The pipelined implemention benefits from different processors executing different iterations of the task graph at the same time. In sequential execution of an application the aim is to minimize the execution time of a single iteration of the task graph, whereas in pipelined implementation however the aim is to minimize the afore-mentioned initation interval, the time between the start of two following iterations of the task graph. So the problem of the beginning is divided into three smaller problems: first the mapping/scheduling of the tasks to processors respectively communication between the tasks. Second the allocation of the optimal size of each private SPM. And third the allocation of the data varibales of each task to every single SPM. All these problems can be formulated as integer linear programming problems. For reasons of simplification, in the following description the authors assume that the present MPSoC architecture consists of four heterogenous processors, so that the execution time of a task is the same on every processor. But they also mention that this is no requirement for the problem formulation. The first small problem is now described as ILP formulation to optimize the performance by task mapping/scheduling. This is the initial situation: if an application got N tasks, they are denoted as T 1...T N. T N is without loss of generality the last task (i. e. it has no successors in the task graph) - if there are several such tasks, a dummy task is added as the last one. Further on there are M available homogenous 135

139 10 or heterogenous processors, described as P 1...P M and with every task T i related is its execution time on every processor P j, time i, j. As mentioned before, it is assumed that every variable is placed in the off-chip memory. One task can be mapped to exactly one processor, which is expressed by X i, j. X i, j = 1, if task T i is mapped to processor P j and 0 otherwise. So M j=1 holds. The execution time for every task T i is expressed by Time i = X i, j = 1 (16) M j=1 X i, j time i, j (17) StartTask i and EndTask i respectively describe the point of the beginning or the end of the task, so that EndTask i = StartTask i + Time i 1 (18) holds. The optimization s objection function is the smallest value for EndTask N for the last task T N of the application (i. e. to minimize the critical path through the task graph). This function has now to be optimized in respect of the already mentioned constraints (16) - (18) plus the following constraints. Every predecessor of a task T i has to be already processed before it can be started with its execution. If a predecessor T h additionally was mapped to another processor, the task has to wait for the end of the communication between it and its predecessor and a latency of comm h,i occurs. This inter-task communication is modeled as task C h,i (where T h is the predecessor and T i is the task which is now to be executed) on the shared bus. The points of time StartComm h,i and EndComm h,i are described analogous to their equivalent of tasks. Now and StartComm h,i EndTask h + 1 (19) StartTask i EndComm h,i + 1 (20) have to hold to ensure that predecessor tasks and communication with them are completed before the execution of a task. Costs arise only, if T h and T i are mapped to different processors, so this is avoided by the following constraint EndComm h,i = StartComm h,i + L h,i comm h,i 1 (21) where L h,i = 1 if and only if T h and T i are mapped to different processors. Next it has to be ensured that two independent tasks which are assigned to the same processor have different lifetimes. So for two independent tasks T i and T i, L i,i is defined as above and additionally B i,i = 0, if T i and T i are assigned to the same processor and T i is executed after T i. B i,i is defined analoguos. So this condition can be formulated with the next three constraints: B i,i + B i,i L i,i = 1 (22) StartTask i EndTask i B i,i + 1 (23) 136

140 11 StartTask i EndTask i B i,i + 1 (24) Of course communication between single tasks must not overlap and so for different communication tasks C h,i and C f,g it is defined: V h,i, f,g +V f,g,h,i = 1 (25) StartComm h,i EndComm f,g V h,i, f,g + 1 (26) StartComm f,g EndComm h,i V f,g,h,i + 1 (27) where V h,i, f,g = 1if and only if C f,g happens after C h,i (and 0 otherwise). V f,g,h,i respectively analoguos. Now the formulation of the task scheduling is extended to consider pipelined scheduling - tasks are distributed among pipeline stages of the same size, in a synchronous pipelined execution. The initation interval which describes the length of a pipeline stage is equal to the maximal time needed to handle every task on a stage. It is the objective to distribute the tasks in regard of their dependencies and needed resources among the pipeline stages so that the initation interval will be minimized. It is important that every processor can be used in only one pipeline stage, because all stages are executed in parallel in different instances of tasks in the steady state. Then again every stage can use more then one processor so that there is a maximum of M stages of the pipeline. In consequence of this observation the formulation of task mapping and scheduling is adapted so that it works as before and then in the next step the different processors are assigned to the different pipeline stages. To model the assignment of a processor to a pipeline stage, the varibale W is introduced: W j,s = 1 if and only if processor P j is assigned to the s th pipeline stage. So M s=1 W j,s = 1 (28) holds. With this condition it is possible that no processor is assigned to some stages, which is a result of the fact that some stages have got more then one processor. Such invalid stages are ignored. The function which has to be optimized is the minimum of the initation interval. To describe the maximum amount of time needed to execute all tasks of a stage, the values StartStage s and EndStage s are introduced in order to mark the points of time where a Stage s starts respectively ends. So II EndStage s StartStage s + 1 (29) holds for every s: 1... M. Obviously a pipeline stage must not overlap with another, so that similar to the tasks it is defined: B s,t + B t,s = 1 (30) StartStage s EndStage t B s,t + 1 (31) StartStage t EndStage s B t,s + 1 (32) where B s,t = 1 if and only if Stage t is executed after Stage s and B t,s = 1 if and only if Stage s is executed after Stage t. A pipeline stage has to cover the whole execution time of the processes assigned to it and this is modeled for every stage s: 1... M and every processor j: 1... M by StartStage s StartProc j + (1 W j,s ) (33) 137

141 12 and EndStage s EndProc j + (1 W j,s ) (34) StartProc j and EndProc j mark the points of time where processor P j starts respectively ends its execution. They are calculated from the earliest startpoint respectively the latest endpoint for all tasks assigned to the processor. For every processor j: 1... M and every task i: 1... N and StartProc j StartTask i + (1X i, j ) (35) EndProc j EndTask i (1X i, j ) (36) have to hold. The communication tasks which are executed on a shared bus which is used throughout all pipeline stages must not be forgotten. Communications between different stages are executed at the same time in one II. The constraints (25) to (27) are in charge of preventing the communications in one pipeline stage from overlaping. But it has to be enforced that the communications between different pipeline stages do not overlap. In order to achieve that, the authors normalize the communication tasks execution intervals by setting the starting time (respectively ending time) relatively to the starting time (respectively ending time) of the pipeline stage to which they are assigned. Now the variable F is defined: F h,i,s = 1 if and only if C h,i is assigned to stage s, and 0 otherwise. Now it can be expressed: M s=1 F h,i,s = 1 (37) So every communication task is included in the interval of the stage to which it is assigned. StartStage s StartComm h,i + (1 F h,i,s ) (38) and EndStage s EndComm h,i (1 F h,i,s ) (39) Finally the mutually exclusion of all pairs of independent communication tasks C h,i and C f,g is demanded: and (StartComm h,i StartStage s ) (EndComm f,g StartStage t ) V h,i,s, f,g,t + 1 (40) (StartComm f,g StartStage t ) (EndComm h,i StartStage s ) V f,g,t,h,i,s + 1 (41) In this formulation V h,i,s, f,g,t = 1 if and only if C h,i is scheduled in Stage s, C f,g in Stage t and the normalized interval of C h,i is scheduled after the normalized interval of C f,g (V f,g,t,h,i,s analogous respectively). Now to the heart of the problem: The SPM partitioning and data allocation. The number of all variables is specified as R and some of them may be used by several tasks. The number of accesses of a variable is determined by profiling. A variable is associated with this number and its size in bytes, called area v. The first value is dependent on the processor on which the task is executed, it is expressed through the value f req v,i, j which tells how often a variable v is accessed if task T i is mapped to processor P j. Each of these accesses causes a different latency, dependent on where v is located - the latency of zero, if v is located on processor P j s private SPM, a constant latency of cross penalty, if it is located on a remote 138

142 13 SPM, and a constant latency of penalty (which will generally be more than cross penalty), if it is located in the off-chip memory. Whether a variable v is located in the SPM of a processor P j is expressed through S v, j. In the described architecture a variable can be allocated to one SPM at maximum: M j=1 S v, j 1 (42) One constraint of the problem is the SPM area available in total, which is used as input to the problem: R M v=1 j=1 S v, j area v total area (43) The objective function is one of the two already mentioned, depending on whether a pipelined setting is used or not. The last thing to consider now is the execution time which can only be reduced due to data allocation to on-chip memory. To consider this, equation (17) has to be replaced by the last two constraints: Time i = M j=1 (X i, j time i, j R v=1 f req v,i, j gain v,i, j ) (44) gain v,i, j = Y v,i, j penalty + Z v,i, j (penalty cross penalty) (45) where Y v,i, j = 1 if and only if variable v and task T i have been mapped to processor P j and Z v,i, j = 1 if and only if task T i has ben assigned to processor P j and v has been assigned to another processors SPM. In order to test the here presented technique, three strategies are used. The equal partitioning strategy (EQ) ignores data allocation to the SPM while task scheduling. The available SPM is divided equally between all tasks. This is a simple Knapsack problem, for which optimal solutions are known according to the authors. The partially flexible strategy (PF) also ignores data allocation to the SPM while task scheduling. Here SPM partitioning and data allocation are computed simultaneously by a simplified version of the ILP where some variables already are known. The completely flexible strategy (CF) works the same way as described above for the simultaneous task scheduling, SPM partitioning and data allocation. For the actual tests five benchmarks were used. Four of them were taken from MiBench [17] and MediaBench [19]. The fifth benchmark, called enhance, was an altered version of the image enhancement application from [24]. The applications were profiled to find the important execution blocks and every application then was divided in a certain number of tasks, where every task corresponded to such a block. This information was used to find the dependencies between tasks and to compute the communication costs. So the task graph for every application was built. For the tests the SimpleScalar cycle-accurate architectural simulation platform [11] was used. An instrumented version of the SimpleScalar profiler was used to determine the size of the variables, the access frequency and the execution time in processor cycles for every single task. As mentioned before it was assumed that off-chip accesses are constant and do not lead to conflicts on the bus. Both scalar variables and array variables were regarded. For the solution of the ILP the already mentioned solver CPLEX [4] was used. At first EQ and PF were compared to each other - the flexibility of PF compared to EQ leads to a significant performance increase for the benchmakrs excuted with the first strategy in most of the cases. While comparing PF 139

143 14 and CF it was observed that further performance increases depend heavily on the characteristics of the applications. The worst-case occured when the SPM was neither to small nor to big and according to the authors, this was expected because this is the most difficult case for scheduling. Altogether it was shown that flexible SPM partitioning can increase the performance by 60% compared to equal partitioning and integration of memory optimization in task scheduling can increase performance up to 80%. This paper shows quite extensive the interaction between task scheduling and memory optimization and it shows also that the expensive CF strategy is not always necessary for optimal solutions and better performance, depending on the desired application. One a bit unusal approach of gainig performance on multicore systems is delivered in "Exploiting Shared Scratch Pad Memory in Embedded Multiprocessor Systems" [18]. The authors propose an optimizing algorithm targeting eliminating extra off-chip memory accesses caused by interprocessor communication. This approach is especially suited for image processing embedded systems. They focus on an SoC with an off-chip DRAM which can hold data as well as instructions. The SoC consists of multiple processors and their private SPMs and also there are inter-processor-communication / synchronization machnisms, clock circuity and some ASIC. The SPMs build a virtually shared SPM, in which every processor can access its own private SPM as well as the remote SPMs of the other processors fast and only access to DRAM is much more expensive and slower than an access to one of the SPMs. The system uses a loop-level parallelized application as input and in the model described, every loop-nest is as parallelized as possible. All processors work together on the computation of one parallel loop and every processor executes a subset of the loop iteration. Once the computation of a loop is completed, the processors synchronize on a construct called barrier before they start the computation of the next loop. For this (and any other) communication they use fast on-chip links. Because of this strategy every processor works on his own part of the array in the code and since the local SPM is generally way smaller than the part of the array on which is worked, the processor divides these parts further in data tiles for execution. If work on such a tile is finished either it will be droped or written back to off-chip memory, if it was modified. To improve reusability of the data in the VS-SPM, intra-processor-reusability or inter-processor-reusability can be used. Here intra-processor-reusability targets optimizing the access pattern of a single processor and according to the authors the strategy does not make much sense on a multicore system, since it totally disregards inter-processor data sharing effects. These effects are very important in environments where the memories of several processors may partially overlap. On the contrary inter-processorreusability concentrates more on the application i. e. the access pattern of every processor is regarded. The problem of compiling array-dominated applications for VS-SPM based systems is divided into two smaller problems by the authors: Data tile shape/size selection and access pattern selection. The first one is described as the first step in compiling - it is the determination of the form and size of the data tiles. Important in this process are the available SPM space and the data access pattern of the application and the authors state this problem as important but do not pursue it any further since the paper s focus is on the second problem. During the paper all tiles are assumed as rectangular and all the processors are assumed to possess the same size of SPM and work on identical tiles. The access pattern selection is the scheduling step - here for a known tile shape/size an access pattern is determined which minimizes additional off-chip accesses by reducing the inter-processor communication. Now a compile technique is presented to address this problem. The degree of freedom of a given tile describes the tile s options to move on the given data space and an access pattern matrix H is defined which describes in which sequence the data tiles are accessed. 140

144 15 This matrix dimensions are dictated by the degree of freedom. The authors limit themselves to twodimensional arrays, but state that their technique can be adapted to higher dimensions as well. Every column of such a matrix corresponds to an axis and the value of the vector of this axis determines the direction in which the axis along the array will be accessed. Direction vectors for every processor are defined with respect to his neighbours - here the goal is to ensure that whenever a processor needs non-local elements of an array, another processor can deliver from his SPM. To achieve this goal the here presented scheduling strategy makes use of the following mathematical law. Scheduling equality: H T i v i, j = H T j v j,i (46) For two processors i and j, schedules through the matrices H i respectively H j reduce additional off-chip memory accesses. The algorithm based on this law consits of three steps: A symbolic scheduling matrix is assigned to every processor, where its rank is equal to the degree of freedom of the data tile. Now the scheduling equalities are built by using direction vectors and the scheduling matrices. Altogether these equalities build the constraint for eliminating additional DRAM-access by inter-processor communication. In the end the scheduling matrix of an arbitrary processor is initialized with an arbitrary value and now the scheduling matrices of the remaining processors can be computed using the scheduling equalities. For nested loops which may include flow-dependencies, the algorithm is nearly the same as for loops without dependencies so that only the scheduling matrix has to consider these dependencies. The approach for this problem consists of two steps: At first the first two steps of the aforementioned algorithem are applied, but now the algorithm differs in the initialization of the scheduling matrix. Not an arbitrary matrix will be computed, but all acceptable matrices for a processor and then all options for the corresponding matrices of every other processor are regarded. From all these solutions now the matrix which does not violate any data dependencies is chosen. If there is no such matrix, a default scheduling scheme that does not violate any data dependency will be used. But this does not necessarily avoid extra DRAM-accesses. The test environment for this approach consisted of a compiler environment and an in-house simulator. The algorithm was implemented with the SUIF (Stanford University Intermediate Format) experimental compiler infrastructure [10]. The simulator takes parallel C-code as input and simulates a multicore environment. A local SPM for every processor with an access latency of two cycles was assumed and also inter-process communication was simulated. For every necessary synchronization a latency of one cycle was assumed. In order to parallelize the applications an aggressive strategy was used. Every but the innermost loop of a nest was parallelized if legal. Four array-dominated applications of the image processing realm were used: 3D, dfe, splat and wave. But the authors failed to deliver the results for the performance in their work, despite the fact that they stated an increase of performance in their abstract. Also the reported energy savings are to be regarded with caution since the authors only compare the energy consumption of a system using their algorithm. So this might be an interessting approach to the problem of the use of SPMs on multicore systems, but the results do not show much information. Also the bruteforcing in the case of dependencies in the loop nests is rather disappointing. After two different techniques for simple multicore systems, it is time to expand the problem to several levels of parallelism. With multiple levels of parallelism the problem of best usage of the available SPM space gets more complicated. 141

145 16 Figure 3: Energy savings in %: Non-Local SPM optimizations compared to local SPM optimizations (sensitivity to the number of processors) Figure 4: Energy savings in %: Non-Local SPM optimizations compared to local SPM optimizations (sensitivity to the shape of the tile and size of the available SPM space) In "Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories" [12] the problem of scratchpad management for parallel architectures with more than one level is addressed. The authors developed a framework based on the polyhedron model for loop nest optimization which allocates automatically storage space in SPM, determins the access functions of references to arrays in SPMs and generates the code for moving local scratchpad memory data to global off-chip memory (and vice versa) automatically. The access functions of all array references and the iteration spaces of all statements in a program block are used as input by the framework, which divides the set of all data spaces accessed by all references of the array which is to store in the SPM. The partitioning is achieved through transformation into an equivalent graph problem. The storage allocation is accomplished through an algorithm in four steps by the framework: at first it generates one local memory array for each partition. For these partitions of data spaces the framework determines if it has adequate reuse in a program block, which is the case if there is one or morereference that accesses data space unregularly or if there is other in some form remarkable reuse of data space. These partitions are marked as beneficial to be copied to the SPM. The next step is to find the local memory storage for the partition. For every partition of data spaces the framework determines the upper and lower bounds of every dimension of its convex hull in form of an affine function of parameters of the program block by using parametric integer programming software. These induce the size of the local memory array to be created for the partition. Next the access functions of local memory array references have to be determined by the framework. The 142

146 17 aim is to find the corresponding access function for the local memory array reference for each reference to the original array in the given program block, therefore it is searched for the bound expressions of each dimension of the aforementioned convex hull. Some dimensions of the original data space do not appear in the convex hull and are represented as affine functions that appear in the polytope. With the help of CLooG [3] an array access function matrix is built, in which every row represents the array subscript of a dimension in the original data space. From this matrix the rows that belong to the dimensions in the original array that do not appear in the local memory array are removed. Now it is possible to calculate the corresponding local memory access from the original global memory access. For every partition of data spaces for which a local memory array is created the following procedure is used: with CLooG the data spaces that are accessed by read (respectively write) references are scanned and the loop structure of the code that moves data from global memory to local memory (respectively vice versa) is generated. Now the loop body is built by creating a matrix out of an identity matrix and an additional row representing each one of the corresponding dimensions that do not appear in the convex hull as an affine function of the dimension that appear in the convex hull and program parameters. The upper bounds of the data moved in to (respectively out of) the local memory array are estimated using - among others - techniques already used for the storage allocation. For further calculations over the polyhedrons the authors use polylib [8]. In future versions of their framework the authors plan to implement further optimizations of the movement code depending on data dependency information to spare space on the SPM. Now the authors use this framework for their multi-level tiling approach on multiple levels of parallelism. They use a framework [13] to find the available parallelism in a program by discovering groups of permutable loops as well as time loops and space loops. The available parallelism is distributed all over the various levels of parallel units of the system by tiling the space loops. Generally there are as many levels of tiling as the number of levels of parallel units but there can be levels of tiling added when necessary by tiling the permutable loops. The authors consider a two-level parallel architecture for example. The number of parallel processes at outer and inner levels is fixed to be a multiple of the number of physical parallel processors at the level. Then the space loops of the outer levels are tiled to equal tiles across the outer-level parallel processors. If one of those tiles requires more local memory than available an additional level of tiling is introduced which means that the tile is split into sub-tiles which are processed sequentially within the outer-level tile. With an optimization problem formulated by the authors now an optimal set of tile sizes as atomic unit of computation in an outer-level tile is found by an algorithm, designed to to minimize data movement cost between local memory and global memory, under the constraint that active local memory used by the process does not exceed a given upper limit. After the outer-level tiling is completed it is time for the inner-level tiling of the space loops. The approach was tested on a GPU. The architecture of a GPU offers several levels of parallelism, namely between the processor-cores and between the multiple SIMD-units of the processors. The processors communicate with each other with an off-chip DRAM, the SIMD-units communicate with each other through a fast local scratchpad. The tests were conducted on a NVIDIA GeForce 8800 GTX GPU device. The CUDA kernels were compiled with NCC to generate code which was started from the host system. The host system was an Intel Core2 Duo processor at 2.13 GHz with 2 MB L2 cache. For the test two kernels were used, namely Mpeg4 Motion Estimation (ME) kernel and 1-D Jacobi kernel. The first needs no synchronization while the second needs synchronization between the thread blocks. In comparision to an execution without SPM and only with GPU DRAM, Mpeg4Motion und 1-D Jacobi can be executed 8 respectively 10 times faster. In comparision with the CPU these values increase to 100 and 15. Generally it was to observe 143

147 18 that the performance increased with the number of thread blocks untill the point was reached where the synchronization costs took over. So the authors of the paper show an approach to optimally use scratchpad memories in systems with multiple levels of parallelism. They based their technique upon the polyhedron model and used linear programming to solve some smaller subproblems, but they also state that there is a limit to their technique when the number of thread blocks reached a point with to expensive synchronization costs. Figure 5: Execution Time: 1-D Jacobi for several problem sizes Figure 6: Execution Time: 1-D Jacobi for smaller problem sizes for varying thread blocks 4 SPM on Multitasking Systems The last aspect of the use of scratchpad memories for performance improvement is the situation on multitasking systems. In the context of explicitly different applications competing for available memory resources, an extension of an already introduced approach will be shown. In "Scratchpad Memory Management in a Multitasking Environment" [16] a SPM manager capable of code allocation supporting dynamically created processes is presented as an extension of the work from section 2. The authors designed a SPM manager for loading code of running processes into the SPM at runtime what differs their approach from the usually designs which work before execution. For this purpose a new dynamic SPM code allocation technique for systems running an operating system 144

148 19 Figure 7: Execution Time: 1-D Jacobi for larger problem sizes for varying tile sizes with virtual memory and preemptive multitasking was developed. In short the code is profiled and a postpass optimizer sorts the code of an application based on the access frequency. Temporally local code is then packed into pages with the size of an MMU page whereas local data is separated from the code into data pages. Every page s binary contains information of the access frequency and if it belongs to a loop, added by the postpass optimizer. These binaries are created independet of the available SPM size so the decision which page is loaded into the SPM is made at runtime - whenever a new process changes its status (i. e. is created, terminated or otherwise changed) the SPM manager is notified by the OS and then loads the code page into the SPM by intercepting the MMU s page fault exceptions and allocates the SPM to the current running processes. Hereby the SPM manager works as follows: The SPMM awaits a traditional cache as well as a softwaremanged SPM working on the system. It then decides which code pages should be loaded into its SPM and which should be loaded into the traditional cache. This decision depends highly on the utilized sharing strategy of the SPM, but also only SPM optimized pages will be loaded into the SPM whereas unoptimized pages will be loaded into the cache by default. So the SPMM needs to be informed whenever a new process is created, termianted, scheduled, changes its ready-to-run status or an MMU page fault occurs. When a new process is created its binary contains a map listing all code block access frequencies and loop affiliations, if its SPM-optimized. If it is informed, the SPMM redistributes the SPM between all processes which have access to the SPM and the SPM is also able to modify a process virtual memory mapping depending on the active sharing strategy, the information in the aforementioned map and the number of available pages. To load the pages in the SPM there is a little trick: if the page needs to be loaded into SPM before execution its memory mappings are marked as invalid so that whenever they are reached in the control flow the MMU triggers a page fault exception - this alerts the SPMM which starts to load the pages into SPM and fixes their memory mapping before restarting the last instruction. The same procedure is used, if there is no free page in the SPM: the SPMM chooses a target page by its sharing strategy and marks its memory mapping as invalid so that it will trigger a page fault when accessed the next time. The SPMM uses a round robin strategy to decide which page has to be replaced so there is no need of additional hardware and the computation costs can be kept low. The decision which page will be loaded into the cache and which page will be loaded into the SPM before execution of a process is based upon the number of the pages available for the process. If the number of needed pages surpasses the number of pages available thrashing occurs - in this case the SPMM will transfer the at least used page to the cache. 145

149 20 To warrant easy integration into already existing OS the SPMM is built as a module. So it has to be notified in cases of creation, scheduling or termination of a process, changing of the ready-to-run status of a process or a page fault, in order to be able to react. Basically the SPM can be seen as an additional layer in the hierarchy of virtual memory with paging and so it does not interfere with paging to external memory media and there are only minor changes to the page fault exception handler necessary to forward page faults, caused by the SPM, to the SPMM. The authors suggest three different SPM sharing strategies, where a page will be put into and which page will be replaced if a page fault exception occurs depends on the sharing strategy. As soon as a process joins or leaves, the available space will be redistributed among the active processes. The shared strategy is simple. Basically the SPM can be understood as a fully-associtive softwaremanaged cache with round robin replacement strategy - all processes share the SPM and a single pointer points which page will be replaced next. As soon as a page fault occurs the aforementioned pointer will be placed on the next free page and if there is no free page in the SPM left, the pointer will be placed on the page which should be replaced if a new page has to be loaded. This strategy is simple to implement and has no need of complex computation, but obviously is not fair. In the dedicated strategy the SPM is distributed by the active processes, where each process has got its own area which can only be accessed by it. In these areas there is also a pointer pointing at the pages next to be replaced and if there is no free page left, again the next page to be replaced is decided by a round robin strategy. The size of these areas is determined by one of the two divison policies. In the maximum-workingset policy, at runtime the size of the maximum-workingset of each process is calculated by the postpass-optimizer. The size of the area allocated to a process is determined proportionally to the size of the process maximum-workingset. This policy is static, which means, as long as there is no change in the number of the active processes, there is no change in the size of each process allocated area. The on-demand policy divides the SPM through the current working-set of the running processes by using the average number of page faults over a certain time for each process. The average number of page faults is determined by the connection of the number of all pages reserved for a process and the current workingset. So in this strategy the number of page faults is steadyly gauged and the average case adapted. Once a process is scheduled, short time before, the current average number of page faults is compared to the last average number of page faults and if there are not only minor or none differences, the divison of the areas will be adjusted again. Last one is the dedicated with pool strategy. This strategy can be seen as a mix of both of the previous strategies: one part of the SPM is shared among all the processes. This part is always assigned to the currently active process while the rest of the SPM is divided as in the dedicated strategy. This strategy is the standard case since with a shared area in the size of zero blocks the SPM is divided like in the dedicated strategy and with a shared area in the size of the whole SPM, it is divided like in the shared strategy. The information needed by the SPMM is delivered by the postpass-optimizer. To obtain best results and save energy, multiple times used code has to be loaded from the SPM, whereas instructions, which are rarely used, have to be executed from the cache or even external memory. Necessary to that end is the postpass-optimizer, which builds the profile data used bei the SPMM. The postpass-optimizer used in this paper is based on the one in [15] which has already been presented. The postpass-optimizer first deassembles thr ARM ELF binaries and then analyzes the traces from training runs and sorts the code of an application by access frequency. In doing so temporally local code will be aggregated to pages in the size of a MMU memory page and furthermore it builds the control flow graph of the whole application and thereby discovers loops. Once the whole code is aggregated to 146

150 21 pages, the loop hierarchy is used to decide the maximum working set of the application. Furthermore to increase the code density within a page the following optimizations are used: on the function level unregularly used blocks are separated from regularly used ones and assigned to distinct pages, while on the loop level functions are arranged according to the access frequency. First functions of the innermost loop are handled, they are assigned to pages first. The outermost loop s functions are handled last, their code is only placed in those pages to avoid internal fragementation. Further on the optimizer exfiltrates constant pools, which are small data sections in the ARM code, so called data pools, that contain constants which are to big to be encoded as immediate operand or global data address, because if they are placed in the SPM, they bring about further delays. These constant pools are aggregated to pages like normal code and due to narrow scope of the immediate operands placed nearby the corresponding code. After the optimization of the application it does not contain separated text areas and data areas but single pages for code and for data. Next to building the final data and code layout the postpass-optimizer adds a code block map. This map contains every block s access freqency and the belongig of every loop. Last but not least the new ELF binaries, which will behave the same on systems without a SPM, are built. The tests were carried out on a cycle-accurate ARM architecture simulator [20]. A small RTE was built, which consisted of a loader, a preempting round robin scheduler and a SPMM. All applications had the same priority. The loader loads processes from the RAM and assigns stack areas and heap areas to newly created processes. As soons as a process tried to access an unmapped instruction, a latency of 69 instructions occured during the loading of the page. 15 benchmarks were used for the tests to ensure representative selection. Nine of them were taken fom MiBench [17] and MediaBench [19], a H.264 video decoder [5], the official ISO MP3 decoder [6], MPEG-4 XviD encoding/decoding [7], and a public key encryption tool, Pretty Good Privacy (PGP) [9]. Additionally some applications were combined to a benchmark called combine again. As a reference system, an RTE was created on an ARM926EJ-S core with virtually-indexed, virtuallytagged caches. In doing so its cache was set to the smallest size which allowed a miss ratio of approximately 1%. For the reference case the benchmarks were arranged of the original single process applications. With the dedicated strategy the approach was dissappointing. Completly separated private SPMs for every application led to multiple page faults. On the average performance was increased by 19%. With rising size of the shared pool the dedicated with pool strategy outperformed the reference system. With increasing size of 1 4, 2 4 and 3 4 also the performance increased by 32%, 39% and 43%. Due to the overall smallest number of page faults, the shared strategy reached the highest performance. Its disadvantage of liability to applications with a big working-set (what could lead to displacement of pages of smaller applications) carried no weight, as expected by the authors. It led up to 47% performance increase. This work shows a quite useful application of the dynamic allocation technique developed by the authors earlier. It indicates that dividing the availbale SPM space into private SPMs for every application doesn t work effectively and comes up with two better solutions to the problem. Additionally it is a way of allocating SPM space to several applications without the need to know which applications will be running (and when) before execution. This makes the approach quite flexible. However some of the flaws remain. With the allocation of data and instructions to pages, memory space is still wasted. Also there still remains a certain latency for the SPMM to load pages into the SPM if needed. 147

151 22 5 Conclusion In this paper first two different ways for storing instructions into a SPM im order to achieve a performance gain were introduced. The first work addressed the overlay problem to store both instructions and variables optimally. It used linear programming. The second went a new way to use the SPM dynamically by using mechanisms of virtuall memory management. As shown it suffers from some problems like not entirely used SPM space but it set the stage for the approach to use scratchpad memories in multitasking environments, shown later in this paper. In the second part of this paper, the problem was extended to the use of multiple processors. The first paper in this part explored the interaction between task scheduling and memory optimization. It showed that the computation epensive approach introduced is not always necessary in order to achieve optimal SPM allocation. Next was an quite original approach to gain performance by reducing inter-processor communication with optimal access patterns to the scratchpad memories. But it failed to deliver its results in meaningful data. The last paper in this section extended the problem of using scratchpad memories in systems with several processors to increase performance to multiple levels of parallelism. The technique proposed in this paper reaches its limits when the number of thread blocks pass a point where the synchronization costs overwhelm the performance gain. The last part again picks up the second approach from the first part and considers the aspects of multitasking systems. It shows two better solutions than to divide the availbale SPM into private SPMs for every application. But it still delivers some flaws like wasted memory space and a certain latency caused by the scratchpad memory manager used. 6 Bibliography References [1] ARM. Available at [2] Benchmark Suite for Multimedia and Communication Systems. Available at edu/mediabenchii/. [3] CLooG: The Chunky Loop Generator. Available at [4] CPLEX. Available at [5] H.264 Video Codec. Available at [6] MP3 Reference Decoder. Available at [7] MPEG-4 Video Codec. Available at [8] PolyLib - A library of polyhedral functions. Available at [9] Pretty Good Privacy (PGPi). Available at [10] S.P. Amarasinghe, J.M. Anderson, M.S. Lam & C.W. Tseng (1995): An overview of the SUIF compiler for scalable parallel machines. Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Compiler. [11] Todd Austin, Eric Larson & Dan Ernst (2002): SimpleScalar: An Infrastructure for Computer System Modeling. Computer 35(2), pp , doi: / Available at