November 2009

Yesterday the InternetNews.com released a piece of news about the end of the Cell Broadband Engine. David Turek the VP of deep computing at IBM said during an interview with the German site Heise Online that the power XCell-8i will be the last of the Cell line. IBM will be focusing on power7 processor, which is due mid 2010.

In this news article, it is mentioned, by Jon Peddie, that Cell processor had many shortcomings that became apparent, such as lack of direct access to the global memory by its computing engines (the SPEs) and wrongly mentioned that everything has to go through its powerPC core, which creates a bottleneck. This is technically not true. The PowerPC core is not handling any of the requests initiated by the other compute engines (the SPEs). Also, it might be noted by some researchers that its cache should be bigger, but its performance still noted by many to be the best compared to other multi-core processors fall within its category. In addition, the Cell processor taught a lot of developers and researchers the best parallel programming practices for multi-core processors. The fact that everything is controlled by the developer forced all its programmers to think better of the best ways to optimize their algorithm’s execution time.

Although I may somehow believe that IBM may do changes to the Cell processor. It is very difficult to believe that IBM will end its Cell processor line that soon. IBM invested a lot of money and time and also many of its customers invested tons of money adopting the Cell processor.

I think IBM is trying to produce its own line away from Sony and Toshiba without giving away $500 million worth of investment and five years of engineering. It is about business. The cell processor is one of the master pieces in the multi-core processors. And as mentioned by David Turk the future is for hybrid multi-core processors, for a very simple reason: they provide great ratio of processing speed to consumed power.

I think IBM will reuse the SPEs instruction set along with their traditional PowerPC architecture but the change might be in how the cache will be organized and managed. Also, I think IBM is rethinking the cores interconnect network. They may use either dynamic networks or a mix between on-chip-network and shared cache architecture.

I’m talking here again about the multi-core processors for massively parallel systems working on complex scientific applications. However, I’m tackling this area from a different perspective. I would like to think with you of how multi-core processors will look like in five years from now and think of these questions: What are the serious problems that these processors will suffer from (from system’s perspective)? Which of the current solutions or anticipated frameworks may help us solving these problems? I’ll be discussing only one of them here. It is very difficult to predict accuratly what technology advancements will take place in the comming five years. However, there are general trends that we can track and resonably predict their future effects.

Anyways, I mentioned before that multi-core processors will be of thousands of cores maybe even tens of thousands of cores (check the latest AMD GPGPU Radeon HD 5970). These cores will be very simple and solving the same problem but on different data chunks. This of course mandates the existance of shared resources and shared areas contain input data and be able to store results. The path from one core to the data storage and other shared resources will get more complex and involving more shared resources, such as more hirarchies of on-chip and off-chip caches, cores interconnections, I/O buses, etc. The anticipated path and hierarchies through which data will be traveling to reach sysem’s main memory or to rech core’s registers will have a very important effect on data movement latency. It is not about only having higer latency which can be hidden by many veified techniques. It is about the variance of this latency from one core to another and from one request to anothr inside the same core. Current software solutions, such as prefetching in multiple buffers depend on the fact that latency to move data from memory to all processor’s cores will be the same at run-time. However, this is not true even in current multi-core processors. For example, inside the Cell Broadband Engine, the DMA latency (or memory latency) differs from one core to another depending on its physical location inside the chip and how far it is from memory controller. This variance will be even bigger as these processors grow in number of cores and as contension increases on shared resources inside them. Such variance requires solutions that would hide memory latency dynamically at run-time inside each core based on each specific core’s data latency. Current software solutions, such as prefetching and multi-buffering are depending on constant memory latency across all cores.Some hardware-based solutions tried to solve this problem through hyper- or multi-threading. Inside multi-core processors with multi-threaded feature, once a thread is blocked for an I/O or data movement, another thread gets active and resumes execution. Sun Microsystems through its latest UltraSparc T2 & T2 Plus, added up to four threads per core, which gives at the end large number of virtually concurrent threads on the same chip. However, there are two important drawbacks. First, if memory latency is pretty low for any reason these threads will be spending most of their time switching, which would give at the end a semi serial performance because four threads are sharing the same ALU and FP units. On the other hand, if memory latency is really high for all of the working threads inside the same core, we may end up with idle time because all of them will be waiting for data to come from system’s memory or an I/O device. Second, it this solution adds complexity to the hardware and consume space that could be used for bigger cache or even more single threaded cores.

Nano-KernelsOk, what would be the solution then? If we could dynamically create a threading framework that can create and manage threads pretty much to hyper- or multi-threaded architectures, we may be able to solve data latency problem smartly and for massively parallel multi-core processors. As long as each core will have its own data latency, why don’t we create small software threads that would switch their context to core’s local cache instead of switching it to the system’s main memory or second level cache. The context in this case will be the core’s registers (pretty much similar to current hardware based multi-threaded architectures) and few control registers affecting the execution of the thread, such as the program counter. So whenever one of these small threads, let’s call them micro-threads, stalls for a data chunk to be copied from system’s main memory, it will go to sleep mode and another micro-thread is switched to running mode and resume execution. A very small and very fast kernel, we may call it a nano-kernel, should actively run inside these cores to schedule micro-threads and make sure that data movement latency is hidden almost completely inside each core. This idea of having micro-threads has two advantages. First, the number of micro-threads is dynamic, which means number of micro-threads depends on data movement latency. For example, in large data movement latency we may add more micro-threads per core to work longer during other micro-threads wait time for data to be ready in core’s cache. Second, context switching inside each core’s cache makes it very cheap and very fast process, i.e. few of nano seconds. Of course, context saving will consume from each core’s cache but this is already consumed by several magnitudes to implement the hardware based multi-threaded architectures. Also, this will require specific faciltities provided by the ISA. For example, manual cache management and internal interrupting facility inside each core are mandatory for this idea to work. So, if we create nano-kernels doing optimization inside each core, we would reach new performance ceilings. It is scalable since each core has its own nano-kernel working independently and scheduling micro-threads based on resources given to the core. So, with thens of throusand of threads this solution would still work and get the most out of the expected massively parallel multi-core processors.

Multi- and Many-Core processors are here to stay for a really long time. They are microprocessors manufacturers response to the uni-core scalability walls. However, although software communities explored the parallel programming models heavily in the 80s and 90s, but these efforts were directed to less finer grained systems, mainly clusters, parallel machines, and Symmetric Multi Processors (SMP). Multi- and many-core architectures poped to the surface some classical problems, such as memory latency, data synchronization, and threads management. Also, they introduced new problems of massively parallel systems with to a great extent fine grained threading models, such as managing thousands of concurrent threads and inter-thread communication and data sharing. In this posting I would like to pinpoint some of these challenges from my research programming experiences on multi-core architectures.

Maintaining the current increase rate of processing power requires from micro-processors designers to introduce more processing cores per microprocessor. However, two important sides to be considered as more cores are introduced. First, processor cores will increase in high rate to double the speed every 18 months and keep Moore’s law in effect. Hence, it is expected to have, within five years, many cores processors with tens or even hundreds of processing cores on the same chip. Second, as number of cores is increasing, they will be with simpler designs and achieving simple tasks and each core will be faster from current single-core processors. Power and heat management issues will impose such design constraints on micro-processors manufactures. Such design aspects will increase overall processor’s speed while maintaining reasonable power consumption and heat dissipation.. As a result, parallelism will be finer grained. Developers will parallelize their applications at a more fine grained level to take full advantage of the multi or many-cores advancements. This granularity will increase the contention among these threads on shared resources. These resources can be a memory location, i.e. data, or an I/O device. Interdependencies among these threads will increase. In addition, as cores are getting simpler and faster, more data will be moving back and forth between processor and system’s main memory. On the other side, memory latency ration is increasing. Using hardware based techniques to hide this latency, such as branch prediction and embedded algorithms for cache replacement, may not be lucrative and efficient enough to hide this latency for parallel applications. Software based cache management and execution scheduling are now vital to fully utilize multi-core processors. Finally, programming complexity of multi-core processors and inherit complexity of parallel applications require tools to reduce some of these complexities.

Memory Latency Wall

As processors and programs become more parallelized, they will be more data hungry. On the other hand, the number of processor cycles to access system’s main memory grew from few cycles in 1980 to almost a thousand cycles today. Moreover, the cache per core ratio will continue to go down, which will make the memory latency problem worse if cache not managed properly. Although there is a great potential in the DRAM based memory to increase performance, but the growth rate of processors aggregate cycles will continue to be faster. The processor-to-memory performance gap is expected to grow by 50% per year according to some estimates. The good news is that memory latency problem can be solved using efficient software based scheduling for memory access. Multi-core processors are now returning some control back to software developer to manage each core’s cache. Such explicit cache management capabilities provide more space for programmers to maneuver around the processor-to-memory performance gap.

Data Synchronization

Using current synchronization mechanisms to synchronize tens or hundreds of threads access to one resource may lay on the line application’s performance. The whole system or application may suffer from deadlock or starvation due to weak synchronization mechanisms. In worst cases, current synchronization mechanisms will serialize the application in areas that need access to shared resources. As number of hardware threads in multi-core processors and parallelism increase in applications, the resulting performance lose increase as well. For example, implementing parallel shared counting algorithm would require from each of the participating threads to lock the counter before incrementing it. In worst performance case, each thread will have to wait for n-1 threads before it can update the counter, where n is the number of threads. If these processors are on the same die, efficiency of data synchronization can be greatly enhanced if data communication is done using available on-chip facilities, such as cores interconnect, shared cache, etc. Current synchronization techniques are using system’s main memory to write and read shared data, which makes it even worse. Such technique introduces the memory latency delay in addition to the delay of synchronization algorithms.

Programming Complexity

Parallel computing is inherently complex mainly due to the difficulty of design and intricacy of resources sharing and synchronization. Presence of multi-core processors at different scales, starting from embedded systems to super computers, made application’s adaptations to these new hardware platforms a critical issue. However, as multi-core processors are increasing their cores and parallelism is becoming more fine-grained, complexity will increase as well. Instead of designing a parallel application with 10 or 20 concurrent threads, an application may be executing 100s or 1000s of threads working on the same machine. A solution is required in this case to help reducing the programming complexity and also providing excellent scaling for the number of working threads.

Actually, these challenges are the main inspirational pillars for most of multi-core researchers, architects and developers. All microprocessors manufacturers are after faster processors without increasing programming complexity and without loosing developers ability to make best use of their new architectures. That’s why microprocessors manufacturers are now involved aggressively in the programming models. Intel for example created Intel’s Parallel Studio (Open CT framework included) for their general purpose multi-core microprocessors and specialized one as well, such as Larrabee GPGPU. Also, ATI built ATI Stream framework and NVIDIA also built CUDA framework to help developers make the best out of these new microprocessors without getting into the nitty-gritty architectural details of these advanced GPGPUs.

In my previous writing I tried to characterize multi-core processors and quickly pinpoint the major distinction features in multi-core processors. In this blog post I would like to briefly share with you some of the untapped aspects inside the Cell Broadband Engine and The GeForce GTX 295.

The Cell Broadband Engine (CBE)The Cell microprocessor is an innovative heterogeneous multi-core processor built by IBM, Sony, and Toshiba. 400 designers worked closely together for 5 years to invent the new heart for the PlayStation3 (PS3). The PS3 was only a starting point for the CBE. IBM made the first stab and re-introduced the CBE as a highly capable microprocessor for compute intensive jobs.I won’t repeat the explanation of the CBE architecture. You can find it at Wikipedia, IBM’s website, and many other places if you Google it. I implemented several discrete algorithms, such as graphs traversals and integers sorting, and scientific algorithms as the FFT. I also was able to get into its architectural details and build a proprietary threading model, called micro-threading. This framework hides memory latency in most workloads without engaging the developer into the lowest architectural details of the Cell processor. I also made a series of experiments to characterize the effect of it architectural properties on memory latency pattern in different workloads.The beauty of the CBE, from my points of view, is in the great extent of control it gives you to reach the best execution time. All the other architectural properties regarding its compute capabilities, multiple level of parallelism, and its on-chip network discussed in many publications and I experienced all these great features during my PhD journey. However, I would like to note the following from some of my experiences.Although the Element Interconnect Bus (EIB) is of high performance and very low latency, the effect of this topology over cores performance is overlooked by most research efforts. For example, from my memory latency measurement experiments I found out that memory latency differs from one core to the other depending on how far this core is from the memory controller. This does not reflect a flaw in the design, but this is a property that would change the measure of memory latency per each core. As this topology is used more and have more cores connected to the same ring, the physical location effect will be of more importance. The question is: how would this affect my performance as long as I can use techniques such as multi-buffering and data prefetching? The answer is very simple; as long as the memory latency differs from one core to the other, you need to hide it according to each core’s latency measures. For example, in cores with relatively very high memory latency you can use more buffers to prefetch your data compare to other cores with lower memory latency. I already discussed this in my optimum micro-threading scheduling paper, please have a quick look at it to understand this issue more.Also, in the new implementation of the CBE can work now with larger RAM. However, this RAM is based on the DDR3 technology. The initial implementation using the XDR RAM had the limitation of a maximum memory of 1 GB. Although the new CBE implementation has a faster floating point unit and two memory controllers to keep high processor-memory bandwidth, but the memory latency is getting higher due to the high latency of the DDR3 RAM compared to the XDR implementation.

NVIDIA GeForce GPGPU

I worked also on NVIDIA GeForce GTX 295 to implement some algorithms in information retrieval. It is still a work in progress and to be published. However, from my insights of the Cell Broadband Engine, I figured out some architectural properties that worth also sharing with you.NVIDIA GPUs programming framework is abstracting to a great extent the architectural details of the microprocessor. From productivity point of view, this is a great feature. However, it is leaving few options for researchers and experienced developers to explore different options inside it. For example, I couldn’t find a clear way that would help me scheduling threads to processor’s cores. It is done, not sure yet, by the processor’s scheduler. It is following the same policy used by Intel’s hyper-threading and Sun’s multi-threading architecture. I’m now measuring if memory latency differs from one core to another. My initial measures show that memory latency is almost the same across all cores. However, I’m a little bit concerned about the relatively many hierarchies built into the processor. I realize that the shared memory model mandates such hierarchy to have reasonable synchronization overhead. However, adding hierarchies to run away from this problem is not the best solution. NVIDIA is still investing in this hierarchical through their new GPU processor, Fermi.

Although the multi-core processors are providing a smart escape from physical limitations of the uni-core processors, but they need thorough architectural analysis to best utilize their resources. I think this is attainable by monitoring different execution patterns and pinpointing bottlenecks. This should make it easier build efficient programming models, run-time systems, and algorithms for multi- and many-core microprocessors. For example, inside the Cell Broadband Engine microprocessor manual cache management and the Element Interconnect Bus (EIB) formed the opportunity to build run-time systems to get best performance while simplifying the programming model, such as micro-threading, MPI microtask, data prefetching. I think multi- and many-cores processors will evolve through a closed feedback loop, see below.

Whenever a new architecture is being introduced, developers and researchers start implementing different algorithms and applications to get the best out of it. However, performance bottlenecks pops to the surface very soon. Several efforts try to solve these bottlenecks through either tweaking implemented algorithms or building general frameworks that would identify at run-time performance degradation parameters and change them. On the other hand the loop is properly closed if microprocessor designers listen to developers notes and try to hide these bottlenecks through architectural enhancements. This I think was properly handled in the new Fermi architecture of NVIDIA’s GPGPUs. For example, Fermi has now multiple memory controllers each is handling requests of different groups. This should reduce effect of memory requests serialization, which is a serious performance bottleneck.

I believe research teams are moving now from the naive ways of speeding up multi- and many-core processors by doing the old tricks of algorithmic enhancements to digging more into the processor’s architectural features and proposing better programming models backup by run-time libraries and frameworks. This trend is blending compilers, operating systems, parallel programming, and microprocessors architecture together.

Blogroll

Great Lakes Consortium for Petascale Computation
The Great Lakes Consortium for Petascale Computation is a collaboration among colleges, universities, national research laboratories, and other educational institutions. The consortium facilitates the widespread and effective use of petascale computing, t
0

The Hybrid Multicore Consortium
Oak Ridge National Laboratory (ORNL), Lawrence Berkeley National Laboratory (LBNL), Los Alamos National Laboratory (LANL) are leading providers of large scale computational resources for the scientific community and have made substantial investments to de
0