Transitioning Software to Future Generations of Multi-Core

The industry shift to increasing levels of hardware parallelism through higher numbers of execution cores in mainstream processors requires change on the part of software makers. One key requirement is to look ahead to the hardware resources that are likely to be available in the future and to make appropriate architectural decisions in advance to accommodate them. This paper contributes to the discussion about planning for those developments.

Overview

The shift to multi-core processing across the computing industry has created a fundamental shift in a long-standing truism-that software achieved performance increases simply by waiting for the next generation of hardware to develop. That fundamental expectation was based largely on the progressive increases in clock speed that were the hallmarks of processor evolution throughout the history of computing, until recently.

Issues of power consumption and heat generation have led the foreseeable future of processor engineering away from increases in processor speed, in favor of increases in parallelism, in the form of increasing numbers of execution cores per chip. Increases in processor clock speed had few implications for the fundamental designs of software (other than, for example, the ability to add more features as performance increased). This new way forward, however, requires added flexibility on the part of the software, which means foresight on the part of the programmer. Supporting this expansion will become a fundamental requirement for software to achieve performance and scalability, with direct implications for its competitiveness in its market segment.

Multi-core processing has become ubiquitous, from laptops to enterprise servers, and everywhere in between. This change has caused the software industry at large to enact a certain level of change in product architectures, mostly at a modest level. Many applications are now threaded to some degree, typically with scaling to two, four, or eight processors in mind. Threading at this level, while it can be complex, has a relatively high degree of forgiveness for suboptimal threading methodologies.

As multi-core architectures continue to evolve, however, they will require developers to refine their threading techniques as a core aspect of their solutions, rather than as merely a desirable feature. Overheads associated with operations like thread creation, synchronization and locking, threading granularity, scheduling, and process management will become more pronounced as time goes on, and the necessity of planning for parallel scalability will become more and more important.

For the purposes of this discussion, it is necessary to consider two separate and distinct usages for the term "many-core" in common usage within the industry. The first of these usages refers to processor architectures that consist of a very large number of execution cores; another slightly less common term for such architectures is "massively multi-core." The second common usage for the term "many-core" refers to processors based on heterogeneous cores with specialized functions, such as graphics processing or TCP/IP offload. Th is paper opts to use "many-core" to denote the latter usage and to use the term "massively multi-core" for large systems of homogenous CPU-type cores.

This paper surveys some fundamental technical issues and relevant industry trends associated with planning for an increasingly multi-core future by software-development organizations. Its intention is to provide decision makers, influencers, and software architects with some guidelines to use in planning the policies and strategies that will position them for successful results as the multi-core future unfolds.

Getting the Basics Right: Robust Threading Practices

Perhaps the most important advice for the development of a programming strategy for systems with increasing numbers of cores is to institutionalize sound threading practices. In addition to the near-term gains in solution performance this practice engenders, it also positions those solutions for refinements in their threading design in the future, as the number of required threads increases.

The implementation of generalized threading practices should consist of a structured methodology that makes good use of tools and libraries that simplify the process. The Multi-Core Developer Community provides resources that are well-suited to the development of comprehensive threading methodology. The following very simplified version of such a process illustrates important tasks in the lifecycle:

Profile the serial version of the application to identify candidate regions for parallelization. This task can be facilitated by sampling the application using tools such as Intel® VTune™ Performance Analyzer or Microsoft Visual Studio 2005* Sampling Profiler to identify hotspots in the application that account for large portions of processor time. Next, using the call graph functionality in Intel VTune Performance Analyzer, one can identify the critical path (shown in red in Figure 1) and select the most time-consuming (or parallelizable) path. The selected path is then examined by looking at the call sequence, and an appropriate function in which to make threading calls is identified. Analysis at this stage also estimates the possible speed-up that can be obtained through parallelization of the section in question, from which goals are set for a performance increase.

Figure 1. Intel® VTune™ Performance Analyzer call graph output.

Design an appropriate threading model to be used in threading the candidate region. This step involves characterizing tasks as candidates for data decomposition or functional decomposition and restructuring data to accommodate threading. Briefly, data decomposition is the application of the same logic to different datasets, as in the case of applying a filter to a graphical image, where the same algorithm can be applied simultaneously (in parallel) to different regions of the image. Functional decomposition breaks overall logic into discrete functional tasks, such as updating a UI window, performing a data que ry, and calculating a formula simultaneously. This step also includes determining the value to the algorithm in question of thread-safe library functions such as those in Intel® Integrated Performance Primitives and Intel® Math Kernel Library.

Implement threading. Implement library functions that the design phase determined to have value to the section of the application being threaded. For functional-decomposition problems, use explicit threading and create a thread pool at startup, encapsulate the task one or more functions, and distribute the work among inactive threads. For data-decomposition problems, use of OpenMP* can dramatically simplify the implementation process, since the implementation details are largely handled by the OpenMP runtime.

Debug and test the threaded code. Once the threading model of choice has been implemented and the code has been instrumented using a compiler or performance analyzer, Intel® Thread Checker can help to iteratively identify and resolve threading errors. It is also necessary to ensure that the code produces results that are consistent with the serial version. This step involves source instrumentation of modules where memory conflicts occur and using Intel Thread Checker to drill down to the source lines associated with them.

Tune the code to reach the performance target. After collecting performance statistics, use a combination of Intel VTune Performance Analyzer and Intel® Thread Profiler to detect thread imbalances. Address those thread imbalances individually and recheck application performance to determine whether the tuning goals have been met, and repeat this process iteratively as necessary.

A structured methodology should be developed according to the needs and preferences of the individual organization, and it should be maintained and refined as needed.

Flexible Programming for Future Architectures

The speedup available by threading a specific region of code depends in part upon how many threads that region generates. Typically, it is desirable to create no more threads than the available processor resources can accommodate simultaneously. Thus, for a two-processor quad-core system that supports Hyper-Threading Technology, the maximum number of desirable threads would be 16 (2 processors x 4 cores per processor x 2 virtual processors per core). This is an important consideration, because creating more threads than can practically be used by the execution hardware generates overhead in creating threads that will not provide additional performance benefit.

One piece of code may be executed on various-sized systems, including systems with processors introduced after the software that have more cores than were available at release. It is therefore important that the application code can adapt to create various numbers of threads when different levels of hardware resources are available. Ideally, that adaptation should be transparent to the user, although depending on the type of application, it may be desirable to give technical users (system administrators for server applications, for example) a degree of control over this behavior.

One means of accommodating this requirement is to use a threading model such as OpenMP that automatically creates the optimal number of threads at runtime to support the machine on which the software is being run. Another method is to use the CPUID instr uction to count the number of cores and logical processors in a processor package. A useful discussion of this technique that also includes a utility for its implementation is available in the article "Detecting Multi-Core Processor Topology in an IA-32 Platform."

In order to ensure that a piece of code does not create more threads than the maximum number for which it has been tested, developers sometimes artificially limit that number by setting an upper limit to the number of threads that the code will create, using mechanisms such as the OpenMP function omp_set_num_threads(). While this is a responsible practice to ensure that your application does not behave unexpectedly or degrade performance, it is also a limiting factor to its support for future hardware architectures. A number of approaches are possible to obviate this issue.

First, it is desirable to test the scalability of software on hardware systems that support large numbers of threads. Ideally, one should seek out hardware in the pre-production stages for such testing, considering venues such as the Intel® Software Partner Program, which provides technical resources that assist in development, including access to development platforms and engineering support. The program also provides planning and strategy resources for executives and decision makers, as well as marketing support.

Another approach is to make the upper limit on the number of threads that the application can create configurable. That approach allows you to easily test the software on hardware with a larger number of cores as that hardware becomes available, and once your organization has verified that the software behaves as desired, you can change the application setting to allow a higher number of threads.

Programming for Resilient Systems of Unreliable Components

As very large numbers of cores become available per processor, the burden of system reliability will shift from individual cores to the overall aggregate. That is, large systems of cores can tolerate the failure of individual units, since each one corresponds to a relatively small percentage of overall execution resources. Traditionally, burn-in of new components has been an important means of identifying those that would fail early in their life cycles, as a means of isolating the 'longer-lived' units for use in production. The shear scale of a system of thousands of cores, as opposed to just a few, makes that approach impractical in a massively multi-core future.

Moreover, if a hypothetical mean time to failure for processor components were ten years, we could roughly posit that perhaps half would fail in the first ten years of system life. In a 1,000-core system, that equates to roughly one core per week on average becoming unusable. While that still leaves plenty of processing power to go around, software must be able to dynamically adjust to this environment, which will paradoxically result in highly resilient systems based on unreliable individual components.

The necessary error-checking that is required under this model might draw from current high-performance computing models, where distributed systems of computers achieve redundancy by sheer numbers, rather than intra-machine resources. Another analogy is TCP traffic over the Internet, where the failure of a node or transmission path can be routed around without causing processes to fail.

Planning for Many-Core Heterogeneous Architectures

Many observers expect the development of processors that contain a large number of cores designed to support specialized tasks to be increasingly prevalent over the next several years. On-chip heterogeneity allows the processor to better match execution resources to each application's needs and to address a wider spectrum of system loads with high efficiency.

For example, a specialized graphics or TCP/IP offload core may be able to perform certain functions with better power efficiency than a conventional CPU core, while at the same time, the CPU core is more flexible in terms of the implementations for which it is suited. As current technologies that illuminate future trends in this area, consider the use of graphics processing units (GPUs) for non-graphics purposes and the idea of including programmable gate arrays (PGAs) in many-core architectures.

A growing body of research and practice centers on the use of GPUs as general-purpose co-processors. This area, known as General-Purpose Computing on Graphics Processing Units (GPGPU), is not widely used but shows considerable promise and illuminates nicely some concepts involved in the use of software to take advantage of multiple types of execution cores within a processor package. Graphics tasks as a whole are highly parallel in nature. Tasks such as rendering and transcoding offer an obvious means of dividing workloads into discrete tasks, with the same operations being performed simultaneously, for example, on many pixel regions within a still graphic or on time slices within an animation.

Because of that fact, GPU architectures are highly parallel in nature, which makes them well-disposed to handling many types of operations with high performance. For example, matrix algebra easily maps directly to the 2D grid used in rendering operations within the GPU, so matrix operations that require large numbers of Fast Fourier Transforms for implementations such as financial markets analysis are a natural fit to this model, as are problems in fluid dynamics, medical imaging, and geological surveying, to name a few.

A significant advance in the development of programming using GPUs for general-purpose computing, Nvidia has released CUDA*, a C-programming environment specifically for use in solving complex computational problems using its GPUs. The CUDA environment is compatible with Windows* and Linux* programming environments and includes math-function libraries and an SDK to aid in implementation. Programming resources for GPUs have also started to be introduced by other vendors, signaling a maturation in this area that may be expected to directly enable software development for many-core architectures.

Because of their inherent flexibility, PGAs suggest another type of functionality that could be offered by future many-core processors. PGAs commonly provide programmable logic blocks that allow reconfiguration to provide various types of functionality. They are typically configured for specific purposes and used in a static manner, although some facility does exist for dynamic configuration at runtime. By expanding those dynamic capabilities, one could imagine processors that would detect the requirements of a spec ific workload and dynamically reconfigure the actual hardware to accommodate it, for example, with a desirable balance of power efficiency and performance.

Conclusion

The course of change in the hardware industry is clear-the primary means of evolution for processors for the foreseeable future is through increasingly large numbers of execution cores per processor package. This change requires software to incorporate increasing levels of parallelism, specifically by means of multi-threading. The limited commitment many development organizations have made to threading their software must be replaced by a robust threading methodology that is applied to all of their products, which will ensure the ability to take advantage of next-generation processors.

In addition to the adoption of conventional threading techniques, developers should also consider a more open-ended number of processing cores than has been the case up to now. Rather than designing a product for a set number of processors, it will become increasingly valuable to be able to accommodate as many cores as are available, dynamically adjusting to the execution environment.

As systems become even larger, the optimal number of threads will shift away from matching all of the available cores to a level set by algorithms within the software. At the same time, processors with very large numbers of cores will create a shift toward resilient systems based on dynamic adjustment among available resources, allowing increased tolerance of failure in individual cores.

In the more distant future, processors are likely to make use of large numbers of specialized, heterogeneous cores that amount to distributed systems on a chip. Offloading specific functions to specialized processing cores is likely to allow greater ability to match resources to tasks optimally, as well as a higher degree of power efficiency. Present-day predecessors of that functionality may be deduced from the use of GPUs as general-purpose co-processors and related implementations that shed light on how development organizations might prepare for a future of diverse processing resources in massively multi-core and many-core systems.

Additional Resources

The following materials provide a point of departure for further research on this topic:

GPGPU.org* collects news and information about the use of graphics hardware for general-purpose computing, including message boards and industry announcements.

About the Author

Matt Gillespie is an independent technical author and editor working out of the Chicago area and specializing in emerging hardware and software technologies. Before going into business for himself, Matt developed training for software developers at Intel Corporation and worked in Internet Technical Services at California Federal Bank. He spent his early years as a writer and editor in the fields of financial publishing and neuroscience.

Using this translation widget will provide you with a machine translation of the original content. The machine translation is provided for informational purposes only; it should not be relied upon as complete or accurate.