Parallel

Optimizing Software for Multicore Processors

Source Code Accompanies This Article. Download It Now.

With the potential for real performance gains, multicore processors present the challenge of deciding how to validate and optimize code.

Evaluate Code Performance

Before examining the overall performance data, what should you reasonably expect? Given we migrated Amide from a single-core to a four-core system, the highest "realistic" expectation is a linear four-fold performance improvement. This is a theoretical maximum because the cores share resources such as busses and memory, which can introduce execution delays. There is also overhead associated with maintaining data coherency among the caches in the system. Applying a rule-of-thumb, these factors can lead to a 10-20 percent reduction in system performance. Therefore, on average, we anticipate a four-core system performs in the range of 3.2 to 3.6 times faster than a single-core system.

First, we measured overall application performance, relative to the original code, as measured by the time it took VolPack to render images along the z- and x-axis (Table 1). These two rotations were rendered from the same image dataset along two different axes. The outcome of our migration from a single-core system to a four-core system was a performance speedup of between 3.2 and 3.4 times. These results indicate this parallelization implementation was effective and reasonably well optimized.

1 Core (seconds)

2 Cores (seconds)

4 Cores (seconds)

Speedup Ratio: 1 Core/4 Cores

z-axis search

1.85

0.96

0.55

3.4

x-axis search

5.48

3.18

1.71

3.2

Table 1: Overall performance results.

Next, we examined the cache hit rate. Table 2 shows the L2 cache hit rate for the images rendered along the z- and x-axis run on one core, two cores, and four cores. Along the z-axis, the cache hit rate is fairly consistent for one, two, and four cores and about 76 percent. Along the x-axis, the L2 cache rate dips down to 34 percent with four cores. This lower L2 cache hit may be further evidence of the slower rendering time for x-axis renders in Table 1.

1 Core

2 Core

3 Cores

z-axis search

78%

78%

76%

x-axis search

26%

28%

34%

Table 2: L2 cache hit rates.

Third, we looked at CPU utilization. During image rendering, all cores are running at nearly 100-percent utilization for configurations: one-core, two-core, and four-core. All available cores are sharing evenly in the workload. The CPU utilization was measured using the Linux's TOP command.

Our fourth code metric was synchronization overhead, which provides a measure of the noncompute resources required to maintain the threads that otherwise could be used to speed up the application. Figure 4, generated by the Intel Thread Profiler, shows Core 1 loading the image for 45 seconds, then spawning threads to cores 2, 3, and 4. The solid green areas represent individual image renders where the renders start around 46, 48, 56, 59, 62, and 68 seconds.

[Click image to view at full size]

Figure 4: Synchronization overhead: Parallel image rendering.

Synchronization overhead is displayed in red when its duration is greater than 30 microseconds. The only synchronization instance displayed in Figure 4 occurs when Core 1 creates the threads for Cores 2, 3, and 4 around time 46 seconds. We were pleased with this relatively low level of synchronization overhead.

Our last code metric was thread stall overhead, which provides a measure of the core idle time due to the cores waiting for system resources or work assignments. Figure 5 shows that a single thread executes for 66 seconds (CL.1); this is the sum of the time Core 1 loads the image and the idle time between rendering images. For the rest of the time, four threads are executing (CL.4) corresponding to image rendering. This means we have four cores all working away during image rendering and this is good.

[Click image to view at full size]

Figure 5: Measuring thread stall overhead.

There are short periods when only two or three threads are working (CL.2 and CL.3), but these are relatively short. CL.2 and CL.3 periods are better than CL.1 periods (one thread), but not as good as CL.4 (four threads). Thread profilers typically provide more detailed views of thread stalls, and they can map thread stalls to the responsible source code.

Conclusion

Developing parallel code for multicore processors warrants careful consideration of threading approaches and a thorough analysis of the resulting performance. The approach I describe here can be applied to any application transitioning from single-threaded to multithreaded to take advantage of multicore processor performance. This code migration case study showed it is possible to take an application designed to run on a single-core processor and migrate it to a four-core system in a matter of days, while realizing a performance increase of more than three times.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!