Parallel

Cache-Friendly Code: Solving Manycore's Need for Faster Data Access

By Chris Gottbrath, November 12, 2012

As the number of cores in multicore chips grows  Intel just announced the 50+ core Xeon Phi  ensuring that program data can be delivered fast enough to be consumed by so many processors is a huge challenge. Optimal use of processor caches is a key solution, and knowing these coding techniques will become a requirement.

The coming wave of manycore processors, such as the Xeon Phi, are many-headed beasts that can chew through a tremendous amount of data  but only if they are fed properly. If data can't be brought in quickly enough, some of the cores will starve and be forced to wait for data to arrive.

The multicore opportunity leads developers to focus on software scalability in unprecedented ways. Before, scalability was primarily of interest to those with access to unusual hardware. Now, it is a critical metric for assessing the long-term performance potential of any code or algorithm.

There are two ways to take advantage of multiple cores: Developers can run a single program and make it multithreaded, or they can run multiple processes at the same time. Scalability is an important characteristic in either case.

The case of a single program needing to be spread across multiple threads has been discussed numerous times in Dr. Dobb's, with extensive coverage of the various techniques.
The scaling of threaded programs is frequently limited by contention for resources. Resources that can limit scalability include processor cache space, memory space, and bandwidth limiting data transfer between different components (processor, memory, interconnect, and storage). With multi- and manycore processors, the key bandwidth limitation is often the link between main memory and the processor  the processor caches. These caches are critical for ensuring that the many cores of our beast are adequately fed.

Homogeneous Workloads

Other significant problems are composed of numerous distinct subproblems that are solved independently. In these cases, processor occupancy  the proportion of busy processors  can be raised by running many copies of the program at the same time. Each individual instance may not run terribly fast, but the workload as a whole should benefit from a net throughput that wouldn't be achievable on older processors. Examples of these workloads, often described as being "embarrassingly parallel," might be serving static Web content, calculating Monte Carlo financial models, and bulk processing of image or audio data.

However, when multicore processors are used to run many instances of simple programs, they are still subject to sharing system resources such as bandwidth, cache, and memory. These shared resources can still be bottlenecks even when the computations are completely distinct and embarrassingly parallel. When conflict occurs, cores must wait for data and so, they are idle part of the time. This can mean that even with embarrassingly parallel workloads, adding tasks (even when cores are available to run them) will not necessarily result in faster throughput.

Heterogeneous Workloads

Virtualization is a way to achieve higher core occupancy at a system-administrative level because a large fraction of front- and back-office software is single threaded. Virtualization allows multi- and manycore processors to support wider workloads with a high degree of transparency to both the application and organization. But the scalability limits mentioned previously still apply, because when two or more virtual machines are placed on a multi- or manycore processor, all those processes share the same memory bandwidth and contend for the same cache space.

Cache and memory bandwidth are limiting factors in all these scenarios and they become progressively more challenging as the number of cores increases.

Cache Memory and a Serial Example

Allocation of space in the processor data cache is managed by the hardware and is based on the processor's access pattern. When the processor accesses a part of memory that is not already in the cache, it loads a chunk of the memory around the accessed address into the cache, in the hope that the processor will need that same data or adjacent data soon, and if it does, then that data will be in the cache and easily accessible.

"Cache" lines are the chunks of memory handled by the cache, and the size of these chunks is called the "cache line size." Common cache line sizes are 64 or 128 bytes.
The limited number of lines that a cache can hold is determined by the cache size. For example, a 64 KB cache with 64-byte lines has 1024 cache lines.

Prefetching in Hardware and Software

It is clear that when data is going to be used multiple times, keeping it in the cache will eliminate the latency problem for the second and subsequent accesses, but nothing discussed so far reduces latency the first time a program needs to access a given cache line. Ideally, the data would already be waiting in the cache for the processor to access the first time. This can be achieved through prefetching lines of data into the cache. Most processors have both hardware and software support for prefetching.

Software prefetching takes the form of instructions that don't operate directly on data, but tell the memory system to preload an address or range of addresses into the cache. Compilers can weave these instructions into computations as an optimization if it is likely that they will help.

Hardware prefetching watches where the program is reading from memory and, if it detects a pattern, it will start preloading what will be needed before the program asks for it. Hardware prefetchers work best with very regular access patterns.

Cache Coherency

Modern processor caches also maintain a property called "coherency." A coherent cache places a copy of data close to the core, then takes steps to ensure that the whole system functions as if there were only one copy of the data, therefore avoiding situations where separate cores see different values for what should be the same data. While vital for correctness, the mechanisms used to ensure coherency between items found in the caches of multiple cores can add substantial overhead for certain types of access patterns. Coherency mechanisms also need to be in place to avoid errors when direct memory access (DMA) technology is used to change data in memory without the CPU being involved.

This routine was run on a system with a cache size (L2) of 512 KB. For comparison purposes, the size of the array was varied while changing the number of iterations to keep the number of increments to the inner array constant. When the array size was greater than 512 KB, it took about 50 seconds to run. When the array size was smaller, so it fit completely within the cache, the calculation ran in about 5 seconds. The magnitude of this effect, approximately 10x speed up, is consistent with the ratio of latency to the cache and latency to main memory.

Multiple Processes, Cache Friendly Programs, and Scalability

So what can be done to fully utilize the performance of a multicore system when many separate processes are running on that system? Examining programs to see if they are using the cache efficiently is an important step. If cache-friendly algorithms are used, even in single-threaded applications, a foundation for efficient virtualization is established: A single greedy process can gobble up all the cache and leave other innocent processes to starve as they wait for every data element to be fetched from main memory. Let's look at the types of problems that can inadvertently make caches inefficient, then examine optimizations that can be made to simple programs to make them more cache friendly.

Inefficient Loop Nesting and Blocking

Dense, multidimensional arrays are common datasets in numerical applications. When a program is processing a multidimensional array, such as pixel data in image processing or elements in a finite-difference program, it is important to pay attention to which dimension is assigned to the inner loop and which dimension is assigned to the outer loop.

If an array is traversed along the wrong axis, only one element will be used from each cache line before the program continues to the next cache line. For example, consider this small snippet of C code iterating over a two-dimensional matrix:

In one iteration of the outer loop, the inner loop will access one element in each row of a specific column. For each element touched by the loop, a new cache line is accessed. If the array is larger than the cache size, the rows from the top of the matrix may already have been evicted from the cache when the bottom of the matrix is reached. The next iteration of the outer loop then has to reload all of the cache lines again, each time paying the full main-memory latency cost.

If the nesting of the loops is changed so that they traverse the matrix along the rows instead of along the columns, the program receives a lot more benefit from the cache:

As the program processes each column along a row, data is moved into the cache and all of the data in that line is used before the cache line ends up being discarded. Accessing data sequentially also allows hardware prefetch logic to engage, placing data in the cache before the processor even asks for it. This maximizes the use of available memory bandwidth. A code transformation like this can result in performance improvements of up to 15x for large array sizes. For example, the LBM code of the SPEC2006 benchmark suite contains such a performance problem and was made to run 2.7-times faster on a quad-core when the simple code transformation suggested above was applied to one of its loops.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!