Design

Cache-Friendly Code: Solving Manycore's Need for Faster Data Access

By Chris Gottbrath, November 12, 2012

As the number of cores in multicore chips grows  Intel just announced the 50+ core Xeon Phi  ensuring that program data can be delivered fast enough to be consumed by so many processors is a huge challenge. Optimal use of processor caches is a key solution, and knowing these coding techniques will become a requirement.

Sometimes it isn't possible to choose a loop order that puts all the data that needs to be read and written adjacent to one another. If this is the case, then a more-advanced technique called "blocking" can be used. Blocking involves breaking up the computation into a set of subcomputations that each fit within the cache. Blocking does involve reordering the sequence of computations and it is important to verify that this reordering doesn't violate a data dependency.

Padding: The Bane of Cache Line Utilization

Many modern processors require memory allocations to be made to aligned addresses. This generally means that a field must be stored at an address that is a multiple of its size; for example, a 4-byte field must be stored at an address that is a multiple of 4, and an 8-byte field must be stored at an address that is a multiple of 8.

Compilers insert padding between variables and fields to ensure that each field starts at an address with correct alignment for its particular data type. This padding consumes valuable cache space and memory bandwidth. Consider the following code snippet:

struct record {
char a;
int b;
char c;
};

Assume that char is a 1-byte data type and int is a 4-byte data type. The compiler will then lay out data with three bytes of padding between fields a and b to ensure that b is stored at an offset that is a multiple of 4. The overall data structure will take up 9 bytes.

If a developer moves the field a after b, the alignment requirements of all fields can be satisfied without any padding. That reduces the overall size of the structure to 6 bytes. The simplest way to ensure the compiler does not add padding is to sort the fields by their alignment requirements. Start with the fields with the greatest alignment requirements and then continue in declining alignment requirement order. Clearly, this dense way of allocating the data structure results in both a leaner usage of the cache capacity and a lower bandwidth demand because less padding data needs to be transferred from the DRAM. These kinds of simple transformations, such as reordering fields to increase data density, can result in significant performance improvements.

Moving Unnecessary Data

Another type of poor cache line utilization can be seen in the following code snippet containing a vector of data objects, each being a structure with a status and a payload:

The program periodically goes through all the elements of the vector and checks each element's status. If the status indicates that the object is active, some action is taken and the object variable is accessed. The status byte and the five object variables are allocated together in a 48-byte chunk. Each time the status of an object is checked, all 48 bytes are brought into the cache, even though only one byte is needed. A more efficient way to handle this would be to have all the status bytes in a separate vector, as shown below, which reduces the cache footprint and bandwidth needed for the status scanning to one-eighth of the previouus code example:

Performance improvements of a factor of two have been achieved for the SPEC benchmark suite Libquantum, and a factor of seven for the open-source application cigar when similar optimizations were applied.

Other Cache Issues

Other program behaviors that can limit cache effectiveness (and may represent an occasion for optimization) include:

Separating computations on large arrays into different loops when there isn't a data dependency that requires one loop to be completed before the other one starts

Programmers will need to consider the implication of these behaviors in order to create applications that run fast and efficiently, and have the scalability needed to take advantage of multi- and manycore processors. While compilers can recognize some of these patterns and apply some optimizations automatically, there are practical limits to what can be detected at compile time.

Caching with Multithreaded Programs

Multithreaded programs can allow a single computation to utilize more than one core in a multicore program. When developers program with threads, they create multiple execution contexts within a single operating system process. Each execution context can run simultaneously on a separate core. The process itself has a memory image and the threads operate on that memory image simultaneously. Communication between the threads can happen very naturally by having them read from, and write to, shared data structures.

The cache is involved in almost all movement of data between the cores of a multicore processor and always works with data in units consisting of cache lines. Programming with threads is complex enough that it is easy to neglect the cache and operate under the assumption that data is moved back and forth between processors at the finer granularity of individual words and bytes. This failure can easily result in a bottleneck that limits scalability.

One way that this conceptual challenge manifests is false sharing, in which two cores unwittingly write data items to the same cache line  thereby creating excess coherency traffic as the cores update their respective copies.

Lessons from Cache Optimization and General Parallelism

The observations that guide cache optimization are closely related to some of the key concepts for thinking about parallelism more generally. Cache optimization draws attention to two aspects about the program data.

The first aspect is the way data is used. Is it used repeatedly within sequential iterations of a calculation or used only one time? How is the program looping over this data? Is there a way of walking through the data that keeps computations performed on the same data close together? Is a given location needed by multiple threads? This awareness of the relationship between data and computations is essential to finding opportunities for parallelism.

The second aspect is the way the data is placed on cache lines. Object-oriented programming encourages programmers to abstract away details such as data placement. This abstraction allows for reuse, but obscures important details that are critical to understand and optimize performance. Awareness of bandwidth and data placement is fundamental to cache optimization and is key to scalability in parallel programming.

Conclusion

Running many programs at the same time or single programs with many threads is necessary to maximize the benefit of multi- and manycore processors. This means that the power of next-generation, manycore beasts can be utilized only if the cores can be supplied with a steady diet of data.

The processor cache plays a vital role in feeding the multicore beast. Ideally, it places the required data where it can be easily and quickly fetched, keeping the many cores of the processor fed. When the cache isn't able to function efficiently, the data is not quickly available, and application performance can slow to a crawl. This article has looked at how cache works and discussed a variety of different techniques that can help developers create programs that make better use of the cache. As cache optimization can be both daunting and complex, developers should take advantage of cache memory optimization tools that analyze memory bandwidth and latency, data locality, and thread communications/interaction in order to pinpoint performance issues and obtain guidance on how to solve them.

At the algorithm design level, there are two different approaches that can be taken towards designing cache-friendly programs. Cache-aware algorithms factor in details about the cache as design inputs and are aggressively tuned to perform well at a specific cache size. Cache-oblivious algorithms are tuned for caches more generally and not for a specific cache size. That both of these techniques can yield highly efficient algorithms is not a surprise, since we've seen here what kind of a difference caches can make.

Programmers need these techniques and tools if they want to be able to use their software efficiently on future generations of processors. Those who neglect the cache are likely to end up with programs that try to monopolize it and end up starving other programs sharing that cache.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!