An important class of algorithmic changes involves blocking data structures to fit in cache. By organizing data memory accesses, one can load the cache with a small subset of a much larger data set. The idea is then to work on this block of data in cache. By using/reusing this data in cache we reduce the need to go to memory (reduce memory bandwidth pressure).
Topic

Blocking is a well-known optimization technique that can help avoid memory bandwidth bottlenecks in a number of applications. The key idea behind blocking is to exploit the inherent data reuse available in the application by ensuring that data remains in cache across multiple uses. Blocking can be performed on 1-D, 2-D or 3-D spatial data structures. Some iterative applications can further benefit from blocking over multiple iterations (commonly called temporal blocking) to further mitigate bandwidth bottlenecks. In terms of code change, blocking typically involves a combination of loop splitting and interchange. In most application codes, blocking is best performed by the user by making the right source changes with some parameterization of the block-factors.

In this example, data (body2) is streamed from memory. Assuming NBODIES is large, we would have no reuse in cache. This application is memory bandwidth bound. The application will run at the speed of memory to CPU speeds, less than optimal.

In this modified code, data (body22) is kept and reused in cache, resulting in better performance.

For instance, the code snippet above shows an example of blocking NBody code. There are two loops (body1 and body2) iterating over all bodies. The original code on the top streams through the entire set of bodies in the inner loop, and must load the body2 value from memory in each iteration. The blocked code at the bottom is obtained by splitting the body2 loop into an outer loop iterating over bodies in multiple of BLOCK, and an inner body22 loop iterating over elements within the block, and interleaving the body1 and body2 loops. This code reuses a set of BLOCK body2 values across multiple iterations of the body1 loop. If BLOCK is chosen such that this set of values fits in cache, memory traffic is brought down by a factor of BLOCK.

Here is the relevant code-snippet from an OpenMP* version of the NBody benchmark (with blocking applied using a factor of CHUNK_SIZE). In this case, the loop unroll-jam transformation is expressed as a pragma and is left to be done by the compiler. In such cases, studying the output from -opt-report can confirm that the compiler did perform the unroll-jam optimization for your loop(s).

Cache Blocking is a technique to rearrange data access to pull subsets (blocks) of data into cache and to operate on this block to avoid having to repeatedly fetch data from main memory. As the examples above show, it is possible to manually block loop data in such a way to reuse cache. For performance critical loops where a performance analysis indicates memory bandwidth constraints and -opt-report is showing that the compiler is not blocking the loop optimally, you may consider hand-unrolling of the loop to better block data for cache reuse.

NEXT STEPS

It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon Phi™ coprocessors. The paths provided in this guide reflect the steps necessary to get best possible application performance.