Align and Organize Data for Better Performance

Challenge

Minimize performance losses due to unaligned data. Unaligned data can be a potentially serious performance problem. It is important to remember to focus on data elements in the most CPU-intensive parts of your program.

Solution

Align data on natural operand size address boundaries. If the data will be accessed with vector instruction loads and stores, align the data on 16-byte boundaries. For best performance, align data as follows:

Align 8-bit data at any address.

Align 16-bit data to be contained within an aligned four-byte word.

Align 32-bit data so that its base address is a multiple of four.

Align 64-bit data so that its base address is a multiple of eight.

Align 80-bit data so that its base address is a multiple of sixteen.

Align 128-bit data so that its base address is a multiple of sixteen.

In addition, pad data structures defined in the source code so that every data element is aligned to a natural operand size address boundary. If the operands are packed in a SIMD instruction, align to the packed element size (64- or 128-bit). Align data by providing padding inside structures and arrays. Programmers can reorganize structures and arrays to minimize the amount of memory wasted by padding.

The __declspec(align(sizeInBytes)) pragma, which is supported in both the Intel® and Microsoft compilers, causes the linker to align variables with the specified alignment. For example, the following declaration ensures that the variable signMask is aligned on a 16-byte boundary suitable for vector instruction processing:

__declspec(align(16)) static const int signMask[4] = {-1,-1,-1,-1};

Employ data-structure layout optimization to ensure efficient use of 64-byte cache-line size. Sometimes, frequently used data in data structures is more than 64 bytes apart, with other, less frequently used data in between. If these data elements are rearranged in the data structure such that they are close together, they are more likely to be on the same cache line, which can potentially reduce the number of cache misses and the memory footprint loaded into the data cache.

Ensure proper data alignment to prevent data split across cache-line boundaries. In some cases, data elements can span cache-line boundaries, causing two separate cache lines to be accessed to retrieve the data. This can be a very costly performance limiter. Using the Intel® VTune™ Performance Analyzer, you can identify cases where this is occurring and realign or relay out data to help prevent it.