For VPENTA, the locality optimizer introduces spatial locality for every
reference in the inner loop by interchanging two of the surrounding loops.
In other words, rather than iterating along the columns of the
matrices, which results in misses on every iteration since the data are
stored in row major order, the code has been restructured to iterate along
the rows of the matrices instead. Therefore references will only miss
when they cross cache line boundaries, which happens once every four
iterations in this case.

With this locality optimization alone, the performance improves
significantly. However, the selective prefetching scheme without this
optimization performs better, since it manages to eliminate almost
all memory stall cycles. Comparing the prefetching schemes before and after
the loop interchange, we see that the indiscriminate prefetching scheme
improves by only 11%while the selective prefetching scheme improves by
25%. The selective scheme improves more because it recognizes that
after loop interchange it only has to issue one fourth as many prefetches.
Consequently it is able to reduce its instruction overhead accordingly.
However, the indiscriminate scheme does not realize that many of its
prefetches are now unnecessary, and therefore continues to suffer from
large instruction overhead.

The best overall performance, by a substantial margin, comes only through
the combination of both locality optimization and selective prefetching.