This paper focuses on translating the concept of prefetching into real
performance. Software-controlled prefetching not only incurs an
instruction overhead, but can also increase the load on the memory
subsystem. It is important to reduce the prefetch overhead by
eliminating prefetches for data already in the cache. We have
developed an algorithm that identifies those references that are
likely to be cache misses, and only issues prefetches for them.

Our experiments show that our algorithm can greatly improve
performance-for some programs by as much as a factor of two. We
also demonstrate that our algorithm is significantly better than an
algorithm that prefetches indiscriminately. Our algorithm reduces the
total number of prefetches without decreasing much of
the coverage of the cache misses. Finally, our experiments show that
software prefetching can complement blocking in achieving better
overall performance.

Future microprocessors, with their even faster computation rates,
must provide support for memory hierarchy optimizations. We advocate
that the architecture provide lockup-free caches and prefetch
instructions. More complex hardware prefetching appears unnecessary.