Authors

Abstract

Modern graphics processing units (GPUs) include hardware-controlled caches to reduce bandwidth requirements and energy
consumption. However, current GPU cache hierarchies are inefficient for general purpose GPU (GPGPU) computing. GPGPU
workloads tend to include data structures that would not fit in any reasonably sized caches, leading to very low cache
hit rates. This problem is exacerbated by the design of current GPUs, which share small caches between many threads.
Caching these streaming data structures needlessly burns power while evicting data that may otherwise fit into the cache.

We propose a GPU cache management technique to improve the efficiency of small GPU caches while further reducing their
power consumption. It adaptively bypasses the GPU cache for blocks that are unlikely to be referenced again before being
evicted. This technique saves energy by avoiding needless insertions and evictions while avoiding cache pollution,
resulting in better performance. We show that, with a 16KB L1 data cache, dynamic bypassing achieves similar performance
to a double-sized L1 cache while reducing energy consumption by 25% and power by 18%.

The technique is especially interesting for programs that do not use programmer-managed scratchpad memories. We give
a case study to demonstrate the inefficiency of current GPU caches compared to programmer-managed scratchpad memories
and show the extent to which cache bypassing can make up for the potential performance loss where the effort to program
scratchpad memories is impractical.