This work addresses the problem of the increasing performance disparity between the microprocessor and memory subsystem. Current L1 caches fabricated in deep submicron processes must either shrink to maintain timing, or
suffer higher latencies, exacerbating the problem. We introduce a new classification for the behavior of memory traffic, which we refer to as target behavior. Classification of
the target behavior falls into two categories: Uni-Targeted
Instructions (UTI) and Multi-Targeted Instructions (MTI).
On average, 30% of all dynamic memory LD/ST operations
come from execution of UTIs, yet only a few hundred static
instructions are actually UTIs. This makes isolation of the
UTI targets an avenue for optimization. The addition of a
small, fast cache structure which contains only UTI data
would ideally reduce MTI pollution of UTI information. By
intelligently selecting between larger, slower data caches
and our UTI cache, we reduce the latency problem while
increasing performance.
Our distinct contributions fall in three areas, with implications to many others: (1) we present a new characterization of memory traffic based on the number of targets from
LD/ST instructions; (2) we explore the underlying nature of
the target division and devise a simple mechanism for exploiting regularity based on a UTI cache; (3) we explore
a variety of prediction mechanisms and processor configuration options to determine sensitivity and the performance
gains actually attainable under different modern processor
configurations. We attain up to 42% IPC improvements on
SPEC2000, with a mean improvement of 8%. Our solution also reduces L2 accesses by up to 89% (average 29%),
while reducing load-load violation traps by up to 84% (average 13%), and store-load violation traps by up to 43%
(average 8%).