File(s):

The large working sets of conmercial and scientific workloads stress the L2 caches of Chip Multiprocessors (CMPs). Some CMPs use a shared L2 cache, to maximize the on-chip cache capacity and minimize misses. Others use private L2 caches, replicating data to limit the delay due to global wires and minimize cache access time. Recent hybrid proposals strive to balance latency and capacity, but use simple, static rules that are not robust to changes in workload behavior and system configuration. This paper studies alternative L2 cache designs for an 8-processor CMP system and shows that two previous selective-replication mechanisms actually degrade performance up to 13%, for some combinations of scientific and commercial workloads and system configurations. We propose the Adaptive Selective Replication (ASR) mechanism that dynamically monitors workload behavior to control replication. ASR replicates read-only blocks only when it estimates the benefit of replication (lower L2 hit latency) exceeds the cost (more L2 misses). Full-system simulation results show that ASR provides robust performance: decreasing runtimes by as much as 31% versus shared caches, 33% versus private caches, and 27% versus CMP-NuRapid and Victim Replication. Furthermore, while ASR does not improve the performance of all workloads on all system configurations, it almost always performs as well as the best alternative.