3
A KTEC Center of Excellence 3 Review of Cache Coherency Snooping vs. directory-based MSI, MESI, MOSI, MOESI Can often transfer dirty data from cache to cache Clean data is more difficult because it does not have an “owner” Inclusion vs Exclusion Performance issues related to coherency Mutex example

4
A KTEC Center of Excellence 4 Chip-level Multithreading (CMP) Multiple CPU cores on a single chip Different from hardware multithreading (MT) Fine-grained, Course-grained, SMT Becoming popular in industry with Intel Core 2 Duo, AMD X2, UltraSPARC T1, IBM Xenon, Cell A common memory architecture is an L1 cache per core and a shared L2 for all cores on chip Each core can use entire L2 cache Another organization is private L2 caches per core Lower latency to L2 cache and simpler design L2 cache contention can become a problem for memory bound threads

5
A KTEC Center of Excellence 5 Goals of CMP Caching Must be scalable!!! Reduce off-chip transactions Expensive and getting worse Reduce side effects between cores They are running different computations and should not severely effect their neighbors if memory bound Reduce latency The main goal of caching Latency of shared on-chip cache becomes a problem for high clock speeds

6
A KTEC Center of Excellence 6 Cooperative Caching Each core has its own private L2 cache but there is additional logic in the cache controller protocol to allow the private L2 caches to act as an aggregate cache for the chip. Goal is to achieve both the low latency of private L2 caches and the low off-chip miss rate of shared L2 caches. Adopted from file server and web caches (where remote operations are expensive)

7
A KTEC Center of Excellence 7 Methods of Cooperative Caching Private L2 cache is the baseline Reduce off chip accesses Victim data does not get written off chip It is placed in a neighbor’s cache (capacity stealing) Did not apply to old SMP systems –Talking to a neighbor process as expensive as talking to memory –Not true for CMP Can dynamically control the amount of cooperation

8
A KTEC Center of Excellence 8 Reducing off chip accesses Cache-to-cache transfers of clean data Most cache coherence protocols do not allow for this Dirty data can be transferred cache-to-cache because it has a known owner Clean data may be in more than one place, therefore it complicates the coherence protocol to assign an owner to clean data Result is that clean data transfers for coherence often go through the next level down (SLOW!) Extra complexity is worth it in CMP because going to the next level in the memory hierarchy is expensive They claim sharing clean data can result in a 10-50% reduction in off-chip misses

9
A KTEC Center of Excellence 9 Reducing off chip accesses Replication aware data replacement Private cache method results in multiple copies of the same data When selecting a victim, picking something that has a duplicate on chip (a “replicate”) is better than picking something that is unique on the chip (a “singlet”) The reason is that if the singlet is needed again in the future, then an off-chip access is required to get it back Must complicate coherence protocol to keep track of replicates and singlets –Once again, they claim it is worth it for CMP If all potential victims are singlet, they use LRU Victims can spill to a neighbor cache using a weighted random selection algorithms that favors nearest neighbors

10
A KTEC Center of Excellence 10 Reducing off chip accesses Global replacement of inactive data Want something like LRU for aggregate caches Difficult to implement because each cache is technically private leading to synchronization problems They use N-chance forwarding to handle global replacement policy (bottom of page 4) –Each block has a recirculation count –When a singlet block is selected, its recirculation count is set to N –Each time that block is evicted, its recirculation count is decremented –When the count reaches 0, it goes to main memory –When the block is accessed, recirculation count is reset –They have N=1 for CMP Cooperative Caching simulations

12
A KTEC Center of Excellence 12 Hardware implementation Need extra bits to keep track of state needed on previous slides Singlet bit Recirculation count Spilling method can be push or pull Push sends victim data directly to other cache Pull sends a request to other cache and then it performs a read operation Snooping requires too much overhead for monitoring private caches They choose a centralized directory based protocol similar to MOSI Might have scaling issues They speculate having clusters of directories is a solution to scaling to hundreds of cores, but do not go any deeper

13
A KTEC Center of Excellence 13 Central Coherence Engine (CCE) Holds the directory and other centralized coherence logic Ever read miss sent to directory, directory says which private cache has data Must keep track of L1 and L2 tags due to non-inclusion between L1 and local L2 for each core Inclusion is between L1 and the aggregate cache instead

14
A KTEC Center of Excellence 14 CCE (Continued) Picks a clean owner for a block to handle the cache-to-cache transfer of clean data CCE must be updated when a clean copy is evicted from a private L2 Implements push-based spilling by working with private caches Write back from cache 1 transfers data to L2, CCE then picks a new cache for data and transfers it to new host cache

17
A KTEC Center of Excellence 17 Conclusion Cooperative caching can reduce the runtime of simulated workloads by 4- 38%, and performs at worst 2.2% slower than the best of private and shared caches in extreme cases. Power utilization (by turning off private L2s) and performance isolation (reducing side effects between cores) are left as future work