Intel Ivy Bridge Cache Replacement Policy

By Henry, on January 25th, 2013

Caches are used to store a subset of a larger memory space in a smaller, faster memory, with the hope that future memory accesses will find their data in the cache rather than needing to access slower memory. A cache replacement policy decides which cache lines should be replaced when a new cache line needs to be stored. Ideally, data that will be accessed in the near future should be preserved, but real systems cannot know the future. Traditionally, caches have used (approximations of) the least-recently used (LRU) replacement policy, where the next cache line to be evicted is the one that has been least recently used.

Measuring Cache Sizes

The behaviour of LRU replacement policies with cyclic access patterns is useful for measuring cache sizes and latencies. The access pattern used to generate Figure 1 is a random cyclic permutation, where each cache line (64 bytes) in an array is accessed exactly once in a random order before the sequence repeats. Each access is data-dependent so this measures the full access latency (not bandwidth) of the cache. Using a cyclic pattern results in sharp changes in latency between cache levels.

Figure 1 clearly shows two levels of cache for the Yorkfield Core 2 (32 KB and 6 MB) and three levels for the other three. All of these transitions are fairly sharp, except for the L3-to-memory transition for Ivy Bridge (“3rd Generation Core i5”). There is new behaviour in Ivy Bridge’s L3 cache compared to the very similar Sandy Bridge. Curiosity strikes again.

Ivy Bridge vs. Sandy Bridge

Figure 2 shows a comparison of Sandy Bridge and Ivy Bridge, for varying stride. The stride parameter affects which bytes in the array are accessed by the random cyclic permutation. For example, a stride of 64 bytes will only access pointers spaced 64 bytes apart, accessing only the first 4 bytes of each cache line, and accessing each cache line exactly once in the cyclic sequence. A stride less than the cache line size results in accessing each cache line more than once in the random sequence, leading to some cache hits and transitions between cache levels that are not sharp. Figure 2 shows that Sandy Bridge and Ivy Bridge have the same behaviour except for strides at least as big as the cache line size for Ivy Bridge’s L3.

There are several hypotheses that can explain an improvement in L3 cache miss rates. Only a changed cache replacement policy agrees with observations:

Prefetching: An improved prefetcher capable of prefetching near-random accesses would benefit accesses of any stride. Figure 2 shows no improvement over Sandy Bridge for strides smaller than a cache line.

Changed hash function: This would show a curve with a strange shape, as some parts of the array see a smaller cache size while some other parts see a bigger size. This is not observed.

Changed replacement policy: Should show apparent cache size unchanged, but transitions between cache levels may not show the sharp transition seen with LRU policies. This agrees with observations.

Figure 3 shows two plots similar to Figure 2 for larger stride values (512 bytes to 128 MB). The curved shape of the plots for Ivy Bridge is clearly visible for many stride values.

Adaptive replacement policy?

An interesting paper from UT Austin and Intel from ISCA 2007 (Adaptive Insertion Policies for High Performance Caching) discussed improvements to the LRU replacement policy for cyclic access patterns that don’t fit in the cache. The adaptive policy tries to learn whether the access pattern reuses the cache lines before eviction and chooses an appropriate replacement policy (LRU vs. Bimodal Insertion Policy, BIP). BIP places new cache lines most of the time in the LRU position, the opposite behaviour of LRU.

Testing for an adaptive policy can be done by attempting to defeat it. The idea is to trick the cache into thinking that cached data is reused, by modifying the access pattern to reuse each cache line before eviction. Instead of a single pointer chase by repeating p = *(void**)p, the inner loop was changed to do two pointer chases p = *(void**)p; q = *(void**)q;, with one pointer lagging behind the other by some number of iterations, designed to touch each line fetched into the L3 cache exactly twice before eviction. Figure 4 plots the same parameters as Figure 3 but with the dual pointer chase access pattern. The Ivy Bridge plots closely resemble Sandy Bridge, showing that the replacement policy is adaptive, and has been mostly defeated.

Since Ivy Bridge uses an adaptive replacement policy, it is likely that the replacement policy is closely related to the one proposed in the paper. Before we probe for more detail, we need to look a little more closely at the L3 cache.

Ivy Bridge L3 Cache

The L3 cache (also known as LLC, last level cache) is organized the same way for both Sandy Bridge and Ivy Bridge. The cache line size is 64 bytes. The cache is organized as four slices on a ring bus. Each slice is either 2048 sets * 16-way (2 MB for Core i7) or 2048 sets * 12-way (1.5 MB for Core i5). Physical addresses are hashed and statically mapped to each slice. (See http://www.realworldtech.com/sandy-bridge/8/) Thus, the way size within each slice is 128 KB, and a stride of 128 KB should access the same set of all four slices using traditional cache hash functions.

Figure 3 reveals some information about the hash function used. Here are two observations:

Do bits [16:0] (128KB) of the physical address map directly to sets? I think so. If not, the transition at 6 MB would be spread out somewhat, with latency increasing over several steps, with the increase starting below 6 MB.

Is the cache slice chosen by exactly bits [18:17] of the physical address? No. Higher-order address bits are also used to select the cache slice. In Figure 3, the apparent cache size with 256 KB stride has doubled to 12 MB. It continues to double at 512, 1024, and 2048 KB strides. This behaviour resembles a 48-way cache. This can happen if the hash function equally distributes 256 through 2048 KB strides over all four slices. Thus, some physical address bits higher than bit 20 are used to select the slice, not just bits [20:17]. Paging with 2 MB pages limits my ability to test physical address strides greater than 2 MB.

Choosing a Policy using Set Dueling

Adaptive cache replacement policies choose between two policies depending on which one is appropriate at the moment. Set dueling proposes to dedicate a small portion of the cache sets to each policy to detect which policy is performing better (dedicated sets), and the remainder of the cache as follower sets that follow the better policy. A single saturating counter compares the number of cache misses occurring in the two dedicated sets.

This test attempts to show set-dueling behaviour by finding which sets in the cache are dedicated and which are follower sets. This test uses a 256 KB stride (128 KB works equally well), which maps all accesses onto one set in each of the four cache slices (due to the hash function). By default, this would mean all accesses only touch the first set in each slice if the low address bits [16:0] are zero. Thus, I introduce a new parameter, offset, which adds a constant to each address so that I can map all accesses to the second, third, etc. sets in each slice instead of always the first. This parameter is swept both up and down, and the cache replacement policy used is observed.

Since the adaptive policy chooses between two different replacement policies, I chose to distinguish between the two by observing the average latency at one specific array size, 4/3 of the L3 cache size (e.g., compare Sandy Bridge vs. Ivy Bridge at 8 MB in Figure 2). A high latency indicates the use of an LRU-like replacement policy, while a lower latency indicates the use of a thrashing-friendlier BIP/MRU type policy. Figure 5 plots the results.

It appears that most of the cache sets can use both cache replacement policies, except for two 4 KB regions at 32-36 and 48-52 KB (4 KB = 64 cache sets). These two regions always use LRU and BIP/MRU policies, respectively. The plot is periodic every 128 KB because there are 2048 sets per cache slice.

The global-counter learning behaviour is seen in Figure 5 by observing the different policies used while sweeping offset ascending vs. descending. Whenever the offset causes memory accesses to land on a dedicated set, it accumulates cache misses, causing the rest of the cache to flip to using the other policy. Cache misses on follower sets do not influence the policy, so the policy does not change until offset reaches the next dedicated set.

DIP or DRRIP?

The later paper from Intel (ISCA 2010) proposes a replacement policy that is improved over DIP-SD by also being resistant to scans. DIP and DRRIP are similar in that both use set dueling to choose between two policies (SRRIP vs. BRRIP in DRRIP, LRU vs. BIP in DIP).

Figure 6: Scan resistance of Sandy Bridge LRU, Ivy Bridge “LRU-like”, and Ivy Bridge “BIP-like”. Pointer chasing through one huge array, and one array whose size is swept (x-axis). Replacement policy is scan-resistant if it preferentially keeps the small array in the cache rather than split 50-50. [png][pdf]

One characteristic of RRIP is that it proposes four levels of re-reference prediction. True LRU has as many levels as the associativity, while NRU (not recently used) has two levels (recently used, or not). Intel’s presentation at Hot Chips 2012 hinted at using four-level RRIP (“Quad-Age LRU” on slide 46). Thus, I would like to measure whether Ivy Bridge uses DIP or DRRIP.

Scans are short bursts of cache accesses that are not reused. SRRIP (but not LRU) attempts to be scan tolerant by predicting new cache lines to have a “long” re-reference interval and allowing recently-used existing cache lines to age slowly before being evicted, preferentially evicting the newly-loaded but not reused cache lines. This behaviour provides an opportunity to detect scan resistance by pointer-chasing through two arrays, one of which is much larger than the cache (scanning), one of which fits in the cache (working set).

Figure 6 plots the average access time for accesses that alternate between a huge array (cache miss) and a small array (possible cache hit) whose size is on the x-axis. The size of the working set that will fit in the cache in the presence of scanning is related to scan resistance. Memory accesses are serialized using lfence. Replacement policy is selected by choosing a stride and offset that coincides with a dedicated set (See Figure 5).

With half of the accesses hitting each array, an LRU replacement policy splits the cache capacity evenly between the two arrays, causing L3 misses when the small array size exceeds ~half the cache. BIP (or BRRIP) is scan-resistant and keeps most of the cache for the frequently-reused small array. The LRU/SRRIP policy in Ivy Bridge behaves very similar to Sandy Bridge and does not seem to be scan-resistant, thus it is likely not SRRIP as proposed. It could, however, be a variant on SRRIP crafted to behave like LRU.

The four-level RRIP family of replacement policies are an extension of NRU, using two bits per cache line to encode how soon each cache line should be evicted (RRPV). On eviction, one cache line with RRPV=3 (LRU position) is chosen to be evicted. If no cache lines have RRPV=3, all RRPV values are incremented until a cache line can be evicted. Whenever a cache line is accessed (cache hit), its RRPV value is set to 0 (MRU position). The various insertion policies as proposed in the paper are as follows:

BRRIP: New cache lines are inserted with RRPV=3 with high probability, RRPV=2 otherwise.

SRRIP: New cache lines are inserted with RRPV=2

I made some modifications for my simulations:

BRRIP: Modified the low-probability case to insert with RRPV=0 instead of RRPV=2

LRU-like RRIP: New cache lines are inserted with RRPV=0. This is intended to approximate LRU.

Figure 7: Cache simulations of scan resistance of several replacement policies. Ivy Bridge (Figure 6) matches very well with LRU and BRRIP policies, but does not show the scan-resistance property of SRRIP. [png][pdf]

Figure 7 shows L3-only cache simulation results of running the same access pattern through simulations of several replacement policies. The simulated cache is configured to match the Core i5 (four 1.5 MB slices, 12-way). The two policies used by Ivy Bridge (Figure 6) match very closely to LRU and BRRIP. Ivy Bridge matches less closely to a modified four-level RRIP configured to behave like LRU, and also matches slightly less well to BIP (not shown) than BRRIP. The simulation of SRRIP verifies its scan-resistance properties, which are not observed in the experimental measurement of Ivy Bridge.

I am not able to measure whether Ivy Bridge has really changed from the pseudo-LRU that was used in Sandy Bridge to a four-level RRIP. Given Intel’s hint in the slide and that Ivy Bridge behaves slightly differently from Sandy Bridge, I am inclined to believe that Ivy Bridge uses RRIP, despite the experimental measurements matching more closely to LRU than LRU-like four-level RRIP. However, it is fairly clear that Ivy Bridge lacks the scan-resistance property proposed in the ISCA 2010 paper.

Conclusion

Although the cache organization between Sandy Bridge and Ivy Bridge are essentially identical, Ivy Bridge’s L3 cache has an improved replacement policy. The policy appears to be similar to a hybrid between “Dynamic Insertion Policy — Set Duel” (DIP-SD) and “Dynamic Re-Reference Interval Prediction” (DRRIP), using four-level re-reference predictions and set dueling, but without scan resistance. For each 2048-set cache slice, 64 sets are dedicated to each of the LRU-like and BRRIP policies, with the remaining 1920 cache sets being follower sets that follow the better policy.

The accesses should be data-dependent reads (with as few other interfering instructions as possible), and the access pattern should be fairly random.

The accesses should be reads: Writes are buffered so the true write latency isn’t easily observable, and you can’t make stores data-dependent on earlier stores.

Because we’re trying to measure memory latency, the load address must be data-dependent on the previous load, or the processor will attempt to issue multiple loads in parallel and hide some of the latency. In your example, the addresses are known very early (a function of i and j and not the previous load).

The access pattern needs randomness to defeat any hardware prefetching that might exist in the processor.

Other than loop overhead, the inner loop consists of a single load instruction that is data-dependent on the previous load. The trick is in initializing the array to contain pointers into the array itself, such that pointer chasing results in the desired access pattern. I used random cycles (variations of Sattolo’s algorithm) to defeat prefetching and allow repeatedly cycling through the array.

I am a little confused about this sentence in the part of Ivy Bridge L3 Cache, “If not, the transition at 6 MB would be spread out somewhat”. My question is, if there is hash bit in [16 : 0], it means that some physical addresses will be mapped to other slice. What leads to the transition you mentioned? The latency difference between different slices? Thank you very much!

I think you’re right that it’s possible for the slice hash number to include some bits within [16:6] in the hash. As long as the cache lines are distributed evenly, this might be hard (or impossible?) to detect.

My thinking was that the set number must be exactly [16:6], even if some of those bits are used to select the cache slice. If sets aren’t distributed evenly, there tends to be some conflict misses when the cache is nearly full (leading to some cache misses with arrays slightly smaller than the cache size), but because there are some conflicts, some parts of the cache don’t get entirely kicked out due to LRU (leading to some cache hits for arrays bigger than the cache size). I’m not sure if this is entirely true, and it seems plausible that you could come up with a set number that isn’t exactly [16:6] yet still looks like a sharp transition.

One other hint that the set number is exactly [16:6] are from the larger stride accesses. Even at 128KB stride, exactly 48 cache lines fit in the L3 (corresponding to 12-way * 4 slices with all accesses hitting the same set). If the set number changed at all with 128KB-aligned addresses (i.e., if the set number includes bits [~22:17]), I would expect to see more than 48 cache lines being able to fit in the cache when using 128KB-aligned accesses.

But I could be wrong. I haven’t done any testing specifically to find out the hash function(s).

I guess, on your first chart, the fourth step represents the latency of the main memory (correct me if I am wrong). How much would that be approximately for Sandy Bridge? Does 4200 MHz represents the clock speed of a certain Sandy Bridge processor ?

Yes, the last step is main memory (and some TLB miss latency in there too). As you can see from the graph, it’s about 230 cycles with a L1 TLB hit on my system (at 4.2 GHz, with my particular DRAM timings).

Hi Henry,
Great discussion. I am a hardware guy and OK C programmer, but have little experience with C++. I tried to compile lat.cc, lat2.cc, and lat3.cc on a MacPro using clang based on LLVM 3.5 and received the following warnings and errors. Any suggestions on where I need to start looking on fixing this? Thank you for any help!

The three warnings are expected. I’m deliberately passing in an argument to printf that isn’t used, just to avoid compiler optimization (It’s in the comment for that line).

The errors don’t make much sense to me. I wonder if it’s because you need to compile with clang++ (a C++ compiler) rather than clang (a C compiler)? Allocating and freeing arrays using new and delete is straightforward code, so I wouldn’t expect compilers to complain about them.

gcc vs. g++ also gives out some strange linker errors if you compile C++ with gcc.

The sharp bends are because the memory access pattern uses a cacheline stride (i.e., use -cyclelen 8 because 8 64-bit pointers fit in a 64-byte cacheline), and also from reducing TLB misses by using 2MB page size. Using 2M pages is rather OS-specific though, so I doubt that would work on OSX without some changes.

Hi,
Thanks for the post I found it very interesting. Is there any chance you could post the compiler flags and test flags that you used for each graph as I am struggling to reproduce some of the graphs.