Understanding CPU caching and performance

An introduction to the concepts of CPU caching and performance.

More associativity

Another, more popular way of organizing the cache is to use a "direct mapping." In a direct-mapped cache, each block frame can cache only a certain subset of the blocks in main memory.

In the above diagram, each of the red blocks (blocks 0, 8, and 16) can be cached only in the red block frame (frame 0). Likewise, blocks 1, 9, and 17 can be cached only in frame 1, blocks 2, 10, and 18 can be cached only in frame 2, and so on. Hopefully, the pattern here is apparent: each frame caches every eighth block of main memory. As a result, the potential number of locations for any one block is greatly narrowed, and therefore the number of tags that must be checked on each fetch is greatly reduced. So, for example, if the CPU needs a byte from blocks 0, 8, or 16, it knows that it only has to check frame 0 to determine if the desired block is in the cache and to retrieve it if it is. This is much faster and more efficient than checking every frame in the cache.

There are some drawbacks to this scheme, though. For instance, what if blocks 0-3 and 8-11 combine to form an 8-block "working set" that the CPU wants to load into the cache and work on for a while. The cache is 8 blocks long, but since it's direct-mapped it can only store four of these particular blocks at a time. Remember, blocks 0 and 8 have to go in the same frame, as do blocks 1 and 9, 2 and 10, and 3 and 11. As a result, the CPU must load only four blocks of this 8-block set at a time, and swap them out as it works on the set. If the CPU wants to work on this set 8-block set for a long time, then that could mean a lot of swapping. Meanwhile, half of the cache is going completely unused! So while direct-mapped caches are almost always faster than fully associative caches due to the shortened amount of time it takes to locate a cached block, they can still be inefficient under some circumstances.

Note that the kind of situation described above, where the CPU would like to store multiple blocks but it can't because they all require the same frame, is called a collision. In the preceding example, blocks 0 and 8 are said to collide, since they both want to fit into frame 0 but can't. Misses that result from such collisions are called conflict misses, the second of the three types of cache miss that I mentioned earlier.

One way to get some of the benefits of direct-mapped caches while lessening the amount of wasted cache space due to collisions is to restrict the caching of main memory blocks to a subset of the available cache frames. To see what I mean, take a look at the diagram below, which represents a four-way associative cache.

?

In this diagram, any of the red blocks can go anywhere in the red set of frames (set 0) and any of the light yellow blocks can go anywhere in the light yellow set of frames (set 1). You can think of it like this: we took a fully associative cache and cut it in two, restricting half the main memory blocks to one side and half the main memory blocks to the other. This way, the odds of a collision are greatly reduced versus the direct-mapped cache, but we still don't have to search all the tags on every fetch like we did with the fully associative cache. For any given fetch, we need search only a single, four-block set to find the block we're looking for, which in this case amounts to half the cache.

The cache pictured above is said to be "four-way associative" because the cache is divided into "sets" of four frames each. Since the above cache has only 8 frames, then it can accommodate only two sets. A larger cache could accommodate more sets, reducing the odds of a collision even more. Furthermore, since all the sets consist of exactly four blocks, no matter how big the cache gets we'll only ever have to search through four frames (or one set) to find any given block. This means that as the cache gets larger and the number of sets it can accommodate increases, the more efficient the tag searches get. Think about it. In a cache with three sets, only 1/3 of the cache (or one set) needs to be searched for a given block. In a cache with four sets, only 1/4 of the cache is searched. In a cache with 100, four-block sets, only 1/100 of the cache needs to be searched. So the relative search efficiency scales with the cache size.

Furthermore, if we increase the number of sets in the cache while keeping the amount of main memory constant, then we can decrease the odds of a collision even more. Notice that in the above diagram there are fewer red main memory blocks competing for space in set 0.