Understanding CPU caching and performance

An introduction to the concepts of CPU caching and performance.

Locality

Example: A Byte's Brief Journey Through the Memory Hierarchy

For the sake of example, let's say the CPU issues a LOAD instruction that tells the memory subsystem to load a piece of data (in this case, a single byte) into one of its registers. First, the request goes out to the L1 cache, which is checked to see if it contains the requested data. If the L1 cache does not contain the data and therefore cannot fulfill the request--a situation called a cache miss--then the request propagates down to the L2 cache. If the L2 cache does not contain the desired byte, then the request begins the relatively long trip out to main memory. If main memory doesn't contain the data, then we're in big trouble, because then it has to be paged in from the hard disk, an act which can take a relative eternity in CPU time.

Let's assume that the requested byte is found in main memory. Once located, the byte is copied from main memory, along with a bunch of its neighboring bytes in the form of a cache block or cache line, into the L2 and L1 caches. When the CPU requests this same byte again it will be waiting for it there in the L1 cache, a situation called a cache hit.

Computer architects usually divide misses up into three different types depending on the situation that brought about the miss. I'll introduce these three types of misses at appropriate points over the course of the article, but I can talk about the first one right now. A compulsory miss is a cache miss that occurs because the desired data was never in the cache and therefore must be paged in for the first time in a program's execution. It's called a "compulsory" miss because, barring the use of certain specialized tricks like data prefetching, it's the one type of miss that just can't be avoided. All cached data must be brought into the cache for the very first time at some point, and the occasion for doing so is normally a compulsory miss.

The two other types of misses are misses that result when the CPU requests data that was previously in the cache but has been evicted for some reason or other. We'll discuss evictions later.

How different applications use the cache

There's one very simple principle that's basic to how caches work: locality of reference. We generally find it useful to talk about two types of locality of reference: spatial locality and temporal locality. Spatial locality is a fancy way of labeling the general rule that if the CPU needs an item from memory at any given moment, it's likely to need its neighbors, next. Temporal locality is the name we give to the general rule that if an item in memory was accessed once, it's likely to be accessed again in the near future. Depending on the type of application, both code and data streams can exhibit spatial and temporal locality.

Spatial locality

The concept of locality of reference is probably easy enough to understand without much explanation, but just in case it's not immediately clear let's take a moment to look closer. Spatial locality is the easiest type of locality to understand, because most of us have used media applications like mp3 players, DVD players, and other types of apps whose datasets consist of large, ordered files. Consider an MP3 file, which has a bunch of blocks of data that are consumed by the processor in sequence from the file's beginning to its end. If the CPU is running Winamp and it has just requested second 1:23 of a 5 minute MP3, then you can be reasonably certain that next it's going to want seconds 1:24, 1:25, and so on. This is the same with a DVD file, and with many other types of media files like images, Autocad drawings, and Quake levels. All of these applications operate on large arrays of sequentially ordered data that gets ground through in sequence by the CPU again and again.

In the picture above, the red cells are related chunks of data in the memory array. This picture shows a program with fairly good spatial locality, since the red cells are clumped closely together. In an application with poor spatial locality, the red cells would be randomly distributed among the unrelated blue cells.