New cache design speeds up processing time by 15 percent

Caching algorithms get smarter, use 25 percent less energy.

Transistors keep getting smaller and smaller, enabling computer chip designers to develop faster computer chips. But no matter how fast the chip gets, moving data from one part of the machine to another still takes time.

To date, chip designers have addressed this problem by placing small caches of local memory on or near the processor itself. Caches store the most frequently used data for easy access. But the days of a cache serving a single processor (or core) are over, making management of cache a nontrivial challenge. Additionally, cores typically have to share data, so the physical layout of the communication network connecting the cores needs to be considered, too.

Researchers at MIT and the University of Connecticut have now developed a set of new “rules” for cache management on multicore chips. Simulation results have shown that the rules significantly improve chip performance while simultaneously reducing the energy consumption. The researchers' first paper, presented at the IEEE International Symposium on Computer Architecture, reported gains (on average) of 15 percent in execution time and 25 percent energy savings.

So how are these caches typically managed, and what is this group doing differently?

Caches on multicore chips are arranged in a hierarchy, of sorts. Each core gets its own private cache, which can be divided into several levels based on how quickly it can be accessed. However there is also a shared cache, commonly referred to as the last-level cache (LLC), which all the cores are allowed to access.

Most chips use this level of organization and rely on what’s called the “spatiotemporal locality principle” to manage cache. Spatial locality means that if a piece of data is requested by a core, that same core will probably request other data stored near it in main memory. Temporal locality serves to say that if a core requests a piece of data, it is likely to need it again. Processors use these two patterns to try to keep the caches filled with the data that’s most likely to be needed next.

However this principle isn't flawless, and it comes up short when the data being stored exceeds the capacity of the core's private cache. In this case, the chip wastes a lot of time by trying to swap data around the cache hierarchy. This is the problem George Kurian, a graduate student in MIT's department of electrical engineering and computer science, is tackling.

Kurian worked with his advisor, Srini Devadas, at MIT and Omer Khan at the University of Connecticut, and their paper presents a hardware design that mitigates problems associated with the spatiotemporal locality principle. When the data being stored exceeds the capacity of the core's private cache, the chip splits up the data between private cache and the LLC. This ensures that the data is stored where it can be accessed more quickly than if it were in the main memory.

Another case addressed by the new work occurs when two cores are working on the same data and are constantly synchronizing their cached version. Here, the technique eliminates the synchronization operation and simply stores the shared data at a single location in the LLC. Then the cores take turns accessing the data, rather than clogging the on-chip network with synchronization operations.

The new paper examines another case, where two cores are working on the same data but not synchronizing it frequently. Typically, the LLC is treated as a single memory bank, and the data is distributed across the chip in discrete chunks. The team has developed a second circuit to treat this chunk as extensions of each core’s private cache. This allows each core to have its own copy of the data in the LLC, allowing much faster access to data.

The nice thing about this work is that it impacts a number of aspects of a processor’s function. By making the caching algorithms a bit smarter, the researchers could both speed execution of code, while cutting down on the number of commands that needed to be executed simply to manage memory. With fewer commands being executed, energy use necessarily dropped.

64 Reader Comments

Given that AMD and Intel use an inclusive cache, wouldn't that mean you end up with potentially 2 copies of each dataset for each core? That means if you have a 4 core CPU, you have 8 copies of the data in your overall cache...

I'm pretty sure AMD uses exclusive because one of the issue AMD has is lots of cache snooping when working on shared data across cores. AMD tends to do better than Intel when it comes to science data crunching because of a large effective cache and algorithms that share little data, but Intel tends to do better when many cores need to sync with each other and share data.

Inclusive cache gets the benefit of quicker sharing of data among cores, but at the cost of constant overhead to keep data in sync. Exclusive gains the benefit of less duplication of data and less syncing, but comes at the cost of slower data sharing among cores.

Other than that, there are a dozen cache papers each year that claim to get ~15% better performance while saving energy, so you might end up being very busy reporting on this kind of stuff.

Indeed, simulations are always fun.

It's also solving the least interesting problem today. The million dollar question is cache coherence protocols that scale to more than a few dozen cores.

I think the general consensus is to do away with cache-coherence past a certain core count or to group cores into cache coherency domains.

Intel had research into working cache coherency protocols for up to ~80 cores (or was it 64?), but yeah the fear seems to be that we really have to give up cache coherency or settle for something less. The hw manufacturers probably wouldn't mind too much if everyone had to (hey much easier design for them), but it makes programming much uglier. Worst case we end up with something like Cuda 1.0 where getting optimal performance out of the memory subsystem was something of a black art.

How does the old saying - Lamport I think - go? There are only two hard things in Computer Science: Naming things and cache invalidation.

Given that AMD and Intel use an inclusive cache, wouldn't that mean you end up with potentially 2 copies of each dataset for each core? That means if you have a 4 core CPU, you have 8 copies of the data in your overall cache...

I'm pretty sure AMD uses exclusive because one of the issue AMD has is lots of cache snooping when working on shared data across cores. AMD tends to do better than Intel when it comes to science data crunching because of a large effective cache and algorithms that share little data, but Intel tends to do better when many cores need to sync with each other and share data.

AMD doesn't use exclusive cache exclusively though. The ultramobile Bobcat line (which the current Jaguar core is part of) feature inclusive caches.

Other than that, there are a dozen cache papers each year that claim to get ~15% better performance while saving energy, so you might end up being very busy reporting on this kind of stuff.

Indeed, simulations are always fun.

It's also solving the least interesting problem today. The million dollar question is cache coherence protocols that scale to more than a few dozen cores.

I think the general consensus is to do away with cache-coherence past a certain core count or to group cores into cache coherency domains.

Well SGI is doing cache coherency up to 2048 cores with their UV 2000 system. The surprising thing is that the limiting factor with their architecture isn't socket count but rather they run into physical address limitations with the Sandy Bridge-EP chips at 64 TB of memory. The cache coherency protocols break down when there is more memory in a system than can physically be addressed by a single core. Going beyond 64 TB is still possible without coherency and applications can use the full 64 bit virtual address space.

Other than that, there are a dozen cache papers each year that claim to get ~15% better performance while saving energy, so you might end up being very busy reporting on this kind of stuff.

Indeed, simulations are always fun.

It's also solving the least interesting problem today. The million dollar question is cache coherence protocols that scale to more than a few dozen cores.

I think the general consensus is to do away with cache-coherence past a certain core count or to group cores into cache coherency domains.

Well SGI is doing cache coherency up to 2048 cores with their UV 2000 system. The surprising thing is that the limiting factor with their architecture isn't socket count but rather they run into physical address limitations with the Sandy Bridge-EP chips at 64 TB of memory. The cache coherency protocols break down when there is more memory in a system than can physically be addressed by a single core. Going beyond 64 TB is still possible without coherency and applications can use the full 64 bit virtual address space.

It is nice to know that people are working on breaking that 64TB of onboard RAM barrier. zfs file system needs all the RAM it can get. As this item from the manual says, large storage arrays require large RAM allocations. 64TB of RAM should cover most home users, but large storage arrays could run into problems.

20.2.1.1. MemoryAt a bare minimum, the total system memory should be at least one gigabyte. The amount of recommended RAM depends upon the size of the pool and the ZFS features which are used. A general rule of thumb is 1GB of RAM for every 1TB of storage. If the deduplication feature is used, a general rule of thumb is 5GB of RAM per TB of storage to be deduplicated. While some users successfully use ZFS with less RAM, it is possible that when the system is under heavy load, it may panic due to memory exhaustion. Further tuning may be required for systems with less than the recommended RAM requirements.

Well, it all sounds very interesting, but I don't see in the paper what kind memory model the software running on the cores end up seeing. That is rather important, because if this caching scheme ends up offering less ordering assurance than the ARM architecture guarantees, this means it's not going to be able to run ARM code in practice (even if the cores are replaced by ARM cores); and so such a scheme won't ship any time soon (and x86 has stricter requirements than ARM does).

Because to answer a question from one of my fellow commenters, the speed at which such research makes it to mainstream hardware (or software) is directly related to how well it supports existing software constraints.

Other than that, there are a dozen cache papers each year that claim to get ~15% better performance while saving energy, so you might end up being very busy reporting on this kind of stuff.

Indeed, simulations are always fun.

It's also solving the least interesting problem today. The million dollar question is cache coherence protocols that scale to more than a few dozen cores.

I think the general consensus is to do away with cache-coherence past a certain core count or to group cores into cache coherency domains.

Well SGI is doing cache coherency up to 2048 cores with their UV 2000 system. The surprising thing is that the limiting factor with their architecture isn't socket count but rather they run into physical address limitations with the Sandy Bridge-EP chips at 64 TB of memory. The cache coherency protocols break down when there is more memory in a system than can physically be addressed by a single core. Going beyond 64 TB is still possible without coherency and applications can use the full 64 bit virtual address space.

The way Intel described it, cache coherency needs to be immediate, so you need to clock down the CPU to give enough time for the information to get notified to EVERY core.

Your CPU frequency cannot be faster than the time it takes for the electrical signal to get propagated to every CPU core. Or at least with the x86 version of it. I'm sure SGI had a special version with different trade-offs.

Other than that, there are a dozen cache papers each year that claim to get ~15% better performance while saving energy, so you might end up being very busy reporting on this kind of stuff.

Indeed, simulations are always fun.

It's also solving the least interesting problem today. The million dollar question is cache coherence protocols that scale to more than a few dozen cores.

I think the general consensus is to do away with cache-coherence past a certain core count or to group cores into cache coherency domains.

Well SGI is doing cache coherency up to 2048 cores with their UV 2000 system. The surprising thing is that the limiting factor with their architecture isn't socket count but rather they run into physical address limitations with the Sandy Bridge-EP chips at 64 TB of memory. The cache coherency protocols break down when there is more memory in a system than can physically be addressed by a single core. Going beyond 64 TB is still possible without coherency and applications can use the full 64 bit virtual address space.

It is nice to know that people are working on breaking that 64TB of onboard RAM barrier. zfs file system needs all the RAM it can get. As this item from the manual says, large storage arrays require large RAM allocations. 64TB of RAM should cover most home users, but large storage arrays could run into problems.

20.2.1.1. MemoryAt a bare minimum, the total system memory should be at least one gigabyte. The amount of recommended RAM depends upon the size of the pool and the ZFS features which are used. A general rule of thumb is 1GB of RAM for every 1TB of storage. If the deduplication feature is used, a general rule of thumb is 5GB of RAM per TB of storage to be deduplicated. While some users successfully use ZFS with less RAM, it is possible that when the system is under heavy load, it may panic due to memory exhaustion. Further tuning may be required for systems with less than the recommended RAM requirements.

Fritzr, are you sure you mean 64TB is enough RAM for most home users? Because that would safely handle about 13 petabytes of data (with dedupe on). And that's more than most people could ever fill in their lifetime (without just writing bloat files).

(I realize you are just continuing the error and it was not yours initially).

Other than that, there are a dozen cache papers each year that claim to get ~15% better performance while saving energy, so you might end up being very busy reporting on this kind of stuff.

Indeed, simulations are always fun.

It's also solving the least interesting problem today. The million dollar question is cache coherence protocols that scale to more than a few dozen cores.

I think the general consensus is to do away with cache-coherence past a certain core count or to group cores into cache coherency domains.

Well SGI is doing cache coherency up to 2048 cores with their UV 2000 system. The surprising thing is that the limiting factor with their architecture isn't socket count but rather they run into physical address limitations with the Sandy Bridge-EP chips at 64 TB of memory. The cache coherency protocols break down when there is more memory in a system than can physically be addressed by a single core. Going beyond 64 TB is still possible without coherency and applications can use the full 64 bit virtual address space.

It is nice to know that people are working on breaking that 64TB of onboard RAM barrier. zfs file system needs all the RAM it can get. As this item from the manual says, large storage arrays require large RAM allocations. 64TB of RAM should cover most home users, but large storage arrays could run into problems.

20.2.1.1. MemoryAt a bare minimum, the total system memory should be at least one gigabyte. The amount of recommended RAM depends upon the size of the pool and the ZFS features which are used. A general rule of thumb is 1GB of RAM for every 1TB of storage. If the deduplication feature is used, a general rule of thumb is 5GB of RAM per TB of storage to be deduplicated. While some users successfully use ZFS with less RAM, it is possible that when the system is under heavy load, it may panic due to memory exhaustion. Further tuning may be required for systems with less than the recommended RAM requirements.

Fritzr, are you sure you mean 64TB is enough RAM for most home users? Because that would safely handle about 13 petabytes of data (with dedupe on). And that's more than most people could ever fill in their lifetime (without just writing bloat files).

(I realize you are just continuing the error and it was not yours initially).

It was definitely a tongue in cheek observation on that item, but RAM size is a very real problem for ordinary consumer products when using zfs.

Other than that, there are a dozen cache papers each year that claim to get ~15% better performance while saving energy, so you might end up being very busy reporting on this kind of stuff.

Indeed, simulations are always fun.

It's also solving the least interesting problem today. The million dollar question is cache coherence protocols that scale to more than a few dozen cores.

I think the general consensus is to do away with cache-coherence past a certain core count or to group cores into cache coherency domains.

Well SGI is doing cache coherency up to 2048 cores with their UV 2000 system. The surprising thing is that the limiting factor with their architecture isn't socket count but rather they run into physical address limitations with the Sandy Bridge-EP chips at 64 TB of memory. The cache coherency protocols break down when there is more memory in a system than can physically be addressed by a single core. Going beyond 64 TB is still possible without coherency and applications can use the full 64 bit virtual address space.

The way Intel described it, cache coherency needs to be immediate, so you need to clock down the CPU to give enough time for the information to get notified to EVERY core.

Your CPU frequency cannot be faster than the time it takes for the electrical signal to get propagated to every CPU core. Or at least with the x86 version of it. I'm sure SGI had a special version with different trade-offs.

The thing to synchronize isn't necessarily the raw data but the knowledge that the memory location itself isn't the necessarily the most recent set of data. The nice thing about about NUMA designs is that the memory controller in each socket can keep track of both locate and remote loads. Caching algorithms can record not only if data has been loaded but where it would be cached (of course this is dependent on the actual caching algorithm and the hardware implementation to support it). This is why execution locality is import for performance: the closer a core is to where the data resides, the less traffic that needs to be passed around for coherency purposes (this in addition to general latency benefits of quicker loads/stores).

This is why transactional memory is a big deal on systems this large is it sets up a database-like structure to loads/stores from memory. It can lock a memory location so that only a particular thread may perform a store to that location.

A thousand cache papers are published every year, many of which show better gains than those presented here. So why were these particular papers presented? Did the authors contact someone at Ars who isn’t familiar with the literature?

We have a bad enough time trying to keep threads from locking each other up, etc. You're going to have to add more complexity into the compiler to keep track of memory access and/or have a watchdog routine in the cache electronics to keep track of when two threads are synchronizing the same data, etc. I'm sure the work here is interesting but I think there are other problems that will need to be solved before this can go mainstream.

Have you ever had to deal with the cache directly when you coded? Probably not. This is probably working at the hardware level and transparent to the programmer. A caching scheme that requires the programmer's intervention is probably not all that useful.

Certain architectures do allow a bit of flexibility in what gets put into cache. The PowerPC cores used in the previous generation of consoles had instructions to prevent data form being cached when loaded. This was intended for data streaming where the loaded data would be quickly discarded. This improved performance by lowering the amount of thrashing in the caches by (indirectly) keeping the frequently used data there.

"Volatile" would take care of what you just described if the compiler is aware the caching capabilities of a processor.

Other than that, there are a dozen cache papers each year that claim to get ~15% better performance while saving energy, so you might end up being very busy reporting on this kind of stuff.

Indeed, simulations are always fun.

It's also solving the least interesting problem today. The million dollar question is cache coherence protocols that scale to more than a few dozen cores.

I think the general consensus is to do away with cache-coherence past a certain core count or to group cores into cache coherency domains.

Well SGI is doing cache coherency up to 2048 cores with their UV 2000 system. The surprising thing is that the limiting factor with their architecture isn't socket count but rather they run into physical address limitations with the Sandy Bridge-EP chips at 64 TB of memory. The cache coherency protocols break down when there is more memory in a system than can physically be addressed by a single core. Going beyond 64 TB is still possible without coherency and applications can use the full 64 bit virtual address space.

The way Intel described it, cache coherency needs to be immediate, so you need to clock down the CPU to give enough time for the information to get notified to EVERY core.

Your CPU frequency cannot be faster than the time it takes for the electrical signal to get propagated to every CPU core. Or at least with the x86 version of it. I'm sure SGI had a special version with different trade-offs.

The SGI UV 2000 system is an Intel Xeon MP system that can scale to 256 sockets and 64TB of NUMA memory. Since there are 8 core processors available for Xeon MP, this means 2048 cores and with hyperthreading, it means 4096 threads. SGI has been doing large, single-system-image computers for a VERY long time and they know what they are doing in that specific area. I was working on SGI Origin 2000 systems 15 years ago that could scale to 4096 processors/cores and have nearly linear performance increases all the way up for the operating system scheduler (obviously, the application in use matters but in those days, the biggest limitation for large processor systems was the operating system scheduler and this was a solved issue for SGI at the time).

The way Intel described it, cache coherency needs to be immediate, so you need to clock down the CPU to give enough time for the information to get notified to EVERY core.

Your CPU frequency cannot be faster than the time it takes for the electrical signal to get propagated to every CPU core. Or at least with the x86 version of it. I'm sure SGI had a special version with different trade-offs.

In a modern CPU, the period of 1 clock cycle isn't even enough for the L1 cache logic to determine wether it has or not the cache line that has been requested.A Haswell, which is a 4+ GHz design, needs 4 cycles to get data from a cache line that is held in the cache. About 30 or 40 cycles, it it's in a L3 cache slice.

Certain architectures do allow a bit of flexibility in what gets put into cache. The PowerPC cores used in the previous generation of consoles had instructions to prevent data form being cached when loaded. This was intended for data streaming where the loaded data would be quickly discarded. This improved performance by lowering the amount of thrashing in the caches by (indirectly) keeping the frequently used data there.

"Volatile" would take care of what you just described if the compiler is aware the caching capabilities of a processor.

Nop.

Volatile tells the compiler (err...) that it should not assume the variable has not been somehow (ie, by another thread) modified between two uses.That means the compiler should generate a memory access instruction for each time the variable shows up in the code.This means the compiler will not remove "redundant" memory access instructions, as it would normally do.

It has no implication to whether the CPU should or not cache that variable.Unless you're dealing with non-coherent caches/memory. And even if you are, most modern CPUs have system level mechanisms to mark regions of memory as "not cacheable".So, "volatile" leads the compiler to generate normal memory access instructions.

What Kevin G mentioned is simply a performance hint to the CPU "don't bother to cache this access, because it's a waste of space". In theory, CPUs can even ignore those hints.

The way Intel described it, cache coherency needs to be immediate, so you need to clock down the CPU to give enough time for the information to get notified to EVERY core.

Your CPU frequency cannot be faster than the time it takes for the electrical signal to get propagated to every CPU core. Or at least with the x86 version of it. I'm sure SGI had a special version with different trade-offs.

In a modern CPU, the period of 1 clock cycle isn't even enough for the L1 cache logic to determine wether it has or not the cache line that has been requested.A Haswell, which is a 4+ GHz design, needs 4 cycles to get data from a cache line that is held in the cache. About 30 or 40 cycles, it it's in a L3 cache slice.

I knew it went to 2 cycles at some point, but I didn't know about 4 cycles. Well, L2 cache is guaranteed to be atomic, which is 12 cycles. So I guess they have 12 cycles to make sure cache-coherency has propagated to all L2 caches of all cores on the domain.

No, they have 12 cycles because that's the time they need to check if they have a cache line in it's L2 cache.

Cache coherency does not work as you think.Cache coherency is achieved by having each core knowing the state of each cache line it holds.The most basic system is based on 3 states: http://en.wikipedia.org/wiki/MSI_protocol

The details vary, but the cores communicate with each others to achieve this and this communication can takes up to hundreds of clocks in large scale multi-socket systems.