New cache design speeds up processing time by 15 percent

Caching algorithms get smarter, use 25 percent less energy.

Transistors keep getting smaller and smaller, enabling computer chip designers to develop faster computer chips. But no matter how fast the chip gets, moving data from one part of the machine to another still takes time.

To date, chip designers have addressed this problem by placing small caches of local memory on or near the processor itself. Caches store the most frequently used data for easy access. But the days of a cache serving a single processor (or core) are over, making management of cache a nontrivial challenge. Additionally, cores typically have to share data, so the physical layout of the communication network connecting the cores needs to be considered, too.

Researchers at MIT and the University of Connecticut have now developed a set of new “rules” for cache management on multicore chips. Simulation results have shown that the rules significantly improve chip performance while simultaneously reducing the energy consumption. The researchers' first paper, presented at the IEEE International Symposium on Computer Architecture, reported gains (on average) of 15 percent in execution time and 25 percent energy savings.

So how are these caches typically managed, and what is this group doing differently?

Caches on multicore chips are arranged in a hierarchy, of sorts. Each core gets its own private cache, which can be divided into several levels based on how quickly it can be accessed. However there is also a shared cache, commonly referred to as the last-level cache (LLC), which all the cores are allowed to access.

Most chips use this level of organization and rely on what’s called the “spatiotemporal locality principle” to manage cache. Spatial locality means that if a piece of data is requested by a core, that same core will probably request other data stored near it in main memory. Temporal locality serves to say that if a core requests a piece of data, it is likely to need it again. Processors use these two patterns to try to keep the caches filled with the data that’s most likely to be needed next.

However this principle isn't flawless, and it comes up short when the data being stored exceeds the capacity of the core's private cache. In this case, the chip wastes a lot of time by trying to swap data around the cache hierarchy. This is the problem George Kurian, a graduate student in MIT's department of electrical engineering and computer science, is tackling.

Kurian worked with his advisor, Srini Devadas, at MIT and Omer Khan at the University of Connecticut, and their paper presents a hardware design that mitigates problems associated with the spatiotemporal locality principle. When the data being stored exceeds the capacity of the core's private cache, the chip splits up the data between private cache and the LLC. This ensures that the data is stored where it can be accessed more quickly than if it were in the main memory.

Another case addressed by the new work occurs when two cores are working on the same data and are constantly synchronizing their cached version. Here, the technique eliminates the synchronization operation and simply stores the shared data at a single location in the LLC. Then the cores take turns accessing the data, rather than clogging the on-chip network with synchronization operations.

The new paper examines another case, where two cores are working on the same data but not synchronizing it frequently. Typically, the LLC is treated as a single memory bank, and the data is distributed across the chip in discrete chunks. The team has developed a second circuit to treat this chunk as extensions of each core’s private cache. This allows each core to have its own copy of the data in the LLC, allowing much faster access to data.

The nice thing about this work is that it impacts a number of aspects of a processor’s function. By making the caching algorithms a bit smarter, the researchers could both speed execution of code, while cutting down on the number of commands that needed to be executed simply to manage memory. With fewer commands being executed, energy use necessarily dropped.

If you're reporting on ISCA papers now, IMO you should talk about ones that actually end up influencing real technology. For example, Jaleel's 2010 ISCA paper on DRRIP apparently ended up getting implemented in Intel's Ivy Bridge core. I won't link the blog that proves it, but a search for "ivy bridge drrip" should yield the relevant post. Other than that, there are a dozen cache papers each year that claim to get ~15% better performance while saving energy, so you might end up being very busy reporting on this kind of stuff.

We have a bad enough time trying to keep threads from locking each other up, etc. You're going to have to add more complexity into the compiler to keep track of memory access and/or have a watchdog routine in the cache electronics to keep track of when two threads are synchronizing the same data, etc. I'm sure the work here is interesting but I think there are other problems that will need to be solved before this can go mainstream.

We have a bad enough time trying to keep threads from locking each other up, etc. You're going to have to add more complexity into the compiler to keep track of memory access and/or have a watchdog routine in the cache electronics to keep track of when two threads are synchronizing the same data, etc. I'm sure the work here is interesting but I think there are other problems that will need to be solved before this can go mainstream.

... being that this research at the Phd level, i think that the researchers know this...

it say this because when I hear my brother describing the research for his PHD, it is one "road block" after another... kernal HDD write/read strategies; hard drive firmware read/write strategies; testing whether the real world is affected by your idea...

The new paper examines another case, where two cores are working on the same data but not synchronizing it frequently. Typically, the LLC is treated as a single memory bank and the data is distributed across the chip in discrete chunks. The team has developed a second circuit to treat this chunk as extensions of each core’s private cache. This allows each core to have its own copy of the data in the LLC, allowing much faster access to data.

Given that AMD and Intel use an inclusive cache, wouldn't that mean you end up with potentially 2 copies of each dataset for each core? That means if you have a 4 core CPU, you have 8 copies of the data in your overall cache...

Quote:

Kurian worked with his advisor, Srini Devadas at MIT, and Omer Khan at the University of Connecticut, and their paper presents a hardware design that mitigates problems associated with the spatiotemporal locality principle. When the data being stored exceeds the capacity of the core's private cache, the chip splits up the data between private cache and the LLC. This ensures that the data is stored where it can be accessed more quickly than if it were in the main memory.

As above, surely this is an exclusive cache issue? If you're working with an inclusive cache, you're duplicating your local cache in the shared cache/LLC already so does this really do anything? The private cache is just for faster access to a subset of data already in the LLC.

Personally, I geek out almost as much at incremental improvements like this as I do at bigger things like silicon on insulator. It shows that progress is available in almost every part of the chips and that there's a lot more to the future of chip design than chasing Moore's law and building bigger GPUs. I hope they continue to help improve chips when they graduate (and I'm a little jealous as an EE myself).

The next step step will be hyper-weaving - moving a thread between cores to chase whichever one has the the best cache of the data you need. If a thread needs more than 1 core's worth of local cache and there's a core to spare, bouncing the execution context back and forth should be cheaper than juggling the whole working set.

The next step step will be hyper-weaving - moving a thread between cores to chase whichever one has the the best cache of the data you need. If a thread needs more than 1 core's worth of local cache and there's a core to spare, bouncing the execution context back and forth should be cheaper than juggling the whole working set.

Hardware generally doesn't understand enough about the TLB state to make that more efficient than just moving data between caches.

We have a bad enough time trying to keep threads from locking each other up, etc. You're going to have to add more complexity into the compiler to keep track of memory access and/or have a watchdog routine in the cache electronics to keep track of when two threads are synchronizing the same data, etc. I'm sure the work here is interesting but I think there are other problems that will need to be solved before this can go mainstream.

... being that this research at the Phd level, i think that the researchers know this...

it say this because when I hear my brother describing the research for his PHD, it is one "road block" after another... kernal HDD write/read strategies; hard drive firmware read/write strategies; testing whether the real world is affected by your idea...

Oh, don't get me wrong. I did a PhD in mechanical engineering. Nobody is still (10 years+) doing what I worked on in my research - though there are a few projects that might try to incorporate that work soon.

My comment was aimed more at Ars than the research student (or the faculty member). Reading the title I was expecting to read an article about work coming out of Intel or IBM or if it's from MIT that it's production ready. As others have commented, there are lots of papers every year that show certain performance gains during simulation. Unfortunately, simulations are only as good as their complexity and computer horsepower. Generally, the kind of simulations that can be run by a single PhD student aren't going to blow anyone's socks off.

Seriously, if you're going to post about academic papers, at least cite the paper and give us a DOI link or something.

As an academic, I can say that academic work can be very interesting, but articles like this bother me because why was this particular paper chosen?

Was it because it came from MIT? Tons of great work does not originate at MIT. Media seems to be overly obsessed with work from MIT. I'm not sure why.

Was it because of the impact it has had on actual processor designs? It's too recent for this to be the case.

The problem with this is that there is simply too much good work out there to present academic studies and give them all a fair shake. For this reason, I recommend lagging the research by a few years and posting work that is known to be highly influential.

Other than that, there are a dozen cache papers each year that claim to get ~15% better performance while saving energy, so you might end up being very busy reporting on this kind of stuff.

Indeed, simulations are always fun.

It's also solving the least interesting problem today. The million dollar question is cache coherence protocols that scale to more than a few dozen cores.

HW and SW are a push-pull relationship. These increases in HW will be nullified if SW doesn't take note of changes and optimizations. One lags, while the other leads, then vice-versa.

I still remember one of my CPU architecture classes from college where we had to optimize SW for certain cache heirarchies in single AND multi-core architectures. It was amazing to see how changing loop indexes would speed up (or slow down) compute time. We also looked at how parallelizing certain data structures was inefficient or just plan didn't work. It was a true testament to the fact that more cores didn't equal better performance.

So while I'd agree that the problem isn't the largest issue, addressing this is still a valid concern. It's almost one of those "your strongest point is your weakest link" kind of issues.

We have a bad enough time trying to keep threads from locking each other up, etc. You're going to have to add more complexity into the compiler to keep track of memory access and/or have a watchdog routine in the cache electronics to keep track of when two threads are synchronizing the same data, etc. I'm sure the work here is interesting but I think there are other problems that will need to be solved before this can go mainstream.

Have you ever had to deal with the cache directly when you coded? Probably not. This is probably working at the hardware level and transparent to the programmer. A caching scheme that requires the programmer's intervention is probably not all that useful.

The next step step will be hyper-weaving - moving a thread between cores to chase whichever one has the the best cache of the data you need. If a thread needs more than 1 core's worth of local cache and there's a core to spare, bouncing the execution context back and forth should be cheaper than juggling the whole working set.

We have a bad enough time trying to keep threads from locking each other up, etc. You're going to have to add more complexity into the compiler to keep track of memory access and/or have a watchdog routine in the cache electronics to keep track of when two threads are synchronizing the same data, etc. I'm sure the work here is interesting but I think there are other problems that will need to be solved before this can go mainstream.

Have you ever had to deal with the cache directly when you coded? Probably not. This is probably working at the hardware level and transparent to the programmer. A caching scheme that requires the programmer's intervention is probably not all that useful.

I did mention the compiler needs to be aware of the technology and handle it there (or possibly in the cache logic itself). There are lots and lots of things my compiler takes care of for me that I don't think about.

The new paper examines another case, where two cores are working on the same data but not synchronizing it frequently. Typically, the LLC is treated as a single memory bank and the data is distributed across the chip in discrete chunks. The team has developed a second circuit to treat this chunk as extensions of each core’s private cache. This allows each core to have its own copy of the data in the LLC, allowing much faster access to data.

Given that AMD and Intel use an inclusive cache, wouldn't that mean you end up with potentially 2 copies of each dataset for each core? That means if you have a 4 core CPU, you have 8 copies of the data in your overall cache...

The bigger issue here is that data isn't synchronized as quickly as possible. This has the tendency to cause issues over time due to slight synchronization. Sure, removing this restriction in modern designs would provide a power savings and potential performance boost.

This may be exploiting a trick hardware designers use for the LLC: each CPU core tends to be tied closer to a particular slice of the LLC. Essentially the slice closest to a CPU is a L2.5 cache compared to the rest of the L3 cache. Latencies throughout the L3 cache are not equal in designs like the Intel's Sandy Bridge-E or IBM's POWER7. In the case of a modern Intel chip, this would actually result in 12 copies of the same data at the same point on a quad core chip: L1, L2 and L3 for each core.

Also AMD uses both inclusive (Bobcat heritage) and exclusive (Bulldozer heritage) designs in their modern chips depending on architecture.

Quote:

Kurian worked with his advisor, Srini Devadas at MIT, and Omer Khan at the University of Connecticut, and their paper presents a hardware design that mitigates problems associated with the spatiotemporal locality principle. When the data being stored exceeds the capacity of the core's private cache, the chip splits up the data between private cache and the LLC. This ensures that the data is stored where it can be accessed more quickly than if it were in the main memory.

As above, surely this is an exclusive cache issue? If you're working with an inclusive cache, you're duplicating your local cache in the shared cache/LLC already so does this really do anything? The private cache is just for faster access to a subset of data already in the LLC.[/quote]

It sounds like it is treating the LLC as a target for loads and not a victim cache. With an aggressive prefetch algorithm, this would be advantageous as data would migrate from LLC to L2 and then to L1 cache just before being needed for execution. An exclusive cache would have to retire the data in the LLC as it moves to the L2 etc. where as an inclusive cache would not.

The problem with prefetchers is that they tend to consume a fair amount power. This alone makes them rather rare in the ultra mobile market. They increase performance but they may not provide enough to show a performance-per-watt increase.

If you're reporting on ISCA papers now, IMO you should talk about ones that actually end up influencing real technology. For example, Jaleel's 2010 ISCA paper on DRRIP apparently ended up getting implemented in Intel's Ivy Bridge core. I won't link the blog that proves it, but a search for "ivy bridge drrip" should yield the relevant post. Other than that, there are a dozen cache papers each year that claim to get ~15% better performance while saving energy, so you might end up being very busy reporting on this kind of stuff.

Agreed.

Plus, getting a manufacturer to change their cache is like getting the government to admit it was wrong. It can be done, but I wouldn't want to be the one doing it.

We have a bad enough time trying to keep threads from locking each other up, etc. You're going to have to add more complexity into the compiler to keep track of memory access and/or have a watchdog routine in the cache electronics to keep track of when two threads are synchronizing the same data, etc. I'm sure the work here is interesting but I think there are other problems that will need to be solved before this can go mainstream.

Have you ever had to deal with the cache directly when you coded? Probably not. This is probably working at the hardware level and transparent to the programmer. A caching scheme that requires the programmer's intervention is probably not all that useful.

I did mention the compiler needs to be aware of the technology and handle it there (or possibly in the cache logic itself). There are lots and lots of things my compiler takes care of for me that I don't think about.

Having the ability to give the compiler hints as to the best strategy would be handy.

Have you ever had to deal with the cache directly when you coded? Probably not. This is probably working at the hardware level and transparent to the programmer. A caching scheme that requires the programmer's intervention is probably not all that useful.

If you know you will want more performance in a program before you write it, it can make a lot of sense to design it such that it will mostly run from the cache where it can (server processors have much more, which accounts for most of their performance and cost increase over the consumer lines.)

Many times a program will have a run time determined almost entirely by memory access times, so multithreading will not help, but optimizing cache access will.

There are assembly instructions which are used to load from memory into the cache (or remove something from the cache explicitly.)

Most C compilers allow you to do this with a compiler intrinsic, in the case of MSVC this is mm_prefetch().

Not all programmers will write programs which require an awareness of how the hardware operates, but some do, and some tasks benefit from having that kind of control.

Given that AMD and Intel use an inclusive cache, wouldn't that mean you end up with potentially 2 copies of each dataset for each core? That means if you have a 4 core CPU, you have 8 copies of the data in your overall cache...

If you're reporting on ISCA papers now, IMO you should talk about ones that actually end up influencing real technology. For example, Jaleel's 2010 ISCA paper on DRRIP apparently ended up getting implemented in Intel's Ivy Bridge core. I won't link the blog that proves it, but a search for "ivy bridge drrip" should yield the relevant post. Other than that, there are a dozen cache papers each year that claim to get ~15% better performance while saving energy, so you might end up being very busy reporting on this kind of stuff.

Very much this. This is one of the first ISCA papers I've ever seen written up here on ars and while its nice to see some coverage of micro-architecture research this paper selection seems quite arbitrary. There are a ton of interesting papers to cover from ISCA and MICRO every year, most of them end up going nowhere in industry for various reasons. If you are going to report on micro architecture papers at the very least focus on the top picks papers and even then its probably only really worth reporting on things that get used by industry.

Given that AMD and Intel use an inclusive cache, wouldn't that mean you end up with potentially 2 copies of each dataset for each core? That means if you have a 4 core CPU, you have 8 copies of the data in your overall cache...

As long as they are all just reading that can be a good thing. If look at Intel Xeon E5 design the L3 cache isn't some uniform monster block of cache memory. It is sectioned. Each pair of x86 cores also has L3 cache component in that 'layer' of the assembled collections. So if can manage to drop the "core local duplicate" into the local L3 they can avoid a trip on the internal bus to get to it.

In short, the interface presented of being just an single inclusive pile isn't as important as the real locality that is actually present. With very large LLC and high number of cores. All the cores can't be close to all the LCC subsections.

If the cores are all just reading then creating duplicates effectively just gives the impact of just having larger local caches. So if if the L3 is the LLC then just bigger L2's all around. That is going to help with performance. If slick they doesn't have to be a uniform increase. If core 1 & 2 just need 1MB more and cores 3 and 6 need 4MB more then can possible carve up the L3 into segments that makes most a bit more "happy".

Once they start writing though, then can toss all but one of the duplicates and all just share the one copy. The bad part about all the duplicates would come when start to do modification. It is a communal mutating piece of data then just one cuts way down on synchronization overhead.

Quote:

As above, surely this is an exclusive cache issue? If you're working with an inclusive cache, you're duplicating your local cache in the shared cache/LLC already so does this really do anything? The private cache is just for faster access to a subset of data already in the LLC.

Go back to the above example where the cores have different sized working sets of this "wish could double, triple, etc my L2 cache" size. Some cores are going to be evicting the expanded set that some other cores want. Toss some might OK, but at some point start to "rob Peter to pay Paul". when it would better overall system throughput if "Paul" just starting going to memory. Purely shared data is not 100% of what is in the LLC. Some of it is private. What is and isn't dynamically changes as workload varies.

To some extent this all looks like what a file system cache does with a fixed cache buffer. A bit of juggling of a common resource when there is both shared ( multiple requests of same file ) and more unique request. Map onto a NUMA box where the file cache is non uniform and very similar issues of duplication versus single copy (with much lower synch overhead).

Speaking of processors,is there any news, or any estimates out yet, on when graphene is coming?It's one of the few things i think of weekly and can't freaking wait for, especially now with the slow as snail performance progression. I don't want to become a gramps before it finally hits the consumer grade products.Just thinking of the potential performance increases makes me giddy as hell.

The next step step will be hyper-weaving - moving a thread between cores to chase whichever one has the the best cache of the data you need. If a thread needs more than 1 core's worth of local cache and there's a core to spare, bouncing the execution context back and forth should be cheaper than juggling the whole working set.

Something similar exists for data locality in a NUMA system: execute closes to where the data resides to avoid SMP link traffic. I don't think such a technique would be ideal at the cache level, at least on current architectures. For serial operations, one could easily pipeline these operations to a core where the data already resides in cache. For parallel tasks that require the same data, executing them on the same socket node would show the benefits of the LLC as one thread would essentially preload this for another.

Trying to take advantage of warmed L1/L2 caches doesn't seem efficient as a context switch that would be invoked to move a thread to another core. At best a technique like SMT (Hyperthreading in Intel marketing speak) could show a benefit here as the L1 data is availible to two concurrently running threads. The catch with SMT here is that the processor core would optimally have enough execution resources to fully feed two threads. Without enough execution resources, it is not a clear cut scenario that running the code on a single core with execution unit starvation + warm caches would still be more advantageous than two separate cores + a cold cache.

The next step step will be hyper-weaving - moving a thread between cores to chase whichever one has the the best cache of the data you need. If a thread needs more than 1 core's worth of local cache and there's a core to spare, bouncing the execution context back and forth should be cheaper than juggling the whole working set.

Why does the word "entangling" suddenly spring to mind.

And the wait until quantum computing goes mainstream- it will be a marketing department's dream: New Xeon Leap processors with Hyper Quantum Entangled Weaving!

We have a bad enough time trying to keep threads from locking each other up, etc. You're going to have to add more complexity into the compiler to keep track of memory access and/or have a watchdog routine in the cache electronics to keep track of when two threads are synchronizing the same data, etc. I'm sure the work here is interesting but I think there are other problems that will need to be solved before this can go mainstream.

Have you ever had to deal with the cache directly when you coded? Probably not. This is probably working at the hardware level and transparent to the programmer. A caching scheme that requires the programmer's intervention is probably not all that useful.

Certain architectures do allow a bit of flexibility in what gets put into cache. The PowerPC cores used in the previous generation of consoles had instructions to prevent data form being cached when loaded. This was intended for data streaming where the loaded data would be quickly discarded. This improved performance by lowering the amount of thrashing in the caches by (indirectly) keeping the frequently used data there.

.....There are assembly instructions which are used to load from memory into the cache (or remove something from the cache explicitly.)

Most C compilers allow you to do this with a compiler intrinsic, in the case of MSVC this is mm_prefetch().

Not all programmers will write programs which require an awareness of how the hardware operates, but some do, and some tasks benefit from having that kind of control.

Unless the whole system is consumed to an extremely high percentage by a single application that kind of "control" is an illusion in most real world workloads. There are multiple programs running. Each program has its of how the cache and memory hierarchy should bend to its will. The hardware's job is to juggle those sometimes conflicting demands. Compilers don't have knowledge of the dynamic runtime environment the code will execute in. They, like the programmers , can assume some things but that doesn't necessarily mean it is true.

Assumptions that the minimal L2 cache size is XXX so there only stuff 70% of XXX into some tight loop will generally work OK. If there is larger L2 or part of the L2 is occupied by one of the other 100 other threads instantiated , then much more likely a closer fit.

Similarly can use hints that may impact L2/L3 to help keep an L1 feed but as get closer to the edge of the hierarchy will loose that. An application can drag in some data into cache only to loose its time slice and the data gets evicted by another thread.

It is more so "get in and very quickly get out" drivers , kernel subsystems , and small app kernels that have a high use for those kinds of low level directives. "Programs" in a general sense not so much, unless being pragmatically run "one active app at a time" like mode.

For tasks where you care about performance quite a bit or have a defined response time, you will also probably want to give the program near exclusive use of the CPU (or at least know what else is running.)

Even in multitasking environments you will usually be able to retain most of the cache for your own use if everything else is a background task. Usually only one CPU intensive program will be run at a time.

If a significant number of programs are competing for system resources it will affect performance for all of them. This is well known, and most people will run resource intensive programs one at a time (or schedule them serially.)

There is really no way around that, but this kind of optimization tends to focus mostly on keeping the amount and size of memory access down (and hiding latency), so it tends to help even if you have less cache available than expected.

It's also solving the least interesting problem today. The million dollar question is cache coherence protocols that scale to more than a few dozen cores.

There's quite a bit of money to be made on lower single-threaded latency still. "Getting the first result" is important, and for instance, Intel Turbo Boost capable chips can run a small number of cores at above the rated speed.

The idea is decent, my first thought would be that the reason it splits the L1 currently is probably due to the complexity of the circuitry which would be necessary to share the L1.

Getting lower energy usage is a good sign, as that probably means their logic is not overly complex, or they removed enough synchronization logic to make it worth it anyway (although this is a bit out of my field.)

If it mostly ends up sharing the L1 that could be a decent improvement on its own due to a larger effective size, and less transfer to or from main memory as a consequence.