Memory Shifts Coming, Says Keynoter

SANTA CLARA, Calif. — New kinds of memory interfaces, memory chips, and processors are coming that will offer more performance and new capabilities for engineers who adopt them, Thomas Pawlowski said in a keynote at DesignCon. "Embrace the change. Don't resist it."

The chief technologist at Micron Technology sketched out the general trend toward abstracted memory interfaces and new kinds of memory chips. He gave a more detailed description of a promising processor architecture that Micron has in development. The Hybrid Memory Cube (shown below) "shows the shape of things to come" in abstracted memory interfaces. HMC is a dense stack of memory die in a package that breaks through memory-bandwidth limits with an interface that delivers 160 Gbyte/s.

The HMC interface "exposes nothing -- just a SerDes interface with a simple command set. All the details are not your problem." Such highly optimized interfaces built into DRAMs are the wave of the future. They will replace lowest-common-denominator standards hammered out between processor and memory vendors in committees that "left performance on the table for decades."

In an interview after the keynote, Pawlowski said the Jedec standards group has "nothing in the pipeline" after its DDR4 high-end DRAM interface. However it is developing a family of low-power DDR interfaces, as well as Wide I/O, an interface for attaching a memory chip directly to a processor.

"Micron's process technology experts have expressed "wild disagreement" about when a DRAM replacement will be needed. "The earliest points to 2015, and the latest points to far enough out you could call it never."

Seems inside Micron there are those who want DRAM forever, those who want MRAM, those who want PCM, those who want RRAM, those who want Flash...

In real life systems arch, I think every system deserves its own dedicated architecture.

As an engineer, I'd love to do it right from the bottom-up. The reality is that drastic changes aren't possible. Look at how long it took us to get multi-threaded CPUs fully supported. First, the CPU guys had to implement it. It took a long time after that before the compiler, OS, and application folks figured out how to take advantage of it. This is one reason that the transputer never really got out of academia - nobody knew how to program it. Maybe now with GPGPU architectures being embraced by the HPC folks, the time of the transputer has come - provided that somebody takes the time to generate a robust library of commonly used functions.

But I would prefer to junk PCIe which I frankly think is an abomination as an interconnect !

Junking PCIe has the same problem as I cited above - it is everywhere, and people know how to use it. Having said that, I would love it if I didn't have to pay certain IP vendors a small fortune to use their PCIe cores.

Hmmm. Now you have me thinking about this with a new perspective. First of all, the FPGA based systems can definitely take advantage of this. I've designed a DDR interface for an FPGA and it is not only a pain in the butt, it also wastes the bandwidth capability of the DRAM. By using the HMC, very few pins are needed and the latency is not a problem. Fan-out to logic that can inhale the data at full bandwidth could be a problem but it is easily solved with wide internal buses. Then the memory can be shared amongst all of the hardware accelerators and embedded processors...

Hello Xilinx and Altera - can you please build me a big FPGA in a smaller package? With PCIe and HMC, I don't need all of those pins!

My other thought about an application of the HMC is for an array of small low-power, lower frequency processors (remember the transputer?). When scaled out, this could provide a lot more compute power per sq in than the monster heater CPUs we use today.

OK - maybe I'm not as skeptical now. Even though it is still a bad fit for conventional CPUs, it might be a good fit for compute intensive workloads that can be parallelized.

I still think that a comm application with built-in packet inspection/routing/etc. would be a great place to start. The array of light weight processors or FPGAs might even be the right infrastructure for this.

Even if the memory cube is directly attached to the CPU (which is a very bad idea from a manufacturing yield perspective), the latency will be higher. To access a DRAM, you need to provide the row and column addresses and a few nanoseconds later a cache line is available. To use a serial interface, you need to create a command packet that says "read starting at this address and give me so many bytes". That command packet then needs to be serialized and then sent to the memory cube controller. That has to be de-serialized and interpreted. If the command is not for that memory cube, it has to be passed along the chain to another cube. If it IS for that memory cube, the DRAM has to be read (same row/column read cycle, but at a higher frequency). The data needs to be read into a buffer, then a response packet needs to be generated, serialized, and finally sent to the CPU. Whichever thread of the CPU that was trying to do the read has had to twittle its proverbial thumbs this whole time while waiting for a cache fill to complete. This takes a few nanoseconds with DDR and will take 10s or 100s of nanoseconds with a memory cube.

That should drag just about any high performance CPU to its knees. If the idea is good enough, the CPU makers might be willing to reinvent the whole multi-thread, cache, and memory management infrastructure, but I kind of doubt it :-).

Like I hinted in my earlier post, this may make a great main memory as long as there is a very large low latency RAM between it and the CPU (4th level cache) - and the cache hit rate of the 4th level cache is VERY high...

It seems that everyone is ignoring the fact that the memory cube will have significantly higher latency than DDR-4. A RMW will stall the CPU for eons. This means that it cannot be used by a CPU as the main memory attached to the cache. It essentiallty brings in a new tier to the memory hierarchy. It seems like a great idea that will bring much higher overall memory bandwidth, but the critical latency to the CPU is not solved.

Maybe the local DRAM will become a 4th level cache. Maybe someday the DRAM will be displaced by MRAM. In any case, I cannot see the DDR interface being simply replaced with a bunch of serial links.

It seems like the first niche for the memory cube would be in comm, where latency is not as big a deal and throughput is king... You could make an amazing switch with such a device.