Latency and HPC Workloads

Michael S (already5chosen.delete@this.yahoo.com) on October 8, 2012 5:12 pm wrote:
> > Sure, but i was talking about the
> > higher end
> of the performance scale, not commodity systems. Something like the
> > big
> POWER systems,
>
> Big POWER systems use fully-buffered memories. So, latency
> wouldn't be as good as commodity boxen regardless of interface you use between
> buffer and memory device.
> Besides, I don't know we they do the scheduling, in
> the controller, i.e in Power chip or in the buffer itself. If the former, then
> you face the same problem as with Intel/AMD.

The POWER solution to this problem is to put an enormous L3 cache on the processor. 80MB / 8-cores in power7+. This is essentially the place we are at with low latency memories. It is cost prohibitive to build an entire system memory from high cost / low latency memory. Also, as capacity goes up, that latency becomes less low. The alternative is to have a small low-latency memory paired with a larger memory to fill that cache.

IBM uses that small/fast memory for hardware controlled cache, on the power7. So too have the most recent generations of vector machines from cray and NEC. IBM cell, though a little long in the tooth now, was cool in that it gave the programmer control over that small/fast memory. This impacts bandwidth too. As core counts go up, it's not possible to put enough pins on a chip, to maintain the bandwidth/core that we're used to. The solution is to put more of that memory on package, or even on-chip. I think we're going to see another level in the memory hierarchy, for bandwidth reasons, if not latency. Hopefully latency comes along for the ride.