On 01/11/2013 05:22 AM, Vincent Diepeveen wrote:>
>> Bill - a 2 socket system doesn't deliver 512GB ram.
>On 01/11/2013 05:59 AM, Reuti wrote:
> Maybe I get it wrong, but I was checking these machines recently:
>> IBM's x3550 M4 goes up to 768 GB with 2 CPUs http://public.dhe.ibm.com/common/ssi/ecm/en/xsd03131usen/XSD03131USEN.PDF>> IBM's x3950 X5 goes up to 3 TB with their MAX-5 extension using 4 CPUs, so I assume 1.5 TB with 2 CPUs could work too http://public.dhe.ibm.com/common/ssi/ecm/en/xsd03054usen/XSD03054USEN.PDF
There's plenty of others as well. Motherboards with 16 dimm slots that
support 32GB dimms are pretty common. Supermicro resellers often will
sell a configuration that supports 512GB ram in a dual socket system.
However, it's much cheaper (around half the price) is to buy a quad
socket with 512GB ram, looks like they start at around $8k.
I updated my memory latency benchmark, and the inner loop is:
while (p != 0)
{
p = a[p];
cnt++;
}
My benchmark tests latency by:
1) allocating 400GB (2^30 bytes) of 64 bit Ints.
2) shuffling them with the knuth shuffle, using drand48 for randomness.
3) visits 1 int per cacheline (3.3B or so).
4) completes 3,355,443,200 reads in 363.08 seconds (108ns per hop).
The goal being to make it impossible for prefetch or caches to make the
main memory latency look lower than it actually is.
About 40ns of that latency is the constant TLB missing involved in
randomly accessing 400GB. The throughput is pretty low because you are
leaving 15 of 16 memory channels idle at any time.
However if it's acceptable to split the 400GB into chunks so they can be
simultaneously read by multiple process/threads you can do substantially
better. With 64 cores running flat out, doing the same job the per
thread latency rises to 199ns. But since you are running keeping all 16
channels busy you end up with a cache line lookup every 3.1 ns or so.
Try that with a RAID of SSDs ;-).