Posted
by
samzenpus
on Thursday September 13, 2012 @12:58AM
from the coming-soon dept.

Nerval's Lobster writes "The Texas Advanced Computing Center plans to go live on January 7 with "Stampede," a ten-petaflop supercomputer predicted to be the most powerful Intel supercomputer in the world once it launches. Stampede should also be among the top five supercomputers in the TOP500 list when it goes live, Jay Boisseau, TACC's director, said at the Intel Developer Forum Sept. 11. Stampede was announced a bit more than two years ago. Specs include 272 terabytes of total memory and 14 petabytes of disk storage. TACC said the compute nodes would include "several thousand" Dell Stallion servers, with each server boasting dual 8-core Intel E5-2680 processors and 32 gigabytes of memory. In addition, TACC will include a special pre-release version of the Intel MIC, or "Knights Bridge" architecture, which has been formally branded as Xeon Phi. Interestingly, the thousands of Xeon compute nodes should generate just 2 teraflops worth of performance, with the remaining 8 generated by the Xeon Phi chips, which provide highly parallelized computational power for specialized workloads."

I wonder why it's got such little memory? You can easily run 64GB per socket at full speed with the E5-2600 (16GB x 4 channels) without spending that much money. Heck for maybe 10% more you can run 128GB per socket (You need RDIMM's to run two 16GB modules per bank). They're apparently only running one 16GB DIMM per socket (any other configuration would be slower on the E5) which IMHO is crazy as you're going to have a hard time keeping 8 cores busy with such a small amount.

Ooops, scratch that miss-read the summary. There probably is n't a need for that much memory because the kind of problems they are most likely to be dealing with will have massive datasets that don't fit in memory anyway. The limiting factory will be CPU and node interconnect bandwidth so adding extra memory wont make much if any difference to performance.

Uhh..haven't looked at the new AMD monster racks [theregister.co.uk] designed to function as a single unit thanks to the tech they got from the SeaMicro purchase. We are talking 2048 cores and 16Tb of RAM in a single rack, with THAT much space if you got the cash you should be able to fit your large dataset into RAM and run it from there.

Man I don't want to even think about what the electric bill for something THAT powerful would be though.

You can easily run 64GB per socket at full speed with the E5-2600 (16GB x 4 channels) without spending that much money. Heck for maybe 10% more you can run 128GB per socket (You need RDIMM's to run two 16GB modules per bank).

I am guessing it might have something to do with budgetFrom the way I look at it, they are populating each memory slot with 4GB of el-cheapo DDR3 DRAM and that way they may be saving quite a bit of $$$ to buy more Dell servers

Reliability and power.. more memory is more chances for a node to have a failure, ram gets hot, and so it will need to be cooled. And my bet is that it is ECC-DDR3, as el-cheapo isn't remotely worth the price in these applications.

Ah, another esoteric, nearly impossible to program for architecture that flies for some problem sets but is nearly unapproachable for the non-CS science folks. I mean it's great that such things exist I guess, and in theory they can have great FLOPS/watt figures, but I wonder how much science will really get accomplished per dollar spent compared to something where standard code just runs?

Esoteric? Nearly impossible to program for? Methinks you haven't read through the actual docs for it. You can use all the standard Intel tools to program for it, which are also MIC-aware, just like you program for a standard multi-core CPU. That includes the threading and math kernel libraries, as well as OpenCL if you want to go that route.

According to the architecture each core does n't have it's own memory other than the L2 & L1 caches. How the memory is mapped per core is arbitary, there is nothing stopping you from having for exampe a shared data set using 4GB and using 64MB per core for a raytracer where the scene data is stored in the shared memory and each core works on part of scene. So no, you don't have to limit memory per thread to fully use the architecture properly.

On a card with 8GB your effective memory accessible per core is 8GB, lots of problems have large data sets that can be shared over cores such as the example I gave. In fact this is a major advantage of MIC versus GPUs.

There is nothing nighmarish of the above, it would appear just a shared memory area to the process.

You will be parallelizing, and each thread will only ever be able to use max_mem/N for its own processing.When you parallelize, you avoid sharing memory between threads. Your data set is split over the threads and synchronization is minimized. In a SMP/NUMA model, this is done transparently by simply avoiding to access memory that other threads are working on. In other models, you have to explicitly send the chunk of memory that each thread will be working on (through DMA, the network, an in-memory FIFO or whatever), but it doesn't change anything from a conceptual point of view.

If your parallel decomposition is much more efficient if your data per thread is larger than 1GB, then you cannot possibly run 64 threads set up like this on the MIC platform. There is often a minimum size required for a parallel primitive to be efficient, and if that minimum size is greater than max_mem/N then you have a problem. This is the limiting factor I'm talking about.128 MB, however, is IMO quite large enough.

In fact this is a major advantage of MIC versus GPUs.

The advantage of MIC lies in ease of programming thanks to compatibility with existing tools and the more flexible programming model.Memory on GPUs is global as well, so I have no idea what you're talking about. There is also so-called "shared" memory (CUDA terminology, OpenCL is different) which is per block, but that's just some local scratch memory shared by a group of threads.

There is nothing nighmarish of the above

Please stop deforming what I'm saying. What is nightmarish is finding the optimal work distribution and scheduling of a heterogeneous or irregular system.Platforms like GPUs are only fit for regular problems. Most HPC applications written using OpenMP or MPI are regular as well. Whether the MIC will be able to enable good scalability of irregular problems remains to be seen, but the first applications will definitely be regular ones.

You will be parallelizing, and each thread will only ever be able to use max_mem/N for its own processing.When you parallelize, you avoid sharing memory between threads. Your data set is split over the threads and synchronization is minimized. In a SMP/NUMA model, this is done transparently by simply avoiding to access memory that other threads are working on. In other models, you have to explicitly send the chunk of memory that each thread will be working on (through DMA, the network, an in-memory FIFO or whatever), but it doesn't change anything from a conceptual point of view.

If your parallel decomposition is much more efficient if your data per thread is larger than 1GB, then you cannot possibly run 64 threads set up like this on the MIC platform. There is often a minimum size required for a parallel primitive to be efficient, and if that minimum size is greater than max_mem/N then you have a problem. This is the limiting factor I'm talking about.128 MB, however, is IMO quite large enough.

For algorithms where you have a basically regular streaming data then yes, your working data set will be mem/n but as I mentioned there are a number of problems where you have a large mainly static dataset such as raytracing or financial modeling. In these scenarios being able to access a large shared pool of memory has big advantages.

In fact this is a major advantage of MIC versus GPUs.

The advantage of MIC lies in ease of programming thanks to compatibility with existing tools and the more flexible programming model.Memory on GPUs is global as well, so I have no idea what you're talking about. There is also so-called "shared" memory (CUDA terminology, OpenCL is different) which is per block, but that's just some local scratch memory shared by a group of threads.

Accessing global memory on GPUs is extremely slow and there is a strict memory heirarchy that you have to adhere to in order to get any kind of performance. This heirarchy i

Accessing global memory on GPUs is extremely slow and there is a strict memory heirarchy that you have to adhere to in order to get any kind of performance.

It could be seen as being the same as the CPU, except will automatically cache it to fast memory for you.

Any problem where you need random access over a large amount of data is just not feasible on GPUs.

What makes you think it would be faster with the MIC?

Granted it is the hardware doing the work, but you have four threads per core on Knights Corner with a cacheline miss causing a context switch masking the latencies. You also do n't have the extra overhead of your code setting up the copies between global and shared memory (which is limited to 48K on CUDA) everytime you want to access a data structure. Obviously you have many more cores on a GPU but how much performance do you think you will get once you have to jump through all the hoops and basically

Each core has 320KB of local cache, but the 8 all share 20MB of L3 cache, so for most efficient use you either want each core to have a working set (including code) of under 2.5MB, or for them to be accessing 256KB of independent data from a shared set of 20MB. That's not quite the whole story, because there is no cache coherency problem if two cores are reading the same data.

That's 2GB per core, a fine amount for supercomputer problems requiring compute density and bandwidth. No virtualization there and the compilers, middleware and programmers are probably sufficiently educated to know how to split the problem.

The Cynic in me says that you don't get to into the Top 5 by spending all of your budget on memory.:)

Practically speaking there are a lot of research codes out there that are using 1GB or less of memory per core. Our systems at MSI typically had somewhere between 2-3GB of memory per core and often were only using half of their memory or less. There's a good chance that TACC has looked at the kinds of computations that would happen on the machine and determined that they don't need more.

"Petaflops" is not representative of the power of modern supercomputers, many of which use massively parallel integer processing to perform their duties. Sure, you can say that simulating floating point operations with the integer units amounts to the same thing, but it actually doesn't. We have discovered that there are a great many real-world problems for which parallel integer math works just fine, or even better (more efficient) than floating point. And for those, flops is a completely meaningless metric.

Flops had always been a useless metric. If you want good metrics, look at the instruction reference with the speed in cycles of each instruction, its latency, its pipelining capabilities, the processor frequency, and cross it all with the number of cores and memory and cache interconnect specifications.

Flops are just a number that give a value for a single dumb computation in the ideal case ; real computations can be up to 100 times slower than that.

Well, as far as achievable computation, that's why Linkpack reports Rmax and Rpeak, however the one big area where Linpack is lacking as a measurement stick for many real workloads is its small communications overhead, it's much easier to achieve high utilization on Linpack then it is for many other workloads.

Most modern supercomputers get their "flop" count from SSE3/4 and/or GPUs which are not integer, but Floating point processing machines(at least 32-bit single precision fp, but also double precision albeit at a slower rate). These machines most certainly do NOT simulate floating point with their integer units (nor cheat by calling an integer op as an approximate fp op), and they have massive amounts of dedicated hardware SIMD FP processing units to do their he

"These machines most certainly do NOT simulate floating point with their integer units (nor cheat by calling an integer op as an approximate fp op), and they have massive amounts of dedicated hardware SIMD FP processing units to do their heavy lifting."

I did not say that they did. I said that you could consider integer units that emulated fp hardware to be doing flops. I did not state that this is the usual case.

"... they are rated by the number of IEEE FP operations..."

I KNOW that... my point is that it probably is not an appropriate rating these days. Not representative of many real-world problems.

"The integer OPs currently don't count in the current ratings and I don't see that changing any time soon."

I said that you could consider integer units that emulated fp hardware to be doing flops.

1. I don't see any supercomputers emulating fp hardware on integer units...2. Even if they did (which they don't), it would be so slow that it would be a rounding error in their flop rating.

As I (and others have posted), although there are some interesting integer problems, existing supercomputers issue those instructions to processing units that are essentially the same speed as FP units, so there's not much difference between a FLOP and the I

...sort of. And whoever rated the parent "insightful" apparently has little insight into HPC and supercomputing. Interesting might have been appropriate.

First off, any metric which yields a single number is bound to be misleading as it is easy to find two applications a and b where a runs faster than b on machine 1 and slower than b on machine 2. Bit since we want such a simple metric, we might just as well settle for the one we already have. Why flops? Because applications use them. I know that the calls o

"First off, any metric which yields a single number is bound to be misleading as it is easy to find two applications a and b where a runs faster than b on machine 1 and slower than b on machine 2."

Part of the point I was making.

Sure, but apparently people still want such a metric, however imperfect and misleading it may be.

"it's much easier to prove numerical stability with floating point numbers"

False.

How eloquent. Simple example: I have two numbers a and b, approximately of the same size, with their LSB being tainted by rounding errors. If I add them on a floating point machine the last two bits of the mantissa will be tainted, but because during normalization the mantissa will be cut we end up with again just the LSB beint tainted.

On a fixed point machine however adding both numbers will either result in

"On a fixed point machine however adding both numbers will either result in an overflow or in two bits being tainted. And so on. Care to disprove me? "

I would not attempt to try to "disprove" you on Slashdot. But I will argue with some of your assumptions.

First, you can do "floating point" math using scaled integers of a size to represent the number of decimal points you desire. But the integer math is not subject to either the speed limitations, or the bugs that have been not just known but fairly common in fp hardware. Sure, you still get rounding errors, but you get those anyway. But they aren't the only kind of errors that occur with floating-point

Floating-point arithmetic in processors is fraught with errors (such as rounding errors) and has quite often turned out to contain very significant bugs. Integer math simply does not have those problems. If you want "numerical stability", you need to stay away from floating-point in hardware.

Numerical stability is not the same as exactness, which you are referring to. Exactness is something we can never achieve (just think of irrational numbers: impossible to store all digits). So we have to clip numbers and resort to rounding, which introduces errors. When caring for numerical stability one usually tries to prove that the errors introduced by the imperfect representation of numbers in computers is less than an acceptable limit "foo". And these proofs are simpler for floating points arithmetics

I'm talking about fixed-point numbers, which are (almost) the same as integers in respect to the logic, but keep in mind the decimal point, which is is fixed (hence the name) to a certain place. Who in his right set of mind would propose to using pure integer arithmetic? You need a way to represent, say, 1.337.

It would be interesting to see how well flops and the Linpack score correlate across the members of the Top500. My guess would be that they correlate pretty well, just because flops is never used as a serious benchmark so nobody bothers gaming it. But I could certainly be wrong.

This reminds me of an old science fiction story. The designers, builders and programmers assemble. The Switch is flipped. The computer boots. The first question they ask is, "Is there a God?" The machine hums away for a few seconds, then arc welds the power switch open and responds, "There is now!"

I can just see on January 8th someone coming in and asking "So what have you invented yet?" So with 864,000 petaflops this is all that you have come up with?
IBM's latest supercomputer does 2 gigaflops per watt. I wonder if this one does better. I really think this is the future of computing since they are capable of doing computing at a far less electrical cost than the average home computer. So any problem that requires a series amount of computing would be sent to a supercomputer with the home comput

The chief science official of Texas has divined that the computational projects will be:1. Derive a proof that the universe is only 5000-6000 years old.2. Derive a proof that God is a silver haired, white man from Texas.