Good to see GPUs gaining traction outside of videogames, paving way for their use as a general purpose devices that can benefit a wide variety of usage patterns outside of games :) Hopefully the profits from these will mean even better GPUs for us gamers down the line. Reply

This is pretty awesome. I'm jealous you got to go. The comment about the thickness requirement of the cables for 480V compared to 208V in the first power delivery video is staggering. I'm surprised there's such a difference.

Some of the videos seem to be stopping early when I play them, and I have to skip ahead a bit to continue watching.Reply

Voltage is 2.3 times higher, so current is 2.3 times lower for the same power. A wire 2.3x thinner (5.3x less cross sectional area) will give the same power loss. Insulation thickness would be slightly higher because it's based on voltage not current.Reply

> The comment about the thickness requirement of the cables for 480V compared to 208V in the first power delivery video is staggering. I'm surprised there's such a difference.

V = VoltageI = CurrentR = ResistanceP = Power

P = V times I

So if you double the Voltage you halve the Current for the same amount of Power.

Power Loss (in the cables) is calculated as I squared times R. Since I is 1/2 at 480 Volts the Power Loss is 1/4 (1/2 squared) as much.

So they determined a fixed power loss in the cables and reduced the size (which increased the resistance) of the cables so that the thinner cables (at 480 volts) had the same loss as the thicker cables (at 208 volts).Reply

A 480 Vrms circuit draws less than half the current of a 208 Vrms circuit at the same power level. So the resistance of the wire can be more than double. Resistance of the wire is the resistivity of the copper material times the length divided by the cross-sectional area. .This means the radius is less than half, or the diameter of the wire for 480 V can be less than a quarter of the diameter of the wire for 208 V.Reply

This is an awesome article Anand! I would love to see more super-computing like this, and maybe some in-depth discussion of how super-computing works and differs from traditional computing architectures. Thanks for the great article though!Reply

I also just registered just to say that this is a great article! One of the best I have seen on Anandtech, keep up the awesome work. Perhaps you can look into the Parallella Adapteva project next!Reply

Yes, there is significant research going on. In our lab we had a pretty big group working of using FPGAs for HPC. The RC based supercomputer is called Novo-G. It was the worlds biggest publicly known RC super computer.

It is very small in physical size compared to some of the top conventional super computer, but for some specific compute requirements it comes close to beating top supercomputers. There was a major upgrade planned (around the time I was graduating) so it might even better now.What exact type of computations? I don't remember very well (I didn't work on RC, I was mostly s/w guy in conventional HPC part of lab), you might be able to get some info by checking out few posters or papers abstract.

According to the paper, it takes 6 to 8 years for the #1 computer on the list to move to #500, and then another 8 to 10 years for that performance to be available in your average notebook computer. Not sure on notebook to smartphone, but it can't be very long.Reply

Not saying it can't be 2688 CUDA cores but you are using the high-end of the range when the article clearly lists a range of 1.2-1.3Tflops. I don't think you can just assume that it's 2688 without a confirmation given the range of values provided.Reply

Great article. Fantastic way of showing to us tiny PC users what really big stuff looks like. Data center is one thing, but my word this stuff is, is... well that is Ultimate Computing Pr0n. For people who will never ever have a chance to visit one of the super computer centers it is quite something. Enjoyed that very much!

@Guspaz

If we get that kind of performance in phones then it is really scary prospect. :DReply

We currently have 1-billion-transistor chips. We'd get from there to 128 trillion, or Titan-magnitude computers, after 17 iterations of Moore's Law, or about 25 years. If you go 25 years back, it's definitely enough of a gap that today's technology looks like flying cars to folks of olden times. So even if 128-trillion-transistor devices isn't exactly what happens, we'll have *something* plenty exciting on the other end.

*Something*, but that may or may not be huge computers. It may not be an easy exponential curve all the way. We'll almost certainly put some efficiency gains towards saving cost and energy rather than increasing power, as we already are now. And maybe something crazy like quantum computers, rather than big conventional computers, will be the coolest new thing.

I don't imagine those powerful computers, whatever they are, will all be doing simulations of physics and weather. One of the things that made some of today's everyday tech hard to imagine was that the inputs involved (social graphs, all the contents of the Web, phones' networks and sensors) just weren't available--would have been hard, before 1980, to imagine trivially having a metric of your connectedness to an acquaintance (like Facebook's 'mutual friends') or having ads matching your interest.

I'm gonna say that 25 years out the data, power, and algorithms will be available to everyone to make things that look like Strong AI to anyone today. Oh, and the video games will be friggin awesome. If we don't all blow each other up in the next couple-and-a-half decades, of course. Any other takers? Whoever predicts it best gets a beer (or soda) in 25 years, if practical.Reply

I'd wondering which model Opterons they threw in there. The Interlagos chips were barely faster and used more power than the Magny-Cours CPUs they were destined to replace, though I'm sure these are so heavily taxed that the Bulldozer architecture would shine through in the end.

Okay, I've checked - these are 6274s, which are Interlagos and clocked at 2.2GHz base with an ACP of 80W and a TDP of 115W apiece. This must be the CPU purchase mentioned prior to Bulldozer's launch.Reply

In this environment, where stability is key, he was probably taught that having a bit more is safer than having a bit less. No doubt the data center was designed around airflow software to ensure that heating issues do not arise based on an 'average' application of thermal material.Reply

I wonder how many Petaflops this beast would have achieved if it used Sandy Bridge EP class chips? Anandtech's review of the Opteron 6276 vs Sandy Bridge Xeon EP showed that Intel was far more performant.Reply

In a world in which millions of morons are enthralled by Honey Boo Boo and her band of genetic regressionists, it is great that scientists are advancing our understanding of the Universe. Without those 1%, one can only imagine the state our planet would be in.Reply

I ported some Brownian motion code from CPU to GPU for my thesis and got a considerable increase (4000x over previously published data). Best thing was that the code scaled with GPUs. Having access to 20k GPUs with 2688 CUDA cores would just be gravy. Especially when simulating 10^12 and beyond independent particles.Reply

4000x ?! i don't think i've ever seen such a speedup, was that simply from 1 cpu to a 1 gpu?i ported a monte carlo risk simulation (which also uses brownian motion, although i suspect for different purposes than yours) and saw about 300-400X speed up, thought that was at the top end of what you can get in terms of speed increases.Reply

It helped that the previously published data was a few generations back, so I had some Moore's Law advantage. The type of simulation for that research was essentially dropped there and then because it was so slow, and no-one had ever bothered to do it on newer hardware. I think a 2.2 GHz Nehalem single core simulation of my code compared to a GTX480 version of the code was 350x jump or so. Make that 16 cores vs 1 GPU (for a DP system) and it makes it more like 23x.Reply

I've read articles on anandtech for years, but I register an account for the first time today to comment on how wonderful this article is. The scope of what is covered in the article is nothing short of fascinating, and the quality of the writing and attention to detail is superb. Thank you!Reply

Very interesting article, loved the 30,000 foot explanation of the supernova modeling, really helped me to understand in more concrete detail what types of things scientists are using these supercomputers for.

One thing I'd love to see is more in depth discussion of the networking. As you pointed out, the networking connectivity is just as important as the data processing, but you really just glossed over it. At least something as simple as vendor, models, host bus adapters, etc. Reply

you missed a huge data item in your article. by saying it's "just a bunch of SATA drives" you completely glossed over the WAY those SATA drives are organized (by DDN). DDN uses a wide/shallow bus topology to keep parallel writes going to the drives organized and processed in a VERY optimal manner. consequently, they're able to ingest at over 6GB/s per head...now, multiply that across the requirements from ORNL and you can see why this becomes important.

Did you get any information about the network (yarc-2 , gemini) ? Cray's claim to fame has been their network architecture which is supposed to be a key contributor to the actual performance of the supercomputer.Reply

You missed the point in the article saying ECC memory was a -must- for a usage scenario like this. With nearly 20,000 GPUs, and all of that information being continuously communicated between the GPU memory and the GPU itself, without ECC, errors would pop up very quickly, and would make useful computation nigh impossible.Reply

It's also about the specific software that works better with CUDA. GCN GPUs are no toys but the software support is nowhere near as prevalent in the professional GPGPU space compared to what NV has accomplished. This makes a lot of sense since NV essentially invented the GPGPU space starting with G80 in 2006. They spent a lot more money creating the CUDA eco-system and making sure they were the pioneers in this space. Given the higher widespread adoption of CUDA and proven track record of working with NV, larger companies are far more likely to go with Nvidia.

This is actually no different than what we saw in the Distributed Computing space. For more than half a decade, NV's GPUs were faster in many apps. As the DC community is more dynamic and adopts much quicker to moder code and technologies, in the last 3 years, almost all of the new DC projects are dominated by AMD GPUs.

On paper, HD7970 GE delivers 1.075 TFlops of DP and an 1200mhz 7970 has 1.23 Tflops. Without software support, for now it doesn't mean much in the professional space but the horsepower is already there. Reply

Very nice article and love your last paragraph, Anand. It's a revelation. It is indeed incredible to think when we wanted that 3D accelerator to play GLQuake, it actually turned the wheel for great things to come. To think back, something as ordinary or insignificant as gaming actually paved the way to accelerate our knowledge today. This goes to show even ordinary things can morph into great things that one can never imagine. It really humbles you to not look down anything, to be respectful in this intertwined world, the same way it humbles us as human beings as we know more about the universe.Reply

Although parallelism is very important for processing large models, there is one important feature Mr Anand failed to discuss about Titain, choosing instead to obscess about transistor count and CPU's and GPU's.

And that is how much memory per box is available. 96GB? 256GB? of DDR3-1333 memory?

Problem is usually memory for those large reactor or coupled neutron-gamma transport problems analyzed with Monte Carlo or Advanced Discrete Ordinates, not the number of processors. Need lots of memory for the geometry, depleteable materials, and cross-section data.

And once the computing is done, how much space is available for storing the results? I have seen models so large that they run for 2 weeks with over 2000 processors only to fail because the file storage system ran out of space to store the output files. Reply

You failed to read the entire article. Anand stated there was something like 32 GB of RAM per CPU and 6 GB per GPU (if I remember correctly, going off the top of my head) for a grand total of 710 TB RAM total as well as 1 PB of HDD storage available. Check back through the pages to find what exactly he posted.Reply

So Sandy Bridge does ~160 GFlops on the LINPACK benchmark, while Titan should do ~20 PFlops, making it 125K times faster. 125K ~ 2^17, so with 17 doublings a PC will be as fast as Titan. If we assume 1.5 years/doubling, that gives us 25 years. And just imagine the capabilities of a 2037 supercomputer....Reply

For all the people speculating or suggesting that they should have used AMD GPUs or Intel CPUs, I think you need to think more like engineers, and less like "cowboys."

To get started, reread this:

"By adding support for ECC, enabling C++ and easier Visual Studio integration, NVIDIA believes that Fermi will open its Tesla business up to a group of clients that would previously not so much as speak to NVIDIA. ECC is the killer feature there."

Now, why on earth would ECC memory on a GPU (which, apparently, AMD wasn't offering) be important? The answer is simple: because a supercomputer that doesn't produce trustworthy results is worse than useless. Shaving some money off the power and cooling budget, or even a 50% boost to raw performance and/or price performance doesn't really matter if the results of calculations that take weeks or months to run can't be trusted.

Since this machine gets much of its compute performance from GPU operations, it is essential that it use GPUs that support ECC memory to allow both detection and recovery from memory corruption.

As to the CPUs, I'm not suggesting that Intel CPUs are significantly less computationally sound than AMDs, but Cray and ORNL already have extensive experience with AMDs CPUs and supporting hardware. Switching to Intel would almost certainly require additional validation work.

And don't underestimate the effort that goes into validating or optimizing these systems. Street price on the raw components alone has to be tens of millions of dollars. You can bet there is a lot of time and effort spent making sure things work right before things make it to full-scale production use.

I know a guy, PhD in Mathematics who used to work for Cray. These days, he's working for Boeing, where his full-time-job, as best as I can understand it, is to make sure that some CFD code they run from NASA is used properly so the results can be trusted. When he worked at Cray, his job was much more technical, he hand-optimized the assembly code for critical portions of application code from Cray's clients so it ran optimally on their vector CPU architecture. When doing computation at this scale things that are completely insignificant on individual consumer systems, or even enterprise servers, can be hugely important.Reply

Anand I LOVE this post. Breath of fresh air to get to see some of the real world applications for all this awesome tech we love. The interviews with scientists are especially fascinating and eye opening. Love the use of video to hear the insights, affect and passion of the researchers and see them at work. Please more of this sort of thing!!Reply

I've been noting till I'm blue in the face that GK-110 formed Nvidia's backup plan, should the GCN/Kepler power ratio not have worked out as much to AMD's disadvantage as it did (presumably 'Big Fermi' was a similar action plan being enacted).

For your second question, if it has the right software then any high-end consumer desktop PC could become self-aware. It would work rather sluggishly, compared to some sci-fi AIs like those in the Halo universe, but would potentially start learning and teaching itself.Reply

Hethos that is not by any stretch certain. Since "self awareness" or "consciousness" has never been engineered or simulated, it is still quite uncertain what the specific requirements would be to produce it. Yet here you're not only postulating that all it would take would be the right OS but also how well it would perform. My guess is that Titan would much sooner be able to simulate a brain (and therefore be able to learn, think, dream, and do all the things that brains do) much sooner than it would /become/ "a brain" It look a 128 core computer a 10hr run render a few-minute simulation of a complete single celled organism . Hard to say how much more compute power it would take to fully simulate a brain and be able to interact with it in real time. as for other methods of AI, it may take totally different kinds of hardware and networking all together. Reply

In addition to the bit about ECC, nVidia really made headway over AMD primarily because of CUDA. nVidia specially targeted a whole bunch of developers of popular academic software and loaned out free engineers. Experienced devs from nVidia would actually do most of the legwork to port MPI code to CUDA, while AMD did nothing of the sort. Therefore, there is now a large body of well-optimized computational simulation software that supports CUDA (and not OpenCL). However, this is slowly changing and OpenCL is catching on. Reply

I was actually surprise at how many actual times the word "actually" was actually used. Actually, the way it's actually used in this actual article it's actually meaningless and can actually be dropped, actually, most of the actual time.Reply