At the Lawrence Livermore National Laboratory in California, a supercomputer named "Sequoia" puts nearly every other computer on the planet to shame. With 1.6 million processor cores (16 per CPU) across 96 racks, Sequoia can perform 16 thousand trillion calculations per second, or 16.32 petaflops.

Who would need such horsepower? The IBM Blue Gene/Q-based system was built for the Department of Energy for simulations designed to extend the lifespan of nuclear weapons. But for a limited time, the machine is being made available to outside researchers to perform all sorts of tests, a few hours at a time.

One of the first to take advantage of this opportunity was Stanford University's Center for Turbulence Research—and it wasn't hesitant about seeing what this machine is really capable of. For three hours on Tuesday of last week, researchers from the center remotely logged in to Sequoia to run a computational fluid dynamics (CFD) simulation on a million cores at once—1,048,576 cores, to be exact.

It's part of a project to test noise generated by supersonic jet engines and help design engines that are a bit quieter. The work is sponsored in part by the US Navy, which is concerned about "hearing loss that sailors on aircraft carrier decks encounter because of jet noise," Research Associate Joseph Nichols of the Center for Turbulence Research told Ars.

Using giant supercomputers to solve complex scientific problems is by no means unique these days. Larger numbers of cores don't necessarily translate to the fastest speed, either, because of differences between processors and the designs of supercomputers. The million-core run is intriguing, but it also poses extreme challenges in trying to use all those cores at once without things going wrong.

Believe it or not, three hours with a million cores wasn't enough to make a real dent in the jet noise project. Despite preparation work aimed at cutting out bottlenecks, it was just enough time to make sure the code ran properly and to get a sense of the possibilities that million-core computers can offer.

"This is really to show what we can do in the future," Nichols said. "The simulations take some time to boot up and pass through initialization. We did tune the I/O for the Blue Gene architecture, but it is still slower than the blindingly fast computation and communication speeds. Depending on how much data gets written, the I/O can add an extra chunk of time to the overhead."

The problems are all over the place, including "How do you write into one file from a million processors that are all trying to step on each other?" Nichols said. "That was an interesting thing. It kind of depended on the interconnect, too. Only some of the processors are connected to the disk, so you have to rearrange data to get the right performance."

More cores, or better cores?

Sequoia was named the world's fastest supercomputer in June 2012. Sequoia later fell to second place behind a 17.59-petaflop system at the Oak Ridge National Laboratory, but it is still the only system on the Top 500 supercomputers list with a million or more cores.

Of course, going for "more cores" isn't necessarily the best way to tackle a supercomputing problem. That Oak Ridge computer, named Titan, hit 17.59 petaflops using "only" 560,640 cores. The Titan system was built by the supercomputer manufacturer Cray, and it uses Nvidia graphics processing units in addition to traditional CPUs to gain dramatic increases in speed.

There are pros and cons to different approaches. Dave Turek, IBM vice president of high performance computing, told Ars last year that GPUs are more difficult to program for, and the GPU-less Sequoia is for "real science." Titan, it should be noted, does plenty of real science, tackling problems related to climate change, astrophysics, and more. With both CPUs and GPUs in a system, Titan's CPUs guide the simulations but hand off work to the GPUs, which can handle many more calculations at once despite using only slightly more electricity.

Throwing more cores at a problem doesn't necessarily result in performance gains. Code has to be carefully prepared to account for the bottlenecks that arise when information is passed from one core to another (unless the application is so parallel that each core can work on separate calculations without ever talking to each other).

Previously, Nichols' biggest calculation was performed for about 100 hours on 131,072 cores on a Blue Gene/P system. The same calculation could be done on a million Sequoia cores in about 8 to 12 hours, he said.

Besides speeding up lengthy calculations, more cores and faster supercomputers will let scientists tackle even more complicated problems.

"Having more cores, either you reduce the time to solution or you solve more complex problems, problems involving flow that involve, say, chemical reactions, combustion," said Parviz Moin, director of the Center for Turbulence Research. "These are grand challenge problems that involve many, many more equations and a lot of work that will be distributed among the processors."

Getting code ready for the million-core run

Nichols and team use the Navier-Stokes equations and codenamed CharLES for the jet noise simulations (LES stands for large eddy simulation). They're using the same code for separate projects to research scramjets (supersonic combustion ramjets), which could travel at 10 times the speed of sound.

We've written before about large supercomputing runs—for example, one using 50,000 cores on the Amazon Elastic Compute Cloud. That one was "embarrassingly parallel," in which the calculations are all independent of each other. That means the speed of the interconnect, the connections between each processing core, didn't really matter.

The million-core run was not only much bigger, it was more complicated. "It is parallel but there is communication [between processors] involved," Moin said. "Each core is not independent."

As we wrote in August 2011, Blue Gene/Q uses "transactional memory" to solve many of the problems that make highly scalable parallel programming so difficult. But prep work by humans is still required.

Nichols worked with the Lawrence Livermore folks to optimize the code for Sequoia, avoiding slowdowns in I/O performance and minimizing the communication needed in each step.

Each compute node is connected to 10 of its nearest neighbors. There are five dimensions and it goes forward and backward along each dimension. That's 5x2 and you get 10 connections with these optical links.

You can communicate with processors that are further away, but it has a higher latency. Latency for the nearest neighbor is 80 nanoseconds, which is totally incredible. The calculation uses the interconnect in such a way that communication overhead was very minimal even at a million processors.

Performance scaled almost at a one-to-one ratio with the increase in cores, with 83 percent efficiency. Nichols explains that since going from 131,000 cores to 1 million multiplies the number of cores by 8, one would want a speed-up of 8 as well.

The real speed-up was 6.6, which is "83 percent of the ideal speed-up we would like to see. It means that as we're adding more cores the code is getting faster and faster, even at the million level. This is amazing for a CFD simulation, because in CFD simulations each of these subdomains has to communicate with its neighbors at every time step to share wave information."

Modeling jet engines is complex, as Stanford notes in a description of the project: "These complex simulations allow scientists to peer inside and measure processes occurring within the harsh exhaust environment that is otherwise inaccessible to experimental equipment."

The simulations allow Stanford to test how changes to the engine nozzle and chevrons impact noise. With supercomputers these simulations can be done without building physical models or testing in wind tunnels.

"That's a really complicated problem because you have shock waves in engines which are very thin scales compared to the length of the combustor," Nichols said. "And then you have combustion as well as turbulence in everything. The idea is that we can predict the behavior. If given enough resolution we can predict what will happen with different types of designs."

The study of scramjets and the conditions under which they might fail involves similar complexity. "NASA is very much interested in such vehicles for access to space," Moin said. "These would be air-breathing vehicles as opposed to rockets. They carry their own oxygen to orbit; they're heavier because of that. They have to carry liquid oxygen."

For now, the researchers will have to continue this work with paltry sub-million-core supercomputers. But perhaps it won't be long before supercomputers as powerful as Sequoia are the standard for such research.

As a high school student in 1994, Nichols attended a summer program at Lawrence Livermore and worked on the Cray Y-MP. At the time, it was one of the faster machines in the world.

"Now Sequoia is 10 million times stronger than that machine," Nichols said. "This is giving us a glimpse into the future, that it's really possible to run on a million cores."

I'd be very curious to know if they actually used the transactional memory support on BlueGene; The TM support only applies to each compute node (single bundle of 16 cores), and is handled separately from the computation that happens across nodes.

Regardless though, the amount of coordination to manage code that scales to a million cores is mind boggling.

The work is sponsored in part by the US Navy, which is concerned about "hearing loss that sailors on aircraft carrier decks encounter because of jet noise,"

Really? That stands at odds to me being told that even though I had worked a flight deck for years, hearing loss wasn't a service related disability. Granted this was a number of years ago and polices might have changed, but interesting none the less. Even with double hearing protection it was sometimes painfully loud on deck.

Dave Turek, IBM vice president of high performance computing, told Ars last year that GPUs are more difficult to program for, and the GPU-less Sequoia is for "real science."

They're no more difficult to program for than CPUs, and GPUs are made for these types of parallelization problems. This sounds like pure marketing speak (especially the part about 'real science').

There's a good reason NERSC, LLNL, Sandia, and Argonne have built their latest large systems without using GPUS, and NCSA only has GPUs in a small fraction of Blue Waters - they're a pain in the ass to work with. Getting value out of them demands substantial application porting effort well beyond what it takes to tune existing CPU-based code to a new system, even with large changes in microarchitecture, interconnect, and storage. When GPUs work, they work astoundingly well. The best cases seem to be new applications tailored from the ground up to benefit from them, though.

Edit to answer Oskiee:

What makes GPUs so wonderful is the huge number of functional units they contain, and the obscene memory bandwidth they can sometimes offer. The challenge is that code needs to be written to very closely match the structure of the GPU hardware. To use all of the functional units, you need big blocks of parallel computation with no inter-dependence - a local calculation that is close to embarrassingly parallel. The memory bandwidth only really shines when your memory access pattern is very regular across the threads, such as contiguous threads accessing contiguous words of memory.

I know that GPUs are great for number crunching, which is why they are being used to build supercomputers, but what makes the better than a CPU? Aren't they essentially the same thing?

Mythbusters has a cool video (search paintball gun gpu mythbusters) - i beleive the jist of it is highly parallelized short pipelines in a gpu vs serial processing over a few cores in a cpu. Each excell at different tasks.

On the article here, absolutely awesome stuff, curious what the power draw is. Important to note that its not just a bunch of processors on a rack, this type of thing is one of the reasons IBM is doing very well where they're at now. Its easy to forget about them for a while, but damn do they come up with amazing stuff.

Dave Turek, IBM vice president of high performance computing, told Ars last year that GPUs are more difficult to program for, and the GPU-less Sequoia is for "real science."

They're no more difficult to program for than CPUs, and GPUs are made for these types of parallelization problems. This sounds like pure marketing speak (especially the part about 'real science').

There's a good reason NERSC, LLNL, Sandia, and Argonne have not built their latest large systems using GPUS, and NCSA only has GPUs in a small fraction of Blue Waters - they're a pain in the ass to work with. Getting value out of them demands substantial application porting effort well beyond what it takes to tune existing CPU-based code to a new system, even with large changes in microarchitecture, interconnect, and storage. When GPUs work, they work astoundingly well. The best cases seem to be new applications tailored from the ground up to benefit from them, though.

Actually, Titan, which is at Oak Ridge, is being used heavily by researchers at LLNL and Sandia.

Dave Turek, IBM vice president of high performance computing, told Ars last year that GPUs are more difficult to program for, and the GPU-less Sequoia is for "real science."

They're no more difficult to program for than CPUs, and GPUs are made for these types of parallelization problems. This sounds like pure marketing speak (especially the part about 'real science').

There's a good reason NERSC, LLNL, Sandia, and Argonne have not built their latest large systems using GPUS, and NCSA only has GPUs in a small fraction of Blue Waters - they're a pain in the ass to work with. Getting value out of them demands substantial application porting effort well beyond what it takes to tune existing CPU-based code to a new system, even with large changes in microarchitecture, interconnect, and storage. When GPUs work, they work astoundingly well. The best cases seem to be new applications tailored from the ground up to benefit from them, though.

Actually, Titan, which is at Oak Ridge, is being used heavily by researchers at LLNL and Sandia.

Yes, the DOE labs share resources pretty heavily. There are plenty of codes run by people at all of the labs that are amenable to GPU parallelization. It just falls far short of all of them. Thus, there's still substantial value in large machines whose whole budget is spent on more general-purpose hardware.

This shouldn't be too surprising, really. Much of the growth and development in computing has come from increased specialization of resources, some of which eventually gets folded back into mainline components, and others that simply get outpaced.

I don't know jack about how components are sourced for supercomputers, but I'm going to make an educated guess that AMD didn't have much for net margins when they sold those processors for Titan. Yes, they should get name recognition, but they also need to turn a profit.

The CPUs in Titan provide only a small fraction - something like 10% - of its total computational capacity. Someone running code on Titan using only the CPUs would be wasting their time, compared to other available resources that they could use instead. What CPUs it happens to have driving it really isn't that relevant.

The more interesting remark to make here is that Cray's subsequent generation hardware, Cascade/Aries, which is shipping in the newer XC30 systems, works with Intel CPUs instead of AMD's. Without some serious new developments, AMD's days in the HPC space may be numbered.

Keep in mind high performance GPU computing is a relatively recent advance. Many academic software developers did not begin to see advantages toward using GPU computing, so the software architecture wasn't developed into many science codes.

DSF1942 wrote:

Quote:

Dave Turek, IBM vice president of high performance computing, told Ars last year that GPUs are more difficult to program for, and the GPU-less Sequoia is for "real science."

They're no more difficult to program for than CPUs, and GPUs are made for these types of parallelization problems. This sounds like pure marketing speak (especially the part about 'real science').

This actually is a bit tricky. Quantum chemistry applications require double precision accuracy in all computations. If one is going to use GPU's with only float precision, then the code gets a little bit more complicated... However, now that double precision GPU's are now available, I am beginning to see more GPU optimized codes for HPC in quantum chemsitry

The work is sponsored in part by the US Navy, which is concerned about "hearing loss that sailors on aircraft carrier decks encounter because of jet noise,"

Really? That stands at odds to me being told that even though I had worked a flight deck for years, hearing loss wasn't a service related disability. Granted this was a number of years ago and polices might have changed, but interesting none the less. Even with double hearing protection it was sometimes painfully loud on deck.

I'd be very curious to know if they actually used the transactional memory support on BlueGene; The TM support only applies to each compute node (single bundle of 16 cores), and is handled separately from the computation that happens across nodes.

Regardless though, the amount of coordination to manage code that scales to a million cores is mind boggling.

I often worry about the quantum affects of such a super computer. I've heard of cosmic rays flipping bits in existing computers, and when you have 1,000,000 cores, that's some serious area of exposure in a calculation. Meanwhile your little 1"x1" quad core does not have much exposure at all.

Many interesting topics are only hinted at in this article. How is a project like this loaded? What is the initialization like? Is there a checkpointing procedure for stopping the simulation so it can be restarted at a later time? And, most importantly, how do you know the results of your simulation are computationally sound? In other words, how do you check your work? The possibilities for error seem to grow enormously.

Once again, an Ars deep look into a project involving these super computers would be fascinating for those of us who once did simulation (even though it was just circuit simulation).

I often worry about the quantum affects of such a super computer. I've heard of cosmic rays flipping bits in existing computers, and when you have 1,000,000 cores, that's some serious area of exposure in a calculation. Meanwhile your little 1"x1" quad core does not have much exposure at all.

Unsurprisingly, the people who put together supercomputers are aware of the possibility of externally induced errors and account for that possibility in their designs.

I often worry about the quantum affects of such a super computer. I've heard of cosmic rays flipping bits in existing computers, and when you have 1,000,000 cores, that's some serious area of exposure in a calculation. Meanwhile your little 1"x1" quad core does not have much exposure at all.

Unsurprisingly, the people who put together supercomputers are aware of the possibility of externally induced errors and account for that possibility in their designs.

This reminded me of that cool article from a few days back about the Saturn V rocket that the geniuses at NASA are reassembling with the ultimate goal of getting super high res digital scans of the physical rocket to use in simulations on supercomputers like this. It's amazing what they can do with the masses of computational power available today.

I know that GPUs are great for number crunching, which is why they are being used to build supercomputers, but what makes the better than a CPU? Aren't they essentially the same thing?

GPUs are great at doing a small bit of math really fast. CPUs are great at changing the kind of operations they do from one moment to the next. Once you've discretized the PDEs of the Navier Stokes Equations what you're really left with is a set of simultaneous algebraic equations. The solutions to these involve a BUNCH of multiplies and adds and it's the same routine over, and over, and over... So if you can cast your problems into a form that can be fed to GPUs and it's worthwhile then you may get an overall speed boost relative to using the same amount of silicon in CPUs.

In the problem described here the results form one numerical time step (for convergence purposes and not necessarily having anything to do with real time) must propagate from every node (location) to at least their nearest neighbors in 3-space. So th I/O of keeping all the GPUs fed may actually be the bottleneck rather than just asking a CPU to take an array in RAM and perform some operation on it.

Why is the Department of Energy concerned about the longevity of nuclear weapons?I realize that nuclear weapons can realease large amounts of energy, but not in a way that should be interesting to the DoE.

Why is the Department of Energy concerned about the longevity of nuclear weapons?I realize that nuclear weapons can realease large amounts of energy, but not in a way that should be interesting to the DoE.

Not to sound like a jerk, but doesn't "flops" mean Floating-point Operations Per Second. So a Petaflop is technically 10^15 Floating-point Operations Per. It's just that the original acronym isn't actually plural, but just happens to end in an 's'.

Start with 4 quarters, place them on your hypercube and then rotate the hypercube clockwise. You'll notice that you have to rotate it 4 times before the quarters return to their original location and orientation.