How to Make a Supercomputer?

Scientists have been trying to build the ultimate supercomputer for a while now, but it’s no easy feat as I’m sure you can imagine. There are currently three Department of Energy (DOE) Office of Science supercomputing user facilities: California’s National Energy Research Scientific Computing Center (NERSC), Tennessee’s Oak Ridge Leadership Computing Facility (OLCF), and Illinois’ Argonne Leadership Computing Facility (ALCF). All three of these supercomputers took years of planning and a lot of work to get them to the standard they are now, but it’s all been worth it as they provide researchers with the computing power needed to tackle some of the nation’s biggest issues.

There are two main challenges that supercomputers solve. The first is that it can analyze large amounts of data and the second is they can model very complex systems. Some of the machines about to go online are capable of producing more than 1 terabyte of data per second, which to put in layman’s terms, is nearly enough to fill around 13,000 DVD’s every minute. Supercomputers are far more efficient than conventional computers and calculations it could carry out in just one day would take 20 years for a conventional computer to calculate.

As mentioned earlier, the planning of a new supercomputer takes years and is often started before the last one has even finished being set up. Because technology moves so quickly, it works out cheaper to build a new one opposed to redesigning the existing one. In regards to the ALCF, staff began planning for it in 2008, but it wasn’t until 2013 that it was launched. Planning involves not only deciding when and where it will be built and installed, but also deciding what capabilities the computers should have that is going to help with future research efforts.

When the OLCF began planning their current supercomputer, the project director, Buddy Bland, said, “It was not obvious how we were going to get the kind of performance increase our users said they needed using the standard way we had been doing it.” OLCF launched their supercomputer, Titan, in 2012 and combined CPU’s (central processing units) with GPU’s (graphics processing units). Using GPU’s instead allows Titan to handle multiple instructions at once and run 10 times faster than OLCF’s previous supercomputer. It’s also five times more energy-efficient too.

Even getting the site ready to house the supercomputer takes time. When the NERSC installed their supercomputer, Cori, they had to lay new piping underneath the floor in which to connect the cooling system and cabinets. Theta is Argonne’s latest supercomputer to go live which launched in July 2017.

There are many challenges that come with supercomputers too, unfortunately. One is that it literally has thousands of processors so programs have to break problems into smaller chunks and distribute them across the units. Another issue is designing programs that can manage failures. To help pave the way for future research, and to stress-test the computers also, in exchange for having to deal with this new computer issues, users are granted special access as well as being able to attend workshops and get hands-on help when needed.

‘Dungeon Sessions were held at NERSC while preparing for Cori. These were effectively three-day workshops, often in windowless rooms, where engineers would come together from Intel and Cray to improve their code. Some programs ran 10 times faster after these sessions. “What’s so valuable is the knowledge and strategies not only to fix the bottlenecks we discovered when we were there but other problems that we find as we transfer the program to Cori,” said Brian Friesen of NERSC.

But, even when the supercomputer is delivered it’s still a long way from being ready to work. First, the team that it goes to have to ensure that it meets all their performance requirements. Then, to stress-test it fully, they load it with the most demanding, complex programs and let it run for weeks on end. Susan Coghlan is ALCF’s project director and she commented, “There’s a lot of things that can go wrong, from the very mundane to the very esoteric.” She knows this firsthand as when they launched Mira, they discovered that the water they had been using to cool the computer wasn’t pure enough and as a result bacteria and particles were causing issues with the pipes.

“Scaling up these applications is heroic. It’s horrific and heroic,” said Jeff Nichols, Oak Ridge National Laboratory‘s associate director for computing and computational sciences. Luckily the early user’s program gives exclusive access for several months before eventfully opening up to take requests from the wider scientific community. Whatever scientists can learn from these supercomputers will be used in the Office of Science’s next challenge, which is in the form of exascale computers – computers that will be at least 50 times faster than any computer around today. Even though exascale computers aren’t expected to be ready until 2021, they’re being planned for now at the facilities and managers are already busy conjuring up just what they can achieve with them.