How Will We Keep Supercomputing Super?

Cray’s (s cray) Jaguar supercomputer is the fastest machine on the planet, according to the Top 500 list of supercomputers published today by four researchers in the computing industry. It marks the first time that Jaguar beat out IBM’s (s ibm) Roadrunner on a performance basis, achieving 2.3 petaflops, or about 2 million billion calculations a second. However, a deeper look at the list shows that the trend in supercomputing is not only one of faster machines, but a steady erosion of how super supercomputing actually is, as exemplified by dedicated vendors such as SiCortex being shut down and venerable players like SGI filing for bankruptcy before then getting acquired.

Increasingly, many of the parts that make up a supercomputer — from the types of processors used to the networking cables — are the same as those used in everyday corporate computing. As the chart makes clear, the number of different processors used to build supercomputers has been shrinking. This is partly due to Moore’s Law, which enables the x86 architecture (the same type of chips inside your computer) to make steady performance gains, but is also a function of how cheap mass-produced chips are. And because most supercomputers are built for the government, getting as many flops for the dollar is essential. Even on the networking side, Ethernet is making strides when compared to more expensive, proprietary networking technologies such as Infiniband.

Advertisement

However as the search for faster computers continues (the goal is to build an exascale machine by 2018), cramming millions of x86 chips into a giant system isn’t going to cut it on either the power consumption or the real estate front. That’s why when it comes to chips in supercomputers, expect to see more graphics processors, which offer screaming performance for certain tasks at a price that’s relatively cheap thanks to the fact that GPUs are found in most consumer computers. For example, the fifth-most powerful system on the Top 500 list is a Chinese supercomputer that uses GPUs from AMD (s amd).

As I explain in a GigaOM Pro piece about the quest for the exascale grail (sub. req’d.), building a supercomputer that can deliver a billion billion (or quintillion) calculations per second is going to force designers to change the way they think about putting these supercomputers together. GPUs are the first step in that process, although more esoteric technologies may emerge.

But for now, as supercomputers and high-performance computers use more mainstream and commodity parts, it makes it that much harder to distinguish the specialty high-performance computing vendors from those offering corporate computing products. Rackable, which purchased SGI and took the SGI name, builds its products with a corporate buyer in mind, for example. And as the underlying hardware for high-performance computing becomes more like the hardware used in corporate data centers, firms like Microsoft (s msft) are trying to take advantage of the familiar architectures (as well as the ever-increasing need for higher-performance computers at the corporate level), with the Redmond giant today releasing products that will allow folks to run Microsoft Office Excel 2010 on a distributed HPC cluster as well as a version of Windows HPC Server 2008 designed to run on large clusters.

So as supercomputers have become less super, they have also become more accessible for corporate computing. Microsoft, Intel (s intc), SGI/Rackable and others are trying to take advantage of this. However, as the industry strives to build machines that can achieve exascale performance, it’s unclear if these commodity and common architectures can scale out linearly without consuming incredible amounts of power and taking up huge amounts of space. So we may see supercomputing become more super and less mainstream once again.

Infiniband is not “proprietary”, it’s an open industry standard. In fact it’s cheaper than 10GigE per port and per NIC. It has much better latency than 10GigE as well. The only advantage of 10GigE is backward compatibility.

Exactly. This article reflects a couple knee-jerk mis-conceptions about Ethernet and InfiniBand. Gigabit Ethernet lost 5% share of the Top500 list since Nov. 2008 a year ago in # of systems, and about 8% since a year ago measured in total processors interconnected, and even more as measured in total Rmax performance delivered — InfiniBand has interconnected the most processing performance on the Top500 list since a few years ago, and is still growing. 10G Ethernet has *finally* been used to build exactly 1 machine on the Top500 list, at #486 – which makes sense, since it delivers about 1/4 the price/performance (~same price, 1/4 the bandwidth — worse latency) vs. InfiniBand, from any of the various vendors. Factor of 4 worse.
As we go to more cores, more power per machine, GPUs, and larger SMPs, trends are pointing towards more industry-standard InfiniBand, less Gigabit Ethernet, and little-or-no 10G Ethernet, except for people who don’t care about how much they pay for bandwidth, as long as it’s called Ethernet.