Supercomputing 12: Intel, Nvidia, and AMD Face Off

At this week's Supercomputing 12 show, the biggest news have been the increasing competition in accelerators and co-processors. AMD, Intel, and Nvidia all unveiled new processors designed to improve the performance of parallel operations by including large numbers of specialized cores. As they have been doing for several generations now, Nvidia and AMD both introduced new versions of designs originally aimed at processors designed as graphics processing units (GPUs), while Intel uses multiple small x86 cores in its long-promised many integrated core (MIC) architecture.

At this week's Supercomputing 12 show, the biggest news has been the increasing competition in accelerators and co-processors. AMD, Intel, and Nvidia all unveiled new processors designed to improve the performance of parallel operations by including large numbers of specialized cores. As they have been doing for several generations now, Nvidia and AMD both introduced new versions of designs originally aimed at processors designed as graphics processing units (GPUs), while Intel uses multiple small x86 cores in its long-promised many integrated core (MIC) architecture.

Many of these announcements were expected, but we heard some of the new details.

Intel Xeon Phi

Intel finally announced the formal shipment of its Xeon Phi processor, which it positioned as improving parts of software that can be parallelized—pretty much the same target as the Nvidia and AMD processors.

Intel announced two general versions, the Phi 5110P, which is available (in limited quantities) now, and the Xeon Phi 3100 series, which will launch in the first quarter of 2013. Both are manufactured on a 22nm process (and reportedly have around 5 billion transistors, though Intel did not confirm this or provide a die size) and are based on Intel's MIC or Many-Integrated Core architecture.

The 5110P has 60 cores, each with four threads, runs at 1.05GHz, has 30MB of L2 cache, and supports up to 8GB of GDDR5 memory with a peak memory bandwidth of 320GBps. That gives it peak double-precision floating point performance of 1.01 teraflops. Early customers, including Stampede at the Texas Advanced Computing Center (TACC) at the University of Texas at Austin, are using a custom version of the 5110P, the SE10, which has 61 cores running at a slightly faster clock speed, with a bit more L2 cache.

Intel hasn't disclosed the number of cores or the speed of the 3100 Series, though it seems likely it will have fewer cores running at a higher clock speeds, as it is rated at 300 watts instead of the 225 watts of the 5110P (which is designed to be passively-cooled, while the 3100 will be available in passive and active cooling models).

Nvidia Tesla K20

For its part, Nvidia formally its announced the 28nm Kepler-based Tesla K20X and K20 processors, which were previously disclosed as part of the Titan supercomputer that now heads the TOP500 list. Both are based on the firm's 28nm Kepler architecture. The K20X, which is used in Titan, has 2,688 cores and runs at 732MHz, and is rated at 1.31 peak teraflops of double-precision floating point performance and 3.95 teraflops single precision. Nvidia says this is twice the double-precision and three times the single-precision performance of the previous generation. The K20 has 2,496 cores and runs at 706MHz, and is rated at 1.17 teraflops for double-precision and 3.52 teraflops for single-precision. In any case, this is a massive, 7 billion-transistor chip.

The company highlighted a number of applications where it can offer huge performance improvements compared with CPU-only calculations. It also emphasized partnerships with almost all of the makers of supercomputing equipment.

AMD FirePro S10000

AMD announced its FirePro S10000 GPU accelerator, which uses two GPUs, each from the firm's 28nm Tahiti (Southern Islands) design. AMD had earlier announced the S9000, based on a single GPU. The S10000 has 3,594 total cores (1,792 per chip), running at 825MHz, with a total peak rating of 1.48 teraflops at double-precision and 5.91 teraflops at single-precision. (ExtremeTech has details here.) The single chip S9000 has 1,792 cores, but runs slightly faster at 900MHz.

The FirePro S10000 has theoretically better scores than the Nvidia K20X, though it has two GPUs instead of one and it also requires more power—375 watts TDP vs. 235 watts for the Nvidia product. Note also that although all of the companies cite performance on the LINPACK benchmark, real-world performance can vary dramatically depending on the application you're running and how the software was developed.

AMD also pushed its "Graphics Core Next" (GCN) architecture, which enables the S10000's two GPUs to carry out compute and graphics/visualization tasks at the same with a single board, so this seems to be promoted more as workstation solution (where Nvidia also offers its Quadro workstation board). AMD has had more success in the workstation market than in the HPC accelerator market.

x86 Coprocessors vs. GPU Accelerators

Indeed, the next big battle in this space seems to be between x86 coprocessors and GPU accelerators. Intel tries to distinguish the two by noting that the Xeon Phi can run operating systems, but accelerators can't. That's true today—the Xeon Phi can run Red Hat Enterprise Linux 6.x or SuSE Linux 12+—but I'm not sure how relevant it is, as both run in systems that use other main processors, and often the point is to distribute the application, not the OS.

The more important differences are likely to be how well they perform, how much performance per watt they can generate, and how easy it is to get that performance. That's because in the real world, software often needs to be written to take advantage of massively parallel processing.

Intel often talks about how because the Xeon Phi uses x86 cores, it can run all of the same languages, libraries, and tools that programmers are used to today. Intel does have a widely-used parallel programming library, often used for multi-core Core and traditional Xeon chips.

Nvidia stresses its CUDA language extensions that work with C/C++ or Fortran, as well as its support for the OpenACC tools for directing compilers to produce parallel code, and says many universities find it easy to teach CUDA as many students have laptops with CUDA-compatible graphics cards. (Xeon Phi is very rare at the moment, though x86 cores are nearly universal.) AMD mostly relies on OpenCL.

Over the next few years, it looks like we will see more experimenting with both co-processors and accelerators, and such heterogeneous systems will likely come to dominate the high-performance computing world.

Interconnects, Memory, and More

Other interesting things at the show include lots of discussion of interconnects, like Mellanox's expanding line of 56Gb/s FDR InfiniBand products. Micron pushed its Hybrid Memory Cube, which uses large amounts of server memory stacked together enlisting a technology called through-silicon vias, or TSVs, to provide dense, high-speed memory. The first commercial products are slated for 2014, with 4Gb and 8Gb cubes with 160GBps of throughput, about 10x what you get out of an entire standard DDR3 memory module today.

Get Our Best Stories!

This newsletter may contain advertising, deals, or affiliate links. Subscribing to a newsletter indicates your consent to our Terms of Use and Privacy Policy. You may unsubscribe from the newsletters at any time.