IBM Embraces Nvidia GPUs for Acceleration

IBM wants its Power Systems platforms to continue to thrive in the datacenter, and it knows it has to do everything it can to give the Power family of processors all of the advantages that an X86 chip has. Oddly enough, an important advantage that the X86 architecture has right now over Power is the ability to offload parallel work to a GPU coprocessor.

IBM is now committed to fixing this shortcoming, and is going to be working very closely with Nvidia, its newly designated supplier of choice for GPU accelerators. The plan calls for both companies to make changes to their chips and their software to make Power processors and Tesla coprocessors work seamlessly with Nvidia’s CUDA parallel programming environment and IBM’s various Rational development tools and compilers for the Power architecture.

Big Blue has also committed to use Tesla GPUs to accelerate its portfolio of database, middleware, and application software.

“We have been asked a lot since forming the OpenPower Consortium about what IBM and Nvidia are going to do, how we are going to partner,” Brad McCredie, vice president of Power Systems development within IBM’s Systems and Technology Group, tells EnterpriseTech. “We are going to use Nvidia GPUs, first and foremost, to accelerate enterprise workloads. There will be some HPC work, sure. But what we find really exciting is that we will be using GPUs to accelerate core enterprise applications.”

The expanded partnership, announced at the SC13 conference in Denver, is the first tangible fruit to come out of the OpenPower Consortium that IBM launched in August with Nvidia, Google, Mellanox Technologies, and Tyan, who respectively make GPUs, homegrown datacenters and servers, network cards and switches, and motherboards and systems. While OpenPower is about emulating the open, collaborative development that characterizes the ARM ecosystem established by ARM Holdings, this recent effort between IBM and Nvidia is a business deal that is all about solving business problems. Specifically, making enterprise applications run faster and more efficiently than they can on processors alone.

IBM knows a thing or two about accelerating workloads. The “Roadrunner” system that Big Blue built for Los Alamos National Laboratories pushed the limits of application acceleration back in 2008 when it broke through the petaflops performance barrier. Roadrunner ganged up banks of compute that consisted of a pair of Opteron 2210 processors with four PowerXCell accelerators hanging off them, which were Power4 processors in their own right with eight vector math units of their own.

The basis for the PowerXCell chips was the game console business, which Big Blue pretty much owned at the time, just like Nvidia’s Tesla GPU compute is based on the high-volume graphics card business that Nvidia shares with Intel and AMD. IBM stopped investing in the PowerXCell processors a few years back and has subsequently lost the game console processor and GPU business to AMD.

But IBM is not going to give up the datacenter without a fight, and it is allying itself with Nvidia in that fight. IBM is certainly not going to tell customers who to pair want GPU coprocessors from Nvidia with its System x rack servers and NextScale hyperscale minimalist servers that they can’t have them, McCredie says that the development effort is really focused on the Power server platform.

Initially, IBM will be working with Nvidia to match up that company’s current “Kepler” and future “Maxwell” GPU accelerators with its twelve-core Power8 processors, which are due in the middle of next year.

The top-end Power8 processors running at 4 GHz will have eight threads per core (up from four threads per four threads per core for Power7 and Power7+). It will deliver about 2.5 times the performance of a Power7+ chip, socket for socket, across a wide variety of workloads, including Java, integer, floating point, memory streaming, and transaction processing. The Power8 chip will have 96 MB of L3 cache on the die and 128 MB of L4 cache implemented in the memory buffer controllers. It will offer 230 GB/sec of sustained memory bandwidth (2.3X that of the Power7+) and 48 GB/sec of peak I/O bandwidth (2.2X of Power7+) and is a performance beast by any measure.

This begs the question: Why does this Power8 chip need to be accelerated?

First, not every system using the Power8 chip will have all of these feeds and speeds. Many of these elements of the design are scaled back for entry and midrange systems. Second, even with all of that oomph, sometimes a GPU is going to offer much better bang for the buck and much better performance per watt, particularly on chunks of parallel code. That is just a fact of life in computing these days. IBM can ignore this fact to its peril, or it can embrace what it learned–what it taught–with Roadrunner. And it must do what all HPC innovators should always do: Bring what it learned in the supercomputer lab into the enterprise datacenter.

At the moment, says McCredie, the plan calls for hooking Nvidia GPUs over PCI-Express 3.0 links to systems using the Power8 processors. In fact, says McCredie, IBM is in the process of tweaking the Power8 systems designs to better accommodate Tesla GPU cards. The GPU coprocessors will be woven into the complete Power Systems lineup, from small rack-mounted machines to big NUMA shared memory systems. McCredie was not at liberty to say which Power8 machines would come out first in the middle of next year. IBM tends to do a staggered rollout of the Power Systems platform with each new chip generation, usually starting in the middle and then working its way up and down the line as chip yields permit.

Over time, Nvidia and IBM will be working together to add features to their respective GPU and CPU chips to more tightly couple them. The Power8 chip has PCI-Express 3.0 peripheral controllers on the die, just like Intel’s Xeon E5 processors do, but IBM has done one better and created what it calls the Coherent Accelerator Processor Interface, or CAPI. This is an overlay that makes use of the PCI transport to create a virtual memory space comprised of the CPU main memory and any memory used by any kind of accelerator that plugs into the PCI bus. By sharing the memory, the CAPI interface will work with the Tesla GPU accelerators and the virtual memory in the CUDA environment to manage the movement of data between main and frame buffer memory, transparent to the application. (Nvidia just announced unified memory between the X86 CPUs and Tesla GPU with CUDA 6 last week ahead of SC13.)

Hooking the Power chips and Tesla GPUs together more tightly is going to take some time. “They have to add stuff to their chip and we have to add stuff to our chip,” says McCredie. That means the CPU chip after Power8 – presumably to be called Power8+ if IBM is consistent – and a GPU after the current Kepler GPU – presumably the next generation “Maxwell” GPU – will be more tightly coupled. IBM and Nvidia are being vague about the precise plan because it is early in the development cycle. What McCredie did say is that IBM was working to have clean-slate Power-Tesla hybrid systems, with that tight integration, around 2015 or so.

In the meantime, plenty of workloads can be accelerated using GPUs attached over PCI-Express buses without having a shared virtual memory space. IBM has set up an acceleration lab within its Software Group to identify what parts of the Software Group product catalog can be juiced with GPUs. “We’re just getting started in the process,” says McCredie.

IBM is very much interested in accelerating Java workloads, as EnterpriseTech has previously reported, but McCredie said that there are C and C++ applications in the IBM portfolio that can offload work to the GPUs. Database and streaming applications are low-hanging fruit perhaps, as are the obvious HPC applications in use at supercomputer centers around the world. Business intelligence, risk analysis, predictive analytics, and similar applications are also well suited to acceleration by GPUs.

Sumit Gupta, general manager of the Tesla Accelerated Computing business unit at Nvidia, says that the GPU maker will be porting its CUDA parallel application development tools to the Power architecture, and is working with IBM to figure out the best way to integrate the company’s Rational development tools and compilers for the Power chips with CUDA. McCredie says that the goal is for this to be a seamless programming experience with on unified tool for both CPU and GPU programming. IBM will itself need such tools to update its Software Group portfolio, and so will customers who write their own applications.

The Power Systems support IBM’s own AIX variant of Unix, its IBM i proprietary platform, Red Hat Enterprise Linux, and SUSE Linux Enterprise Server. IBM has not said if all of these operating systems will be supported, but it is a given that Linux will be and a fair certainty that AIX will be. IBM i is always a question mark, mainly because the workloads on machines that use IBM i tend to be transaction processing and they are more I/O bound than compute bound.

Nvidia likes two things about this partnership with IBM. Nvidia gets a new set of workloads and customers to chase, and has a good, clean collaboration as well.

“This is a huge announcement for us,” says Gupta. “IBM’s future supercomputing roadmap is going to put Tesla GPUs front and center. This is a broader announcement than that, but given that it is SC13, we are casting this in a supercomputing frame. But this is the same hardware that will get shipped into enterprise datacenters. And the other thing is that our relationship with IBM can be really good because they do not have a competing accelerator product.”

The implication is that this is a much tighter collaboration than Nvidia can have with either Intel or AMD.