Uncategorized —

IBM, Rapport’s Kilocore, and reconfigurable computing

IBM and start-up Rapport team up on a Cell-like design with 1,024 cores. Also …

One of the bigger announcements this past week, at least in terms of hype, was Tuesday's news that IBM will work with a small start-up called Rapport to provide a massively parallel, reconfigurable computing device capable of extremely high performance/watt numbers.

I haven't commented on this yet, because I was trying to get more technical details on the technology. Rapport didn't bother to return my call, so I'll go ahead and comment based on what little I can ascertain about it from the coverage and the press release. (If one of the Rapport tech people is reading, go yell at the PR department for me.) First, the facts that are out there.

Rapport has announced a processor dubbed "Kilocore," because it combines 1,024 specialized eight-bit processor cores and a general-purpose PowerPC core onto a single chip. The connections between the cores can be reconfigured arbitrarily on-the-fly, so that the chip can change configurations to suit a specific workload. Because it's a field programmable gate array (FPGA), Kilocore's clockspeed is quite low. I couldn't find an actual number, but Kilocore's predecessor is a 256-core design that runs at 125MHz and uses under a watt of power.

IBM and Rapport are making some incredible performance/watt claims about the device. From the NYT:

At a computing conference scheduled to begin in San Jose, Calif., on Tuesday, Rapport will demonstrate the chip processing a stream of video images. While a standard industry microprocessor chip, the ARM 7, can process 3.3 images a second while consuming half a watt of power, the new Rapport chip will convert 30 frames a second while consuming only 100 milliwatts, about one-fifth the power.

He said that was a power-efficiency ability roughly 50 times the current industry standard component. The power savings are obtained by radically lowering the energy used in conjunction with each separate computing element in the system.

Because Kilocore is clocked so low, it relies for performance on a combination brute force and the ability to reconfigure itself for a custom fit to the software being run. This approach also enables it to do much more with much less power than a more general-purpose design.

Roadblocks ahead

Reconfigurable computing, where chips have interconnects between logic blocks that can by changed in on-the-fly by software, has long been looked at as a sort of holy grail. After all, wouldn't it be cool if your GPU could reconfigure itself for, say, high-end audio processing during non-gaming periods when most of its hardware is going unused? The idea of semiconductor devices that can be dynamically reprogrammed in order to custom-fit a particular workload at a particular moment is incredibly attractive. Still, there are long-standing problems with the idea.

Aside from the fact that you just can't clock an FPGA as fast as a normal CMOS device, the whole notion of using software to reconfigure the underlying hardware so that it runs the software better is fraught with all kinds of inherent complexities. These complexities can range from annoying to impossible, depending on how you plan to use the device.

The classic fantasy of reconfigurable computing is something like self-modifying code on steriods and strapped to a fusion rocket: you write code that modifies the underlying hardware over the course of its own execution in response to changing inputs and constraints. If you thought multithreaded programming was a hard problem, how about multithreaded programming on a device where one part of your code is responsible for dynamically reconfiguring the hardware resources that the rest of the application is running on. Such code is a beast to write, and it's even harder to debug.

(Regarding debugging, imagine running into a bug that doesn't consistently repeat itself—a common problem in multithreaded programming. You'd have to ask, "is the problem in my algorithm, in my application code, in the hardware configuration I've constructed for the device, or in the code that's actually (re)configuring the hardware to run my application code? Or is it some combination of those factors? Or is the hardware broken?")

As near as I can tell, what Kilocore appears to be doing is something less complex, but still challenging. It looks like you'll use Kilocore by initializing it in a certain configuration, and then loading and running a program that's designed for that specific configuration. There's still an added layer of complexity here, and even more importantly you also need coders who can do this kind of hardware design.

Part of the attraction that IBM seems to have to Kilocore is that,by their own admission, it's a lot like Cell—i.e. there are a large number of small processing elements under the control of a larger, general-purpose CPU. Because of the Cell-like nature of the design, I don't think it's too far-fetched to assume that Kilocore is also meant to be used by applications that must deal with dynamic (probably OS-driven) changes in the number of cores they use. As with Cell applications, which can also use more or fewer PEs depending on resource availability, modularity and atomicity will be the keys to making this level of on-the-fly hardware reconfiguration work.

Due to the complexities inherent in programming such devices, you have to develop the software for something like Kilocore specifically to take advantage of the chip's peculiar abilities. In other words, you don't just "port" a random application to Kilocore in the traditional sense. Rather, software development for a chip like Kilocore is more akin to something like firmware writing, because you're focused on implementing a specific set of functions on that specific device.

This being the case, Rapport will have to sell this device by moving from application to application, demonstrating how they made it work in each particular case. It'll be something along the lines of, "here's Kilocore decoding H.264 video, where it currently provides an X percent performance/watt advantage; here's Kilocore accelerating search queries, where we're getting a Y percent performance/watt advantage; here's Kilocore doing audio conversion, where we've managed to get it up to a Z percent performance/watt advantage; etc." You just don't brag to potential buyers about how many GFLOPs it does and at how many watts, and then expect that their programmers will take it from there.

Ultimately, Kilocore is a lot like Cell, but with an added layer of complexity. And as I said above, that new layer is a fairly large, non-trivial layer.

All of that having been said, I understand that great progress has been made in the past few years in the area of reconfigurable computing. That this device, or something like it, will succeed by moving from niche to niche is probably inevitable. In spite of its complexity, reconfigurable computing remains far too compelling to write off.