Thursday, September 07, 2006

Exactly How Many PCI-E Slots Does One Need?

Recently we saw Ageia's physics accelerator come to market as an add-on expansion card, astride your existing video accelerator and sound card. Next we saw the Killer NIC reviewed by IGN, a "network interface accelerator" which promises to offload the assembly of TCP and UDP packets to take load off the CPU. It's actually an embedded Linux instance, assembling your TCP stack instead of Windows. Windows just sends the Killer NIC raw datagrams, then the Killer NIC does all the work to disassemble the datagrams into bare signals over the wire.

So that eats at least six slots, but if you have double-wide video cards, more like eight. Oh yeah, and you'll spend nearly $1,500 on just the above hardware accelerators alone.

This is a cycle that repeats itself in the computer world, tho. Popular software algorithms become API's. API's become baked into hardware. The hardware becomes to inflexible and becomes firmware. The firmware isn't flexible so it goes into software. And on and on we go.

But this hasn't necessarily happened for OpenGL or graphics acceleration. Why? Because there you're not baking an API into silicon, you're using a different type of processing unit altogether to solve a different genre of problems. Multiple vector processing pipelines can streamline linear algorithms much more effectively that complex instruction set CPU's. The same analogy can be made for floating point units when they were introduced along side elder CPU's that could only handle integer math.

Not only that, we're looking at chip makers shy away from raw clock speeds and instead looking at cramming as many CPU cores as possible on a single die. This allows for every CPU stamped out of the factory to be a multiprocessor machine. Once API's become more threadsafe, this crazy specialized hardware will instead just be an API call sent to one of the four idle CPU's on someone's desktop.

I think the eventual result of this crazy hardwareization (feel free to use that one) is that people are going to want to fit their algorithms into one of three processing units: the CPU pool (by "pool" I mean a pool of processors upon which to dump your asymmetric threads), the VPU (vector processing unit) or the FPU. The CPU & FPU have already become as one but as any C/C++ programmer can attest to, the decision to use integer based math as opposed to floating point math still rests heavily on the programmer's mind.