You can Build a $2,500 supercomputer – but what can you do with it?

Years ago, David Cheriton and others built a distributed OS – THOTH I think it was called, and the HARMONY extension to UNIX. Cheriton went off t build “The V System” in which there was a message passing micro kernel on each CPU and the processes, even the subroutines of the device drivers, were distributed. Essentially all (well not quite all) subroutine calls were low cost messages. The result of this was that the load was always balanced across all available nodes. The dining philosophers problem not only became trivial, but stayed trivial as more philosophers turned up and/or more tables and plates were added or subtracted.

We’ve now got to the point where we desperately need this technology. We’ve got two, four, sixteen or sixty four processors on a chip, which is a real high speed backplane! Stack a few of them with a high speed switch like in this article ….

At the end of the article he says “another 10 years you�ll be able to have tte equivalent of a 5,000 node Google cluster in your den.” Heck, using this technique of four boards in a mini-tower case with four CPUs on each board I can easily get a lot of parallel power on my desktop today.

But the point is that we don’t have the software that will spread the processing across it. We still have an architecture where one process lives on one machine and stays there.

Oh, I know about VMWare, but that doesn’t do the micro-level migration that Cheriton could achieve. Right now, the Beowulf clusters are dedicated to specially written applications, like the chess playing search tree.

Years ago (the 1980s) I wrote RPC-based applications using the SUN XPC protocols. Since then I’ve seen three (or more)-tier applications, like web front ends talking to database engines via TCP links. I’m now seeing RPC embedded in XML embedded in HTML for web sites. But its still about a complete process on a machine and that process unable to dynamically migrate to an idle machine. Yes I know about load balancers – that’s the same trap.

We need a new programming paradigm to deal with the new hardware.

Or perhaps we need new compilers that will break up the program into new modules. Of course some programmers will still use a style that fights the compiler.

Lets see …. When the Macintosh first came out it had an overlay scheme borrowed from one of the not-quite-virtual-memory models of the IBM 360 range. The idea was that an application had modules and a dependency tree for them, so that not all the modules needed to be loaded at once. You could write:

and compile that as one module. The “do_Initialization()” module, also compiled in parts, would load and then unload … and so on. So a 800k program might only need “main()” – at less than 1k – and the data and some other modules loaded, amounting to perhaps 250k. Great if your machine only had 256k!

But LO!, some application developers (I recall Adobe being one of them!) didn’t Get It. They compiled the application into one big module. Perhaps this was deliberate so that you couldn’t run anything along side it 🙂

Of course the advent of demand-paged virtual memory made all this moot. It had been a technique to allow for lower cost hardware – even back in the 360 days. The cost of the additional hardware for instruction interruption and restart was non-trivial back when. Now, its all just on the chip.

But the approach to distributed programming that that Cheriton illustrated in his papers on the V System did require a new paradigm. In the same way that classical SQL (i.e. before cursors) turned the nested “for each” blocks inside out, so too did Cheriton’s approach to subroutines get turned inside out.

Certainly this is going to be an area for research if massively multi-node computing is going to end up on the desktop.