Just another WordPress weblog

I’m still amazed at how much faster my generation is at programming than are our students. What gives? It can’t just be the 10000 hour practice limit.

It occurs to me that there is one big difference in our experience (beyond that we had to learn on machines that were pathetic by todays standards), and that is that we could only input the code for a little bit of time. There was competition for the terminal or card punch and you only got an hour or so. So you learned to think about your problem and its solution (i.e. your code) and kept a fair chunk of the program and algorithm in your head.

So it probably takes me just as long to write the program, its just that a lot of it is done offline.

One of the continually promising and seldom delivering ideas in computer engineering is “grid computing”. The idea is to couple a bunch of machines together and use the slack computing time to have a truly massive computational resource. It’s always been promising, but never really works.

Distributed computing, where individual machines communicate in a safe internal environment with reliable communications, does work and is a routine tool in software development.

So what is the difference? and (more importantly) what can we do about it?

There are three basic differences between the grid environment and the distributed environment and they are probably responsible for the differences in usefulness.

Authentication/security. In a distributed environment the individual machines trust each other. Either they are connected on a high-speed local network or they share a memory bus. There is a central issue of authentication/accounting and when the user logs into one part of the machine, he is authenticated for the whole system. In the grids the machines have to have some way of transferring authentication, and this is fraught with potential security holes. For example if one machine in the grid uses weak user authentication or a short out of date certification then it becomes a gateway for undesirable activity on the whole grid. Messages also can be packet sniffed revealing sensitive information and faked to subvert the computation.

Computing Reliability. The grid is a heterogeneous mix of machines, and the grid user is typically the lowest priority guest on each machine. Since the grid user is a guest who may be preempted by any other user, it is difficult to assert a reliable time frame for the calculations. The heterogeneous nature of the grid means that the software has to be verified to work on several architectures and that silly issues like default word sizes do not interfere with the exchange of data. Additionally, individual machines will be turned off for maintenance (we’ll assume hardware failures are equally likely) on an arbitrary schedule that is not necessarily coordinated across the grid. What happens when a task is checkpointed, the control program spawns a new task (on some other resource), and then the task resumes and tries to re-integrate into the program?

Communications Reliability. The worst case for communications in conventional distributed computing is a beowulf with consumer grade internet (say 100base T nowadays). Other than some weird contention issues if you try to use a broadcast algorithm (switches don’t magically expand channel capacity – they just pick who gets to talk), messages between processes simply work. Replacing that network with an arbitrary internet connection, that probably should be encrypted and certainly should be internally check-summed, reduces both the speed of communication and the reliability of it. It is hard enough to avoid deadlocks without having to worry about the message never arriving.

So what can we do about this?

Algorithm changes. If you can’t have reliable communications or computing, then don’t use it. Many problems, most notably in simulations, are simply decomposable into individual smaller problems that can run independently and have the results merged. In many cases these multiple run algorithms can produce better results than an individual run. Think “task parallel”. Ideas like granular learning machines, Monte Carlo algorithms, and Markov chain models suggest themselves.

Language changes. Functional languages that are designed for concurrent programming (did anyone mention Erlang?) and that are designed to have no side effects (i.e. stateless functions) are highly adaptable to a grid environment. The function evaluations can be fired off over a pool of servers and if they don’t return in time, re-fired. Map reduce is “trivial” when written in the list comprehension primitives of a functional language. The variable latencies in communication and computation don’t really matter with a highly granular approach to software.

Just read one of the email newsletters that comes from the ACM, and in it a scientific panel was lamenting the sorry state of simulation science in the USA. To paraphrase, the field is flat, there is no new algorithm development, and the competition can buy similar performance machinery so we’re loosing our edge.

As a computer scientist who works in simulation, this comes as no surprise. In the application domains, the “scientists” are wedded to standard programs and cannot conceive that someone other than “the master” could write useful code or develop algorithms. I’ve personally experienced a major simulation group publishing their version of an algorithm as the “first example”, when others have published close or identical versions before them. Similarly the “high priests” of computer science have decided that programming isn’t a necessary skill – which is fine until you need to have an instantiation of a new algorithm that actually works. (programming is not computer science in precisely the same way that spelling and grammar are not parts of English literature). Matlab and similar programs are similar in their effect – people learn incantations that work, but do not understand the mathematics behind them. In fact, students who depend on these programs are singularly useless because they haven’t put in the work to understand what they are doing.

Until there is real support for research in simulation algorithms, and not just running code, this sorry state of affairs won’t change.