Thursday, July 29, 2004

It's not as scary as it sounds.

Greg Blackcomments that he took a look at Joe Armstrong's thesis I linked to below. Just in case his discription makes it sound intimidating, the error handling philosophy discussion --- let it crash --- is one section of chapter 4 (ie. about 3 pages). Of course, handling errors is only one step towards a reliable system.

In fact, the chapters of the thesis are largely approachable independently of each other. Chapters 2 and 4 (Architecture and Programming Principles) are particularly good in this regard.

In the meantime for those who are feeling too lazy to read the actual pdf, an executive summary:

We don't know how to write bug-free programs

So every substantial program will have bugs

Even if we are lucky enough to miss our bugs, unexpected interactions with the outside world (including the hardware we are running on) will cause periodic failures in any long-running process

So make sure any faults that do occur can't interfere with the execution of your program

Faults are nasty, subtle, vicious creatures with thousands of non-deterministic side-effects to compensate for

So the only safe way to handle a bug is to terminate the buggy process

So don't program defensively: Just let it crash, and make sure

Your runtime provides adequate logging/tracing/hot-upgrade support to detect/debug/repair the running system

You run multiple levels of supervisor/watchdog all the way from supervisor trees to automatic, hot-failover hardware clusters