... fault-tolerant computing. Not hardware fault tolerance, but software fault tolerance, which is a relatively new option

The issue for software developers is that fault tolerance is structural, we need to modularize code differently:

"...compiler writers have assumed perfect hardware and contended that they can provide good fault isolation through static compile-time type checking. In contrast, operating systems designers have advocated run-time checking combined with the process as the unit of protection and failure.... if a process or its processor misbehaves, stop it. The process provides a clean unit of modularity, service, fault containment, and failure." p16

"Each independent activity should be performed in a completely isolated process. Such processes should share no data, and only communicate by message passing. This is to limit the consequences of a software error.As soon as two processes share any common resource, for example, memory or a pointer to memory, or a mutex etc the possibility exists that a software error in one of the processes will corrupt the shared resource." p19

Granted, I haven't read all the links in the previous post, but I do have a couple of questions:

1. What would happen to the whole system if a process *does* fail? (Assuming system failure is not an option.) I'd imagine a) a redundant process would take over without perturbing the client, b) the remaining processes would somehow route around the failure, or c) a replacement process starts up, and the rest of the system would explicitly retry and re-resolve references to that process. Transparent failover (option a) requires a loose coupling between each process; fallback and retry (options b and c) might entail the sort of error handling policy that make ordinary exception handling such a pain. All three seem more appropriate for programming larger subsystems than for functions and small object.

2. Is the Eiffel/Meyer/DbC method somehow inconsistent with the Erlang/CSP vision? Each "pipe" into a process might have its own preconditions for allowed messages, and a contract failure within the process would emit an error reply. Eiffel's exception mechanism allows for trapping a contract violation in a called routine and trying an alternative that meets the caller's contract ... which might be "return valid result or error". Again, this sounds like a technique for autonomous components in a large system, not a programming technique required for even the smallest objects/functions.

I'm posting this for Ulf Wiger - seems that the Artima registration process failed for him :-(

1) If system failure is not an option, you have to go with hardware redundancy, so a process might be allowed to crash the processor/OS it is running on. This is important, as it allows you write "kernel processes", that have to be assumed correct for the node to be operational. In Erlang, you can build a system using multiple "Erlang nodes", where distribution aspects can be either transparent or explicit, depending on the role of your program. This is how redundancy is normally implemented, and it can be done in several ways, depending on requirements:a) Hot standby: typically, a process on another computer would monitor the active process, and the two would employ some replication protocol to stay in synch. This implies quite explicit exception handling on the part of the standby process. However, the logic required can be packaged as a reusable framework, so that the process assuming the active role is notified through a simple callback function.b) Cold standby: The Erlang nodes can be configured so that the applications running on one node will be restarted on another in case of failure. The applications can detect that they are starting due to "failover" from another node, or they can start as they normally do.

2) A process crash does not have to lead to a node crash. Erlang's "process linking" concept can be used in a variety of ways.a) The default behaviour is that if a process dies, all processes linked to it will also die. This is called "cascading exit", and allows you to clean up a fairly large amount of work automatically.b) A process that wants to take action when another dies can trap exits. Example: if process A wants to open a file, the file library spawns a process B that opens the file and acts as a middle man; B becomes A's file handle. If A dies, B, having linked itself to A and trapping exits, detects this (it receives an 'EXIT' message from A), closes the file, and then exits.c) Supervisors are special processes built on the linking concept. If a supervised process dies, it is restarted with default values by its supervisor. If necessary, the supervisor can be configured to restart a group of processes, as this may simplify the re-synchronization. If the restart frequency exceeds a configured limit, the supervisor exits, and lets the next-level supervisor handle the situation (escalated restart.)d) Re-acquiring a process handle may not be necessary. A process can register itself using a logical name, and other processes wanting to talk to it, can use the logical name as the destination for message sending. After a crash, the new process registers under the same name, and other processes may never know the difference.

3) Erlang doesn't really use Design by Contract, but relies rather heavily on pattern matching. For example, The function file:open(File, Mode) is defined so that it returns {ok, FileDescr} or {error, Reason}. A typical call to this function would be formulated:

{ok, Fd} = file:open("foo.txt", [read]).

This means that the caller will assert that the returned value is a 2-tuple where the first element is the constant 'ok', and the second is some object that becomes bound to the free variable Fd. If the function would return e.g. {error, enoent}, the caller would crash. This is called "programming for the correct case", and is widely used in Erlang. It works wonderfully for both large and small systems. Pattern matching can also be used on the inputs to a function. For example the function hd(List), extracting the first element from a linked list, could be written:

hd([Head|_]) -> Head.

Meaning that the function will only accept as input a list containing at least one element (_ is a "don't care" pattern, and in this case represents the tail of the list.) Any other input will cause a function_clause exception. This could also be written explicitly as:

Over the years, I've discovered that the primary design criterion for fault tolerant software is establishing the correct location for responsibility for system operation.

It should always be simple to return the system to a useful state. If it's not simple to recover, you're probably not recovering at the right place.

Granted, there are many systems, like old telephone switch code written in C, that has countless, explicit handlers of fault recovery. But, even without formal language support, it is possible to establish levels of responsibility that allow code at each point in the call graph to take straight forward actions to clean up as the error condition is passed up the stack. Languages with exception handling built in allow systems to be designed as fault tolerant more readily.

Java, long running threads, should always catch all exceptions. Under many cases, it is even desireable to catch all throwables. Early in the lifecycle of many Java programs, it will not be uncommon to have the heap undersized, and/or to suffer from a thread explosion that can result in OutOfMemoryError being thrown and causing long running threads that don't catch Error to be terminated.

Thread explosions are a great DOS attack on JVMs. So, it is almost always a good idea to catch Throwable at the top of long running threads. Threads which handle operations creating more threads or allocating large memory collections need to have their life cycle carefully managed.

Humans are not faultless, and hardware will fail. The design paradigm that Bill Venners mentions where preconditions are asserted and exceptions declared for them, causes callers to take note of particularly important failure modes.

If you don't document failures, except unexpected failures, calls will be less prepared to deal with bugs in the implementation of what they are calling.

"the only safe way to execute multiple applications, written in the Java programming language, on the same computer is to use a separate JVM for each of them, and execute each JVM in a separate OS process."

Creating very robust, even fault-tolerant systems, gets very expensive very quickly. When you find there is a possibility of a fault, you go in and try to fix the bug or bugs as much as possible. One result is that the fault tolerant code you then add to the system, is less useful than originally planned, since you went ahead and fixed the bug, or some of them. Trying to build software that tolerates faults (or symptoms) the programmers never considered is considerably harder. Brute retry has its limits.

Compartmentalization at the application level is very useful. Just one bank account will be trashed, requiring recovery from backup tapes, or just one ATM transaction will be rudely rejected, requiring the customer to retry, or just one phone call will be dropped in the middle of the conversation, requiring the caller to redial. Or just one bomb goes awry and sails off into the distant desert sands...

Very few organizations have the resources to create fault tolerant super-reliable software. Many more organizations, however, are able to benefit from using such software.

The field of fault-tolerant software was born in the days of much less reliable computer hardware, and was created as an attempt to deal with failing computers, as well as buggy software. Adding fault tolerance by a buggy organization, however, has its limits, when the same organization that put the bugs in the software is putting in more software with fault tolerance to survive the previous batch of bugs.

Retrying disk I/O is still standard for very ordinary computers, but the Sequent/Tandem technique of hot-swappable CPU's has gone out of style. It may be coming back with blades and google-style applications.

"The approach I suggest for exception handling is more low profile than the approach that seems to have become popular these days. In most recent programming languages, exceptions are a normal part of life. For example, exceptions figure prominently in the specification of operations. The exception handling strategy that I've pushed for is more low profile in the sense that it views exceptions really as what happens when everything else has failed, and you don't have much of a clue as to what is going on except that something is seriously wrong. The best you can really do is to try to restart in a clean state."

But why not use exceptions as a normal programming construct? I mean if they are the most efficient way to program something why programm it with some other artificial construct and not use exceptions? It's like saying that you can use a for loop only to iterate over a list and in all other circumstances you should use the while loop. It might be a sensible advice, but it just feels a bit artificial.

> Java, long running threads, should always catch all> exceptions...> > "the only safe way to execute multiple applications,> written in the Java programming language, on the same> computer is to use a separate JVM for each of them, and> execute each JVM in a separate OS process."> > Multitasking without Compromise: a Virtual Machine> Evolution> http://research.sun.com/projects/barcelona/papers/oopsla01.pdf

There are in fact a number of ways to issolate multiple applications within the same VM and get the vast majority of what you need. The most predominant feature of isolation in Java involves the use of custom classloading that keeps Class.forName() from having unlimited access to parent class loaders. Applications that are not meant to live in the same Java program, should not appear as the same Java program. Thus, they can each be allocated a classLoader that does not have as its parent, a classLoader that loads classes for the other applications.

If the JVM has bugs, then you will not be isolated from those. However, this would always be the case, whether you ran in a single JVM, or one per application.

As more sharing is implemented in hotspot, we will see the cost of multiple JVMs on a single host decrease. But, in the end, the Heap and all of the issues with memory manangement are still the predominent issue with the JVMs memory use, and thus I don't hold out much faith that we'll ever get to have more than 1 JVM in a GigaByte of memory for any serious application.