Fault Management for Realtime Networks

Anindo Banerjea

Realtime networks provide guaranteed performance communication to applications that require such guarantees. The Tenet Group, under Professor Domenico Ferrari, has developed a scheme to provide such Quality of Service (QoS) guarantees in a packet-switched internetwork. The scheme is based on the concept of the realtime channel, which is a network connection with associated traffic specifications and performance guarantees. Other schemes to provide realtime services exist, and share some fundamental similarities. However, none of these schemes address the problem of how to restore (or continue to provide) realtime service in the presence of network faults. This dissertation addresses the problem of dealing with faults in the context of realtime networks, using two classes of mechanisms: proactive and reactive.

Reactive mechanisms can be used to reroute realtime channels to surviving links in the network. This approach does not use any extra resources in the absence of faults, but involves a disruption of the service while the recovery action is being performed. In addition, some channels may not be successfully rerouted if the realtime load on the network is very high. Proactive schemes may be used to reserve redundant resources (such as on multiple paths in the network) so that there is no disruption (or only disruption that can be bounded a priori), as long as the fault scenario is one that is covered. The proactive scheme may be designed to cover fault scenarios such as single faults, double faults, and so on.

The first part of the dissertation describes the rerouting schemes for fault recovery. We start with a very general fault recovery model, and systematically refine our design till we are left with a well-structured and limited problem domain, which we explore through simulation. The simulation experiments were used to identify the best scheme within the chosen solution space, and to show that its performance is reasonable on a number of metrics of performance, such as speed, amount of traffic rerouted, and efficiency of resource usage, as well as for large variations in the external factors such as network load, fault scenario, traffic mix, and network topology. The dissertation also contains a high-level design of a protocol for fault recovery of Tenet realtime channels, the Real-Time Control Message Protocol, based on the results of the experiments.

The dissertation also describes the use of dispersity routing and forward error correction to provide fault tolerance for realtime channels. These techniques are used to design a variety of schemes that deliver various levels of service in the presence of restricted network faults. Some schemes provide transparent tolerance to multiple faults in the network, at the cost of increased network resource requirements; some others use no extra resources, as compared to a simple realtime channel, but the service degrades gracefully when a network failure occurs; still others suffer a total disruption in the event of a failure, but the duration of the disruption is limited to the time needed to notify the source of the fault. These mechanisms are end-to-end in nature, permitting implementation on top of a basic realtime service, as long as appropriate support for routing is provided by the network. The services are validated and the cost to the network is evaluated through simulation. The techniques are also compared to existing mechanisms that provide fault tolerance to realtime networks.