Abstract

The distributed shared memory(DSM) system transforms an existing network of workstations to a powerful shared-memory parallel computer which could deliver superior price/performance. However, with more workstations engaged in the system and longer execution time, the probability of faults increases which could render the system useless. Several checkpointing and logging schemes have been proposed to enable the DSM system to continue work after transient failures. Using checkpoints, it is not necessary to roll back to the beginning of the process but the processes need to roll back to the latest checkpoint. The logging is introduced to further reduce the amount of rollback propagation on other related processes. Although logging makes the rollback propogation unnecessary, it introduces the overhead for the logging itself. If it is needed to log all the read/write operations, the logging overhead would be prohibitive. Moreover, some of the logging methods proposed earlier could result in incorrect recovery when processes synchronize using barriers. In this paper, we propose a novel logging scheme which greatly reduces the amount of logging by not loging all the pages accessed but logging only the pages which are invalidated. The performance our proposed scheme is analyzed using extensive simulation. Compared with two other schemes proposed earlier, our new logging scheme shows superior performance in various cases.

Keywords

Data Item Shared Memory Read Counter Address Space Read Operation

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.