Solutions

The Extreme-scale Simulator (xSim) is a performance investigation toolkit that permits running native high-performance computing (HPC) applications or proxy applications in a controlled environment with millions of concurrent execution threads, while observing application performance in a simulated extreme-scale system for hardware/software co-design. Using a lightweight parallel discrete event simulation (PDES), xSim executes a Message Passing Interface (MPI) application on a much smaller system in a highly oversubscribed fashion with a virtual wall clock time, such that performance data can be extracted based on a processor and a network model with an appropriate simulation scalability/accuracy trade-off. xSim is designed like a traditional performance tool, as an interposition library that sits between the MPI application and the MPI library, using the MPI profiling interface. It has been run up to 134,217,728 (2^27) communicating MPI ranks using a 960-core Linux cluster. Ongoing work focuses on extending it to a resilience co-design toolkit with definitions, metrics, and methods to evaluate the cost/benefit trade-off of resilience solutions, identify hardware/software resilience properties, and coordinate interfaces/responsibilities of individual hardware/software components. … more

RedMPI is a prototype that enables transparent redundant execution of Message Passing Interface (MPI) applications. It sits between the MPI library and the MPI application, utilizing the MPI performance tool interface, PMPI, to intercept MPI calls from the application and to hide all redundancy-related mechanisms. A redundantly executed application runs with r*m MPI processes, where r is the number of MPI ranks visible to the application and m is the replication degree. RedMPI supports partial replication, e.g., a degree of 2.5 instead of 2 or 3, for tunable resilience. It also supports a variety of message-based replication protocols with different consistency. Results indicate that the most efficient consistency protocol can successfully protect HPC applications even from high silent data corruption (SDC) rates with runtime overheads between 0% and 30%, compared to unprotected applications without redundancy. RedMPI can be also used as a fault injection tool by disabling the online error correction and keeping replicas isolated from each other. … more

The proactive fault tolerance framework consists of a number of individual proof-of-concept prototypes, including process and virtual machine migration, scalable system monitoring, and online/offline system health analysis. The novel process-level live migration mechanism supports continued execution of applications during much of process migration. This scheme is integrated into an Message Passing Interface (MPI) execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 s of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 s. This self-healing approach complements reactive fault tolerance by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively. The scalable health monitoring system utilizes a tree-based overlay network to classify and aggregate monitoring metrics based on individual needs. The MRNet-based prototype is able to significantly reduce the amount of gathered and stored monitoring data, e.g., by a factor of 56 in comparison to the Ganglia distributed monitoring system. The online/offline system health analysis uses statistical methods, such as clustering and temporal analysis, to identify pre-fault indicators in the collected health monitoring data and in traditional system logs. … more

This proof-of-concept prototype includes enhancements in support of scalable group communication for membership management, reuse of network connections, transparent coordinated checkpoint scheduling, a job pause feature, and full/incremental checkpointing. It is based on the Local Area Multicomputer MPI implementation (LAM/MPI) and the Berkeley Lab Checkpoint/Restart (BLCR) solution. The transparent mechanism for job pause allows live nodes to remain active and roll back to the last checkpoint, while failed nodes are dynamically replaced by spares before resuming from the last checkpoint. A minimal overhead of 5.6% is incurred in case migration takes place, while the regular checkpoint overhead remains unchanged. The hybrid checkpointing technique alternates between full and incremental checkpoints: At incremental checkpoints, only data changed since the last checkpoint is captured. This results in significantly reduced checkpoint sizes and overheads with only moderate increases in restart overhead. After accounting for cost and savings, benefits due to incremental checkpoints are an order of magnitude larger than overheads on restarts. … more

Head and service nodes represent single points of failure and control for an entire high-performance computing (HPC) system as they render it inaccessible and unmanageable in case of a single node failure until manual repair. This solution relies on virtual synchrony, i.e., state-machine replication, utilizing a process group communication system for service group membership management and reliable, totally ordered message delivery. This replication method may be implemented internally, i.e., by modifying the service to be replicated to support redundant instances, or externally, i.e., by wrapping around an unmodified service, replicating input to multiple instances, and unifying output from these instances. Internal replication offers usually more performance, while external replication is typically easier to implement. This solution encompass a fully functional symmetric active/active high availability prototype for a HPC job and resource management service that does not require modification of service and a fully functional symmetric active/active high availability prototype for a HPC parallel file system metadata service that offers high performance. Assuming a mean-time to failure of 5,000 hours for a single head or service node, the presented solutions improve service availability from 99.285% (2 nines) of a single node to 99.995% (4 nines) in a two-node system, and to 99.99996% (6 nines) with three nodes. … more