research paper

2009-09-29

The first XtreemOS summit was co-located with Euro-Par 2009 in Delft, The Netherlands.

The objective of this half-day summit was to present the XtreemOS technology with different talks ranging from a general overview to selected topics including security, resource matching, and parallel I/O. After the talks different demonstrations were run to show the benefits of the XtreemOS system.

The summit concluded with an interesting and fruitful discussion between the audience and the XtreemOS representatives.

2009-05-12

Efﬁcient Management of Consistent Backups in a Distributed File System

Authors: Jan Stender

Abstract: Setting up backup infrastructures for large-scale data management systems that can be operated cheaply and accessed with low latency has emerged as a practical problem. As a solution, we present a highly scalable and cost-efﬁcient architecture for backup management in a distributed ﬁle system. We describe techniques for the creation of consistent backups at runtime, as well as approaches to resource management in connection with an integrated backup architecture.

COMPSAC 2009: website

Seattle, Washington, July 20-24, 2009

The Doctoral Symposium at COMPSAC will provide an international forum for doctoral students to interact with other students and faculty mentors. Since 2006, COMPSAC has been designated as the IEEE Computer Society Signature Conference on Software Technology and Applications.

The Doctoral Symposium seeks to bring together PhD Students working in computer software and applications and related fields. Selected students will have the opportunity to present and discuss their research goals, methodology, and preliminary results within a constructive and international atmosphere.

The Symposium organizers will strive to provide useful guidance for completion of the dissertation research and motivation for a research career. The Symposium is intended for students who have already settled on a specific research proposal and have produced limited preliminary results, but have enough time remaining before their final defense to benefit from the fruitful Symposium discussions. Due to the mentoring aspect of the event, the Symposium will be open only to the students and mentors participating directly in the event.

In coordination with the technical theme of COMPSAC 2009, topics pertaining to software engineering of critical infrastructure systems such as civil, telecommunications, and medical systems will be of particular interest. Related topics include, but are not limited to, requirements analysis, co-analysis and co-design, modeling, design, development, testing, measurement, verification and validation for performance, safety, security, and dependability constraints of such systems. As effective construction of critical infrastructure systems is not limited solely to the field of computer science and engineering and is truly a multidisciplinary effort, submissions addressing multidisciplinary research topics are particularly encouraged.

Abstract - The EU-funded XtreemOS pro ject implements a grid operating system (OS) transparently exploiting distributed resources through the SAGA and POSIX interfaces. XtreemOS uses an integrated grid checkpointing service (XtreemGCP) for implementing migration and fault tolerance. Checkpointing and restarting applications in a grid requires saving and restoring applications in distributed heterogeneous environments. In this paper we present the architecture of the XtreemGCP service integrating existing system-speciﬁc checkpointer solutions. We propose to bridge the gap between grid semantics and system-speciﬁc checkpointers by introducing a common kernel checkpointer API that allows using diﬀerent checkpointers in a uniform way. Our architecture is open to support diﬀerent checkpointing strategies that can be adapted according to evolving failure situations or changing application requirements. We also present how to avoid resource conﬂicts during restart. Finally, we discuss measurements numbers showing that the XtreemGGP architecture introduces only minimal overhead.

Abstract -To execute MPI applications reliably, fault tolerance mechanisms are needed. Message logging is a well-known solution to provide fault tolerance for MPI applications. It has been proved that it can tolerate a higher failure rate than coordinated checkpointing. However pessimistic and causal message logging can induce high overhead on failure free execution. In this paper, we present O2P, a new optimistic message logging protocol, based on active optimistic message logging. Contrary to existing optimistic message logging protocols that save dependency information on reliable storage periodically, O2P logs dependency information as soon as possible to reduce the amount of data piggybacked on application messages. Thus, it reduces the overhead of the protocol on failure free execution, makes it more scalable and simplifies recovery. O2P is implemented as a module of the Open MPI library. Experiments show that active message logging can effectively improves scalability and performance of optimistic message logging.

2009-02-12

Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems

Authors: Pierre Riteau, Adrien Lebre and Christine Morin

Abstract

Computer clusters are today the reference architecture for high-performance computing.The large number of nodes in these systems induces a high failure rate. This makes fault tolerance mechanisms, e.g. process checkpoint/restart, a required technology to effectively exploit clusters.Most of the process checkpoint/restart implementations only handle volatile states and do not take into account persistent states of applications, which can lead to incoherent application restarts.In this paper, we introduce an efficient persistent state checkpoint/restoration approach that can be interconnected with a large number of file systems. To avoid the performance issues of a stable support relying on synchronous replication mechanisms, we present a failure resilience scheme optimized for such persistent state checkpointing techniques in adistributed environment. First evaluations of our implementation in the kDFS distributed file system show the negligible performance impact of our proposal.

Autonomous Resource Selection for Decentralized Utility Computing

Abstract

Many large-scale utility computing infrastructures comprise heterogeneous hardware and software resources. This raises the need for scalable resource selection services, which identify resources that match application requirements, and can potentially be assigned to these applications. We present a fully decentralized resource selection algorithm by which resources autonomously select themselves when their attributes match a query. An application specifies what it expects from a resource by means of a conjunction of (attribute,value-range) pairs, which are matched against the attribute values of resources. We show that our solution scales in the number of resources as well as in the number of attributes, while being relatively insensitive to churn and other membership changes such as node failures.

Furthermore, he also presented the paper "Checkpointing Process Groups in a Grid Environment" within the main track of the International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT) in Dunedin, New Zealand, December 2008.

Paper abstract:"The EU-funded XtreemOS project implements a Linux-based grid operating system (OS), exploiting resources of virtual organizations through the standard POSIX interface.The Object Sharing Service (OSS) of XtreemOS addresses the challenges of transparent data sharing for distributed applications running in grids. We focus on the problem of handling consistency of replicated data in wide area networks in the presence of failures. The software architecture we propose interweaves concepts from transactional memory and peer-to-peer systems. Speculative transactions relieve programmers from complicated lock management.Super-peer-based overlay networks improve scalability and distributed hash tables speed up data search. OSS replicates objects to improve reliability and performance. In case of severe faults, the XtreemOS grid checkpointing service will support OSS. In this paper we describe the software architecture of OSS, design decisions, and evaluation results of preliminary experiments with a multi-user 3D virtual world. "