A Performance Monitor based on Virtual Global Time for Clusters of PCs
Michela Taufer
Dept. of CSE
University of California, San Diego
taufer@cs.ucsd.edu
Thomas Stricker
Dept. of Computer Science
ETH Zurich, Switzerland
tomstr@computer.org
ABSTRACT
Debugging the performance of parallel and distributed systems
remains a difficult task despite the widespread use of middleware
packages for automatic distribution, communication and tasking
in clusters. In this paper we present a performance monitoring
tool for clusters of PCs that is based on the simple concept of accounting
for resource usage and on the simple idea of mapping all
performance related state of hardware performance counters and
operating system variables backwards to the application level. In
this way a monitoring tool can explain the most relevant performance
metrics at a higher level that is easily understood by the application
developer. The most important metric for distributed high
performance applications remains the total execution time vs. the
number of compute nodes involved, since it translates into the scalability
of an application. As a detailed contribution of this paper,
we closely look into what is needed to reverse map the low level
performance counters at each node back through the middleware
layer responsible for the parallelization and distribution. The specific
problems encountered and dealt with are the creation of a flexible
notion of global time for timestamping and the reassembling of
performance data and an appropriate communication mechanism
to minimize monitoring intrusion due to the additional networking
traffic caused by the monitor. We show how our tool can be used to
measure, explain and predict the performance and scalability of a
distributed OLAP application running on clusters of PCs.
Keywords: Parallel and distributed systems, real-time performance
analysis and evaluation, performance metrics, monitoring
traffic, notion of time in distributed systems, clusters of PCs.