Stampede: Middleware for Monitoring and Troubleshooting of Large-Scale Applications on National Cyberinfrastructure

Large-scale applications today make use of distributed resources to support
computations and as part of their execution, generate large amounts of log
information. Up to now, we have been using the Netlogger analysis tools to
perform off-line log analysis. Stampede extends the current offline
workflow log analysis capability and develops a comprehensive middleware solution
that will allow users of complex scientific applications to track the status of
their jobs in real time, to detect execution anomalies automatically, and to
perform on-line troubleshooting without logging in to remote nodes or searching
through thousands of log files.

We build on an important class of applications, scientific workflows, that are
being used today in a number of scientific disciplines including astronomy,
biology, ecology, earthquake science, gravitational-wave physics, and many
others that are running on today's large-scale infrastructure such as the OSG or
the TeraGrid. This solution will be modular and distributed, and reusable across
a broad class of applications and workflow systems.

The system will be able to capture application-level logs from jobs as they are
executing on the cyberinfrastructure. At the same time, it will also collect log
information from the underlying cyberinfrastructure services, such as resource
management and data transfer. These end-to-end logs will be combined and
brokered through a subscription interface. External components will use the
subscription interface to provide monitoring services.

This work is supported by the NSF under grant OCI-0943705

Publications

Contact: Deb Agarwal
Credits:This work is supported by the Director, Office of Science, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
Additional sponsors include National Science Foundation, Department of Homeland Security, and Microsoft Corporation.