Contents

Overview

This page describes the new design for the PTP resource manager monitoring/control framework. The motivation for providing a new framework is primarily because the existing RM infrastructure (both model and UI) will not scale and is not flexible enough to encompass all machines that PTP wishes to target.

The purpose of this framework is to:

Collect and display monitoring information relating to the operation of a target system in a scalable manner

Provide job submission, termination, and other job-related operations

Support debugger launch and attach

Enable the collection and display of stdin and transmission of stdout information from running jobs (where supported by the target system)

Monitoring information will comprise:

The status and position of user's jobs in queues

Job attribute information

Target system status and health information for arbitrary configurations

The physical/logical location of jobs on the target system

Predictive information about job execution

Key attributes of the framework include:

Support for arbitrary system configurations

Support for all existing resource managers

The ability to scale to petascale system sizes and beyond

Support for both user-installable and system-installable modes of operation

Automated installation for user-installable operation

Simple to add support for new resource managers

Eliminate the need for compiled proxy agents

Rationale

The existing RM design is documented in the PTP 2.x Design Document. The main issues with the existing RM design fall into the following areas:

Model scalability and flexibility

UI scalability

Complexity of adding new RM support

Model Scalability and Flexibility

PTP employs a MVC architecture for monitoring job and system status. The model is used to represent the target system and the jobs that are running on that system. The model receives updates from the proxy agents running on the target system.

Currently, the model provides a fixed hierarchy in which machines are comprised of nodes, and (resource manager) queues contain jobs. A job has one or more processes, which are running on specific nodes.

One problem with this approach is that model hierarchy is inflexible and can't be used to easily represent more complex architectures (e.g. BG). Although it is possible to map the architecture into machines/nodes, the user may wish to see the actual physical layout of the machine. Also, machines often have physical and logical layouts which should be visible to the user.

Another issue is that the model currently represents the entire system down to the individual process level. This is clearly going to have scaling issues with node/core counts in the hundreds of thousands and process counts in the millions.

The model is really only required for visualizing the system and job status on a target system. A better approach is to have a model that is tailored for this visualization, and that only models the currently visible aspects of the system.

UI Scalability

UI scalability is also a significant concern if PTP is to support petascale (and beyond) systems. The current runtime views display the