第2回予定

第1回概要

Predicting Bounds on the Batch Queuing Delay
Experienced by Individual TeraGrid User Jobs in Real Time

Rich Wolski
University of California, Santa Barbara

In this talk, we present a new method for providing end-users with real-time predictions of the bounds on queuing delay individual jobs will experience when waiting to be scheduled to a machine partition. Predicting the delay users will experience while waiting for their jobs to be be scheduled is a problem that has been studied both by the academic and commercial HPC communities for some time. Our approach, based on a new statistical methodology, predicts bounds on the waiting time (upper or lower) that individual jobs will experience with quantified confidence measures. Thus the predictions made by this system constitute a statistical guarantee of best-case and worst-case waiting delay where the confidence measure quantifies the quality of the guarantee.

We have implemented this new methodology as part of the Network Weather Service and deployed it on several large-scale systems (TegraGrid, Datastar at SDSC, Lonestar at TACC, etc.) where it currently provides real-time bounds predictions. In the talk we will report on the effectiveness of the system which has been in operation as a prototype for approximately 8 months. We will discuss the methodology and its evaluation using batch-queue logs spanning 10 years at the NSF and open DOE supercomputer centers. We will also demonstrate the web interface to the system and make "live" predictions of delay bounds during the presentation from the web page located at

and we will detail the operation of a set of command-line tools that are portable among all national Extended Terascale Facility (ETF) architectures.

Our results show that it is possible to predict delay bounds with specified confidence levels for individual jobs in different queues, and for jobs requesting different ranges of processor counts and different maximum execution delays Using these predictions, users with roaming allocations or with allocations at multiple sites can choose the machine that is most likely to minimize turn-around time. Users can also determine the probability that a job will meet a specified deadline in a particular queue. Finally, the system is portable to all ETF architectures making it possible for users to consider the use of heterogeneous resources, and to predict which is most likely to impose the shortest waiting time for their jobs.

第2回概要

Computational Grids, global Internet computing, Autonomic, and Peer-to-peer systems promise new levels of computing performance and storage capacity by allowing users to harness globally distributed resources dynamically. To succeed, these systems require effective models of resource behavior that foster new algorithm designs, new simulation capabilities, and new dynamic scheduling techniques. In particular, modeling resource availability is of critical importance, both to the design and to the implementation of new global computing systems.

In this talk, we discuss the problems of modeling and predicting resource availability in federated and distributed computing environments (Grids, P2P systems, etc.) as well as a new approach we have developed for addressing these problems. Using data from an enterprise-wide compute setting, the Condor cycle-harvesting compute infrastructure, and an Internet host availability survey, we describe how accurately we can fit a member of the Weibull family of statistical distributions to the empirical behavior. We compare our results, in terms of modeling power, to the use of both exponential and Pareto distributions -- two of the currently prevalent methodologies -- and find that our fitting techniques provide significantly greater accuracy. We also compare the accuracy of these parametric approaches to non-parametric techniques for predicting bounds on availability. Finally, we discuss the ramifications for simulation, distributed algorithm design, and on-line predictions that these results may have.