Using Application-Domain Knowledge in the Runtime Support of Multi-Experiment
Computational Studies
Candidate: Siu-Man Yau
Advisor: Vijay Karamcheti and Denis Zorin

Abstract

Multi-Experiment Studies (MESs) is a type of computational study in which the
same simulation software is executed multiple times, and the result of all
executions need to be aggregated to obtain useful insight. As computational
simulation experiments become increasingly accepted as part of the scientific
process, the use of MESs is becoming more wide-spread among scientists and
engineers.

MESs present several challenging requirements on the computing system. First,
many MESs need constant user monitoring and feedback, requiring simultaneous
steering of multiple executions of the simulation code. Second, MESs can
comprise of many executions of long-running simulations; the sheer volume of
computation can make them prohibitively long to run.

Parallel architecture offer an attractive computing platform for MESs.
Low-cost, small-scale desktops employing multi-core chips allow wide-spread
dedicated local access to parallel computation power, offering more research
groups an opportunity to achieve interactive MESs. Massively-parallel,
high-performance computing clusters can afford a level of parallelism never
seen before, and present an opportunity to address the problem of
computationally intensive MESs.

However, in order to fully leverage the benefits of parallel architectures,
the traditional parallel systems' view has to be augmented. Existing parallel
computing systems often treat each execution of the software as a black box,
and are prevented from viewing an entire computational study as a single
entity that must be optimized for.

This dissertation investigates how a parallel system can view MESs as an
end-to-end system and leverage the application-specific properties of MESs to
address its requirements. In particular, the system can 1) adapt its
scheduling decisions to the overall goal of an MES to reduce the needed
computation, 2) simultaneously aggregate results from, and disseminate user
actions to, multiple executions of the software to enable simultaneous
steering, 3) store reusable information across executions of the simulation
software to reduce individual run-time, and 4) adapt its resource allocation
policies to the MES's properties to improve resource utilization.

Using a test bed system called SimX and four example MESs across
different disciplines, this dissertation shows that the application-aware
MES-level approach can achieve multi-fold to multiple orders-of-magnitude
improvements over the traditional simulation-level approach.