Ensemble Kalman Filter

Clouds are rapidly joining high-performance Grids as viable
computational platforms for scientific exploration and discovery, and it
is clear that production computational infrastructures will integrate
both these paradigms in the near future. As a result, understanding
usage modes that are meaningful in such a hybrid infrastructure is
critical. For example, there are interesting application workflows that
can benefit from such hybrid usage modes to, per- haps, reduce times to
solutions, reduce costs (in terms of currency or resource allocation),
or handle unexpected runtime situations (e.g., unexpected delays in
scheduling queues or unexpected failures). The primary goal of this
paper is to experimentally investigate, from an applications
perspective, how autonomics can enable interesting usage modes and
scenarios for integrating HPC Grid and Clouds. Specifically, we used a
reservoir characterization application workflow, based on Ensemble
Kalman Filters (EnKF) for history matching, and the CometCloud autonomic
Cloud engine on a hybrid platform consisting of the TeraGrid and Amazon
EC2, to investigate 3 usage modes (or autonomic objectives) –
acceleration, conservation and resilience.

The framework is composed of autonomic managers that coordinate using
Comet coordination spaces that span, and can be transparently accessed
across the integrated execution environment. The key components of the
management framework are described below.

Workflow Manager: The workflow manager is
responsible for coordinating the execution of the overall application
workflow, based on user-defined polices, using Comet spaces.

Estimators: The cost estimators are responsible for
translating hints about computational complexity provided by the
application into runtime and/or cost estimates on a specific resource.

Grid/Cloud Agents: The Grid/Cloud agents are
responsible for provisioning the resources on their specific platforms,
configuring workers as execution agents on these resources, and
appropriately assigning tasks to these workers.

Experiment and results

The goal of the experiments presented in this section is to
investigate how possible usage modes for hybrid HPC Grids-Cloud
infrastructure can be supported by a simple policy-based autonomic
scheduler. Specifically, we investigate experimentally, implementations
of three usage modes – acceleration, conservation and resilience, which
are the different objectives of the autonomic scheduler.

Our experiments use a single stage EnKF workflow with 128 ensemble
members (tasks) with heterogeneous computational requirement. The
heterogeneity is illustrated in Figure 3; which is a histogram of the
runtimes of the 128 ensemble members within a stage on 1 node of a TG
compute system (Ranger), and 1 EC2 core (a small VM instance, 1.7 GB
memory, 1 virtual core,160 GB instance storage, 32-bit platform)
respectively. The distribution of tasks is almost Gaussian, with a few
significant exceptions. These plots also demonstrate the relative
computational capabilities of the two platforms. Note that when a task
is assigned to a TG compute node, it runs as a parallel application
across the node’s 16 cores with linear scaling. However on an EC2 node,
it runs as a sequential simulation, which (obviously) will run for
longer.

We use two key metrics in our experiments: Total Time to Completion
(TTC), which is the wall-clock time for the entire (1-stage) EnKF
workflow (i.e., all the 128 ensemble members) to complete and the
results are consumed by the KF stage, and may include both TG and EC2
execution. The Total Cost of Completion (TCC), which is the total EC2
cost for the entire EnKF workflow.

Our experiments are based on the assumption that for tasks that can
use 16-way parallelism, the TG is the platform of choice for the
application, and gives the best performance, but is also the relatively
more restricted resource. Furthermore, users have fixed allocation on
this expensive resource, which they might want to conserve for tasks
that require greater node counts. On the other hand, the EC2 is a
relatively more freely available, but is not as capable.

Note that the motivation of our experiments is to understand each of
the usage scenarios and their feasibility, behaviors and benefits, and
not to optimize the performance of any one scenario (or experiment). In
other words, we are trying to establish a proof-of-concept, rather than a
systematic performance analysis.