As a member of the Parallel Programming Lab (PPL),
I work on the Charm++ programming model and runtime system. Charm++ is an object-based parallel programming model, mainly used in the
construction of high performance computing (HPC) applications (usually involving scientific simulations). I'm involved with a variety
of efforts within PPL. However, my particular area of focus is enabling the execution of applications on heterogeneous systems through
various programming model, runtime system, and build process enhancements/modifications.

My research interests focus mainly on programming model and runtime system support for heterogeneous systems. In other words,
what support can (should) programming languages and their associated runtime systems play in helping the programmers create
applications that target heterogeneous systems? There has been an increased interest in using heterogeneous systems, especially
for computation intensive codes, in recent years since power limitations effectively ended the clock speed race. We keep the
term "heterogeneous" as open as possible. It may include systems with simple differences such as the amount of RAM
per node in a cluster to more complex systems that include multiple host core architectures along with multiple accelerator
technologies. Currently, the work focuses on supporting Cell and MIC, however, we are also looking into methods for extending
support to GPGPU hardware.

Please note, since this research is in the context of the Charm++ programming model, the remainder of this discussion assumes
that the reader is somewhat familiar with the Charm++ programming model. If this is not the case, please see the publications
section below. The paper "Towards a Framework for Abstracting Accelerators in Parallel Applications: Experience with Cell" has
a good introduction to the research along with a brief high-level introduction to the Charm++ programming model.

We are extending the Charm++ programming model and modifying the Charm++ runtime system to support accelerator technologies and
heterogeneous clusters in general. In short, we have introduced accelerator entry methods into the Charm++ programming
model. Entry methods, in general, can be thought of as tasks. Accelerated entry methods are entry methods that may or
may not execute on an accelerator. The underlying runtime system then takes care of automatically moving data as required to
the core, host or accelerator, that is tasked to execute the entry method. Entry methods, including accelerated entry methods,
and data movement all occur asynchronously under the direction of the runtime system. Given the clear boundaries between the
entry methods, we have further modified runtime system to handle some of the mundane details of executing an application on a
heterogeneous system. For example, with knowledge of the data types, array lengths, and so on that make up the application's
data, the runtime system can modify the data to correct for architecture differences, such as endianness, as data passes between
cores. In addition to accelerated entry methods, we have also introduced accelerated blocks and a SIMD Instruction
Abstraction. For more details on our research, please see the publications section for
relevant papers.

When programming for accelerator technologies, it is quite common for programmers to have to include architecture specific code
within their application code. This increases the burden placed on programmers in that they not only have to structure their
application towards a specific type of core, but it also decreases the portability of the code itself. Our modifications to the
Charm++ programming model and runtime system help to divorce the application code from the architecture specific details. However,
it is clear that these architecture specific details are important, especially when it comes to the performance of an application
running on the given architecture. Thus, a balance must be struck to make sure performance is good while still assisting the
programmer.

Perhaps more importantly, given a unified programming model and portable code, the runtime system can start doing some more
interesting things on the programmer's behalf. One such activity is automatic dynamic load balancing. Given a heterogeneous application
(that is, an application with multiple different calculations going on, with task variations within a given calculation), spreading
the application's workload across the available cores, host and accelerator alike, may not be straight forward for the programmer
to do (especially at compile time). The Charm++ load balancing framework has already makes runtime measurements to load balance
applications executing on homogeneous clusters. This research intends to extend this work to load balancing on heterogeneous
systems by having the runtime system dynamically migrated work between the host cores and any available accelerators.

David M. Kunzman. “Runtime support for object-based message-driven parallel applications on heterogeneous clusters.”
Diss. Department of Computer Science, University of Illinois at Urbana-Champaign, June 2012.
(link)