Research interests

Fault tolerance

The current model for fault tolerance involves collecting periodic
collective checkpoints, with all processes restarting from the last
successful checkpoint on failure. We have worked on
virtualization-based approach to checkpointing in the context of
Infiniband clusters.

Rather than rolling back all the processes in the event of a failure,
we have developed strategies to restart only the failed processes
while other processes co-operate in deriving a consistent state
without discarding the work done by them since the last
checkpoint. This work is done in the context of the task-based
programming framework we have developed for dynamic load balancing.

Performance tools

Shared and distributed memory programming models

Dynamic load balancing

Irregular and dynamic parallel applications pose significant
challenges to achieving scalable performance on large-scale multicore
clusters. These applications often require ongoing, dynamic load
balancing in order to maintain efficiency. Scalable dynamic load
balancing on large clusters is a challenging problem. I have addressed
this issue in the context of a variety of architectural environments,
such as shared-memory systems, distributed-memory clusters, and
GPUs. In our model, the computation is organized as a collection of
tasks. Each task provides sufficient information to the runtime to
enable relocation of the task, possibly also handling its data
requirements. Solutions have been developed that are tailored to
application classes, ranging from hypergraph partitioning-based disk
I/O scheduling to distributed memory work stealing for a collection of
independent tasks.