Our research in load balancing focuses on two primary areas: Object Migration and Seed Balancing.

Object Migration

Periodic Load Balancing for bipartite object networks

Adaptive use of Workstation Clusters

Optimal Object Migration to handle background load variation

A major reason slowing deployment of parallel programs is that efficient parallel programs are difficult to write. Parallel programming adds a second dimension to programming: not just when will a particular operation be executed, but where, i.e. what processor will perform it. A vast number of parallelizable applications do not have a regular structure for efficient parallelization. Such applications require load balancing to perform efficiently in parallel. The load in these applications may also change over time, requiring rebalancing. The programmer is left with the choice of either distributing computation haphazardly, producing poorly-performing programs, or spending more development time including load-balancing code in the application.

A typical object distribution on two processors in a Charm++ program.

The components and interactions in the load balancing framework.

In recent years, new types of "parallel" computers have appeared. Networks of commodity workstations are making parallel computation available to an expanding group of researchers. Workstation networks present new issues for the application programmer. Now, in addition to application imbalance, a parallel program must be concerned with background load from other simultaneous users. Parallel programs may run on clusters of workstations on an interactive user's desks, where the primary user only permits parallel computation when the computer is not being used interactively. Finally, computational clusters may expand over time, but with the rapid increase in computational power, new processors are likely to be faster than the older machines that they are supplementing. To maximize throughput, load balancers in parallel applications must account for all these factors.

Work migration is a unified scheme for handling both application-specific and externally-arising load imbalance. The difficulty with migrating work is that either work is repartitioned in an application-specific way, placing the burden on the application programmer, or that automatic migration is supported, but with poor accuracy, due to the lack of application-specific knowledge.

Object migration provides a way of performing accurate, fine-grained automatic load balancing. Objects usually have small, well-defined regions of memory on which they operate, reducing the cost of migration. Using the Charm++ object model, the run-time system measures the work represented by particular objects, rather than deriving execution time from application-specific heuristics. Furthermore, the run-time system records object-to-object communication patterns, so the load balancer can asses the communication impact of migrating particular objects.

With the advent of massively parallel machines like Bluegene/L and Cray XT3, our recent work has focused on topology-aware migration of objects. We have developed strategies which take into account both the message sizes and the network hop length to minimize the total amount of communication.

Communication latencies form a significant factor in the performance of parallel applications on these large machines. The latencies are primarily due to network contention in the grid and torus networks, which are usually used in these large parallel machines. Our load balancing strategies minimize the impact of topology by heuristically minimizing the number of hops traveled by each communicated byte. They are not network specific and work for all classes of interconnection networks.

Seed Load Balancing

Seed load balancing involves the movement of object creation messages, or "seeds", to create a balance of work across a set of processors. Several variations of strategies are being analyzed. In particular, we distinguish between global strategies, which may result in communication amongst all processors to exchange load information, and neighborhood strategies, which typically impose a dense graph organization on the processors, and restrict communication to neighbors only. Some strategies use averaging of loads to determine how seeds should be distributed, while others use receiver-initiated strategies, where a processor requests work from elsewhere when it is about to go idle. A strategy that places seeds randomly when they are created and does no movement of seeds thereafter is used as a baseline for comparison on numerous benchmarks.