Summary

The quest for efficient and scalable parallel reservoir simulators has beenevolving with the advancement of high-performance computing architectures.Among the various challenges of efficiency and scalability, load imbalance is amajor obstacle that has not been fully addressed and solved. The causes of loadimbalance in parallel reservoir simulation are both static and dynamic. Robustgraph-partitioning algorithms are capable of handling static load imbalance bydecomposing the underlying reservoir geometry to distribute a roughly equalload to each processor. However, these loads that are determined by a staticload balancer seldom remain unchanged as the simulation proceeds in time. Thisso-called dynamic imbalance can be exacerbated further in parallelcompositional simulations. The flash calculations for equations of state (EOSs)in complex compositional simulations not only can consume more than half of thetotal execution time but also are difficult to balance merely by a static loadbalancer. The computational cost of flash calculations in each gridblockheavily depends on the dynamic data such as pressure, temperature, andhydrocarbon composition. Thus, any static assignment of gridblocks may lead todynamic load imbalance in unpredictable manners. A dynamic load balancer canoften provide solutions for this difficulty. However, traditional techniquesare inflexible and tedious to implement in legacy reservoir simulators. In thispaper, we present a new approach to address dynamic load imbalance in parallelcompositional simulation. It overdecomposes the reservoir model to assign eachprocessor a bundle of subdomains. Processors treat these bundles of subdomainsas virtual processes or user-level migratable threads which can be dynamicallymigrated across processors in the run-time system. This technique is shown tobe capable of achieving better overlap between computation and communicationfor cache efficiency. We use this approach in a legacy reservoir simulator anddemonstrate a reduction in the execution time of parallel compositionalsimulations while requiring minimal changes to the source code. Finally, it isshown that domain overdecomposition, together with a load balancer, can improvespeedup from 29.27 to 62.38 on 64 physical processors for a realisticsimulation problem.