We have developed a robust and scalable multi-GPU (Graphics Processing
Unit) version of the cellular-automaton-based MAGFLOW lava simulator.
The cellular automaton is partitioned into strips that are assigned to
different GPUs, with minimal overlapping. For each GPU, a host thread is
launched to manage allocation, deallocation, data transfer and kernel
launches; the main host thread coordinates all of the GPUs, to ensure
temporal coherence and data integrity. The overlapping borders and
maximum temporal step need to be exchanged among the GPUs at the
beginning of every evolution of the cellular automaton; data transfers are
asynchronous with respect to the computations, to cover the introduced
overhead. It is not required to have GPUs of the same speed or capacity;
the system runs flawlessly on homogeneous and heterogeneous hardware.
The speed-up factor differs from that which is ideal (#GPUs×) only for a
constant overhead loss of about 4E−2 · T · #GPUs, with T as the total
simulation time.