Publication

Callisto-RTS: Fine-Grain Parallel Loops (July 2015)

We introduce Callisto-RTS, a parallel runtime system
designed for multi-socket shared-memory machines. It
supports very fine-grained scheduling of parallel loops—
down to batches of work of around 1K cycles. Fine-grained
scheduling helps avoid load imbalance while
reducing the need for tuning workloads to particular
machines or inputs. We use per-core iteration counts
to distribute work initially, and a new asynchronous
request combining technique for when threads require
more work. We present results using graph analytics algorithms
on a 2-socket Intel 64 machine (32 h/w contexts),
and on an 8-socket SPARC machine (1024 h/w
contexts). In addition to reducing the need for tuning, on
the SPARC machines we improve absolute performance
by up to 39% (compared with OpenMP). On both architectures
Callisto-RTS provides improved scaling and performance
compared with a state-of-the-art parallel runtime
system (Galois).