Despite the success of instruction-level parallelism
(ILP) optimizations in increasing the performance of microprocessors,
certain codes remain elusive. In particular, codes containing
recursive data structure (RDS) traversal loops have been largely
immune to ILP optimizations, due to the fundamental serialization and
variable latency of the loop-carried dependence through a
pointer-chasing load. To address these and other situations, we
introduce decoupled software pipelining (DSWP), a technique that
statically splits a single-threaded sequential loop into multiple
non-speculative threads, each of which performs useful computation
essential for overall program correctness. The resulting threads
execute on thread-parallel architectures such as simultaneous
multithreaded (SMT) cores or chip multiprocessors (CMP), expose
additional instruction level parallelism, and tolerate latency better
than the original single-threaded RDS loop. To reduce overhead, these
threads communicate using a synchronization array, a
dedicated hardware structure for pipelined inter-thread
communication. DSWP used in conjunction with the synchronization array
achieves an 11% to 76% speedup in the optimized functions on both
statically and dynamically scheduled processors.