Contents

Location-Based Memory Fences

Traditional memory fences are program-counter (PC) based. That is, a memory fence enforces a serialization point in the program instruction stream --- it ensures that all memory references before the fence in the program order have taken effect before the execution continues onto instructions after the fence. Such PC-based memory fences always cause the processor to stall, even when the synchronization is unnecessary during a particular execution. We propose the concept of location-based memory fences, which aim to reduce the cost of synchronization due to the latency of memory fence execution in parallel algorithms.

Unlike a PC-based memory fence, a location-based memory fence serializes the instruction stream of the executing thread only when a different thread attempts to read the memory location which is guarded by the location-based memory fence. In this work, we describe a hardware mechanism for location-based memory fences, prove its correctness, and evaluate its potential performance benefit. Our experimental results are based on a software simulation of the proposed location-based memory fence, which incurs higher overhead than the proposed hardware mechanism would. Even though applications using the software prototype implementation do not scale as well compared to the traditional memory fences due to the software overhead, our experiments show that applications can benefit from using location-based memory fences. These results suggest that a hardware
support for location-based memory fences is worth considering.

Deterministic Parallel Random-Number Generation

Existing concurrency platforms for dynamic multithreading, such as Cilk and TBB, do not provide repeatable parallel random-number generators. We propose that a mechanism called pedigrees be built into the runtime system to enable efficient deterministic parallel random-number generation, and in this work we design an efficient variant of the pedigree mechanism. Experiments with the MIT Cilk runtime system show that the overhead for this mechanism is minimal. On a suite of 10 benchmarks, the relative overhead of Cilk with pedigrees to the original Cilk has a geometric mean of 2%. We also explore library implementations of several deterministic parallel random-number generators that use these runtime mechanisms, based on a generalization of linear congruential generators, XOR'ing entries in a random table, SHA1, and Feistel networks. Although these deterministic parallel random-number generators are 3 to 18 times slower per function call than a nondeterministic parallel version of the popular Mersenne twister, in practical applications that use random numbers the additional overhead from using an efficient, high-quality DPRNG is relatively small.

Deterministic Nonassociative Reducers

Cilk++ and Intel Cilk Plus support a type of parallel data structure, called a reducer (hyperobject), which allows workers to perform update operations to the data structure in parallel. To accomplish this task, each worker maintains its own "view" of a reducer on which it performs its own update operations, and after these workers complete their parallel tasks, these views are combined using a "reduce" operation such that, if this reduce operation is associative, then the data structure that results from combining these views has serial semantics --- the reducer resulting from any parallel execution is equivalent to the reducer resulting from any serial execution of the Cilk program. While reducers are very useful for parallelizing many programs, for some data types, such as floating-point numbers on which additive updates are performed, no associative reduce operation is apparent. We propose that a runtime mechanism called pedigrees be built into the runtime system in order to support deterministic (nonassociative) reducers, a variant of reducers that guarantees that all executions of the program will yield a reducer data structure with serial semantics. In particular, in this work we design deterministic reducers using a particular pedigree mechanism, with the goal of developing the most efficient design possible. Such deterministic reducers allow programmers to write parallel programs that use data types such as floating-point numbers and still behave deterministically, regardless of the underlying nondeterminism of the Cilk scheduler.

A Consistency Architecture for Hierarchical Shared Caches

Hierarchical Cache Consistency (HCC) is a scalable cache-consistency architecture for chip multiprocessors in which caches are
shared hierarchically. HCC’s cache-consistency protocol is embedded in the message-routing network that interconnects the caches,
providing a distributed and scalable alternative to bus-based and directory-based consistency mechanisms. The HCC consistency
protocol is “progressive” in that every message makes monotonic progress without timeouts, retries, negative acknowledgments, or
retreating in any way. The latency is at most proportional to the diameter of the network. For HCC with a binary fat-tree network, the
protocol requires at most 13 bits of additional state per cache line, no matter how large the system. We prove that the HCC protocol is
deadlock free and provides sequential consistency.