Parallel shared-memory machines with hundreds or thousands of processor-memory
nodes have been built; in the future we will see machines with millions or
even billions of nodes. Associated with such large systems is a new set of
design challenges. Many problems must be addressed by an architecture in
order for it to be successful; of these, we focus on three in particular.
First, a scalable memory system is required. Second, the network messaging
protocol must be fault-tolerant. Third, the overheads of thread creation,
thread management and synchronization must be extremely low.
This thesis presents the complete system design for Hamal, a shared-memory
architecture which addresses these concerns and is directly scalable to one
million nodes. Virtual memory and distributed objects are implemented in a
manner that requires neither inter-node synchronization nor the storage of
globally coherent translations at each node. We develop a lightweight
fault-tolerant messaging protocol that guarantees message delivery and
idempotence across a discarding network. A number of hardware mechanisms
provide efficient support for massive multithreading and fine-grained
synchronization.
Experiments are conducted in simulation, using a trace-driven network
simulator to investigate the messaging protocol and a cycle-accurate simulator
to evaluate the Hamal architecture. We determine implementation parameters
for the messaging protocol which optimize performance. A discarding network
is easier to design and can be clocked at a higher rate, and we find that with
this protocol its performance can approach that of a non-discarding network.
Our simulations of Hamal demonstrate the effectiveness of its thread
management and synchronization primitives. In particular, we find
register-based synchronization to be an extremely efficient mechanism which
can be used to implement a software barrier with a latency of only 523 cycles
on a 512 node machine.