Technology scaling has reached miniaturization levels, where multiple processor cores can be integrated onto the same die. During the last four decades, this scaling has been the primary driver behind improving system performance, at the expense of higher temperatures and power densities. However, when scaling down to deep submicron technologies, a new evil rises: unreliable silicon. The reason behind the increasing concerns for transistor reliability is that the effects of process variation, transistor aging, electrical noise, and high temperatures are becoming stronger when shrinking the transistor dimensions. Consequently, industry projects that future chips will be exposed to large numbers of failures and is researching fault-tolerant designs.
At the same time, the number of processor cores in a single chip is increasing steadily, and an efficient on-chip communication medium between them is necessary. Packet-switched on-chip networks have been gaining increased importance in this area, due to their modularity and scalable bandwidth. However, due to extreme transistor scaling, these interconnection networks are expected to experience permanent defects and runtime failures in future technology generations. On top of this, a single failure in the network may cascade across several routers and ultimately cause interruption of network service. Hence, resilient on-chip networks, which can tolerate both permanent and runtime failures transparently to upper layers, are emerging.
In this dissertation, we present a characterization study of network faults, and a full-system solution to tackle them. Our characterization is conducted with an accurate circuit-level tool, which we developed to explore the impact of faults in architecture. Specifically, we present a case study where we pinpoint the common fault types in the network, their probabilities, and their architectural outcome. This way, we diagnose the vulnerable components of the interconnection network that need protection, and identify the fault types that resilient network architectures must address.
We then propose a resilient architecture that can tolerate both permanent and transient faults in the interconnection network. To address permanent network faults, which disable communication links and network routers, we suggest a network architecture that can reconfigure at runtime and utilize its surviving network resources to enable continued chip operation. Our solution, namely Ariadne, explores the surviving topology upon each permanent failure, and discovers resilient routes to connect functional nodes. We also address transient network faults, which result in corrupted or lost coherence messages. We do so by developing a systematic methodology to incorporate resilience into the coherence protocol, so that it resends lost and corrupted messages, to replay the corresponding transaction after a timeout.
Overall, this dissertation argues that designing chips that never experience network failures will not be economically feasible in the future, because this would result in enormous performance degradation, as well as financial losses for chip vendors, since a large number of chips would not meet the required specifications during testing. Instead, we propose to continue exploiting transistor scaling to maintain the current rate of performance improvement, but tolerate failures, so that a chip can gracefully degrade its performance over time only after actual faults occur.