Abstract:
------------
A new breed of low-latency I/O devices, such as the emerging remote memory access
and the high-speed Ethernet NICs, are becoming ubiquitous in current data centers. For
example, big data center operators such as Amazon, Facebook, Google, and Microsoft are
already migrating their networks to 100G. However, the overhead incurred by the system
software, such as memory management and protocol stack, is dominant with these faster
I/O devices.

To address these system software overheads, this thesis analyses system services such as
memory management and protocol stacks, and makes the following contributions: First, the thesis proposes a lazy, asynchronous mechanism to address the system software overhead incurred due to a synchronous TLB shootdown. The key idea of the lazy shootdown mechanism, called LATR , is to use lazy memory reclamation and lazy page table unmap to perform an asynchronous TLB shootdown. By handling TLB shootdowns in a lazy fashion, LATR can eliminate the performance overheads associated with IPI mechanisms as well as the waiting time for acknowledgments from remote cores. Second, the thesis proposes an extensible protocol stack to address the software overhead incurred in protocol stacks such as TCP and UDP. Xps allows an application to specify its latency-sensitive operations and executes them inside the kernel and user space protocol stacks, providing higher throughput and lower tail latency by avoiding the socket interface. For all other operations, Xps retains the popular, well-understood socket interface. In addition, Xps abstraction is flexible enough to even embody the latency-sensitive operations in a off-the-shelf smart NIC. Third, the thesis analyses the overhead incurred on the leader node for consensus algorithms such as Multi-Paxos/Viewstamp Replication(VR). In addition, it classifies the parts of the VR algorithm that will be executed on the Smart NIC and the host processor. With such a classification, the consensus and recovery overhead on the leader node is eliminated, which in turn reduces the latency of the consensus algorithm.