As hardware evolves, so do the needs of applications. To increase the performance of an application,
there exist two well-known approaches. These are scaling up an application, using a larger multi-core
platform, or scaling out, by distributing work to multiple machines. Both approaches are typically
implemented using the shared-memory and message-passing programming models.
In both programming models, and for a wide range of workloads, there is contention for
access to data, and applications need synchronization. On modern
multi-core platforms with tens of cores and network interconnects with sub-millisecond latencies,
synchronization bottlenecks can significantly limit the scalability of an application.
This dissertation studies the scalability of synchronization in shared-memory
settings and focuses on the changes brought upon by modern hardware advances in processors and network
interconnects. As we show, current approaches to understanding the scalability of applications, as
well as prevalent synchronization primitives, are not designed for modern workloads and hardware
platforms. We then propose methods and techniques that take advantage of hardware and workload
knowledge to better serve application needs.
We first look into better understanding the scalability of applications, using low-level performance
metrics in both hardware and software. We introduce ESTIMA, an easy-to-use tool for
extrapolating the scalability of in-memory applications. The key idea underlying ESTIMA is the use of
stalled cycles in both hardware (e.g., cycles spent waiting for cache line fetches or busy locks), as
well as software (e.g., cycles spent on synchronization, or on failed Software Transactional Memory
transactions). ESTIMA measures stalled cycles on a few cores and extrapolates them to more cores,
estimating the amount of waiting in the system.
We then focus on locking, a common way of synchronizing data accesses in a shared-memory setting.
Typically, contention for data limits the scalability of an application. A poor choice of locking
algorithm can further impact scalability. We introduce GLS, a middleware that makes lock-based
programming simple and effective. In contrast to classic lock libraries, GLS does not require any
effort from the programmer for allocating and initializing locks, nor for selecting the appropriate
locking strategy. GLS relies on GLK, a generic lock algorithm that dynamically adapts to the
contention level on the lock object, delivering the best performance among simple spinlocks, scalable
queue-based locks, and blocking locks.
Finally, we study the performance of distributed systems that offer transactions, by exposing
shared-memory semantics. Recently, such systems have seen renewed interest due to advancements in
networking hardware for datacenters. We show how such systems require significant manual tuning to
achieve good performance in a popular category of workloads (OLTP), and argue that doing so is
error-prone, as well as time-consuming, and it needs to be repeated when the workload or the hardware
change. We then introduce SPADE, a physical design tuner for OLTP workloads on modern RDMA clusters.
SPADE automatically decides on data partitioning, index and storage parameters, and the right mix of
direct remote data accesses and function shipping to maximize performance. To achieve this, SPADE
uses low-level hardware and network performance characteristics gathered through micro-benchmarks.