Geoffrey Wyman Blake

About the Event

Scaling processor performance with future technology nodes is essential to enable future applications for devices ranging from smart-phones to servers. But the traditional methods of achieving that performance through frequency scaling and single-core architectural enhancements are no longer viable due to fundamental scaling limits. To continue scaling performance, parallel computers in the form of Chip Multi-processors (CMPs) are now prevalent, moving the challenge of parallel programming from a niche to the general domain.
One challenging area is scalable synchronization to shared data structures using traditional methods. It can take many years for expert programmers using traditional methods to craft a scalable and correct scheme to synchronize access to data-structures in a complex program. Researchers have been searching for methods to make synchronization more tractable. One proposal is to use “Transactional Programming” to abstract synchronization to shared data structures as transactions in a similar fashion as database operations. Trans-
actional programming can be efﬁciently supported by using a “Transactional Memory” (TM) system.
One main problem with TM systems is scalability bottlenecks. When transactional applications are written to emulate future average programmer practices, performance can be worse than a single processor on large CMPs. This should not happen on a system meant to make programming easier. This happens because transactions as represented in the TM system may be dependent on each other—accessing the same data and therefore
must serialize—without the programmer being knowledgeable about these dependencies due to the abstraction hiding system details.
This thesis develops a hardware/software approach to alleviate scalability bottlenecks in TM systems, while maintaining the level of abstraction presented in transactional programming. I ﬁrst introduce “Proactive Transaction Scheduling” (PTS), a technique that proﬁles parallel code at runtime to determine orders transactions should execute in to maintain acceptable forward progress. I then propose using PTS to automatically determine transactions causing large amounts of serialization. These transactions are then accelerated using
an asymmetric CMP to get better performance. I also show PTS can be used to partition resources in a Multi-threaded processor core for better overall performance over a fair partitioning of resources.