Are we beyond the trough of disillusionment?

Transactional memory has not been immune to the Gartner hype-cycle. Nearly two years ago Ali-reza Adl-tabatabai from Intel reminded me of the hype-cycle, and we both commiserated that it appeared that we were at the “Peak of Inflated Expectations”. We had both just completed a week at the ACM’s PPoPP and we were both at what we viewed as the climax of the week, the Transact ’08 workshop. Every day that week we heard research about transactional memory – even though there was a whole day dedicated to this technology at the workshop. There was just so much enthusiasm about its promise, it seemed like every university was dedicating significant research efforts to it. It was an exciting time.

Yet, this was alarming to both of us since we had been working for commercial companies on implementations of transactional memory and both of us had already gone through our own personal hype cycle. I think I can honestly say that Ali was reminding me of the hype-cycle because he was worried about the impact of the “trough of disillusionment”. The higher the enthusiasm, the harder the fall feels when you dip into the trough. It felt like we were flying pretty high at that conference.

Into the Trough of Disillusionment

Looking back, maybe we were even beyond that peak but didn’t know it. Later that year both ACM Queue and the Communications of the ACM published critical articles on software transactional memory (STM). A critic with some rather damaging evidence was Calin Cascaval who was published in both venues and calling STM a research toy in CASCM and in Queue. The harshest terms were left to Bryan Cantrill, who while polite in print resorted to name-calling on his blog. (I tried not to take it personally since I have never met the man, I prefer civil discourse please.)

The economy then helped accelerate the fall into the trough of disillusionment with what appears to be Sun’s decision to abandon the ROCK processor. (It must have been made months before the day the Times reported that decision.) Lastly, looking at the pace of research today, we are not seeing the same number of papers published on STM as in the past.

My Personal Journey through the Hype Cycle

I can talk about my own “hype-cycle” -- how I expected transactional memory to be a pervasive solution; and then as I gained more experience with it, I learned that it’s no silver bullet and only a piece of the pie.

Parallel programming is hard. Transactional memory is just a tool – by itself it doesn’t make parallel programming any easier but in the hands of skilled programmers and with tools such as Parallel Extensions or the Concurrency Runtime, it can appreciably enhance productivity. Parallel programing still requires the ability to decompose your work into logical pieces that can be run in parallel; you should still strive to reduce the amount of shared state; and the more performance you try to squeeze from hardware, the more you need to understand its memory model.

I came to this conclusion after writing some horrible parallel code with transactional memory that performed abysmally. I went after a problem naively, trying to see if it was solvable. It was. Although I found I still had to be careful with STM’s memory model and synchronize on an object boundary. Even still, my application performed worse than serialized code, but it was correct. That is the inherent value of transactions – correctness. As anyone experienced with database technology or on-line transaction processing (OLTP) can attest, transactional actions can and often do impact performance.

Transactions underscore the inherent tension between performance and correctness.

My colleague Pat Helland, a major pioneer in the field of distributed transactions, denounced the two-phase commit protocol as an “anti-availability protocol”. While some may think he dislikes transactions, he was just dramatically describing the dangers inherent in this technology. Transactions create a dependency tree. As your application develops more dependencies, it becomes more fragile. This is especially true if these dependencies are between components in different recovery domains. To put it another way, the reliability of a transactional system is the intersection of the availability of all its participants. If the availability of two participants rarely overlaps, the likelihood of that transaction succeeding is low. This is a hard problem that really requires a lot of thought and compromise. Jim Johnson, et al and I worked on gluing various transactional management domains together and now Windows has transactions available from managed code through to the kernel.

When you apply this knowledge to transactional memory, the discussion only changes slightly. Availability is not the Achilles-heel of transactional memory since the typical TM system’s memory is shared across one or many CPUs within a single computer and available so long as the computer is running. So availability in TM is replaced by contention. The transaction is broken not by a participant leaving or voting “no” but by participating memory being incompatibly accessed by two different transactions. The larger the set of memory involved in a transaction (read-set), the greater likelihood that there will be some contention between that transaction and other transactions. Again, the more work you want to put under transactions the higher performance impact you accept.

What is the real problem?

The optimist in me is not too worried about incompatible accesses by “well-written” applications. There have been well documented studies that show locks are not always necessary. So, what do well-written applications do? They make sure that their time under lock is small and that there is little or no data sharing. That mirrors OLTP and database transaction guidance to keep your transactions as small as possible and the observation that a successful transactional system is one where all participants are reliably available – and in STM’s case, with little or no contention.

Those who come out of the compiler world seem to focus on a transaction’s inherent overhead – it must do at least two stores for single store. It likely does more than one read for every read. This pervasive overhead is what critics point to when they call transactional memory a “research toy” or worse, hopeless. But, they are critics and pessimistic about this technology; personally, I think they are focusing on the wrong problem.

Databases have the same issues; from a naïve examination, they have to do two-writes for every write and yet they are the pervasive solution to multi-user access to durable data. How is it that transactions are “OK” for durable media and not “OK” for volatile memory? The expense of reads and writes on spinning media should make this inherent problem with databases much worse than for volatile memory.

Database developers have worked many years to batch up their writes, reduce the number of seeks the disk-head must take, and create innovative logging techniques all for the stated purpose of improving performance. Compilers can do the same. They can determine what data is “local” and does not need to be instrumented; make static or dynamic analysis that detects that specific variables are only accessed in transactions or that specific transactions are working on disjoint sets of memory. Basically, the tricks that the database community has been developing for the past thirty-ish years may and probably do have parallels that can be exploited by compilers. I allude to this in our Ch9 interview last year.

Further, the critics are focusing on the negative. If you create micro-benchmarks that only measure time-under-synchronization; the transactional memory serial performance will be distressing. But, the real measure is how fast the overall application is and how much it speeds up when more processors are available; how well does it scale?

Users of transactional memory can take a page out of the database programming manual and keep their atomic regions as small as possible. This is interesting from a performance point-of-view. If the amount of time your application is inside of a transactional-memory block is small; so small, in fact, that the serial cost of executing that block does not impact the overall application performance; then the serial overhead of transactional memory is less of a concern. When combined with real-world blocking events such as I/O, it might not even be a measurable cost.

The real advantage of transactional memory is then expressed in the form of productivity to the programmer. Instead of a grand replacement of locks with transactions, the goal is to enable application architectures that scale without needing to create complex locking mechanisms. Keep what little shared state you have managed by the transaction and free yourself from non-deterministic deadlocks.

Demonstrated Productivity and Ease of Use

This observation was recently underscored with multiple research reports. The first report “Transactional Memory versus Locks – A comparative Case Study” was delivered at ICSE 2009 - a research team tested usability and scalability of transactional memory. In what I consider to be a groundbreaking report on STM usability, researchers at the University of Karlsruhe took 12 students, had them create a parallel desktop search engine using Intel’s STM C compiler. 3 teams worked with traditional locks and the other 3 teams used STM. Their results demonstrated that STM is usable, created more maintainable code, and created an application that scales. The overall performance of the resulting application was that the best TM application was more than 3 times faster than the best time using traditional synchronization mechanisms. They were able to develop their solution in less time; resulting in code that was judged to be more readable; and the majority of their time was spent in sequential code – focusing on the real problem they were solving – not on the mechanisms of parallelism and shared state. To put this in clearer terms:

The value of STM is that it allows you to focus on designing applications which happen to scale instead of the mechanisms employed to scale those applications.

This is the promise of STM – not that it makes your application scale, but it frees you from worrying about lock ordering or even building a system that imposes some lock-hierarchy. Instead, focus on your problem and make it scalable. STM is simply a synchronization tool -- one that helps you get your work done without too much effort.

The second report is from WDDD 2009. I was unfamiliar with that conference. It’s the Eighth Annual Workshop on Duplicating, Deconstructing, and Debunking held at ISCA (Symposium of Computer Architecture). Honestly, that sounds like a really hard audience and I would love to have heard this talk. The UT Austin team asked “Is Transactional Programming Actually Easier?” So what do you think they found?

Now this isn’t just a talk, it’s a full paper. Their results are impressive – and very defendable. They did a “user-study in which 147 undergraduate students in an operating systems course implemented the same programs using coarse and fine-grain locks, monitors, and transactions.” They then made both quantitative and qualitative observations on the students work through code analysis and surveys. They did this over two years and documented all results.

The results are as I would have hoped. Some quotes from the paper (the bolding are my additions):

· “Overwhelmingly, the number and types of programming errors the students made was much lower for transactions than for locks. On a similar programming problem, over 70% of students made errors with fine-grained locking, while less than 10% made errors with transactions.”

· “…we found that coarse locks and transactions required less time than fine-grain locks on the more complex two-lane assignments. This echoes the promise of transactions, removing the coding and debugging complexity of fine-grain locking and lock ordering when more than one lock is required.”

· “…transactional programming really is less error-prone than high-performance locking, even if newbie programmers have some trouble understanding transactions…. for similar programming tasks, transactions are considerably easier to get correct than locks.”

To be fair, the students thought course-grained locks were the easiest to use. But to fairly judge that observation, the Austin team was using also using more arcane implementations of STM than the Karlsruhe study. They commented that the second years usability scores improved when they adopted a different STM library. The easy take-away here is that STM does deliver on its promise of safe parallelism. The more implied result is that fine-grained locks and TM provide scalability and fine-grained locking was hard to do and error prone. This latter statement comes from the fact that they did not present scalability or performance benchmarks associated with the students work. But, between these two studies, I think this is a viable conclusion.

Onward to Enlightenment

At the end of last year there were many detractors that, as Ali and I feared, significantly blunted the TM hype and plunged it into the “trough of disillusionment”. STM detractors argue that the serial cost of transactional memory is too high to make it more than a research tool. As I discuss here, the serial cost is the least important part of the equation. In fact, it’s making a scalable application that you need to focus on. If STM can free you from worrying about correct execution in the very few cases you need to synchronize, then this productivity tool helps you scale your application. If you use it gratuitously such that the serial performance costs of STM are noticeable, then it is not helping you. In fact, I will argue you need to rethink how you are using it and likely will need to change your code to make it scalable – it’s not STM that is the problem but instead that you are using it incorrectly.

Does STM help you create this scalable architecture? I don’t think so. It does not impose a “well-disciplined” pattern on your work. You have to gain that discipline through other means. Instead, what it provides is a tool to avoid deadlock and be able to easily reason about the correct behavior of the small amount of code that impacts shared state.

What I am hoping is that research, such as presented by UT Austin and the University of Karlsruhe, is signaling an exit from the trough and now we as an industry can focus on delivering this technology and not focus on addressing hype. Their students used STM to create scalable, reliable applications. With less effort these students produced more readable and maintainable code that scaled better than the code of students who were forced to use traditional synchronization methods.

I encourage you to read the paper from WDDD and review the slides from ICSE. Is there actionable guidance there as well? I think that one area that practitioners of STM should investigate is how to make its programming model more approachable, understandable, and maybe even prescriptive in its use. I wonder what that looks like.

What do you think? Are we coming up onto the “Slope of Enlightenment”?

There is one problem with STM that bothers me, and authors of STM usually do not address it. It’s scalability.

The problem is not "serial cost", the problem is inherent centralization and excessive synchronization, both are incompatible with scalability.

As you said, the first thing developer must do is to properly design the application in order to reduce synchronization to minimum. If it’s possible then it’s actually does not matter what implementation technique (locks, STM) used then (note that 19 out of 20 usages of locks do not subjected to deadlocks). Yes, there are some corner cases when we have low rate of accesses to shared state .and. complicated interaction patterns (maybe application shutdown process) and where STM has value.

In many situations it’s just impossible to reduce synchronization. Consider server application running on 16 cores, and each worker thread has to do some memory management, object life-time management, logging, statistics, reads/modifications of shared data structures (hash maps, queues, etc), such activities are inherently "shared".

The problem here is that current STM implementations force modification operations on different objects to implicitly contend with each other; read operations that may be basically costless otherwise are turned into write operations (modifications of internal STM state).

On first graph you may see that STM provides lower performance than lock on 1 thread and then basically does not scale.

On second graph you may see STM on 16 threads does not even catch up with lock on 1 thread.

Not saying that in an amicable way benchmarking of such thing must be against abstract linear scaling of single-threaded implementation w/o lock. If such graph would be provided STM would look totally ridiculous.

And from the names they use I may conclude that their implementation is based on the quite state-of-the-art TL2.

Even 1 atomic RMW on global variable per transaction in advanced STMs (RingSTM) is scalability destroying on todays x86 hardware.

Well, maybe you use some novel algorithms that go around the problem. I don’t know, no material/benchmarks here.

Taking into account all said above I do not see any bright future for STM. This does not relates to HTM and HyTM, though. But it’s unclear now when we will see HTM on commodity hardware. What will be with Sun’s Rock is unclear now, and AMD’s ASF is not directly intended to work in HyTM (is’s just a kind of advanced and flexible "CAS").

May you provide any comments on above? I would want to be wrong and you parry my arguments.

The question initially meant to Dave Detlefs, but I did not have time to ask that time. It would be great if he will provide his opinion too.

First, for the Deuce STM references: wrt the left graph, you say "STM provides lower performance than (sic) lock on 1 thread and then basically does not scale." Well, it doesn’t scale *linearly*, I guess, but the throughput does increase pretty dramatically between 20 threads and 150. (Though this confuses me — I’ll be surprised if they had a machine with 150+ hardware threads to run it on, so it’s not clear why it should scale at all at these #’s of thread…)

Let’s look at the right right-hand graph, which has #’s of threads in ranges I can handle. While it is indeed true that this shows that the throughput of the various STMs here does not reach the 1-thread lock-based throughput at 16 threads, I would claim that this is not because it doesn’t scale well, but precisely because of the serial overhead. The slopes of the TM curves are showing pretty OK parallel scalablity — it doesn’t catch up because the difference in 1-thread serial performance is so darn large.

Now it may or may not be the case that the serial overhead were smaller that it would scale in the same way. But it’s pretty clear that locks do not; TM at least *might*.

You say

The problem here is that current STM implementations force modification operations on different objects to implicitly contend with each other; read operations that may be basically costless otherwise are turned into write operations (modifications of internal STM state).

I’d say that this is mistaken. It is indeed generally true that the STM implementation of a read involves some write operations, but these "modifications of internal STM state" are generally to *thread local* data structures, and need not cause cache conflict with read operations to the same data by other transactions in other threads. In fact, this is one of the main promises of TM: if distinct transactions don’t conflict on the data they access, you can implement it such a way that the underlying locking doesn’t (or at least usually doesn’t) involve contention at the cache line level (which, as you correctly note, kills scalability).

A TM like TL2 does not quite achieve this ideal; there’s a shared global "timestamp" counter that writers write and readers read. For workloads with many short transactions, that can hurt scalability. I will note that the Sun TM people are doing interesting work on avoiding this scalability bottleneck, and would also point you towards work by several of our Microsoft Research colleagues (Harris, et al), showing that it can be done in some contexts with *no* conflicts at all:

In many situations it’s just impossible to reduce synchronization. Consider server application running on 16 cores, and each worker thread has to do some memory management, object life-time management, logging, statistics, reads/modifications of shared data structures (hash maps, queues, etc), such activities are inherently "shared".

In a managed world, the application doesn’t usually have to do any memory or object lifetime management, and the GC and TM can be made to play together nicely. There are ways to do logging and statistics that do most things thread-locally, and only occasionally synchronize. Reads/modifications of shared data structures are of course shared, but whether they *scale* depends on the data structure. It’s possibly to make a hash table scale pretty nicely. A queue is more problematic, by nature of its semantics. But in many circumstances you don’t require "precise" queue semantics — an approximate queue, that gives "roughly FIFO" ordering, may be adequate, and allow better scaling. So yes, sometimes you have to share — but that doesn’t mean you can’t scale.

Does STM have a bright future? Dan Grossman at U of Washington makes the analogy with garbage collection, and I think it’s a good one: would you have predicted that GC would have a bright future in, say, 1985? Probably not, but it did. It turned out that there was cleverness to be had that people hadn’t thought of yet. The same might still be true of software STM techniques.

First, for the Deuce STM references: wrt the left graph, you say "STM provides lower performance than (sic) lock on 1 thread and then basically does not scale." Well, it doesn’t scale *linearly*, I guess, but the throughput does increase pretty dramatically between 20 threads and 150. (Though this confuses me — I’ll be surprised if they had a machine with 150+ hardware threads to run it on, so it’s not clear why it should scale at all at these #’s of thread…)

————————————————————

They cheat with axis scales, the STM achieves only 4x speedup on 200 threads. If you will imagine the line in linear scale it will be basically horizontal! Well, for me it does not scale, at least I won’t pay for 200 processor beast to get 4x speedup. I would expect such speedup for 4-6 core single processor desktop.

Since it’s Java I think they use some Big Azul, like Azul Vega 3 3300:

Let’s look at the right right-hand graph, which has #’s of threads in ranges I can handle. While it is indeed true that this shows that the throughput of the various STMs here does not reach the 1-thread lock-based throughput at 16 threads, I would claim that this is not because it doesn’t scale well, but precisely because of the serial overhead. The slopes of the TM curves are showing pretty OK parallel scalablity — it doesn’t catch up because the difference in 1-thread serial performance is so darn large.

Now it may or may not be the case that the serial overhead were smaller that it would scale in the same way. But it’s pretty clear that locks do not; TM at least *might*.

——————————————————————

Well, yes, I think that single-threaded overheads are dominating here, i.e. scalability problems are just masked by single-threaded overheads. You know, it’s possible to make any program scale perfectly by inserting empty loops between all statements in the program. The program will be damn slow, but will scale linearly, because single-threaded overheads scale perfectly 🙂

You may notice that implementation does not scale linearly, it achieves only ~10x speedup on 16 threads. And you may notice graph drop off starting and 12 threads. If performance starts dropping off, then it’s usually progressive process. So I am pretty sure that if you will reduce single-threaded overheads and/or move to higher number of threads scalability problems will start dominating. So it may be not the case that STM will catch up with single-threaded implementation on any number of threads.

In such case we are saying "create the problems, solve the problems, live on the difference" 🙂

The whole thing about concurrency is performance. First we tackle concurrency with raw multi-threading, it was efficient but hard. Then STM make it again simple and again slow. So we basically back to where we start: simple and slow 🙂

I’d say that this is mistaken. It is indeed generally true that the STM implementation of a read involves some write operations, but these "modifications of internal STM state" are generally to *thread local* data structures, and need not cause cache conflict with read operations to the same data by other transactions in other threads. In fact, this is one of the main promises of TM: if distinct transactions don’t conflict on the data they access, you can implement it such a way that the underlying locking doesn’t (or at least usually doesn’t) involve contention at the cache line

level (which, as you correctly note, kills scalability).

A TM like TL2 does not quite achieve this ideal; there’s a shared global "timestamp" counter that writers write and readers read. For workloads with many short transactions, that can hurt scalability. I will note that the Sun TM people are doing interesting work on avoiding this scalability bottleneck, and would also point you towards work by several of our Microsoft Research colleagues (Harris, et al), showing that it can be done in some contexts with *no* conflicts at all:

Well, one may choose between (1) read operations mutating shared state or (2) high retry rate (down to livelock).

I think that TL2 loads read global timestamp not senselessly, otherwise single-threaded overheads will be even higher for transactions with large read-sets.

Intel STM dynamically choices between speculative reads and pessimistic reads, but it’s still (1) or (2).

Though I have to read the paper you cited, probably I am missing something. Thank you for the link. In IT one is loosing qualification even on a coffee-break 🙂 And, yes, I see that the paper is dated by 2006 🙂

For example, how STM can efficiently handle iteration over lengthy linked-list with concurrent mutations?

Yes, I can redesign the application or use something like AMD ASF’s ‘release’ operation. But doesn’t this defeat the whole point of simplicity and productivity? I guess that neither way will be fast and easy.

Btw, I proposed to incorporate ‘release’ statement (see http://developer.amd.com/assets/45432-ASF_Spec_2.1.pdf) into Intel STM, but they rejected the proposal. In you manual I see you mention ‘open nesting’ and ‘abstract nested transaction’ regarding linked list-problem. Are you going to incorporate something which will solve linked-list problem? Btw, does open nesting solve the problem? I’m not sure how…

In a managed world, the application doesn’t usually have to do any memory or object lifetime management, and the GC and TM can be made to play together nicely. There are ways to do logging and statistics that do most things thread-locally, and only occasionally synchronize. Reads/modifications of shared data structures are of course shared, but whether they *scale* depends on the data structure. It’s possibly to make a hash table scale pretty nicely. A queue is more problematic, by nature of its semantics. But in many circumstances you don’t require "precise" queue semantics — an approximate queue, that gives "roughly FIFO" ordering, may be adequate, and allow better scaling. So yes, sometimes you have to share — but that doesn’t mean you can’t scale.

But (1) I would not call that easier that careful fine-grained locking at all, (2) I am not sure as to whether STM is a good tool for that, currently I use low-level atomics with fine-grained fences + a bit OS dependent stuff + a dash of hardware dependent stuff.

Does STM have a bright future? Dan Grossman at U of Washington makes the analogy with garbage collection, and I think it’s a good one: would you have predicted that GC would have a bright future in, say, 1985? Probably not, but it did. It turned out that there was cleverness to be had that people hadn’t thought of yet. The same might still be true of software STM techniques.

———————————————————————-

I am saying anything about 10-25 years future. I do not care about that. Probably that time STM will backed by HTM so that there will be no problem of instrumentation overheads and read scalability. Or maybe everyone will be using auto-parallelization stuff, so that nobody will care about such things as mutexes, threads and STM.

What really matters in the TM vs. locks vs. X debate is the programming interface. Atomic blocks don’t specify how synchronization is implemented. This makes it a tough problem for the TM library+compiler, but at the same time if offers plenty of flexibility for how the TM+compiler actually implement this. This is an optimization problem that is largely decoupled from the composition of application code. It is a lot harder to optimize with locking (unless the application is trivial), especially if you do not know the workload a priori. It is hard to believe that every application can have built-in tuning for every possible synchronization workload. It is more reasonable to try this for an STM, because you build TM+compiler once and then use it in several applications.

And I think this is what Nir Shavit referred to. STM research doesn’t need to care that much about expert programmers (eg, people working on OS kernels such as Solaris’) as they can spend a lot of time on their code and are highly qualified. But, locking will be hard for larger teams of average programmers that develop applications with nontrivial synchronization, and that do not want to or cannot spend lots of development time on synchronization.

So, TM is, IMO, mostly about enabling a certain style of programming (atomic blocks for synchronization) that decreases development costs for a average programmers and ordinary applications. I don’t see anything wrong with trying to accomplish this. Additionally, we actually do learn a lot about synchronization (unless when we just recycle old DB ideas :), and we do learn more about how failure atomicity and concurrency relate to each other in non-database settings (@Dana: do you remember the discussions at Transact08?).

Some more detailed comments:

The shared counter in time-based TMs is not a scalability problem, because you can use hardware clocks instead (many current machines already have suitable hardware clocks). For details, see

One possible approach to solve the "linked-list problem" is to use multi-level concurrency control, which has been investigated by database researchers long before the TM hype. Transactional Boosting is essentially multi-level concurrency control applied to a TM scenario with fast base objects (custom concurrent algorithms for data structures).

@Torvald is right the programming model is much more important than that performance for most of the programmers. From my experience most of them are not even trying to handle concurrency issue and will not even try to use anything but coarse grained locking.

Saying that on that on the other hand the performance can’t be dramatically low even for a single thread.

In Deuce we started to work on this single thread overhead and we expect to reduce it dramatically using very similar to the ways the GC overhead was reduced gradually.

BTW, we used Azul Vega2 (96 threads) and Sun Maramba (128 threads), for the big tests.