Incomplete Mitigation

So in my committee paper about transactional memory I mentioned that we had never run into starvation (i.e. having a long transaction that can never commit due to small transactions that continually cause it to have conflicts). Well today I found the first instance where starvation was happening. It wasn’t the end of the world, in this case it just meant that we would burn through calculations for all the measurements in one of our experiment files1 only to have them restart and be done a second time. This probably wouldn’t have ever been noticed but someone2 had the brilliant idea to put a calculation counter in the status bar of the application so that users can see it count down to zero and then pop back up to 500 having to watch it count down to zero a second time (thankfully it stopped after the second go around).

This was a classic case of transactional memory starvation. I had a transaction that needed to touch all the measurements multiple times while those measurements had calculations running that were also modifying the measurements in ways that would conflict with the long transaction. And the calculations, while not quick, were still much faster than the transaction that needed to touch everything. So until all the calculations were done the long transaction could not commit (this is the first countdown of the calculation counter), and then when it did finally commit it would change all the measurements in a way that caused them to need their analysis results to be calculated again (the pop back up to 500 of the counter).

When I designed our transactional memory system I had anticipated this problem and put in a mechanism to deal with it: running locked. When starting a transaction one can set a limit on how many conflicts are acceptable and if the limit is reached the transaction will grab a commit lock at the outset so that no other transaction can commit until this one is done. By commit lock I actually mean a boost::upgrade_lock on a boost::upgrade_mutex. Our system has a somewhat brain-dead implementation, everything is protected by a single boost::upgrade_mutex. When reading from a transactional variable for the first time in a transaction we get a shared_lock (subsequent reads of the same variable will find the value in the transaction’s read cache). when committing we first get an upgrade_lock and hold it while validating the transaction’s reads. This lock allows shared locks to be obtained (and thus variables to be read) while we’re validating, but no other transaction will be able to get an upgrade_lock and start committing until we’re done. If the validation succeeds we upgrade our upgrade_lock to a unique_lock which will lock out reads (after all the existing shared locks are released) and then we can publish our changes so that other threads will see them. When a transaction runs locked it gets the upgrade_lock at the start and holds it for the duration of the transaction so that no other transaction can commit.

So I made the modifications to trigger running locked on the offending transaction and tested things out: the calculations still restarted. My unit tests of the running locked functionality said that is should work, but after spending some time in the debugger I figured out where my unit tests were coming up short. It turned out that if the transaction that is running locked ran a child transaction the commit process of that child transaction would release the commit lock that the parent transaction was holding thus converting it back to a normal transaction that was not running locked and the pesky calculation transactions could get in there and cause conflicts. This is the first time I’ve seen this code exercised outside of testing and this was just not a scenario that I thought of when writing the unit tests. Luckily the fix for this (and the unit test to go with it) was not difficult. And before you say code coverage analyzer, this was not a case where the code was not being exercised by the tests, it just wasn’t being exercised in the correct way3.

So, yeah, starvation can happen. In this case it may be the symptom of a performance issue; I’m still working that out4. So the real solution may be to restructure the code so that the starvation can’t happen. Though it may not be worth the trouble, we only need to run locked two or three times during a couple minutes worth of calculations. But for the moment it’s nice to know that my starvation mitigation actually works (after a fixing it).

We refer to our data files as experiments and the experiment is made up of measurements that are independent as far as the calculation system is concerned (lots of easy to exploit parallelism there). The measurements are further broken down into smaller units (more parallelism) but we don’t need to worry about that for our purposes here.↩