November 22, 2010

Misusing mutable state with F# Asynchronous Workflows

Or perhaps the title should really be “why it’s really a good idea to avoid misusing mutable state when using F# Asynchronous Workflows”. Ultimately I wanted to share a brief (or not, we’ll see), cautionary tale about the dangers of shared state when implementing concurrent tasks.

I’ve been using F# for some time, and have it fairly well drilled into my skull by this point that shared, mutable state is bad. And yet occasionally it somehow creeps in as an expedient way to solve certain issues: perhaps it’s just a lack of discipline on my part or the years of imperative conditioning (wow, that sounds like something from Dune :-). Whatever the reason, this time it actually did cause a problem for my application, and worth the effort of eradicating.

Here’s the scenario: in my BrowsePhotosynth application, I use F# Asynchronous Workflows to download a set of files defining a point cloud from Microsoft’s Photosynth servers. Doing so asynchronously means the code runs at around 8x the speed of a more traditional, synchronous approach. While the various downloads – and their associated processing – are being managed and performed by the Asynchronous Workflows mechanism in F#, the UI thread spins its wheels updating the progress meter and check whether the user wants to cancel (at least that’s what’s meant to happen – it turns out I also had a focus issue which stopped AutoCAD from being responsive to Windows messages when the processing was happening: another thing I’ve fixed in the latest version).

Anyway, the UI thread needs to poll the F# “processor” to find out whether it’s finished. I was previously maintaining a count of the completed tasks: when that reached the total number of files to download, we knew we were done. The problem was this: I was executing the operation to increment the mutable variable when each of these various tasks completed – and not necessarily on the main UI thread. Mostly it would work fine, but occasionally the progress meter would just stop progressing. My latest guess at the reason for this is not that certain tasks had failed, but that the count of completed task somehow got out of sync with reality.

I now think – and I might still be wrong that this was the ultimate cause of my problem, as these issues are hard to debug – that updating a shared member variable (jobsCompleted) from the “a job is completed” event callback, that could happen on a different thread, was a bad thing to do.

To understand why this is, let’s take a look at what happens at a low level when you increment an integer:

The results gets stored in the memory location of the original variable

Now my low-level digital systems/assembly language knowledge has decayed significantly over the last decade, but we can look at some generated IL code to see something similar happening:

L_0033: ldloc.2 // Load the local variable at index 2 onto the

// evaluation stack

L_0034: ldc.i4.1 // Push the integer value of 1 onto the eval stack

L_0035: add // Add the top two values on the eval stack and push

// the result to the top

L_0036: stloc.2 // Pop the current value from the top of the eval

// stack and store it in the local variable at idx 2

Our main goal in adopting Asynchronous Workflows was to make sure we’re not waiting around to download one file before processing it. This actually has nothing much to do with concurrency: the fact that we may get some improvement from using more cores of our processor is very much a secondary benefit.

But ultimately that may still be happening: we’re not specifying when or where the tasks should get executed. So it could be that the increment operation gets executed on two or more threads at exactly the same time. And if two threads read the variable at the same time, and then increment it before saving it back, we effectively lose one of our increment operations.

This is all fairly unlikely, but – if I’m reading this behaviour correctly – the larger the size of the point clouds (such as what I would call “very large” point clouds, at least from a Photosynth perspective, with 200+ files each containing up to 5,000 points) the greater risk of some kind of conflict occurring.

So what can we do? Well, we already raise an event on the UI thread to update AutoCAD’s progress meter safely, so we could increment a counter there. Or we could use message-passing to have an agent count our successfully completed tasks for us, just as we use one now to write to our text file. In the end I decided to set a Boolean flag in the “all jobs completed” event, which avoids us having to count the completed tasks at all.

I’m now going to update my AU Handout and the ADN Plugin of the Month posting with this updated implementation. For those of you who are interested, here’s the updated project, in the meantime.

Update

I’m still getting the occasional “lost task” error, which is frustrating, and ultimately means that my reasoning was at best partially correct (and that’s phrasing it very generously ;-). However, the advice remains sound – and the fundamentals worth sharing – so I’m going to leave the post as it is. But bear in mind that there seems to be a deeper issue in my async code that still could use fixing. The good news, of course, is now that the user is able to cancel effectively, whether it fails occasionally is somewhat moot: if the processing stalls, they can always hit escape and run the BROWSEPS command to attempt the download of the point cloud once again. We’ll see if it gnaws away at me enough to diagnose the issue – probably not before AU, at this stage, but we’ll see.