The case of the mysterious "thread blocked indefinitely in an MVar operation" exception

I recently tracked down the cause of the persistent “thread blocked indefinitely in an MVar operation” exceptions I was getting from the GHC runtime when I used my parallel-io library. There doesn’t seem to be much information about this exception on the web, so I’m documenting my experiences here to help those unfortunate souls that come across it.

The essence of the parallel-io library is a combinator like this:

parallel::[IOa]->IO[a]

The list of IO actions on the input are run (possibly in parallel) and the results returned as a list. The library ensures that we never go beyond a certain (user specified) degree of parallelism, but that’s by-the-by for this post.

There are a number of interesting design questions about the design of such a combinator, but the one that tripped me up is the decision about what to do with exceptions raised by the IO actions. What I used to do was have parallel run all the IO actions wrapped in a try, and wait for all of the actions to either throw an exception or complete normally. The combinator would then either return a list of the results, or, if at least one action threw an exception, the exception would be rethrown on the thread that had invoked parallel.

At first glance, this seems very reasonable. However, in this turns out to be a disastrous choice that all-but-guarantees your program will go down in flames with “thread blocked indefinitely in an MVar operation”.

This code is similar to what the parallel combinator does in that the main thread (thread 1) waits for thread 2 and thread 3 to both return a result or exception to res1 and res2 respectively. After it has both, it shows the exception or result.

The fly in the ointment is that thread 2 is waiting on thread 3, so the thread dependency graph looks something like this:

(You should read an arrow that points from thread A to thread B as thread B is waiting on thread A to unblock it).

Unfortunately, our programmer has written buggy code, and the code running in thread 3 throws an exception before it can wake up thread 2. Thread 3 writes to the res2 MVar just fine, but thread 1 is still blocked on the res1 MVar, which will never be written to. At this point, everything comes crashing down with “thread blocked indefinitely in an MVar operation”.

The most annoying part is that the exception that originally caused the error is squirrelled away in an MVar somewhere, and we never get to see it!

This goes to show that the proposed semantics for exceptions above is dangerous. In the presence of dependencies between the actions in the parallel call, it tends to hide exceptions that we really want to see.

The latest version of the parallel-io library solves this by the simple method of using the asynchronous exceptions feature of GHC to rethrow any exceptions coming from an action supplied to parallelas soon as they occur. This unblocks thread 1 and lets us deal with the exception in whatever way is appropriate for our situation.

Although this bug is simple to describe and obvious in retrospect, it took a hell of a time to diagnose. It never ceases to amaze me exactly how difficult it is to write reliable concurrent programs!

Well, the users of the parallel-io library can do whatever they want, including having their parallel tasks wait on each other. In fact, the client I really care about is my own build system (OpenShake) where dependencies between parallel tasks are vital to the operation of the system.