Details

Previously, unsafeInterleaveIO used noDuplicate to prevent the
computation from being run twice. noDuplicate needs to walk the
evaluation stack. This experimental implementation instead usesMVars to prevent duplication.

Using two MVars seems a little weird, but it's simpler than the alternatives I've been able to think of thus far. Another option is to use MVar (Maybe a), which starts out containing Nothing. That seems rather uglier:

unsafeInterleaveIO::IOa->IOaunsafeInterleaveIOm=dov<-newMVarNothingunsafeDupableInterleaveIO$dor<-tryTakeMVarvcaserof-- Someone else has taken the MVar and is performing-- the action, so we will wait for their result.Nothing->dores<-readMVarvcaseresofNothing->error"Impossible Nothing"Justa->purea-- Someone else has performed the action, so we use-- their result and put it back in the MVar.Justj@(Justr)->putMVarvj>>purer-- We're the first ones to get the MVar, so we actually-- do the work.JustNothing->dores<-mputMVarv(Justres)pureres

@bgamari I can definitely reduce the number of MVars to one, but I don't see a way to reduce the number of MVar *operations* too much. That doesn't mean it's impossible; I just don't see it. I'll take the MVar (Maybe a) approach I gave above for a spin and see how that goes, and then maybe try the evil "null pointer" version.

@dfeuer what's this aiming at? Is there a problem with the old implementation, or are you trying to optimise it? If so, what's the benchmark?

I'm very wary of modifying this code, the bugs are really hard to find.

I'm trying to optimize it. You chose to use an MVar rather than noDuplicate# for fixIO to try to optimize it. unsafeInterleaveIO
gets a lot more use, so we should worry at least as much about its performance. I can't guarantee that I can improve performance,
but I think it's worth a shot. As for the benchmark, I imagined that we must have some benchmarks that use lazyIO in multi-threaded code. If that's wrong, I'll have to try to come up with something.

@dfeuer what's this aiming at? Is there a problem with the old implementation, or are you trying to optimise it? If so, what's the benchmark?

I'm very wary of modifying this code, the bugs are really hard to find.

I'm trying to optimize it. You chose to use an MVar rather than noDuplicate# for fixIO to try to optimize it. unsafeInterleaveIO
gets a lot more use, so we should worry at least as much about its performance. I can't guarantee that I can improve performance,
but I think it's worth a shot. As for the benchmark, I imagined that we must have some benchmarks that use lazyIO in multi-threaded code. If that's wrong, I'll have to try to come up with something.

Hmm, I'm slightly dubious about all this. It seems like an unforced change in code that's been historically very difficult to get right. We don't have an open performance bug related to this code, and it doesn't affect the performance of GHC itself. Is it really worth the effort? You'd have to find a benchmark or synthesise one, and that's a lot of work, getting confidence in correctness will be hard, and at the end there might not be much to be gained here.

It turns out that in simple cases (without deep thunk chains that would make this more likely useful), the simple two-MVar solution is the best of the MVar options, with the crazy null-pointer one a close second. When run with -N1, all are inferior (in this simple case) to the noDuplicate version. However, when run with -N2 (even without any actual parallel or concurrent code), the two-MVar version pulls ahead.

@simonmar, why do you say this does not affect GHC performance? GHC uses unsafeInterleaveIO directly in multiple places, although I certainly couldn't say if the differences are significant enough to be easily measurable in such a large program. You're right that finding/writing good benchmarks could be a bit of a challenge. I wonder if it would be possible to produce a test demonstrating quadratic behavior from repeated stack walking.

I'm not personally too concerned about correctness in the two-MVar implementation. It's quite simple, and the failures of the past have left us lots of good tests to pound on any new implementation.

A final thought: whichever way we go on this, it seems quite peculiar to me to use one approach for fixIO and another for unsafeInterleaveIO.

Well ok, if you can demonstrate a perf improvement then I won't object. GHC does use unsafeInterleaveIO, but as far as I know we've never identified it as a significant factor for performance (that could be because our profiling tools aren't good enough, of course). So my concerns are really around whether this is the most productive area to focus effort, rather than whether it's a good idea in isolation.

It might be better to use an IORef for the exclusion part and an MVar for the blocking part. Something like

Are we assured that multiple threads evaluating an unsafeInterleaveIO like defined above will see the same ref and m? Doesn't the allocation of these objects need to be protected by noDuplicate? I'm happy to be completely wrong, but I would like to know why.

Ah, that makes sense (both the text and the code) except that thetakeMVar should be readMVar. I did think about combining an IORef
with an MVar, but my ideas in that direction were all obviously wrong.
I'll give your approach a shot and see how it performs.

Are we assured that multiple threads evaluating an unsafeInterleaveIO like defined above will see the same ref and m? Doesn't the allocation of these objects need to be protected by noDuplicate? I'm happy to be completely wrong, but I would like to know why.

The objects are allocated in the outer IO thread, so we can be guaranteed that they are executed exactly once for this call to unsafeInterleaveIO. Inside the unsafeDupableInterleaveIO we lose that guarantee and we have to implement mutual exclusion.

I was able to improve performance a bit using casByteArray# instead ofatomicModifyIORef. But I still don't have an absolute improvement.
Further, performance in this realm seems fragile in the extreme, in
sometimes surprising ways.

Are we assured that multiple threads evaluating an unsafeInterleaveIO like defined above will see the same ref and m? Doesn't the allocation of these objects need to be protected by noDuplicate? I'm happy to be completely wrong, but I would like to know why.

The objects are allocated in the outer IO thread, so we can be guaranteed that they are executed exactly once for this call to unsafeInterleaveIO. Inside the unsafeDupableInterleaveIO we lose that guarantee and we have to implement mutual exclusion.