Can you efficiently parallelize this? The parallel versions are much slower than the sequential version, and I'm not sure why. Does SetSharedVariable allow simultaneous reads for different kernels? It appears that it doesn't even though the documentation says you should use CriticalSection to make separate reads and writes atomic and thread-safe.

SetSharedVariable forces that variable to be evaluated on the main kernel (not the parallel kernels). This will effectively make your code not run in parallel.
–
SzabolcsMay 15 '13 at 20:55

So the slowdown for the second one is from copying definitions between kernels?
–
Michael HaleMay 15 '13 at 20:56

1

Moving data (not definitions) and waiting for the main kernel to finish. There's also a bug in the parallel tools which temporarily unpacks packed arrays sent back to the main kernel, using up a lot of memory and causing a considerable slowdown. See here: mathematica.stackexchange.com/q/2886/12 Generally, parallelization is valuable if your calculation take a longer time, if it's not too fine grained (minimize communication between kernels), and unfortunately: if the data sent back is not too large.
–
SzabolcsMay 15 '13 at 21:03

No problem. I can stick with the sequential version for now. Thanks for the information.
–
Michael HaleMay 15 '13 at 21:06

1

I think this slowness is a bug (see my update in the answer). I re-titled your question because the problem is not (only) caused by SetSharedVariable.
–
SzabolcsMay 15 '13 at 22:33

On my machine, the Map version takes about 5 seconds, regardless of the size of n (at least for n between 100 and 10000). However, the ParallelMap version depends strongly on the value of n: for n=100 it's much faster than Map, but for n=2000 it already takes a bit longer. If I remove the Dispatch@, it will be faster.

ParallelEvaluate[dp]; is not slow, which suggests that it is not transferring dp to the parallel kernels that takes time.

Update: evaluating Rule on parallel kernels is very slow

Finally I managed to come up with a smaller and more enlightening test case for this slowdown. In a fresh kernel, evaluate:

This is now very fast. So the culprit is Rule. But why? Is Rule overloaded in the parallel kernels? ParallelEvaluate[Information[Rule]] reveals nothing special.

Does anyone have any ideas what might be going on here?

Update on 2014-03-04:

I received a reply about this from WRI support and they pointed out that the issue is mentioned in the documentation. It's the third entry under "Possible Issues" at DistributeDefinitions.

The example there describes the same problem with InterpolatingFunction objects. Quoting the relevant part of the documentation, without repeating the example code:

Certain objects with an internal state may not work efficiently when distributed. ... Certain objects with an internal state may not work efficiently when distributed. ... Alternatively, reevaluate the data on all subkernels.

For the example above the workaround is as simple as

ParallelEvaluate[data = data;]

This is very simple to @Michael's workaround in the other answer, but it's not even necessary to recreate the expressions form scratch. It's sufficient to just re-assign the variable to itself (and re-evaluate it in the process).

I'm wondering what the support said about this?
–
xslittlegrassSep 3 '13 at 1:35

@xslittlegrass They said they filed a report, I haven't heard anything since. It hasn't affected me directly so I have not pursued it. I wouldn't normally report problems that don't hinder me personally but in this case I spent so much time debugging the OP's problem (why do I do this?!) that I couldn't let it go ...
–
SzabolcsSep 3 '13 at 3:55

Here is a way in which ParallelMap works three times faster (V9.0.1, Mac OS, Intel i7, Quad core, 8 virtual cores). The trick is to evaluatedispatch = Dispatch@rules on each kernel.

I don't really have an explanation, other than the guess that the dispatch table resides wholly in each kernel's memory and the observation that when rules are distributed, they take use half the memory in the parallel kernels than in the main kernel. When the rules are evaluated in the parallel kernels, even by ParallelEvaluate[ByteCount[rules]], the memory size of the kernels swells to match the main kernel. I assume that the rest of the definition of rules is transferred. But when the calculation is over, the kernel memory usage goes back down to what it was. So while the definition of rules is registered with the parallel kernels, it appears that the data itself is not completely copied.

Useful tip. My original motivation involved a larger dispatch table and a longer operation using the dispatch table. Paging issues might have been unavoidable when making multiple copies of a dispatch table as large as my original, but I'll keep this in mind for situations with large but not huge memory requirements and long evaluation times.
–
Michael HaleNov 19 '13 at 19:04

In the parallel kernels, data and foo are Equal, satisfy SameQ, and have, as near as I can tell, the equivalent Language`ExtendedFullDefinition, but one can see from the traces that the evaluation paths are different.

Mathematica is a registered trademark of Wolfram Research, Inc. While the mark is used herein with the limited permission of Wolfram Research, Stack Exchange and this site disclaim all affiliation therewith.