</div><div><div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I&#39;ve been trying to figure this out for a couple days now, and I&#39;m curious if anyone has seen a similar problem.<div><br></div><div>My setup is</div><div><br></div><div> ipcontroller --profile=sge</div><div> ipcluster engines -n 100 --profile=sge</div>

<div><br></div><div>My script uses map_sync with a direct view. After running my script for a couple minutes, the load on the compute nodes grows excessively high and the scheduler starts suspending jobs, so some of the engines get suspended. This causes my script to terminate with an error like the one below</div>

<div><br></div><div> [Engine Exception]EngineError: Engine 1315 died while running task &#39;966abf73-3183-4db3-8cf2-96bd08c2312b&#39;</div><div><br></div><div>The engine is numbered 1315 because I sometimes restart the engines without restarting the controller.</div>

<div><br></div><div>Why would suspending an engine would cause my script to terminate instead of simply forcing it to wait?</div><div><br></div><div>Why might the load be so high? Each node has 32 cores. At most twenty engines are running on each node. Yet, sometimes several hundred processes are vying for space on a given node (and I&#39;m the only one using the cluster). Could it be the queuing of messages or something?</div>

</blockquote><div> </div></div></div><div>This is a bit of shot in the dark, but on our machines we need to set <span><em></em>MKL_NUM_THREADS</span>=1, otherwise some numpy functions (which I assume are calling MKL functions) try and use 16 threads. Is it possible some of your code, or some library you rely on, is mufti-threaded?<br>

</div></div></blockquote><div><br></div></div></div><div>The only library *IPython* uses that is multithreaded in zeromq, but that&#39;s only one additional thread. If *you* are using numpy, then the MKL environment is relevant.</div>
<div class="im">