Hi Thiago,
Let me address your questions one by one.
On Wed, Aug 22, 2012 at 1:01 AM, Thiago Negri <evohunz at gmail.com> wrote:
> Hello everyone. I'm taking my first steps in Cloud Haskell and got
> some unexpected behaviors.
>> I used the code from Raspberry Pi in a Haskell Cloud [1] as a first
> example. Did try to switch the code to use Template Haskell with no
> luck, stick with the verbose style.
I have pasted a version of your code that uses Template Haskell at
http://hpaste.org/73520. Where did you get stuck?
> I changed some of the code, from ProcessId-based messaging to typed
> channel to receive the Pong; using "startSlave" to start the worker
> nodes; and changed the master node to loop forever sending pings to
> the worker nodes.
>> The unexpected behaviors:
> - Dropping a worker node while the master is running makes the master
> node to crash.
There are two things going on here:
1. A bug in the SimpleLocalnet backend meant that if you dropped a
worker node findSlaves might not return. I have fixed this and
uploaded it to Hackage as version 0.2.0.5.
2. But even with this fix, you will still need to take into account
that workers may disappear once they have been reported by findSlaves.
spawn will actually throw an exception if the specified node is
unreachable (it is debatable whether this is the right behaviour --
see below).
> - Master node do not see worker nodes started after the master process.
Yes, startMaster is merely a convenience function. I have modified the
documentation to specify more clearly what startMaster does:
-- | 'startMaster' finds all slaves /currently/ available on the local network,
-- redirects all log messages to itself, and then calls the specified process,
-- passing the list of slaves nodes.
--
-- Terminates when the specified process terminates. If you want to terminate
-- the slaves when the master terminates, you should manually call
-- 'terminateAllSlaves'.
--
-- If you start more slave nodes after having started the master node, you can
-- discover them with later calls to 'findSlaves', but be aware that you will
-- need to call 'redirectLogHere' to redirect their logs to the master node.
--
-- Note that you can use functionality of "SimpleLocalnet" directly (through
-- 'Backend'), instead of using 'startMaster'/'startSlave', if the master/slave
-- distinction does not suit your application.
Note that with these modifications there is still something slightly
unfortunate: if you delete a worker, and then restart it *at the same
port*, the master will not see it. There is a very good reason for
this: Cloud Haskell guarantees reliable ordered message passing, and
we want a clear semantics for this (unlike, say, in Erlang, where you
might send messages M1, M2 and M3 from P to Q, and Q might receive M1,
M3 but not M2, under certain circumstances). We (developers of Cloud
Haskell, Simon Peyton-Jones and some others) are still debating over
what the best approach is here; in the meantime, if you restart a
worker node, just give a different port number.
Let me know if you have any other questions, and feel free to open an
issue at https://github.com/haskell-distributed/distributed-process/issues?state=open
if you think you found a bug.
Edsko