<!DOCTYPE html><html><head><meta http-equiv="Content-Type" content="text/html charset=us-ascii"><base href="x-msg://54/">
<style type="text/css">body { font-family:'Helvetica'; font-size:12px}</style>
</head>
<body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div>Hi,</div><div><br></div><div>OK, "tons" might be a bit dramatic. There are 3 or 4 receive statements in the dist_ac gen_server code without a timeout. These are the potential hang up points. The root cause of all the problems is that dist_ac assumes all applications are started on all nodes in the same sequence. Now imagine a typical setup: the boot script starts app A then B, but unfortunately A has a restart timeout of 5000 and B has 3000. If the node running these apps crashes, the rest of the cluster will attempt to start B then A. But if the crashed node is restarted by heart within 3 seconds, it will rejoin the cluster before the takeover and attempt to start A then B. Result: neither of the apps fails over to anywhere and the restarted node won't even finish it's init sequence.</div><div><br></div><div>Regarding a patch: I wrote a fix for the first couple of bugs I discovered. The problem is that I did it in my work time at a big company, where an entire security and legal department is thinking hard since then whether it is OK to release code to the public...</div><div>To be honest, I don't push them hard right now either, because my fixes are not good for the above described scenario. That would need a complete rewrite of the dist_ac code to allow multiple apps to start concurrently. I have some ideas how to do it, but I won't have time to write a fix until January, I'm afraid (this time I'd do it from home).</div><div>And I don't think this feature would be widely used btw. The dist_ac module hasn't been modified since the erlang/otp git repo exists. Furthermore I believe you are also safe to use it as long as you have only one distributed application. So I guess I'm the first one to run into this problems using 5-6 distributed apps and 5 nodes with equal priorities.</div><div><br></div><div>BR,</div><div>Daniel</div><div><br></div><div>On Sun, 01 Dec 2013 22:45:31 -0000, Tony Rogvall <tony@rogvall.se> wrote:<br></div><br><blockquote style="margin: 0 0 0.80ex; border-left: #0000FF 2px solid; padding-left: 1ex">...<br><div><blockquote type="cite"><div style="font-family: Helvetica; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; "><div><br></div><div>PS: I would avoid using distributed applications in production. The dist_ac module in the kernel application that takes care of deciding where to run which distributed application is a terrible spaghetti of gen_server callbacks and ad-hoc message passing with tons of race conditions that can block your entire cluster from starting up any distributed apps. I run into about 3-4 different bugs of this kind before abandoning the idea of using this feature.</div><div><br></div></div></blockquote><div><br></div><div>Is this really true? tons of race conditions, meaning over 1000 ? 3-4 different bugs ?</div><div>This raises some serious questions, like: Did you try to correct this and send a patch, or why not?</div><div>If distributed application is not usable, do OTP team know about this?</div><div>if so why is this feature still there and could fool people into try to use it?</div><div><br></div><div>/Tony</div><div><br></div><br><blockquote type="cite"><div style="font-family: Helvetica; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; "><div>On Sun, 01 Dec 2013 16:19:25 -0000, Tyron Zerafa <<a href="mailto:tyron.zerafa@gmail.com">tyron.zerafa@gmail.com</a>> wrote:<br></div><br><blockquote style="margin: 0px 0px 0.8ex; border-left-color: rgb(0, 0, 255); border-left-width: 2px; border-left-style: solid; padding-left: 1ex; "><div dir="ltr">Hi all,<div> </div><div> I am trying to understand how to implement takeover in Erlang by following the example presented<a href="http://learnyousomeerlang.com/distributed-otp-applications">here</a>. Basically, I am creating the application's supervisor as follows;</div><div> </div><div><div><div>start(normal, []) -></div><div><span class="" style="white-space: pre; "> </span>m8ball_sup:start_link();</div><div>start({takeover, _OtherNode}, []) -></div><div><span class="" style="white-space: pre; "> </span>m8ball_sup:start_link().</div></div><div><br></div><div><br></div><div><u>Supervisor init code:</u></div><div><div>start_link() -></div><div><span class="" style="white-space: pre; "> </span>supervisor:start_link({global,?MODULE}, ?MODULE, []).</div></div><div><br></div><div><u>Supervisor child Specification:</u></div><div><div>{</div><div><span class="" style="white-space: pre; "> </span>{one_for_one, 1, 10},</div><div><span class="" style="white-space: pre; "> </span>[</div><div><span class="" style="white-space: pre; "> </span>{m8ball,</div><div><span class="" style="white-space: pre; "> </span>{m8ball_server, start_link, []},</div><div><span class="" style="white-space: pre; "> </span>permanent,</div><div><span class="" style="white-space: pre; "> </span>5000,</div><div><span class="" style="white-space: pre; "> </span>worker,</div><div><span class="" style="white-space: pre; "> </span>[m8ball_server]</div><div><span class="" style="white-space: pre; "> </span>}]</div><div><span class="" style="white-space: pre; "> </span>}</div></div><div><br></div><div><u>Child (m8ball_server) Initialization</u></div><div><div>start_link() -></div><div><span class="" style="white-space: pre; "> </span>gen_server:start_link({global, ?MODULE}, ?MODULE, [], []).</div></div><div><br></div><div><br></div><div>Consider the following scenario; an Erlang cluster is composed of two nodes A and B with application m8ball running on A. </div><div>Failover works perfect, I'm managing to kill node A and see the application running on the next node, B. </div><div>However, when I try to put back up node A (which have a higher priority then B) and init the app, I am getting the following error. I'm assuming that this occurs because node B already contains a supervisor globally registered with that name. </div><div><u>Log on Node A </u><br></div><div><div>{error,{{already_started,<2832.61.0>},</div><div> {m8ball,start,[{takeover,'b@Tyron-PC'},[]]}}}</div><div><br></div><div>=INFO REPORT==== 1-Dec-2013::16:17:32 ===</div><div> application: m8ball</div><div> exited: {{already_started,<2832.61.0>},</div><div> {m8ball,start,[{takeover,'b@Tyron-PC'},[]]}}</div></div><div><br></div><div><br></div><div><u>Log on Node B</u></div><div><div>=INFO REPORT==== 1-Dec-2013::16:24:55 ===</div><div> application: m8ball</div><div> exited: stopped</div><div> type: temporary</div></div><div><br></div><div>When I tried registering the supervisor locally, I got a similar exception failing to initializing the worker process. However, if I also register this as local, I would not be able to call it from any node using the app name (since it would not be globally registered).</div><div><br></div><div><u>Log on Node A </u><u>(Supervisor Registered Locally)</u><br></div><div>{error,</div><div> {{shutdown,</div><div> {failed_to_start_child,m8ball,</div><div> {already_started,<2832.67.0>}}},</div><div> {m8ball,start,[{takeover,'b@Tyron-PC'},[]]}}}</div><div><br></div><div> </div><div>Any pointers?</div><div><br></div>--<span class="Apple-converted-space"> </span><br>Best Regards,<div>Tyron Zerafa</div></div></div></blockquote><br><br><br>_______________________________________________<br>erlang-questions mailing list<br><a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br><a href="http://erlang.org/mailman/listinfo/erlang-questions">http://erlang.org/mailman/listinfo/erlang-questions</a><br></div></blockquote></div><br><div>
<span class="Apple-style-span" style="border-collapse: separate; border-spacing: 0px; "><div><span class="Apple-style-span" style="color: rgb(51, 51, 51); font-family: Geneva, Arial, Helvetica, sans-serif; font-size: 12px; ">"Installing applications can lead to corruption over time. </span><span class="Apple-style-span" style="color: rgb(51, 51, 51); font-family: Geneva, Arial, Helvetica, sans-serif; font-size: 12px; ">Applications gradually write over each other's libraries, partial upgrades occur, user and system errors happen, and minute changes may be unnoticeable and difficult to fix"</span></div><div><span class="Apple-style-span" style="color: rgb(51, 51, 51); font-family: Geneva, Arial, Helvetica, sans-serif; font-size: 12px; "><br></span></div></span><br class="Apple-interchange-newline">
</div>
<br></blockquote><br><br><br></body></html>