Just a thought: would it be an option (and would it help) to monitor each
child from birth?
/siri
2013/5/13 Siri Hansen <>
> Bryan and Tim, your analysis is very good, and the problem is complicated.
> I don't see a "water tight" solution right now, and I can not spend too
> much time pondering without having a real priority for this case. I have
> written a ticket for it, and it will be prioritized along with all other
> backlog items. Any further thoughts and contributions will be very much
> appreciated :)
> Thanks again
> /siri
>>> 2013/4/30 Tim Watson <>
>>> Hi Bryan,
>>>> On 30 Apr 2013, at 18:34, Bryan Fink wrote:
>>>>>> But twiddling the timing there is just as racy, as you've noticed, right?
>>>>>> Correct. The length of the timeout is irrelevant. The EXIT signal is
>> not guaranteed to arrive within any specific amount of time.
>>>>>> Indeed. Almost a halting problem this isn't it. :)
>>>>>> Isn't the point that the EXIT signal might /never/ come, if the child
>> un-links, or might come *after* the 'DOWN' if the race you've located
>> occurs? Surely you've got to be able to handle either case?
>>>>>> Yes, the point of the monitor is to handle the case where the EXIT
>> never comes (because the child unlinks). It is not the case, however,
>> that the EXIT always arrives after the DOWN in the race I'm seeing.
>> They might both be delayed.
>>>>>> Waiting without a timeout for the 'DOWN' is acceptable, because you've
>> got a guarantee (via the runtime) the it *will* arrive, no matter what
>> state the target process was in when you created the monitor. Waiting some
>> arbitrary time for the 'EXIT' is a real problem though, because you could
>> wait forever.
>>>> Handling either order is important, but the problem with this race is
>> that only the EXIT message contains the actual exit reason when this
>> happens. The 'noproc' in the DOWN is just saying that there was no
>> process to monitor.
>>>>>> Indeed. But it could equally be true that the 'EXIT' signal was never
>> dispatched, because the child process unlinked before it died; You can't
>> wait forever for the 'EXIT' after you've seen a 'DOWN' with 'noproc' as the
>> reason, so now you've got to choose how long to wait, but whatever timing
>> works for one particular case isn't going to solve the general problem.
>>>>>> We ran into something similar with our supervisor2 fork a while back,
>> whilst terminating (multiple) simple children:
>>http://hg.rabbitmq.com/rabbitmq-server/rev/812d71d0716c . That code is
>> somewhat different though, not only because it was terminating multiple
>> children (during shutdown) but also because it explicitly unlinks from the
>> child *after* creating the monitor, and /still/ allowed for an EXIT signal
>> to have made its way into the mailbox unexpectedly.
>>>>>> The monitor_child/1 function also unlinks from the child after
>> creating the monitor. That patch looks a little bit like the fixes I
>> was trying. Basically it's checking for an EXIT message after
>> receiving the DOWN, just in case one is in the mailbox, yes?
>>>>>> That's correct.
>>>> The problem is that it might still miss an EXIT, because it might still
>> not have arrived yet, even though it will later.
>>>>>> Yes that's definitely true and we were aware of that problem, however
>> since we know we cannot wait for the 'EXIT' forever and whatever arbitrary
>> timeout we choose is just someone else's race condition, we decided that if
>> the EXIT signal wasn't delivered expediently to the process' mailbox, that
>> loosing the real exit reason was something we could live with in the worst
>> case.
>>>> Since we've started merging the R15/R16 changes in though, that code has
>> disappeared so we're in the same boat as you guys. :)
>>>> Cheers,
>> Tim
>>>>>-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130514/b416bdd8/attachment.html>