We frequently hit "code -20" (SIGCHLD) crashes in the OSX testsuite. This is a fatal and widespread crash, causing frequent test failures and retriggers. It is often underrepresented in our Orange Factor statistics due to the widespread nature of the crashes and the lack of individual bugs being filed for every instance.

(In reply to Justin Lebar [:jlebar] from comment #1)
> Can you please explain the impact to users here?
It deserves attention, which is why I've nominated it for my team to look at. We wouldn't actually track for a specific release, but we do want this fixed in the short- to medium- term.

It's obviously a race condition. I highly doubt 100% STR are possible. Can we just understand why we're crashing with a SIGCHLD?
I don't really know POSIX signals, but after reading through a bunch of code I have a theory:
* When a process starts, the default SIGCHLD handler is a no-op
* When we shut down an IPC child process we end up calling ProcessWatcher::EnsureProcessTerminated which can end up in MessagePumpLibevent::CatchSignal
* This calls signal_set and signal_add
* After the child actually quits (normally), this which triggers the first SIGCHLD handler which calls MessagePumpLibevent::SignalEvent::StopCatching, which calls event_del. I believe at this point that the SIGCHLD handling is left i
It doesn't appear that this happens in the case where the child process shuts down before the parent process GeckoChildProcessHost is destroyed because it hits the IsProcessDead check here: http://hg.mozilla.org/mozilla-central/annotate/d2a7cfa34154/ipc/chromium/src/chrome/common/process_watcher_posix_sigchld.cc#l198
I suspect that multiple SIGCHLD handling in libevent is just completely broken: that when the signal handler is removed it should end up in the default "ignore" state but instead ends up in an aborting default. This may be limited to the kqueue implementation which is used on mac but not Linux.

> NOTE: child process received `Goodbye', closing down
> ...
> TEST-UNEXPECTED-FAIL | Shutdown | Exited with code -20 during test run
> ...
Does the browser actually crash here, or has it merely received a SIGCHLD (which it may be ignoring)?
I don't see anything about minidumps in these logs (which you usually see when the browser has actually crashed in the middle of (say) a mochitest).

(Following up comment #4)
> * After the child actually quits (normally), this [which] triggers the
> first SIGCHLD handler which calls
> MessagePumpLibevent::SignalEvent::StopCatching, which calls
> event_del. I believe at this point that the SIGCHLD handling is left i
Benjamin, you seem to have dropped part of your last sentence here.
And the first "which" in your first sentence seems to be superfluous ... or you've dropped something there, too.

(In reply to Steven Michaud from comment #13)
> I'll try to find time this week to look further into this.
>
> Without STR, though, it may not be possible to do anything about it.
Thanks Steven!
Can we catch the issue when it occurs on automation rather than finding STR? Lukas or I can set up a meeting with the A Team if you'd like to move forward with that.

I really do need STR.
For example, someone could find (or write) a test that *always* fails in this way. This should be possible, as someone's already done it at least once -- see bug 230675 comment #39.

We have fixed lots of bugs without being able to reproduce them on demand. Indeed, many (I'd venture most) intermittent oranges have been fixed by people who were unable to reproduce them locally.
It would be great if we could reliably reproduce this bug, and I don't want to suggest that you shouldn't try to get STR. But as a matter of historical fact, I do not think it's true that without STR, a bug like this is insoluble, and therefore I think digging into bsmedberg's train of thought from comment 4 might prove fruitful.

I stand by what I said in comment #15.
Without STR (without at least a test (or combination thereof) that always fails in this way), it may be impossible to fix such an obscure bug as this one. Or (which is also very important) to know that we've actually fixed it, instead of just hacking around it.
Of course we should follow up all the leads we have, and I've already started doing that.
But right now, the very most useful contribution anyone could make is to find/write a test that always fails in this way -- possibly by starting from bug 230675 comment #39.
Would you care to try? :-)

As often as this happens on our test infrastructure, you should also be able to retrigger yourself to glory on Try. It might be difficult to reproduce locally, but I'm very certain that we can find out in short order whether a proposed fix works or not on the test slaves.

> we can find out in short order whether a proposed fix works or not on the test
> slaves
Using this method, we won't be able to tell a hack from a fix.
We may have to settle for a hack in the end. But I'd like to spend a few days (at least) looking for a fix.

(Following up comment #19)
This didn't lead anywhere.
I did two tryserver builds -- one that shortened the unload timer (by about 35 times) to 5 seconds, and another that lengthened it (by 10 times) to 30 minutes. There were no test failures at all :-(

Just how uniform is the (OS X) test machines' setup? For example do they all have the same version of Python? And if they don't, is there any correlation between this bug and a particular Python version?
The reason I ask is that the "Exited with code -20 during test run" error message is displayed (by the parent process) from Python code here, after "status = proc.wait();":
https://hg.mozilla.org/mozilla-central/annotate/4ffb23062b3b/build/automation.py.in#l989
So I'm wondering if Python libraries might conceivably be involved in this bug.
In any case, please let me know the version(s) of Python that are installed on the test machines.

(In reply to comment #23)
> you could just retrigger them thousands of times
Not worth the trouble. I was looking for gross effects, big enough to show up in a single set of tryserver runs.
But I didn't see any, so I'm going to stop playing with the length of the plugin unload timer.

(In reply to Steven Michaud from comment #24)
> Just how uniform is the (OS X) test machines' setup? For example do they
> all have the same version of Python? And if they don't, is there any
> correlation between this bug and a particular Python version?
They're all the same.
> The reason I ask is that the "Exited with code -20 during test run" error
> message is displayed (by the parent process) from Python code here, after
> "status = proc.wait();":
>
> https://hg.mozilla.org/mozilla-central/annotate/4ffb23062b3b/build/
> automation.py.in#l989
>
> So I'm wondering if Python libraries might conceivably be involved in this
> bug.
>
> In any case, please let me know the version(s) of Python that are installed
> on the test machines.
[cltbld@talos-mtnlion-r5-026.test.releng.scl3.mozilla.com ~]$ python -V
Python 2.7.2

I've written an interpose library that logs lots of signal activity in all currently running programs.
It hooks and logs the usual suspects (like signal(), sigaction(), kill() and raise()), plus a bunch of exit() methods. And I've found that, on OS X, the kernel sends all signals (that have handlers) through an undocumented user-level _sigtramp() method (a "trampoline"), which my library also hooks. So there's some hope that it will log all relevant signal activity.
I'd like to get it onto one of our build machines.
I hope and assume that its logging will get interleaved with the "regular" test logging, and redirected along with it to the correct destination(s).
First we should test that this is in fact true.
Then I'd like to run tests on this build machine (the one with my interpose library), continuously, until we hit the bug.

(In reply to Steven Michaud from comment #29)
> Is this (what I talk about in comment #28) something you could help me with,
> Ben?
Is this something that you could run in a loop on a single machine to repro, provided you had access? This doesn't seem like something we should be deploying to our machines unless absolutely necessary.

> Is this something that you could run in a loop on a single machine
> to repro, provided you had access?
I assume so (though I'll probably need your help to figure out how to
do it).
The most important thing is that it should generate the same kind of
logs that the "real" test machines do, and that I should have access
to them (hopefully the same way as with the "real" machines).
Also, it needs to be set up *exactly* like the real test machines.
I think bsmedberg's hypothesis from comment #4 (that this bug is a
problem with the SIGCHLD handling in Mozilla's libevent) is highly
unlikely. I think it's much more likely that some other program is
messing with the signals -- possibly one that only runs on our test
machines.

(In reply to Steven Michaud from comment #31)
> > Is this something that you could run in a loop on a single machine
> > to repro, provided you had access?
>
> I assume so (though I'll probably need your help to figure out how to
> do it).
>
> The most important thing is that it should generate the same kind of
> logs that the "real" test machines do, and that I should have access
> to them (hopefully the same way as with the "real" machines).
>
> Also, it needs to be set up *exactly* like the real test machines.
>
> I think bsmedberg's hypothesis from comment #4 (that this bug is a
> problem with the SIGCHLD handling in Mozilla's libevent) is highly
> unlikely. I think it's much more likely that some other program is
> messing with the signals -- possibly one that only runs on our test
> machines.
OK, running out of band is a bad option then, because Buildbot won't be running.
Given that, we should think about deploying this to all of the machines temporarily. Would this library affect build results or performance negatively? How does it need to be deployed? Where does it log to?

Before we deploy my interpose library to all the build machines, we should try it out on one of them.
> Would this library affect build results or performance negatively?
That's hard to say. It'd slow things down, but I think only marginally. Hard to know for sure without a test, though.
> Where does it log to?
As I said in comment #28, I hope its logging will get interleaved with "regular" test logging. But this also needs to be tested.
> How does it need to be deployed?
The source tarball I've posted contains some minimal instructions. But I'll repeat (and elaborate on) them here:
Once you've built the interpose library ('make' should do the trick), you need to set the DYLD_INSERT_LIBRARIES environment variable so that it gets loaded into the appropriate executables (which should be *every* executable that runs as part of a build, including BuildBot). I don't know the build system that's used on the tryservers, so I can't give detailed instructions.
What is the single, lowest-level command to start a build? If it's something like 'BuildBot', and that's called from a shell script, I'd add the following line just before 'BuildBot':
export DYLD_INSERT_LIBRARIES=/full/path/to/interpose.dylib
(Here I assume that DYLD_INSERT_LIBRARIES isn't already being used for other purposes.)

Bug 230675 comment #39 *was* worth following up. V5 of that bug's patch *does* still reliably trigger this bug:
https://tbpl.mozilla.org/?tree=Try&rev=dcc27681f63f
Not surprisingly, I can't reproduce these failures locally.
Once we have my interpose library running (successfully) on at least one tryserver build machine, we need to do a suite of tests with this patch.

Comment on attachment 771087[details]
Interpose library for logging signal activity, rev1
I'd like to get this onto a single tryserver machine, to ensure that it works as intended and doesn't interfere with test results (if need be by altering it). Then I'd like to get it on all the production test machines -- at least until the first occurrence of this bug.
Can you help me with this, Ben? Or should I ask someone else?

Steven, we're not the experts here -- the ateam usually does the testing of these sorts of libraries somehow, and we just do the final production deploy.
We think you can:
- put your zip file on people.mozilla.com
- modify your local copy of testing/talos/talos_from_code.py to refer to people.m.c
- submit your job to the tryserver
If that doesn't make sense, it's probably best to ask in #ateam (many of us are there as well) and we can figure out what is needed. :jmaher knows the magic incantation for sure.

Steven-
I am unclear on what exactly you are trying to pull off here.
1) do you need to modify talos in order to gain more information?
2) do you need to modify the browser and just run talos on try while collecting more information?
3) something else I don't know about?

I've said the following already several times, but I'll say it again as clearly as possible.
Here's what I need to do:
As best I can tell, this bug is caused by some part of the build infrastructure doing weird things with signals. But whether or not this is true, this bug involves weird things happening with signals. So I need to "hook" methods having to do with signals in all the programs that run when we're doing tests with our current build infrastructure -- this includes the programs that are part of the build infrastructure (including interpreters like Python), plus (of course) whatever's being tested.
For the methods I hook, I log their calling parameters and their results. For the output to be useful, it needs to be "interleaved" with all the logging that's produced when a test is run -- which (for example) the success or failure of individual tests.

I have concerns with this as our production talos numbers will most likely be affected. I assume we could set this up on a machine or two and use a standalone buildbot master to run the tests as we would normally.
releng, am I making sense here?

I need a machine that's set up *exactly* like the production machines that run mochitests and reftests and the like. Nothing else will do.
I suggest that you guys pull one of the production machines from its pool and give me remote access to that, via VPN.

hwine, can you ensure smichaud has buildvpn access and a loaner machine. I assume osx 10.7 will be sufficient.
:smichaud, after we get this setup on a loaner machine and attempt runs to reproduce, we can talk about multiple machines and buildbot masters if necessary. a loaner machine is a machine pulled directly out of production such that it won't be scheduling jobs, but setup the same as it would in production.

(In reply to Hal Wine [:hwine] from comment #51)
> :jmaher - thanks for helping to clarify
>
> :smichaud - loaner being processed via bug 892107
Hal confirmed in IRC that the specific Mac type does not matter, this issue has manifested in https://bugzilla.mozilla.org/show_bug.cgi?id=795256#c45 as well (on a mountain lion machine)
I'm going to loan that instead since we can easily bring this back up after loan right now. (if you do need a snow or lion tester we can do that, I just believe this is easier all around).

I'm slowly making progress on this, but I've a ways to go yet.
Thanks to Callek and Kim Moir, I've been able to connect to my loaner, using both ssh and screen sharing, for about a week. And by my own digging around, I've found a fairly reliable way to "reproduce" this bug at least once in 100 runs of the mochitest-2 (aka plain2) tests, with a tryserver build altered to lengthen the plugin shutdown timer. The scare quotes are because this still definitely requires brute force -- 100 full runs of the plain2 tests take about 25 hours on my loaner. I'll say more about this in a later comment.
I've also got my interpose library working, though I haven't been able to get it to interleave its logging with that of the test output. But the test output logging uses timestamps, so I've largely gotten around the problem by making my interpose library also timestamp each of its log statements.
My interpose library tells me that it's python (and only python) that calls exit() with a "status" of '-20'. So I currently suspect this is a bug in one of our python scripts, or in python itself. But I need to keep digging to find out for sure.

Does anyone have any idea how often the "Exited with code 20" problem occurs? Or in other words, what percentage of test runs fail with this error?
I strongly suspect the percentage is *much* less than 1 in 100 test runs. I ran about 300 plain2 tests without seeing even a single case, before I ran them another 100 times with the plugin unload timeout lengthened, and saw two of them.
I'd guess the percentage is once in 500 or 1000 test runs. But do we have a more reliable way of finding out?

> The combined number of failures is 43 [last week]
Do we know how many Mac test runs we do per week (on all the slaves)?
This would give us a rough idea -- which is all I'm looking for.
And no, this isn't vital information. I'm just curious.
I'll keep working on this bug regardless of the results :-)

Here's something my tests show is an effective workaround for this bug.
It's a change to libevent, and follows a hint in libevent code here, written by you, Josh:
https://hg.mozilla.org/mozilla-central/annotate/89294cd501d9/ipc/chromium/src/third_party/libevent/kqueue.c#l441
Since it's your hint that I'm following, I've asked you to review this patch. If you'd rather someone else do the review, please pass it along.
I tested this patch by adding to it my patch from comment #54, doing a tryserver build, then running the mochitest-2 (aka plain2) tests 100 times in that build on my loaner. Normally I'd have seen 3-4 "Exited with code -20" failures, plus a lot of other failures of different kinds. This time I saw no failures of any kind.
So besides "fixing" the "Exited with code -20" failures, this patch is likely to reduce other kinds of intermittent failures.
I still haven't managed to find the actual cause of the "Exited with code -20" failures. But I'm quite sure it isn't in either libevent or Python. As best I can tell it's an OS bug -- possibly an kernel bug.
Fortunately I didn't hook any of the wait...() calls in early versions of my interpose library, and so was able to find a way to "reproduce" these failures. But then I found that when I hooked waitpid() (which is used by Python to wait for Firefox to finish a set of tests), I could no longer reproduce them.
I don't know whether doing this eliminates the "Exited with code -20" failures, or just makes them happen much less often. Either way, it probably makes it impossible to find this bug's true cause.
My current theory is that the "Exited with code -20" failures happen when, while Python is waiting (in waitpid()) for Firefox to finish its tests, and Firefox has a handler installed for SIGCHLD, a plugin child calls exit(0). This *should* make the kernel deliver a SIGCHLD signal to Firefox's handler, but the handler never gets called. Instead (I think) the kernel terminates Firefox and tells Python (via waitpid()) that this happened because Firefox received a SIGCHLD.
I have a couple of ideas for finding out more information, which I'll try over the next few days. But I'm not optimistic they'll tell me anything more than I already know.

Comment on attachment 794102[details][diff][review]
Effective workaround
Review of attachment 794102[details][diff][review]:
-----------------------------------------------------------------
FYI, I didn't write any of this code, as your comment suggests (maybe I'm misinterpreting). I just updated our libevent in bug 842887. I wonder if this existed before that update, or this is a regression. The code looks fine to me but you might want to run it by someone like bsmedberg as well. Also, I am not a module owner or peer for this.
Either way, if this fixes a problem that is not specific to how Mozilla sends signals the patch should be upstreamed. It's a bit messy to just apply patches to third party libs. The other cleaner option is to put a patch into the tree that our build system can apply.
::: ipc/chromium/src/third_party/libevent/kqueue.c
@@ +432,5 @@
> if (kevent(kqop->kq, &kev, 1, NULL, 0, &timeout) == -1)
> return (-1);
>
> /* XXXX The manpage suggest we could use SIG_IGN instead of a
> * do-nothing handler */
You can remove this comment now.
@@ +433,5 @@
> return (-1);
>
> /* XXXX The manpage suggest we could use SIG_IGN instead of a
> * do-nothing handler */
> + if (_evsig_set_handler(base, nsignal, SIG_IGN) == -1)
Do you expect this change to actually do something or is SIG_IGN equivalent to the do-nothing function?

> FYI, I didn't write any of this code, as your comment suggests
> (maybe I'm misinterpreting). I just updated our libevent in bug
> 842887.
Oops, sorry. I just saw your name in hg blame and assumed.
> The code looks fine to me but you might want to run it by someone
> like bsmedberg as well. Also, I am not a module owner or peer for
> this.
I'll find an additional review who's closer to the code I changed.
> Do you expect this change to actually do something or is SIG_IGN
> equivalent to the do-nothing function?
As I understand it, SIG_IGN (which tells the kernel to ignore the
signal) is *supposed* to be functionally equivalent to installing an
empty handler. So also would SIG_DFL be in this case (since the
default behavior for SIGCHLD is to ignore it).
Clearly we've now found a case where they *aren't* equivalent, at
least on OS X. But I think that's because of an OS bug (probably a
kernel bug).
> Either way, if this fixes a problem that is not specific to how
> Mozilla sends signals the patch should be upstreamed. It's a bit
> messy to just apply patches to third party libs. The other cleaner
> option is to put a patch into the tree that our build system can
> apply.
I'll look for an existing bug on libevent, or open one if I need to.
In it I'll refer to this bug.
I suspect the issue may take a long time to resolve, though. So I'll
also follow your other suggestion, and imitate what we do for patches
to cairo.

Apparently the first libevent "release" the changes from comment #67 appeared in was 2.1.1-alpha. 2.0.21 (the endpoint of Josh's update from bug 842887) is still the current release on the 2.0.X "stable" branch -- which (of course) doesn't contain these changes.
I don't know how the libevent release train works. But whenever the patch from comment #67 gets onto the 2.0.X stable branch (or *some* stable branch), we can upgrade to that version and drop whatever's been done here (presuming such an upgrade doesn't raise other issues).
In the meantime we can land my patch, or something like it.

If people want I could change my patch to make it identical to the one from comment #67. I don't think it'd make any difference. But I don't know the libevent code at all well, so I might have missed something.

I don't fully understand the problem here but I can comment a little on libevent and the change I made to it:
It looks like Steven Michaud's patch unconditionally sets the signal handler to SIG_IGN, whereas mine uses SIG_DFL for SIGCHLD (and only SIGCHLD). This is because, on many Unixes (including at least everything with kqueue) if the handler for SIGCHLD is SIG_IGN, child exit statuses are discarded by the kernel. This means, not only is no signal handler called, kqueue doesn't report the signal either, and *none of the wait() functions will report that the child has exited.* SIG_DFL for SIGCHLD behaves the same way SIG_IGN does for other signals, i.e. no signal handler is called but the event can still be fielded by kqueue, and wait() continues to work.
For a generic event handling library like libevent, SIG_DFL is therefore the correct choice when setting up child exits to be fielded via kqueue. And it is *probably* also the correct choice for us - I assume we actually want to know when child processes exit! However, since this bug is about spurious termination of the main Firefox process with SIGCHLD, I have to wonder if setting SIG_DFL here *wouldn't* fix the problem. I strongly recommend that we try it and find out what happens.
It is indeed the case that libevent 2.1.x is not yet considered stable. My patch did not go onto the 2.0.x branch because when I wrote it, I thought it was pure optimization; however, if it can be demonstrated to fix this problem, I'm sure Nick M would be happy to put it into the next 2.0.x release. I have no real sense of how hard it will be to upgrade from 1.4.x to 2.0.x, but it ought to be done (in conjunction with Chromium upstream), because 1.4.x is EOLed.

> I have to wonder if setting SIG_DFL here *wouldn't* fix the problem.
> I strongly recommend that we try it and find out what happens.
I did check it, about a week ago. It *does* work.
In other words, SIG_DFL works around this bug just as well as SIG_IGN.
> My patch did not go onto the 2.0.x branch because when I wrote it, I
> thought it was pure optimization; however, if it can be demonstrated
> to fix this problem, I'm sure Nick M would be happy to put it into
> the next 2.0.x release.
Glad to hear it. Should I ask at
http://sourceforge.net/p/levent/bugs/244/?
> I have no real sense of how hard it will be to upgrade from 1.4.x to
> 2.0.x, but it ought to be done (in conjunction with Chromium
> upstream), because 1.4.x is EOLed.
Even though Chromium hasn't, we've already upgraded to 2.0.13, at bug
842887. I'm sure it wouldn't be hard for us to upgrade again to
whatever the next 2.0.X release will be.
In the meantime, though, we need to do something here, since this bug
is pretty urgent.

(In reply to Steven Michaud from comment #74)
> > I have to wonder if setting SIG_DFL here *wouldn't* fix the problem.
> > I strongly recommend that we try it and find out what happens.
>
> I did check it, about a week ago. It *does* work.
> In other words, SIG_DFL works around this bug just as well as SIG_IGN.
Good to know!
> Glad to hear it. Should I ask at
> http://sourceforge.net/p/levent/bugs/244/?
Yes, but please also send mail to libevent-users@freehaven.net. (I don't remember if you have to subscribe or not, sorry.)
(In reply to Steven Michaud from comment #75)
> But Zack, first let me ask you what you think of my patch's changes to
> _evsig_set_handler() in event.c? Good? Bad? Makes no difference?
I believe those are harmless but unnecessary. The kernel should ignore sa_flags and sa_mask when sa_handler is SIG_IGN or SIG_DFL.

Thanks, Ryan. You beat me to it. And you're quite welcome.
I expect this patch to make the "Exited with code -20" errors disappear completely, and to cause a substantial reduction in the "application timed out" errors (most of which I now suspect are spurious). Please comment again in a week with whatever results you can find. I'll set the needinfo flag to remind you.
Working on this bug has taught me a lot about how our build infrastructure works, at least on the Mac. That may make it easier for me to figure out other Mac build infrastructure bugs (like bug 898519). Or to be more realistic, it may move some of those bugs from "impossible" to "very difficult" :-)

Comment on attachment 794908[details][diff][review]
Effective workaround rev2 (backport Zack's patch)
[Approval Request Comment]
Bug caused by (feature/regressing bug #): Apple kernel bug
User impact if declined: none
Testing completed (on m-c, etc.): one day on m-c, lots of testing on loaner slave
Risk to taking this patch (and alternatives if risky): low to none
String or IDL/UUID changes made by this patch: none
The people who (like RyanVM) manage the build/test infrastructure would really like to get this patch on all branches as soon as possible. Its randomness and frequency make it a significant burden.
RyanVM tells me there have been no "Exited with code -20" errors since this patch landed on the trunk last night. That, by itself, is no guarantee. But I myself did extensive testing of various versions of my patch (at least 400 runs), without seeing even one "Exited with code -20" error (or any other failures).
This patch will have no positive user impact. "Ordinary" users (I assume) don't build Firefox or run it from a script. Likewise it's very unlikely to have any negative impact. But if it *is* going to cause problems, the best way to find them is to get more testers, which we can do by landing this patch on the branches.

Comment on attachment 794908[details][diff][review]
Effective workaround rev2 (backport Zack's patch)
Patch is low risk, avoid's sheriff pain.
Steven, given this will need to go out sooner than later to get more testing, if regressions do exist in the wild, what kind of reports would we be looking at , Would they be easy enough to discover in a couple of beta cycle's ?

> if regressions do exist in the wild, what kind of reports would we
> be looking at
It's very difficult to say.
It's extremely unlikely that the original bug effects users. If it
did, though, I'd be looking for cases of Firefox spontaneously exiting
when lots of plugins are in use -- probably without any Breakpad
report.
So maybe something like that?
> Would they be easy enough to discover in a couple of beta cycle's
Anything major would be.

I'm giving up my attempt to find this bug's true cause. It's terribly
time consuming. And there's already strong evidence that this is some
kind of Apple bug -- presumably a kernel bug.
I mentioned in comment #60 that hooking waitpid() in my interpose
library stops the bug happening. I found a workaround for this
(instead of logging calls to waitpid() from my interpose library, I
now do it from Python code). But I keep running into the same kind of
problem, and it's a royal pain to have to wait several hours (or
overnight) for the results of each new test.
Here, for the record, is my current best guess as to what's going on:
1) Firefox calls ProcessWatcher::EnsureProcessTerminated() on a child
process, installs a SIGCHLD handler (in a call to
loop->CatchSignal(SIGCHLD, reaper, reaper)), and the child process
exits normally.
2) Before the kernel has had a chance to call the handler (or Firefox
has had a chance to remove it) another child process of the same
Firefox process exits normally.
At least some of the time this bug results.
Another weird thing I noticed, which may or may not be related (and
which is only observable when I hook waitpid() in my interpose
library):
Sometimes a Firefox child process will exit normally (will call
_exit(0)), but then when Firefox calls waitpid() on that process, it
says that the process terminated with a SIGHUP. I never see this when
Python calls waitpid() on a Firefox process.