At 11:13 PM 5/13/2002 -0400, Michael Bacon wrote:
>Sounds like what we're running into at the moment, which appears to be the
>master processes ending up with an incorrect count of available workers.
>The problem occurs when a worker process dies while in the "available"
>state, and doesn't notify the master. Jeremy Howard recently posted a
>patch which addresses this problem, by decrementing the "available
>workers" counter when receiving a SIGCLD, which strikes me as the right
>way to go. However, his patch is for 2.1.3, and like you, we're using
>2.0.16 (the bleeding edge is a bad place

This is extremely interesting. Michael, do you find this happens at
seemingly random times though? We can go a week or two with no problems,
and then bam, I get a 911. Of course, our volume is considerably lower than
yours. Another issue, and one that may differentiate our problems from
yours (but hopefully not as your at least have a work-around), is that I
can sometimes restart Cyrus, and even after a restart, no new connections
are serviced. (They connect, but get no service.) I've found that when this
happens Cyrus will often appear to work for a VERY short while, and then
revert back to the point where connections occur but no service (pop3d)
responds.

Shouldn't a restart completely fix the problem? If so we may be fighting
something different. A reboot also doesn't always clear up the problem.
Again, Cyrus will come up, but then fail shortly thereafter.

What is really odd is that the problem just goes away after a few hours.