We've been running unicorn-3.6.2 on REE 1.8.7 2011.12 in production for quite some time and we use monit to monitor each unicorn worker. Occasionally, I'll get a notification that a worker has timed-out and has been re-spawned. In all these cases, when I look at the rails logs, I can see the last request that the worker handled, and they all have appeared to complete successfully from the client's perspective (rails and nginx respond with 200), but the unicorn log shows that it was killed due to timeout. This has always been relatively rare and I thought it was a non-problem.
Until today.
Today, for about a 7 minute period, our workers would continually report as having timed-out and would be killed by the master. After re-spawning, the workers would serve a handful of requests and then eventually be killed again.
During this time, our servers (Web, PG DB, and redis) were not under load and IO was normal. After the last monit notification at 8:30, everything went back to normal. I understand why unicorns would timeout if they were waiting (>120 secs) on IO, but there aren't any orphaned requests in the rails log. For each request line, there's a corresponding completion line. No long running queries to blame on PG, either.
I know we're probably due for an upgrade, but I'm hoping to get to the bottom of these unexplained timeouts.
Thanks for your help!
-Nick