Rails applications that use FCGI have been observing some strange
behavior. I have a hypothesis regarding the cause, but I'd like some
feedback as to whether it is a reasonable hypothesis, and any
solutions/workarounds that people might have.
Sometimes (and some apps experience this more frequently than others)
a FCGI process that is not currently handling a request will fail to
respond to a signal (specifically USR1 or HUP) until a request is
received. This is problematic when updating an application, because
you typically want to gracefully terminate all existing FCGI
processes and start up some new ones pointing at your updated code.
But some (or many) of the processes don't respond until a request is
received, meaning the user can get anything from a stale version of
your app, to a 500 error, depending on how well-behaved (or ill-
behaved) the FCGI process is.
Currently, Rails uses a "nudge" approach (EXTREMELY hacky) to handle
this. When an application is restarted, you send _n_ requests to the
application with the assumption that those requests will be
sufficient to trigger the sleeping processes and let them gracefully
terminate. The problem is, it doesn't work very well, especially in
the case of Apache- or Lighttpd-managed FCGI processes. And even
independently-managed FCGI processes will sometimes croak with this
approach.
My hypothesis regarding the cause of the unresponsiveness is this
(and please feel free to gently debunk it--I'm not ashamed to admit
that I'm in somewhat over my head here): the processes in question
are stuck on some IO-bound process (like listening on a socket), and
Ruby is blocking until that finishes. This prevents Ruby from
invoking the signal handler callback until the IO finishes. Sounds
reasonable? If not, any other ideas what might be causing it?
And even more importantly, is there a sane way to work around (or
better yet, _fix_) this problem? It's a rather nasty stumbling block
to automated application deployment.
Thanks for any help,
Jamis