The problem complained of in bug #3843 was something I'd noticed a few
days ago and meant to fix. ISTM the recent change to have the archiver
outlive the postmaster was incompletely thought out, and we really need
to take two steps back and reconsider, if we want to fix it so it works.
As of CVS HEAD, the behavior after the postmaster receives a shutdown
request and has seen its last regular-backend child die is:
1. Issue SIGUSR2 to the bgwriter to make it start a shutdown checkpoint.
2. Immediately SIGQUIT the archiver.
3. Back at the main loop, restart the archiver, if it exits before the
bgwriter finishes the checkpoint (as is highly likely).
4. After postmaster exits, archiver eventually notices it's gone,
but that takes a good while since we are guaranteed to be just
starting the delay loop inside the fresh archiver process.
This is just plain dumb. Aside from the uselessness of killing a
process only to immediately re-fork it, we should not be SIGQUIT'ing
the archiver during normal operation --- that might abort an archive
copy partway through, and it's anybody's guess whether the
archive_command script is smart enough to deal with that situation.
ISTM the postmaster should leave the archiver alone at the
PM_WAIT_BACKENDS -> PM_SHUTDOWN transition, and instead send it
a WAKEN signal (SIGUSR1) when it sees normal exit of the bgwriter.
That will afford an opportunity to archive anything that was pushed
out during the shutdown checkpoint. A possibly better alternative,
since the archiver isn't using SIGUSR2, is to send SIGUSR2 which
would be defined as "archive what you can and then quit". (In that
case, the !PostmasterIsAlive exit would be taken only in the event
of a true postmaster crash, which is improbable.)
Another case that seems not to have been thought about very much is
whether the archiver should behave differently in a "mode fast" shutdown
as opposed to "mode smart". I would argue that it should not, since
both cases are supposed to be equally safe for your data. I notice
though that the postmaster suppresses forwarding of WAKEN signals
after entering FastShutdown mode; that doesn't seem like a good idea.
Another case that needs some revisiting is the archiver's response
to SIGTERM, which is currently SIG_IGN. Since the postmaster will never
send it SIGTERM, we should assume that receipt of SIGTERM means that
init is telling us we have N seconds left before system shutdown.
Is it a good idea to continue archiving in that situation? I doubt it
--- it seems like we are just asking to get SIGKILL'd partway through a
copy step. I suggest that the response to SIGTERM ought to be to finish
out the current copy operation (if possible) but then quit without
initiating any new ones.
And while I'm griping: I see that the pgstats process is SIGQUIT'ed at
the entry to PM_SHUTDOWN state, same as the archiver. This likewise
seems out of step with current reality, since the bgwriter now sends
messages to the stats collector. This step needs to be moved to after
bgwriter termination, too.
Comments? Anyone see any other bugs here?
regards, tom lane