There is an open item for synchronous replication and smart shutdown,
with a link to here:
http://archives.postgresql.org/pgsql-hackers/2011-03/msg01391.php
The issue is not straightforward, however, so I want to get some
broader input before proceeding. In short, the problem is that if
synchronous replication is in use, no standbys are connected, and a
smart shutdown is requested, any future commits will wait for a
wake-up that will never come, because by that point postmaster is no
longer accepting connections - thus no standby can reconnect to
release waiters. Or, if there is a standby connected when the smart
shutdown is requested, but it subsequently gets disconnected, it won't
be able to reconnect, and again all waiters will get stuck.
There are a couple of plausible ways to proceed here:
1. Do nothing. If this happens to you, you will need to request fast
or immediate shutdown to get the system unstuck. Since it's pretty
easy for this to happen already anyway (all you need is one connection
to sit open doing nothing), most people probably already have
provision for this and likely wouldn't be terribly inconvenienced by
one more corner case. On the flip side, I would rather that we were
moving in the direction of making it more likely for smart shutdown to
actually shut down the system, rather than less likely.
2. When a smart shutdown is initiated, shut off synchronous
replication. This definitely makes sure you won't get stuck waiting
for sync rep, but on the other hand you probably configured sync rep
because you wanted, uh, sync rep. Or alternatively, continue to allow
sync rep for as long as there is a sync standby connected, but if the
last sync standby drops off then shut it off.
3. Accept new replication connections even when the system is
undergoing a smart shutdown. This is the approach that the
above-linked patch tries to take, and it seems superficially sensible,
but it doesn't really work. Currently, once a shutdown has been
initiated and any on-line backup has been stopped, we stop creating
regular backends; we instead only create dead-end backends that just
return an error message and exit. Once no regular backends remain, we
then stop accepting connections AT ALL and wait for the dead end
backends to drain out. What this patch proposes to do (though it
isn't real clear from the way it's written) is continue creating
regular backends but boot out all but superuser and replication
connections as soon as possible. However, that misses the reason why
the current code works the way that it does: to make sure that even in
the face of a continuing stream of connection requests, we actually
eventually manage to stop talking and shut down. Basically, this
patch would fix the smart-shutdown-sync-rep interaction at the expense
of making smart shutdown considerably more fragile in other cases,
which does not seem like a good trade-off. AFAICT, this whole
approach is doomed to failure.
Anyone else have an idea or opinion?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company