Shared Storage

From PostgreSQL wiki

Contents

Cold standby only

PostgreSQL (as opposed to various vendor forks of PostgreSQL) does not support multi-master or master-and-read-replica shared storage operation. It can only be used in cold standby failover configurations for shared-storage, and even then it is only safe with robust fencing/STONITH.

Attempts to use PostgreSQL in multi-master shared storage configurations will result in extremely severe data corruption. PostgreSQL's usual protections against multiple postmasters (postmaster.pid, shmem locks, etc) won't work in shared storage configurations, so errors in fencing could leave you with major data problems.

If in doubt, you should use replication instead. For "must not lose a transaction" deployments, synchronous replication is almost certainly more appropriate.

It's straightforward to build a shared storage implementation using PostgreSQL and either one of the proprietary clustering systems or using Linux-HA.

The main things you have to figure out are a) which of the clustering software solutions you're going to use, ranging from the free Linux on to the multiple proprietary options, and b) how you're going to enforce a node that's been demoted from being the master from reconnecting again (what's called fencing). There's some overlap between those two decisions; not every cluster program supports every node fencing method.

One hard part is getting a heartbeat between the two servers sharing the SAN that is both sensitive enough to catch failures while not being so paranoid that it fails over needlessly (say, when load spikes on the primary and it slows down). Make sure you test that part out very carefully with any vendor you evaluate.

As far as the PostgreSQL specifics go, you need a solid way to ensure you've disconnected the now defunct master from the SAN (the classic STONITH or "shoot the other node in the head" problem). All you *should* have to do is start the database again on the backup after doing that. That will come up as a standard crash, run through WAL replay crash recovery, and the result should be no different than had you restarted after a crash on the original node. The thing you cannot let happen is allowing the original master to continue writing to the shared SAN volume once that transition has happened.

The hard part of shared storage failover is always solving the "shoot the other node in the head problem", to keep a down node from coming back once it's no longer the active one. In order to do that well, you really need to lock the now unavailable node from accessing the storage at the hardware level--"fencing"--with disabling its storage port being one way to handle that. Figure out how you're going to do that reliably in a way that's integrated into a proper cluster manager, and there's no reason you can't do this with PostgreSQL.

There's a description of the fencing options for Linux-HA at http://www.clusterlabs.org/doc/crm_fencing.html The cheap way to solve this problem is to have a UPS that disables the power going to the shot node. Once that's done, you can then safely failover the shared storage to another system. At that point, you can probably even turn back on the power, presuming that the now rebooted system will be able to regain access to the storage during a fresh system start.

Typical Setup

Here's one user (John Pierce) writing about this style of deployment:

In most of the HA clusters I've seen, the RAID controllers are in the SAN, not in the hosts, and they have their own failover, with shared write cache, also extensive use of ECC so things like double-bit memory errors are detected and treated as a failure. The sorts of high end SANs used in these kinds of systems have 5-9's reliability, through extensive use of redundancy, dual port disks, fully redundant everything, mirrored caches, etc.

Ditto, the servers used in these sorts of clusters have ECC memory, so memory failure should be detected rather than passed on blindly in the form of corrupted data. Server grade CPUs, especially the RISC ones, have extensive ECC internally on their caches, data busses, etc, so any failure there is detected rather than allowed to corrupt data. Failure modes can include things like failing fans (which will be detected, resulting in a server shutdown if too many fail), power supply failure (redundant PSUs, but I've seen the power combining circuitry fail). Any of these sorts of failures will result in a failover without corrupting the data.

And of course, intentional planned failovers to do OS maintenance... you patch the standby system, fail over to it and verify its good, then patch the other system.

We had a large HA system at an overseas site fail over once due to flooding in the primary computer room caused by a sprinkler system failure upstairs. The SAN was mirrored to a SAN in the 2nd DC (fiber inteconnected) and the backup server was also in the second DC across campus, so it all failed over gracefully. This particular system was large Sun hardware and big EMC storage, and it was running Oracle rather than Postgres. We've had several big UPS failures at various sites, too, ditto HVAC, over a 15 year period.

In my dev-lab tests of some clusters, I used the QLogic 5600 FC switch that connects my motly collection of servers... I used RHCS for one test, it supported the qlogic via telnet... I created two zone sets in the qlogic, one for each state, with the standby host blocked from accessing the LUN, and the cluster manager used telnet to talk to the switch. I ran heartbeats over two separate ethernets (one was the lab LAN segment, the other was a private switch i have all the servers connected to for various tests, and such). The qlogic switch also had another zoneset for all sorts of other servers and storage which wasn't affected by these clustering tests.

I don't like power cycling servers, so I'd prefer not to use power switch based fencing, although I believe my blade box's management unit is supported as a power fencing device.

Credits

And several pieces of good advice there have not yet been absorbed into this discussion.

Initial text here came from those discussions on the pgsql-general and pgsql-admin lists, primarily from Greg Smith and John Pierce. Revised by Craig Ringer with more up to date information and a clearer statement about the limitations of shared storage.