I have spent several days now puzzling over the corrupted WAL logfile
that Scott Parish was kind enough to send me from a 7.1beta4 crash.
It looks a lot like two different series of transactions were getting
written into the same logfile. I'd been digging like mad in the WAL
code to try to explain this as a buffer-management logic error, but
after a fresh exchange of info it turns out that I was barking up the
wrong tree. There *were* two different series of transactions.
Specifically, here's what happened:
1. Scott (or actually his associate) shut down and restarted the
postmaster using the /etc/rc.d/init.d/pgsql script that ships with
our RPMs. That script shuts down the old postmaster with
killproc postmaster
It turns out that at least on Scott's machine (RedHat 6.1), the default
kill level for the killproc function is kill -9. (This is clearly a bad
bug in the init script, but I digress.)
2. So, the old postmaster was killed with kill -9, but its child
backends were still running. The new postmaster will start up
successfully because it'll think the old postmaster crashed, and
so it will go through the usual recovery procedure.
3. Now we have two sets of backends running in different shmem blocks
(7.0 might have choked on that part, but 7.1 doesn't care) and running
different sets of transactions. But they're writing to the same WAL
log. Result: guaranteed corruption of the log.
It actually took two iterations of this to expose the bug: the third
attempted postmaster start went looking for the checkpoint record last
written by the second one, which meanwhile had got overwritten by
activity of the first backend set.
Now, killing the postmaster -9 and not cleaning up the backends has
always been a good way to shoot yourself in the foot, but up to now the
worst thing that was likely to happen to you was isolated corruption in
specific tables. In the brave new world of WAL the stakes are higher,
because the system will refuse to start up if it finds a corrupted
checkpoint record. Clueless admins who resort to kill -9 as a routine
admin tool *will* lose their databases. Moreover, the init scripts
that are running around now are dangerous weapons if used with 7.1.
I think we need a stronger interlock to prevent this scenario, but I'm
unsure what it should be. Ideas?
regards, tom lane