On Mon, Jan 31, 2011 at 10:01 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Jan 28, 2011 at 3:39 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> What happens if we (a) keep the current rule after reaching
>> consistency and (b) apply any such updates *unconditionally* - that
>> is, without reference to the LSN - prior to reaching consistency?
>> Under that rule, if we encounter an FPI before reaching consistency,
>> we're OK. So let's suppose we don't. No matter how many times we
>> replay any initial prefix of any such updates between the redo pointer
>> and the point at which we reach consistency, the state of the page
>> when we finally reach consistency will be identical. But we could get
>> hosed if replay progressed *past* the minimum recovery point and then
>> started over at the previous redo pointer. If we forced an immediate
>> restartpoint on reaching consistency, that seems like it might prevent
>> that scenario.
>
> Actually, I'm wrong, and this doesn't work at all. At the time of the
> crash, there could already be pages on disk with LSNs greater than the
> minimum recovery point. Duh.
>
> It was such a good idea in my head...
I should mention that most of this idea was Heikki's, originally.
Except for the crappy parts that don't work - those are all me. But
I'm back to thinking this can work. Heikki pointed out to me on IM
today that in crash recovery, we always replay to end-of-WAL before
opening for connections, and for Hot Standby every block we write
advances the minimum recovery point to its LSN. This implies that if
we're accepting connections (either regular or Hot Standby) or at a
valid stopping point for PITR, there are no unreplayed WAL records
whose changes are reflected in blocks on disk.
So I'm back to proposing that we just apply FPI-free WAL records
unconditionally, without regard to the LSN. This could potentially
corrupt the page, of course. Consider delete (no FPI) - vacuum (with
FPI) - crash, leaving the vacuum page half on disk. Now the replay of
the delete is probably going to do the wrong thing, because the page
is torn. But it doesn't matter, because the vacuum's FPI will
overwrite the page anyway, and whatever stupid thing the delete replay
did will become irrelevant - BEFORE we can begin processing any
queries. On the other hand, if the delete record *isn't* followed by
an FPI, but just, by, say, a bunch more deletes, then it should all
Just Work (TM). As long as the page header (excluding LSN and TLI,
which we're ignoring by stipulation) and item pointer list are intact,
we can redo those deletes and clean things up. And if they're not
intact, then we must've done something that emits an FPI, and so any
temporary page corruption will get overwritten when we get to that
point in the WAL stream...
That is a bit ugly, though, because it means the XLOG replay of
FPI-free records would have to be prepared to just punt if they
encounter any sort of corruption, in the sure hope that any such
corruption must imply the presence of a future FPI that will be
replayed - since if there is no such future FPI, it should be
impossible for the page to be corrupted in the first place. But that
might reduce our chances of being able to detect real corruption.
Heikki also came up with another idea that might be worth exploring:
at the point when we currently emit FPIs, emit an image of just the
part of the page that precedes pd_lower - the page header and item
IDs. To make this work, we'd have to make a rule that redo isn't
allowed to rely for correctness on any bits following the pd_lower
boundary - it can write those bits, but it can't read them. But most
of the XLOG_HEAP records follow that rule already - we look at the
item pointers to figure out where we're putting a new tuple or to
locate an existing tuple and unconditionally overwrite some of its
bits. The obvious exception is XLOG_HEAP2_CLEAN, emitted by VACUUM,
which would probably need to just log the entire page. Also, we'd
again need to apply records unconditionally, without reference to the
page LSN, until we reached the minimum recovery point.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company