Greg Burek from Heroku (CCed) reported a weird issue on IM, that wasweird enough to be interesting. What he'd observed was that he promotedsome PITR standby, and early clones of that node work, but later clonesdid not, failing to read some segment.

The problems turns out to be the following: When a node is promoted ata segment boundary, just after an XLOG_SWITCH record we'll have EndOfLog = EndRecPtr;pointing to the *beginning* of the next segment, as XLOG_SWITCH recordsare treated as using the whole segment. After creating theEND_OF_RECOVERY record (or checkpoint), we'll do:

if (ArchiveRecoveryRequested) { /* * We switched to a new timeline. Clean up segments on the old * timeline. * * If there are any higher-numbered segments on the old timeline, * remove them. They might contain valid WAL, but they might also be * pre-allocated files containing garbage. In any case, they are not * part of the new timeline's history so we don't need them. */ RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);

note that this uses EndOfLog, pointing to ab/cd000000 (i.e. thebeginning of a record). RemoveNonParentXlogFiles callsRemoveNonParentXlogFiles() which in turn uses RemoveXlogFile() to removesuperflous files. That's where the fun begins.

So what happens here is that we're calling InstallXLogFileSegment() toremove superflous xlog files (e.g. because they're before the recoverytarget, because restore command ran before the trigger file was detectedor because walsender received them). But because endptr = ab/cd000000,the use of XLByteToPrevSeg() means InstallXLogFileSegment() will becalled with the *previous* segment's segment number.

That in turn will lead to InstallXLogFileSegment() installing theto-be-removed segment into the current timeline, but into a segment fromone *before* the creation of new timeline, for the purpose of recyclingthe segment. I'll call this the "phantom" segment, which has nomeaningful content and lives on a timeline which does not yet exist.

As there's no .ready file created for that segment, and we'll neveractually write to it, it'll initially just sit around. Not visible forarchiving, and normally unused by wal streaming. But that changes atlater checkpoints, because, via RemoveOldXlogFiles()'sXLogArchiveCheckDone() checks we:/* * XLogArchiveCheckDone *... * If <XLOG>.done exists, then return true; else if <XLOG>.ready exists, * then return false; else create <XLOG>.ready and return false. * * The reason we do things this way is so that if the original attempt to * create <XLOG>.ready fails, we'll retry during subsequent checkpoints.

So we'll at some later point create a .ready for the above createdphantom segment. Which then will get archived.

At that point we're in trouble. If any standbys of that promoted nodecatch up after that fact (or new ones are created from older basebackups), after the phantom segment has been archived, andrestore_command is set, recovery will fail. The reason for that is thatone commonly will have recovery_target_timeline = latest (or the newtimeline) set. And XLogFileReadAnyTLI() is pretty simplistic. Whenrestoring a segment it'll simply probe all timelines, starting from thenewest. Which means that, once archived, our phantom segment will "hide"the actual segment from the source timeline. Because it's not parseable(it's at a different segment, thus parsing decide it's unusable),recovery will hang at that point.

Which means quick standbys catch up, slow ones are "dead". It's"fixable" by creating a restore_command which filters that phantomsegment, or deleting the segment from the archive.

The minimal fix here is presumably not to use XLByteToPrevSeg() inRemoveXlogFile(), but XLByteToSeg(). I don't quite see what purpose itserves here - I don't think it's ever needed. Normally it's harmlessbecause InstallXLogFileSegment() checks where it could install the fileto, but that doesn't work around timeline bumps, triggering the problemat hand. This seems to be very longstanding behaviour, I'm not surewhere it's originating from (hard to track due to code movement).

There seems to be a larger question ehre though: Why doesXLogFileReadAnyTLI() probe all timelines even if they weren't a parentat that period? That seems like a bad idea, especially in morecomplicated scenarios where some precursor timeline might live forlonger than it was a parent? ISTM XLogFileReadAnyTLI() should checkwhich timeline a segment ought to come from, based on the historY?