There are situations in which vacuuming big btree index causes stuck in WAL replaying on hot standby servers for quite a long time. I’ve described the problem in more details in this thread [0]. Below in that thread Kevin Grittner proposed a good way for improving btree scans so that btree vacuuming logic could be seriously simplified. Since I don’t know when that may happen I’ve done a patch that makes some improvement right now. If Kevin or someone else would expand [1] for handling all types of btree scans, I suppose, my patch could be thrown away and vacuuming logic should be strongly rewritten.

The idea of the patch is not to read pages from disk to make sure they are unpinned (like btree_xlog_vacuum does it right now). This is done with creating a new ReadBufferMode which returns locked buffer without setting BM_VALID flag on it. I don’t know if that is the right way of doing that but it seems to work well.

Testing it my environment gives a good win [2] - green is unpatched replica, blue is replica with a patch, two spikes are results of calling manual vacuuming of big table. Since the picture could be unavailable I’ll write here that: 1. replication delay reduced from ~1250 MB to 200 MB of replay_location lag, 2. patched replica caught master in less than a minute against 12 minutes of unpatched replica, 3. Physical I/O load on patched replica didn’t change compared with the normal workload while unpatched replica did lots of reads from PGDATA during spikes.

There is still a stuck in WAL replay but much smaller that right now. Also this change seems not to affect any other scenarios.