I'd like to add some new flag bits to XLogRecord. (xlog.h)
Where? xl_prev.
xl_prev is an XLogRecPtr which points backwards to the immediately
preceeding WAL record. All of the bits are currently used, but I have
some observations and a proposal to change that.
We currently compare the whole xl_prev value against the whole
XLogRecPtr of the last WAL record.
When we are reading back WAL, if a WAL record is valid the xlogid
portion of the value seldom differs by more than +1 from pointer of the
current record, since that would imply an xlog record of more than 4GB.
If it is incorrect, it will either be garbage or occasionally be a
previously valid value but from two prior checkpoints back before this
file was reused.
So we probably don't need to compare the whole of xl_prev against the
whole of the last WAL record pointer, we can probably avoid comparing
some of the high bits, since the range of valid values is so limited.
How many bits?
checkpoint_segments is limited to INT_MAX, which means the xlogid
increase of a single checkpoint is always at most INT_MAX/255. That
means that the xl_prev value cannot differ by more than 2* INT_MAX/255
across two checkpoints. (I make that 134 Petabytes). Alternatively, the
checkpoint_timeout is one hour. So we're OK until systems can write WAL
at 67 Petabytes/hour.
Which means if
* we never get WAL records of more than 67 Petabytes in size *and*
* the lowest 25 bits of xl_prev do not match the position of the last
WAL record
then the XLogRecord is invalid, no matter what the value of the highest
7 bits of xl_prev.
So I would like to propose that we ignore the top 4 bits in
xl_prev.xlogid when comparing values, rather than using all 32 bits for
comparison. That then frees up 4 new flag bits on XLogRecords. Changing
xl_prev handling is only required in 3 places, all in xlog.c, plus some
log outputs.
I would simply document the limitation of WAL record sizes. Putting code
in for that would be pointless since the test would last years on
current systems. (We wouldn't need dtrace to measure the WALInsertLock
hold time, we could use tree rings.:-)
These values would vary if we allow XLOG_SEG_SIZE higher than 16MB, but
we should probably limit checkpoint_segments according to the setting of
XLOG_SEG_SIZE anyhow.
Thoughts?
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support