On Fri, Nov 1, 2013 at 9:49 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> The bigger picture here is that such an approach amounts to deciding that
> no one will ever be allowed to fix hash indexes. I'm not for that, even
> if I'm not volunteering to be the fixer myself.
Yeah.
I have thought about doing some work on this, although I don't know
when I'll ever have the time. As far as I can see, the basic problem
is that we need a better way of splitting buckets. Right now,
splitting a bucket means locking it, scanning the entire bucket chain
and moving ~1/2 the tuples to the new bucket, and then unlocking it.
Since this is a long operation, we have to use heavyweight locks to
protect the buckets, which is bad for concurrency. Since it involves
a modification to an arbitrarily large number of pages that has to be
atomic, it's not possible to WAL-log it sensibly -- and in fact, I
think this is really the major obstacle to being able to implementing
WAL-logging for hash indexes.
I think we could work around this problem as follows. Suppose we
have buckets 1..N and the split will create bucket N+1. We need a
flag that can be set and cleared on the metapage, call it
split-in-progress, and a flag that can optionally be set on particular
index tuples, call it moved-by-split. We will allow scans of all
buckets and insertions into all buckets while the split is in
progress, but (as now) we will not allow more than one split to be in
progress at the same time. We start the split by updating the
metapage to incrementing the number of buckets and set the
split-in-progress flag. While the split-in-progress flag is set, any
scans of bucket N+1 first scan that bucket, ignoring any tuples
flagged moved-by-split, and then ALSO scan bucket (N+1)/2. The
moved-by-split flag never has any effect except when scanning the
highest-numbered bucket that existed at the start of that particular
scan, and then only if the split-in-progress flag was also set at that
time.
Once the split operation has set the split-in-progress flag, it will
begin scanning bucket (N+1)/2. Every time it finds a tuple that
properly belongs in bucket N+1, it will insert the tuple into bucket
N+1 with the moved-by-split flag set. Tuples inserted by anything
other than a split operation will leave this flag clear, and tuples
inserted while the split is in progress will target the same bucket
that they would hit if the split were already complete. Thus, bucket
N+1 will end up with a mix of moved-by-split tuples, coming from
bucket (N+1)/2, and unflagged tuples coming from parallel insertion
activity. When the scan of bucket (N+1)/2 is complete, we know that
bucket N+1 now contains all the tuples that are supposed to be there,
so we clear the split-in-progress flag. Future scans of both buckets
can proceed normally.
Of course, there's a bit of a problem here: we haven't actually
removed any tuples from bucket (N+1)/2. That's not a correctness
problem; we'll check for an exact hash-value match before returning
any tuples from this bucket from the scan, so tuples that aren't
present in the bucket but should be will just ignored anyway. But
it's clearly something that needs to be fixed before we can really
declare victory. We need to wait lfor the completion of any scans
that began before we finished populating bucket N+1, because otherwise
we might remove tuples that they're still expecting to find in bucket
(N+1)/2. I'm not sure if there's some clever trick we can use to
accomplish this; e.g. if the scan will always maintain a pin, we could
take a buffer cleanup lock on all the buffers in buckets N+1 and
(N+1)/2 in the same sequence a scan would visit them. Of course, even
if that works, it might be a while if somebody's left a cursor lying
around. And even if scans can't be counted on to maintain a pin, a
point about which I'm unsure off-hand, then things get even trickier.
But maybe there's some solution.
Anyway, once we out-wait concurrent scans, we can go back and nuke
everything from (N+1)/2 that no longer maps onto that bucket. It
might be possible to make this work opportunistically, like HOT
cleanup: if we're scanning a buffer, and RecentGlobalXmin has
progressed enough that we know there can't be any pre-split scans in
progress (or if we can recognize in some other inexpensive way that
old scans must be gone), and we see a tuple there that no longer maps
onto that bucket, we scan the page and remove all such tuples.
This is full of hand-waving, but I'd be curious to know whether the
approach is perceived to have any merit.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company