TestAcidGuarantee broken on trunk

Details

Description

TestAcidGuarantee has a test whereby it attempts to read a number of columns from a row, and every so often the first column of N is different, when it should be the same. This is a bug deep inside the scanner whereby the first peek() of a row is done at time T then the rest of the read is done at T+1 after a flush, thus the memstoreTS data is lost, and previously 'uncommitted' data becomes committed and flushed to disk.

One possible solution is to introduce the memstoreTS (or similarly equivalent value) to the HFile thus allowing us to preserve read consistency past flushes. Another solution involves fixing the scanners so that peek() is not destructive (and thus might return different things at different times alas).

This issue is pernicious and difficult. I have attempted many times to nail this but there continue to be issues.

Right now there are at least 2 issues:

When a memstore is flushed to HFile the 'memstoreTS' generation stamp is dropped. With the way the scanners work right now, this causes some minor issue. A row is really composed on a sequence of KeyValues, with different memstoreTS values, so you can imagine right before the flush the scanner looks sort of like so:

The new gen have newer memstoreTS value and were ignored during the next(). So the scanner is pointing to a Key part of the way through a row. When the memstore is pushed to a HFile, we mistakenly ignore all the skipped new gen keys, and we end up with the FIRST KeyValue from the older gen, and the subsequent columns from the newer gen, ending up with a mixed generation result, thus looking like non-atomic put.

One solution to this problem is to take the top Key during a scanner reset (done after a switch between memstore -> hfile) and transform it into KeyValue.createFirstOnRow(). There has to be exception code for partial row fetches where the scanner gets partial row results each time (Because the row is too big).

The second issue is that of scanner updates and multi-column families. Right now we do code sort of like this:
for each store (aka: family) in region:
switch memstore -> hfile

For each Store/family the switch between memstore->hfile is done atomically with respect to scanners, that is for a scanner they either access the old snapshot and old files or the new null snapshot and the new files. But this atomic snapshot is done at a Store at-a-time level, meaning that concurrent scanners might see some data in one store from memstore (complete with generation timestamps) and some data in another store all from hfile (with no timestamps), thus again ending up with mixed generation row data that violates row atomicity.

One possible solution to this problem is to move scanner updates to the HRegion/RegionScanner level. This is not as easy as it sounds from a code structure point of view. Furthermore by using larger update blocks we may be introducing performance issues surrounding scanner updates.

-----------

The ultimate problem here is these two problems are just the most recently identified bugs in a long list of bugs. We can fix them in the ways described above, but what about the next series of bugs? We are plugging leaks in a dam. A better approach would be to build a new, better, dam, only with the knowledge we have about what previously went wrong.

------------

The proposal is thusly:

Add a new generation timestamp (or perhaps we could call it transaction id) to every KeyValue. This generation timestamp is also serialized and stored inside HFile.

The generation timestamp starts at 1 for brand new tables. Every write operation increments this by 1, much like ReadWriteConsistencyControl

KeyValues with ts=0 belong to no generation and are always included in every read operation

KeyValues with ts>0 belong to that specific generation and the reader should include them only if their own read point is larger. The reader obtains the read point at scanner start time, and may possibly update it between rows if the implementation can do so. (As per the current RWCC algorithms)

Bulk imported hfiles have all their TS=0 (the default value)

Modifications set their ts=X as per the current RWCC algorithms.

During hfile flush the largest TS is written to the HFile metadata.

During HRegion/Store init, the largest TS is obtained from HFile metadata, and the RWCC is set to this value.

Some predicted questions:

What about the extra space? 8 bytes is a lot.

With vint encoding and that the values start at 1, we should be able to shave a lot of space off. We will also have to store a length field with 1 extra byte. 2 bytes minimum for '0'.

What about vint decode speed?

It may not be an issue, we need to profile. If it is, we can trade off RAM space in KeyValue to cache this.

What about file migration? I dont want to take a long time to upgrade my cluster!

By creating a new metadata key, call it KEYVALUE_VERSION, the Store layer on top of HFile can parse the older KeyValues from HFile (which wont have this metadata), substituting '0' for the missing timestamp.

Newer HFiles will use the newer KeyValue parse code to obtain the generation timestamp.

This is a lot of cost just to fix something that seems we should know in memory

Yes it isn't cheap, but this problem has proven to be difficult to solve. By addressing the issue directly we can have confidence in our design and the existing test frameworks which are failing should validate it for us.

What about splits?

Each split region will 'take off' from the maximal generation timestamp and evolve independently. If they were to be merged together that'd be ok, each region would have to close, halting all writes, then take the largest generation timestamp and continue evolving from there. The timestamp exists to differentiate committed and non-committed data, and during a close by definition all values in the HFile would be committed.

Ok so I'm paying at least 2 bytes per KeyValue, can we get better return on investment?

Yes! By using the generation timestamp we can detect the situation where a previously inserted Delete is masking new Puts. We can switch between the existing behaviour and newer behaviour at runtime and ignore the rogue Delete marker.

There may be other benefits as well. Providing a partial total order to all the KeyValue inserts and grouped by transaction would provide useful debugging info, for both HBase and client bugs.

What about read isolation?

With this we can definitely implement at-scanner-open read isolation. Due to the nature of the scanner design advancing the generation timestamp read point is more difficult, but it may be possible.

ryan rawson
added a comment - 16/Aug/10 22:01 This issue is pernicious and difficult. I have attempted many times to nail this but there continue to be issues.
Right now there are at least 2 issues:
When a memstore is flushed to HFile the 'memstoreTS' generation stamp is dropped. With the way the scanners work right now, this causes some minor issue. A row is really composed on a sequence of KeyValues, with different memstoreTS values, so you can imagine right before the flush the scanner looks sort of like so:
scanner points here
(new gen here) |
K K K K K K K K K K K K K K K K K K K K K (keys for 1 row)
(new and old gen)
The new gen have newer memstoreTS value and were ignored during the next(). So the scanner is pointing to a Key part of the way through a row. When the memstore is pushed to a HFile, we mistakenly ignore all the skipped new gen keys, and we end up with the FIRST KeyValue from the older gen, and the subsequent columns from the newer gen, ending up with a mixed generation result, thus looking like non-atomic put.
One solution to this problem is to take the top Key during a scanner reset (done after a switch between memstore -> hfile) and transform it into KeyValue.createFirstOnRow(). There has to be exception code for partial row fetches where the scanner gets partial row results each time (Because the row is too big).
The second issue is that of scanner updates and multi-column families. Right now we do code sort of like this:
for each store (aka: family) in region:
switch memstore -> hfile
For each Store/family the switch between memstore->hfile is done atomically with respect to scanners, that is for a scanner they either access the old snapshot and old files or the new null snapshot and the new files. But this atomic snapshot is done at a Store at-a-time level, meaning that concurrent scanners might see some data in one store from memstore (complete with generation timestamps) and some data in another store all from hfile (with no timestamps), thus again ending up with mixed generation row data that violates row atomicity.
One possible solution to this problem is to move scanner updates to the HRegion/RegionScanner level. This is not as easy as it sounds from a code structure point of view. Furthermore by using larger update blocks we may be introducing performance issues surrounding scanner updates.
-----------
The ultimate problem here is these two problems are just the most recently identified bugs in a long list of bugs. We can fix them in the ways described above, but what about the next series of bugs? We are plugging leaks in a dam. A better approach would be to build a new, better, dam, only with the knowledge we have about what previously went wrong.
------------
The proposal is thusly:
Add a new generation timestamp (or perhaps we could call it transaction id) to every KeyValue. This generation timestamp is also serialized and stored inside HFile.
The generation timestamp starts at 1 for brand new tables. Every write operation increments this by 1, much like ReadWriteConsistencyControl
KeyValues with ts=0 belong to no generation and are always included in every read operation
KeyValues with ts>0 belong to that specific generation and the reader should include them only if their own read point is larger. The reader obtains the read point at scanner start time, and may possibly update it between rows if the implementation can do so. (As per the current RWCC algorithms)
Bulk imported hfiles have all their TS=0 (the default value)
Modifications set their ts=X as per the current RWCC algorithms.
During hfile flush the largest TS is written to the HFile metadata.
During HRegion/Store init, the largest TS is obtained from HFile metadata, and the RWCC is set to this value.
Some predicted questions:
What about the extra space? 8 bytes is a lot.
With vint encoding and that the values start at 1, we should be able to shave a lot of space off. We will also have to store a length field with 1 extra byte. 2 bytes minimum for '0'.
What about vint decode speed?
It may not be an issue, we need to profile. If it is, we can trade off RAM space in KeyValue to cache this.
What about file migration? I dont want to take a long time to upgrade my cluster!
By creating a new metadata key, call it KEYVALUE_VERSION, the Store layer on top of HFile can parse the older KeyValues from HFile (which wont have this metadata), substituting '0' for the missing timestamp.
Newer HFiles will use the newer KeyValue parse code to obtain the generation timestamp.
This is a lot of cost just to fix something that seems we should know in memory
Yes it isn't cheap, but this problem has proven to be difficult to solve. By addressing the issue directly we can have confidence in our design and the existing test frameworks which are failing should validate it for us.
What about splits?
Each split region will 'take off' from the maximal generation timestamp and evolve independently. If they were to be merged together that'd be ok, each region would have to close, halting all writes, then take the largest generation timestamp and continue evolving from there. The timestamp exists to differentiate committed and non-committed data, and during a close by definition all values in the HFile would be committed.
Ok so I'm paying at least 2 bytes per KeyValue, can we get better return on investment?
Yes! By using the generation timestamp we can detect the situation where a previously inserted Delete is masking new Puts. We can switch between the existing behaviour and newer behaviour at runtime and ignore the rogue Delete marker.
There may be other benefits as well. Providing a partial total order to all the KeyValue inserts and grouped by transaction would provide useful debugging info, for both HBase and client bugs.
What about read isolation?
With this we can definitely implement at-scanner-open read isolation. Due to the nature of the scanner design advancing the generation timestamp read point is more difficult, but it may be possible.

ryan rawson
added a comment - 17/Aug/10 19:40 one addendum to this proposal, the hadoop vint format already encodes the length in an "easy to read" way. For small values < 120 or so the vint is encoded in 1 byte, else it may take up to 9 bytes.

I guess we could fix this by not updating the scanners after a flush. Currently, after every flush we are notifying the scanners (called as observers) so that they update their heap. If we do not notify them about the flush, the scanner wouldn't encounter any inconsistencies. This should solve the specific problem you discussed above where flushing results in inconsistency. This seems like an easy change and maintains correctness. The only drawback is that we are holding some memstore keys for a little longer which doesn't seem too big of a problem.

Pranav Khaitan
added a comment - 20/Aug/10 00:27 I guess we could fix this by not updating the scanners after a flush. Currently, after every flush we are notifying the scanners (called as observers) so that they update their heap. If we do not notify them about the flush, the scanner wouldn't encounter any inconsistencies. This should solve the specific problem you discussed above where flushing results in inconsistency. This seems like an easy change and maintains correctness. The only drawback is that we are holding some memstore keys for a little longer which doesn't seem too big of a problem.

That sounds possible... the extra memory held could be up to 64mb *
block-size * # of families. Ie: a few hundred megs or even gigs.

https://issues.apache.org/jira/browse/HBASE-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900535#action_12900535]
Currently, after every flush we are notifying the scanners (called as
observers) so that they update their heap. If we do not notify them about
the flush, the scanner wouldn't encounter any inconsistencies. This should
solve the specific problem you discussed above where flushing results in
inconsistency. This seems like an easy change and maintains correctness. The
only drawback is that we are holding some memstore keys for a little longer
which doesn't seem too big of a problem.
columns from a row, and every so often the first column of N is different,
when it should be the same. This is a bug deep inside the scanner whereby
the first peek() of a row is done at time T then the rest of the read is
done at T+1 after a flush, thus the memstoreTS data is lost, and previously
'uncommitted' data becomes committed and flushed to disk.
equivalent value) to the HFile thus allowing us to preserve read consistency
past flushes. Another solution involves fixing the scanners so that peek()
is not destructive (and thus might return different things at different
times alas).

ryan rawson
added a comment - 20/Aug/10 02:04 That sounds possible... the extra memory held could be up to 64mb *
block-size * # of families. Ie: a few hundred megs or even gigs.
https://issues.apache.org/jira/browse/HBASE-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900535#action_12900535 ]
Currently, after every flush we are notifying the scanners (called as
observers) so that they update their heap. If we do not notify them about
the flush, the scanner wouldn't encounter any inconsistencies. This should
solve the specific problem you discussed above where flushing results in
inconsistency. This seems like an easy change and maintains correctness. The
only drawback is that we are holding some memstore keys for a little longer
which doesn't seem too big of a problem.
columns from a row, and every so often the first column of N is different,
when it should be the same. This is a bug deep inside the scanner whereby
the first peek() of a row is done at time T then the rest of the read is
done at T+1 after a flush, thus the memstoreTS data is lost, and previously
'uncommitted' data becomes committed and flushed to disk.
equivalent value) to the HFile thus allowing us to preserve read consistency
past flushes. Another solution involves fixing the scanners so that peek()
is not destructive (and thus might return different things at different
times alas).

@Ryan Thanks for writing this up. Useful. My thinking is that the extra ts/version is too much to do just now as we're coming up fast on a 0.90.x. Its a 0.92.x project I'm thinking especially as I think it'll take a bit of work to implement it so migration is done at runtime. Also, following on from the Pranav suggestion, for now, can we not just let the scanner exhaust the current row only against whatever was in memstore rather than let the scanner run to the end of the region and then when it reaches the end of the row, then let go of memstore reference? In other words, only at row junctures update its memstore/hfile scanner set?

stack
added a comment - 20/Aug/10 23:48 @Ryan Thanks for writing this up. Useful. My thinking is that the extra ts/version is too much to do just now as we're coming up fast on a 0.90.x. Its a 0.92.x project I'm thinking especially as I think it'll take a bit of work to implement it so migration is done at runtime. Also, following on from the Pranav suggestion, for now, can we not just let the scanner exhaust the current row only against whatever was in memstore rather than let the scanner run to the end of the region and then when it reaches the end of the row, then let go of memstore reference? In other words, only at row junctures update its memstore/hfile scanner set?

@Stack, in my understanding what you are saying is happening now. The scanner keeps hold of the snapshot till it is reading the current row and only lets go of it before it starts reading the next row.

Pranav Khaitan
added a comment - 20/Aug/10 23:55 @Stack, in my understanding what you are saying is happening now. The scanner keeps hold of the snapshot till it is reading the current row and only lets go of it before it starts reading the next row.

@Pranav: Now I seem to recall Ryan remarks where the row boundary is unknowable; of how the only way you learn of a row boundary is AFTER you've read the first on the next row. At this stage you have already read into the next row so can't make the memstore/hfile switch.

stack
added a comment - 21/Aug/10 00:09 @Pranav: Now I seem to recall Ryan remarks where the row boundary is unknowable; of how the only way you learn of a row boundary is AFTER you've read the first on the next row. At this stage you have already read into the next row so can't make the memstore/hfile switch.

You guys have iet right, that you can't know the next row until you
are already past the first key.

The problem is that we would have to hold on to the snapshot and kvset
until the end of the scanner when it is closed. This will up our
chance of preventable OOME, but it would depend on the scanner use
patterns. It seems risky to me to take this approach.

I'm not sure we want to change the entire nature of how scanner
updates work, it would require redoing the heap a bit to bubble update
scanner events down. I'm not sure how complex that patch would be,
there is no way to easily know until the work is done. But without
either of those two approaches, this JIRA may have to be a 0.92 issue.

ryan rawson
added a comment - 21/Aug/10 06:15 You guys have iet right, that you can't know the next row until you
are already past the first key.
The problem is that we would have to hold on to the snapshot and kvset
until the end of the scanner when it is closed. This will up our
chance of preventable OOME, but it would depend on the scanner use
patterns. It seems risky to me to take this approach.
I'm not sure we want to change the entire nature of how scanner
updates work, it would require redoing the heap a bit to bubble update
scanner events down. I'm not sure how complex that patch would be,
there is no way to easily know until the work is done. But without
either of those two approaches, this JIRA may have to be a 0.92 issue.

So, rather than return inconsistent data around a storefile switch, since we get rows at a time (we do – right?), what if we aborted the current row when we have to swap memstore for new file? What if we threw an exception? Let the client fix up the scanner and make it come back in on the row we were about to serve? (We do something like this when a NSRE anyways)?

stack
added a comment - 21/Aug/10 18:47 So, rather than return inconsistent data around a storefile switch, since we get rows at a time (we do – right?), what if we aborted the current row when we have to swap memstore for new file? What if we threw an exception? Let the client fix up the scanner and make it come back in on the row we were about to serve? (We do something like this when a NSRE anyways)?

+ We'll add a new insertion sequence id into the KeyValue Key. It will be inserted after the current Timestamp and before KeyValue Type. This new insertion sequence id will be used resolving 'order' when all else matches in two KeyValue keys.
+ We'd use the last two bits of the KeyValue Type for 'version'. We'll set bit 128 to denote KeyValue version '2' (what we currently have is KeyValue version '1').
+ We'd adjust our comparators so they ignored these upper bits in Type so we can compare Type 1 and Type 2 KeyValues.
+ A total cluster restart will be needed. We'll up the RPC version to be sure.
+ Post-restart, writes will be version 2 KeyValues.
+ Reads may be a mix of version 2 and version 1 KeyValues while regions are made of a mix of old and new style HFiles. This should be fine as long as comparators work properly witha mix of version 1 and version 2 KVs.
+ We're thinking the new insertion sequence id will be fixed-length long. A vint/vlong will make for an awkward parse.

stack
added a comment - 06/Dec/10 23:25 Ryan, Kannan, and myself chatted on this issue a while:
+ We'll add a new insertion sequence id into the KeyValue Key. It will be inserted after the current Timestamp and before KeyValue Type. This new insertion sequence id will be used resolving 'order' when all else matches in two KeyValue keys.
+ We'd use the last two bits of the KeyValue Type for 'version'. We'll set bit 128 to denote KeyValue version '2' (what we currently have is KeyValue version '1').
+ We'd adjust our comparators so they ignored these upper bits in Type so we can compare Type 1 and Type 2 KeyValues.
+ A total cluster restart will be needed. We'll up the RPC version to be sure.
+ Post-restart, writes will be version 2 KeyValues.
+ Reads may be a mix of version 2 and version 1 KeyValues while regions are made of a mix of old and new style HFiles. This should be fine as long as comparators work properly witha mix of version 1 and version 2 KVs.
+ We're thinking the new insertion sequence id will be fixed-length long. A vint/vlong will make for an awkward parse.

ryan rawson
added a comment - 07/Dec/10 00:22 I think we can use HLog sequenceID as the "insertion sequence ID".
This is how it will work:
HLog append happens, resulting in sequence ID "X"
Start memstoreInsert, but instead of using what is provided, use X instead.
Do the usual, stamping KVs with X
Commit and push read point forward to X when done.
Successive "X" values will be increasing in value. One thing we will need to do is initialize "RWCC" read point to the largest HLog sequenceID on Region open (before region can take any edits).

At 2^63, we have about 9.22 x 10^18 values available.
At 50,000 operations/second, we will take about 5,849,424 years to roll over.

Stack points out, what about bulk commit, this would then wrap up every bulk commit as 1 memstore/RWCC transaction, which might be the logically wrong thing to do, and also the extra time it takes to insert that much stuff into memstore could also be an issues with RWCC. It is very important that RWCC transactions stay short or else we pile up too many handlers.

ryan rawson
added a comment - 07/Dec/10 00:32 question was: would we overflow a long?
At 2^63, we have about 9.22 x 10^18 values available.
At 50,000 operations/second, we will take about 5,849,424 years to roll over.
Stack points out, what about bulk commit, this would then wrap up every bulk commit as 1 memstore/RWCC transaction, which might be the logically wrong thing to do, and also the extra time it takes to insert that much stuff into memstore could also be an issues with RWCC. It is very important that RWCC transactions stay short or else we pile up too many handlers.

This patch adds sequence number to KeyValue. Also adds handling of version (version of KV is an internal detail, not let out of KV). All current TestKeyValue tests pass but everything else is broke currently.

The new KV particle is called sequence number. Its looking like this sequence number could be the HRS sequence id. I've purposely not made connection between the two. We'll start out with HRS sequence id == KV sequence number but this may change at some later time.

Will next make RWCC use the KV sequence number instead of the KV data member it used keep up.

stack
added a comment - 10/Dec/10 00:07 This patch adds sequence number to KeyValue. Also adds handling of version (version of KV is an internal detail, not let out of KV). All current TestKeyValue tests pass but everything else is broke currently.
The new KV particle is called sequence number. Its looking like this sequence number could be the HRS sequence id. I've purposely not made connection between the two. We'll start out with HRS sequence id == KV sequence number but this may change at some later time.
Will next make RWCC use the KV sequence number instead of the KV data member it used keep up.

it makes it clearer I think. Do you really need to have a constant for the bit compliment of a different constant? The compiler will do the optimization properly and inline constants if you do ~(some other thing thats a constant).

ryan rawson
added a comment - 14/Dec/10 08:32 I'd prefer bitbashing to be in hex, not decimal, eg:
static final byte VERSION_BITS = (byte)0x40;
instead of:
static final byte VERSION_BITS = (byte)64;
it makes it clearer I think. Do you really need to have a constant for the bit compliment of a different constant? The compiler will do the optimization properly and inline constants if you do ~(some other thing thats a constant).

one thought I had, if we are doing replication do we use seqid or do
we generate a new one locally at each cluster?

If we allow multimaster we will probably have to generate a new one at
each cluster replication target. Else we might end up with a situation
where new incoming edits are not visible yet, due to being < current
seqid.

ryan rawson
added a comment - 10/Jan/11 22:13 one thought I had, if we are doing replication do we use seqid or do
we generate a new one locally at each cluster?
If we allow multimaster we will probably have to generate a new one at
each cluster replication target. Else we might end up with a situation
where new incoming edits are not visible yet, due to being < current
seqid.

Latest. Starts moving readPoint up into HRegionScanner and out of MemStore. Adds tests. Adds setting sequence number into KVs whether they go via WAL or not, etc. Not done by a long shot and probably not in a state for review. Just adding work to date.

stack
added a comment - 13/Jan/11 00:02 Latest. Starts moving readPoint up into HRegionScanner and out of MemStore. Adds tests. Adds setting sequence number into KVs whether they go via WAL or not, etc. Not done by a long shot and probably not in a state for review. Just adding work to date.

Just had good conversation with Ryan. We conclude that using the HLog sequence number is NOT a good idea, mostly for performance reasons. Too many updates will be stuck waiting on the completion of edits that may have started before our update but that have yet to complete (we do not want to return to the client until all transaction started before ours – but that are slower than ours to run – have completed else there is the danger of not being able to see what you have written). Instead, we need to keep a running sequence number that is per HRegion rather than per HRegionServer as HLog sequence number is. This new HRegion sequence number is very much like HLog sequence number in that on open of HRegion we read in the largest and then increment from there.

Let me try and explain how we arrived at this notion.

We do ACID - - prevent readers reading part of an update – by only letting clients (scanners and gets) read stuff that has been fully committed. Currently we do this by moving forward a monotonically increasing 'read point'. Each update is given a write point. The read point is moved forward to encompass all completed write points or 'transactions'. Transactions complete willy-nilly but the read point will not move beyond the incomplete.

Up to this, the way we did 'ACID' was around memstore only. The readpoint is kept up inside in an instance of RWCC. A RWCC instance is Region scoped (one is created on creation of a HRegion). A new writepoint is created when we go to write the memstore in step (4) above and then the readpoint is moved forward to match the writepoint just before we do step (7) in the above. Currently our RWCC transaction spans step (4) to (7) roughly.

"Wait to be visible" in the above means wait until all transactions that have an id that is less than mine complete before I proceed to update the read point and return to the client. A transaction that started before us may not complete until after ours because of thread scheduling, hiccups, etc. We do not want to move the read point forward until all updates previous to ours have completed else we'll be letting clients read the incomplete earlier transactions.

Of note in the above, how long the WAL takes is not part of a RWCC transaction.

IF we move to using HLog sequence numbers, now the transaction starts at step (1) when we go to the WAL. We'll need to update in RWCC the writepoint at step (1). The HLog sequence number is for all of the region server, its not just HRegion scoped. The 'wait for our edit to be visible' will be dependent now on the completion on edits against unrelated HRegions whose character may be completely different (e.g. the schema on HRegion A may be for increments whereas the schema on HRegion B may be for fat batches of cells. If both are on the same regionserver, the 'wait for our edit to be visible' may have the increments waiting on the completion of a fat batch of updates).

So, the thought is instead to have a per region sequence number with the write point updated only after we emerge from the WAL append. We keep the current 'transaction' scope where scope is between steps (4) and (7) in the above.

I'm going to go implement the per region edit number unless an alternative suggested.

stack
added a comment - 13/Jan/11 00:46 Just had good conversation with Ryan. We conclude that using the HLog sequence number is NOT a good idea, mostly for performance reasons. Too many updates will be stuck waiting on the completion of edits that may have started before our update but that have yet to complete (we do not want to return to the client until all transaction started before ours – but that are slower than ours to run – have completed else there is the danger of not being able to see what you have written). Instead, we need to keep a running sequence number that is per HRegion rather than per HRegionServer as HLog sequence number is. This new HRegion sequence number is very much like HLog sequence number in that on open of HRegion we read in the largest and then increment from there.
Let me try and explain how we arrived at this notion.
We do ACID - - prevent readers reading part of an update – by only letting clients (scanners and gets) read stuff that has been fully committed. Currently we do this by moving forward a monotonically increasing 'read point'. Each update is given a write point. The read point is moved forward to encompass all completed write points or 'transactions'. Transactions complete willy-nilly but the read point will not move beyond the incomplete.
Here are the coarse steps involved in a 'transaction':
(0) row lock (Put, Increment, etc.)
(1) Go to WAL
(2) get new sequence id
(3) actually write WAL
(4) update memstore
(5) wait for our edit to be visible
(6) commit/move forward the read point
(7) undo rowlock
Up to this, the way we did 'ACID' was around memstore only. The readpoint is kept up inside in an instance of RWCC. A RWCC instance is Region scoped (one is created on creation of a HRegion). A new writepoint is created when we go to write the memstore in step (4) above and then the readpoint is moved forward to match the writepoint just before we do step (7) in the above. Currently our RWCC transaction spans step (4) to (7) roughly.
"Wait to be visible" in the above means wait until all transactions that have an id that is less than mine complete before I proceed to update the read point and return to the client. A transaction that started before us may not complete until after ours because of thread scheduling, hiccups, etc. We do not want to move the read point forward until all updates previous to ours have completed else we'll be letting clients read the incomplete earlier transactions.
Of note in the above, how long the WAL takes is not part of a RWCC transaction.
IF we move to using HLog sequence numbers, now the transaction starts at step (1) when we go to the WAL. We'll need to update in RWCC the writepoint at step (1). The HLog sequence number is for all of the region server, its not just HRegion scoped. The 'wait for our edit to be visible' will be dependent now on the completion on edits against unrelated HRegions whose character may be completely different (e.g. the schema on HRegion A may be for increments whereas the schema on HRegion B may be for fat batches of cells. If both are on the same regionserver, the 'wait for our edit to be visible' may have the increments waiting on the completion of a fat batch of updates).
So, the thought is instead to have a per region sequence number with the write point updated only after we emerge from the WAL append. We keep the current 'transaction' scope where scope is between steps (4) and (7) in the above.
I'm going to go implement the per region edit number unless an alternative suggested.

@stack: we briefly talked about this issue internally the other week. I think you want a per-CF sequence ID. We could split flushes to happen on a per-CF basis and a lot of our filtering needs to be done on a per-file basis (and consequently, per-CF).

Nicolas Spiegelberg
added a comment - 13/Jan/11 01:02 @stack: we briefly talked about this issue internally the other week. I think you want a per-CF sequence ID. We could split flushes to happen on a per-CF basis and a lot of our filtering needs to be done on a per-file basis (and consequently, per-CF).

if you use a per-CF sequence ID you can only have atomic properties on
at most a column family. We use the transaction id to know which ones
were committed and which ones were not, and if it wasn't across all
the CFs, we would not have atomicity across all CFs which is what we
DO want.

even if we split flushes to happen on per-Store/CF basis, it would not
affect the acid guarantees we are trying to achieve with this patch.

ryan rawson
added a comment - 13/Jan/11 01:12 if you use a per-CF sequence ID you can only have atomic properties on
at most a column family. We use the transaction id to know which ones
were committed and which ones were not, and if it wasn't across all
the CFs, we would not have atomicity across all CFs which is what we
DO want.
even if we split flushes to happen on per-Store/CF basis, it would not
affect the acid guarantees we are trying to achieve with this patch.

@ryan. sorry, I looked at this description a little too quickly. We were talking about slightly different scenario with HMaster log split pruning. You are correct, the region level seems to be the correct location.

Nicolas Spiegelberg
added a comment - 13/Jan/11 01:47 @ryan. sorry, I looked at this description a little too quickly. We were talking about slightly different scenario with HMaster log split pruning. You are correct, the region level seems to be the correct location.

Thinking on the way home and after chatting more with Ryan, I'm thinking that we do not want two sets of sequence numbers – one at regionserver level (HLog sequence id) and then another HRegion-scoped one. All edits are bottlenecked via the WAL – give or take the skip WAL write flag and deferred flush – and so arbitrarily breaking the transaction into two phases, the WAL write then the commit to memstore, seems like it gains us little; edits will stack up behind the WAL chicane anyways.

Paranoia regards performance issues come from long waits in 'wait to be visible' phase when the Transaction spanned WAL.

I'm thinking now that for this issue, we do it first 'correct' – e.g. use HLog sequence number throughout and start the 'transaction' pre-WAL – and then in a new issue figure the smarts that will get us over the slowness we've seen before taking this tack.

stack
added a comment - 13/Jan/11 05:12 Thinking on the way home and after chatting more with Ryan, I'm thinking that we do not want two sets of sequence numbers – one at regionserver level (HLog sequence id) and then another HRegion-scoped one. All edits are bottlenecked via the WAL – give or take the skip WAL write flag and deferred flush – and so arbitrarily breaking the transaction into two phases, the WAL write then the commit to memstore, seems like it gains us little; edits will stack up behind the WAL chicane anyways.
Paranoia regards performance issues come from long waits in 'wait to be visible' phase when the Transaction spanned WAL.
I'm thinking now that for this issue, we do it first 'correct' – e.g. use HLog sequence number throughout and start the 'transaction' pre-WAL – and then in a new issue figure the smarts that will get us over the slowness we've seen before taking this tack.

Unfortunately the paranoia re: performance is borne out by direct
experience. It will be an issue, and it will be a blocker and we
should deal with it right now. Since fixing it might require
architectural level changes in how we manage these things internally.
including up to and including using a different ID stream for atomic
consistency.

Backing up a bit, the basic issue is that a handler thread cannot
complete and return to the client until the row-transaction it was
working on is visible to other clients. To do otherwise risks data
loss for ICV and inconsistent read-your-own-write scenarios for
clients. But while waiting we are tying up a handler thread, and have
to wait on the longest pole HLog append (which can take seconds at
their worst!). You end up with a RS level stall which is pretty ugly.

I dont want 2 sets of sequence numbers, but I am concerned that we
might need it. Perhaps we can find a more elegant mechanism of
cheaply keeping track of which seqids are 'committed' and visible and
which are not. Right now we use a simple 'read point' which acts like
a line in the sand. Previous proposals called for a bitmask of the
last N numbers. The problem with this is that deferred flushing
combined with non-deferred flushing would cause major problems, as the
last N we need to keep track of keeps on expanding.

Perhaps a reverse bitmask where we keep track of the PREVIOUS N tx
that are NOT committed might make more sense. Implementing it
efficiently is another question.

ryan rawson
added a comment - 13/Jan/11 06:02 Unfortunately the paranoia re: performance is borne out by direct
experience. It will be an issue, and it will be a blocker and we
should deal with it right now. Since fixing it might require
architectural level changes in how we manage these things internally.
including up to and including using a different ID stream for atomic
consistency.
Backing up a bit, the basic issue is that a handler thread cannot
complete and return to the client until the row-transaction it was
working on is visible to other clients. To do otherwise risks data
loss for ICV and inconsistent read-your-own-write scenarios for
clients. But while waiting we are tying up a handler thread, and have
to wait on the longest pole HLog append (which can take seconds at
their worst!). You end up with a RS level stall which is pretty ugly.
I dont want 2 sets of sequence numbers, but I am concerned that we
might need it. Perhaps we can find a more elegant mechanism of
cheaply keeping track of which seqids are 'committed' and visible and
which are not. Right now we use a simple 'read point' which acts like
a line in the sand. Previous proposals called for a bitmask of the
last N numbers. The problem with this is that deferred flushing
combined with non-deferred flushing would cause major problems, as the
last N we need to keep track of keeps on expanding.
Perhaps a reverse bitmask where we keep track of the PREVIOUS N tx
that are NOT committed might make more sense. Implementing it
efficiently is another question.

How do the arrays of outstanding txs work? How do you correlate a particular bit to a particular transaction?

I don't get how this would be different than read point since you can't return to client till all transactions before yours have completed, right? Or does the bit array somehow expand our vocabulary beyond binary read point?

stack
added a comment - 13/Jan/11 06:29 How do the arrays of outstanding txs work? How do you correlate a particular bit to a particular transaction?
I don't get how this would be different than read point since you can't return to client till all transactions before yours have completed, right? Or does the bit array somehow expand our vocabulary beyond binary read point?

ryan rawson
added a comment - 13/Jan/11 06:34 yes precisely, the bit array allows us to not just have a binary read
point threshold, hence we dont have to 'wait' anymore.
at the cost of space and complexity though. how to do it in a
manageable way is outstanding.

@ryan : I'm trying to understand why you think we might need 2 seqids. If you move the seqid from the RS level to the region level, I see a lot of benefits beyond this jira. Log recovery, for example, wouldn't need to balance between the dead RS's seqid numbering and the new RS's numbering.

Nicolas Spiegelberg
added a comment - 14/Jan/11 00:27 @ryan : I'm trying to understand why you think we might need 2 seqids. If you move the seqid from the RS level to the region level, I see a lot of benefits beyond this jira. Log recovery, for example, wouldn't need to balance between the dead RS's seqid numbering and the new RS's numbering.

Chatting w/ Benoît last night, he had a good suggestion. He suggested we NOT add extra long to the KV, that instead we just use the already existing in-memory only memstorets that is in each KV. He suggested that on open of a storefile, that we just pick up its oldest sequence number or assign a sequence number on opening and keep that too in memory associated with the storefile. Chatting, whenever we read in a KV from a store file, we could insert into the memstorets field the storefiles sequence number; all edits in a storefile would have the same sequence number. When we flush, and we bring online the new storefile, it would have the flush sequence number. We would still need to move to using sequence number instead of the incrementing number we have in RWCC and we'd need to have the transaction, for correctness, span the WAL where it doesn't now, but this suggestion would save our upping the Key part of KV when persisted.

stack
added a comment - 01/Feb/11 20:13 Chatting w/ Benoît last night, he had a good suggestion. He suggested we NOT add extra long to the KV, that instead we just use the already existing in-memory only memstorets that is in each KV. He suggested that on open of a storefile, that we just pick up its oldest sequence number or assign a sequence number on opening and keep that too in memory associated with the storefile. Chatting, whenever we read in a KV from a store file, we could insert into the memstorets field the storefiles sequence number; all edits in a storefile would have the same sequence number. When we flush, and we bring online the new storefile, it would have the flush sequence number. We would still need to move to using sequence number instead of the incrementing number we have in RWCC and we'd need to have the transaction, for correctness, span the WAL where it doesn't now, but this suggestion would save our upping the Key part of KV when persisted.
Trying the above on Ryan.

Don't we already have this? The comparator uses the max_seq_id to break ties between KVs...

The primary issue is that we need to know which KVs are 'committed' and which are still being created in progress. Right now we have a problem whereby the scanner stack gets a little wonky about how it handles partial next()s. By moving the memstoreTS pruning up to the HRegion scanner level, and working on entire rows at a time, this might mitigate most of the problem actually. This might get ugly with family only flushes, since in theory you might end up with a row that is not completely written but is in memstore & hfile at the same time. Given that the scope of a RWCC "transaction" is only memstore insert, I'm not sure how that would happen. It's possible we could prevent it from becoming a problem with judicious use of the updateLock in HRegion though.

For example, by grabbing the updateLock.writeLock().lock() during the switch over, or the flush, we could ensure that all the pending writes are now complete, then do the switch out, then we'd never have a situation where a half committed write is in memstore & hfile at the same time.

ryan rawson
added a comment - 01/Feb/11 21:46 Don't we already have this? The comparator uses the max_seq_id to break ties between KVs...
The primary issue is that we need to know which KVs are 'committed' and which are still being created in progress. Right now we have a problem whereby the scanner stack gets a little wonky about how it handles partial next()s. By moving the memstoreTS pruning up to the HRegion scanner level, and working on entire rows at a time, this might mitigate most of the problem actually. This might get ugly with family only flushes, since in theory you might end up with a row that is not completely written but is in memstore & hfile at the same time. Given that the scope of a RWCC "transaction" is only memstore insert, I'm not sure how that would happen. It's possible we could prevent it from becoming a problem with judicious use of the updateLock in HRegion though.
For example, by grabbing the updateLock.writeLock().lock() during the switch over, or the flush, we could ensure that all the pending writes are now complete, then do the switch out, then we'd never have a situation where a half committed write is in memstore & hfile at the same time.

+ To be solved is how read/write rwcc points are respected on hfile flush; how do we not pull from hfile, items that are in the future as far as current rwcc read point is concerned (especially when cf7 of a ten cf row flushes mid-read).
+ Soln is that we'll move the read point forward up in region scanner on each next invocation; i.e on entrance into a new row. We'll also only swap in new hfiles on next up in the region scanner (rather than down in store scanner as is currently done). So, if row of 100 cfs and 1M columns, as we're reading, we'll hold the rwcc read point. cf 48 and cf 59 might flush but we'll not swap in their new store files until we get to the end of the row (we'll be holding on to the snapshots for a little longer than we do now).
+ On next up in region scanner, we also need to reseek each row even though this could be a perf killer. Our current notion of end-of-row marker is the kv that does not have a row that matches that of the row we are currently in. Lets call this next row kv kvnext. We park here in between next invocations. Well, what if in between region next invocations, there is a big pause and a bunch of puts come in only the puts have same row as kvnext AND they happen to sort before the kvnext at which we are currently parked. We have to reseek (It could be worse, a new row could have been inserted between next invocations in between kvbefore and kvnext...... if parked at kvnext we're not going to see it, not unless we do hbase-3498).
+ We do not think we need to add a sequence number to KV, one that is persisted out to HFile.
+ It looks like we do not need to use hlog's sequence number all over; we can keep up RWCCs little incrementing value. We can also keep its memstore scope – as opposed to what was being discussed above where we were going to broaden the scope to cover WAL writing.

stack
added a comment - 02/Feb/11 01:19 Chatting with Ryan:
+ To be solved is how read/write rwcc points are respected on hfile flush; how do we not pull from hfile, items that are in the future as far as current rwcc read point is concerned (especially when cf7 of a ten cf row flushes mid-read).
+ Soln is that we'll move the read point forward up in region scanner on each next invocation; i.e on entrance into a new row. We'll also only swap in new hfiles on next up in the region scanner (rather than down in store scanner as is currently done). So, if row of 100 cfs and 1M columns, as we're reading, we'll hold the rwcc read point. cf 48 and cf 59 might flush but we'll not swap in their new store files until we get to the end of the row (we'll be holding on to the snapshots for a little longer than we do now).
+ On next up in region scanner, we also need to reseek each row even though this could be a perf killer. Our current notion of end-of-row marker is the kv that does not have a row that matches that of the row we are currently in. Lets call this next row kv kvnext. We park here in between next invocations. Well, what if in between region next invocations, there is a big pause and a bunch of puts come in only the puts have same row as kvnext AND they happen to sort before the kvnext at which we are currently parked. We have to reseek (It could be worse, a new row could have been inserted between next invocations in between kvbefore and kvnext...... if parked at kvnext we're not going to see it, not unless we do hbase-3498).
+ We do not think we need to add a sequence number to KV, one that is persisted out to HFile.
+ It looks like we do not need to use hlog's sequence number all over; we can keep up RWCCs little incrementing value. We can also keep its memstore scope – as opposed to what was being discussed above where we were going to broaden the scope to cover WAL writing.

Regards HBASE-2673, which is about consistent view when intra-row scanning, if we do NOT add the transactionid/sequenceid to the hfile, then we must punt on it; i.e. intra-row scanning, you will not get a consistent view.

stack
added a comment - 03/Feb/11 00:43 Regards HBASE-2673 , which is about consistent view when intra-row scanning, if we do NOT add the transactionid/sequenceid to the hfile, then we must punt on it; i.e. intra-row scanning, you will not get a consistent view.

stack
added a comment - 03/Feb/11 19:46 The other dimension we need to consider is bulk load – especially the new multifamily bulk load. The bulk addition needs to come in at a row boundary so we keep up a consistent row view.

i dont think bulk load is an issue, since we only call update readers between rows (once we move the update readers to the HRegion.Scanner level), then it will be an atomic 'appearance' of data. Does that sound right?

ryan rawson
added a comment - 03/Feb/11 19:50 i dont think bulk load is an issue, since we only call update readers between rows (once we move the update readers to the HRegion.Scanner level), then it will be an atomic 'appearance' of data. Does that sound right?

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:https://reviews.apache.org/r/2224/
-----------------------------------------------------------

1. Discussed with Amitanand, he is planning to move this to the end of the KV (to play nice with delta encoding).
2. Also planning to store this in varying-length format
3. Also, if (kv.memstoreTS < current read point across all scanners) then we can just write a 0. This would be the case for most of the KV's except the last few written.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:https://reviews.apache.org/r/2224/
-----------------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:https://reviews.apache.org/r/2224/
-----------------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:https://reviews.apache.org/r/2224/
-----------------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:https://reviews.apache.org/r/2224/
-----------------------------------------------------------

1) Unfortunate that we have to create these two objects (ByteArrayInputStream & DataInputStream) for every KV. I suppose if this becomes an issue we could add overloads in WriteableUtils.readVLong that directly work on byte arrays instead of requiring a DataInput stream. [Probably not a big deal-- but just something to remember.]

2) Could you combine lines 582 & 583 by using this overload of the constructor:

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:https://reviews.apache.org/r/2224/
-----------------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:https://reviews.apache.org/r/2224/
-----------------------------------------------------------

jiraposter@reviews.apache.org
added a comment - 15/Oct/11 04:10
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2224/
-----------------------------------------------------------
(Updated 2011-10-15 04:08:41.977544)
Review request for Ted Yu, Michael Stack, Kannan Muthukkaruppan, and Karthik Ranganathan.
Changes
-------
Fix 2 more issues that could potentially cause ACID violations
(a) We only used to write maxVersions number of KV's to disk during
flush.
Not all KVs should be counted during this calculation. We shall
ignore all KV's newer than the oldest read point. So the oldest
Scanner can also get enough versions.
(b) move the ignoring newer KV's logic to the StoreFileScanner. That
way, this only returns KV's that are guaranteed to be included in the
search.
There was a condition where if two KVs were written to the same file. Both
identical, but only differ in memstoreTS, then we would skip the duplicate.
It was possible that the first one would be ignored because it has a newer
memstoreTS, and we would never get to see the second one, which might be
the KV we want.
Summary
-------
address the 2856 issues by writing the memstoreTS to the disk.
version v11 of the patch.
uploading it here for easier review process.
This addresses bug HBASE-2856 .
https://issues.apache.org/jira/browse/HBASE-2856
Diffs (updated)
/pom.xml 1183581
/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java 1183581
/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java 1183581
/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java 1183581
/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java 1183581
/src/main/java/org/apache/hadoop/hbase/regionserver/ColumnCount.java 1183581
/src/main/java/org/apache/hadoop/hbase/regionserver/ColumnTracker.java 1183581
/src/main/java/org/apache/hadoop/hbase/regionserver/ExplicitColumnTracker.java 1183581
/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 1183581
/src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java 1183581
/src/main/java/org/apache/hadoop/hbase/regionserver/ReadWriteConsistencyControl.java 1183581
/src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java 1183581
/src/main/java/org/apache/hadoop/hbase/regionserver/ScanWildcardColumnTracker.java 1183581
/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java 1183581
/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java 1183581
/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java 1183581
/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java 1183581
/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java 1183581
/src/test/java/org/apache/hadoop/hbase/TestAcidGuarantees.java 1183581
/src/test/java/org/apache/hadoop/hbase/io/hfile/TestCacheOnWrite.java 1183581
/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileWriterV2.java 1183581
/src/test/java/org/apache/hadoop/hbase/regionserver/TestExplicitColumnTracker.java 1183581
/src/test/java/org/apache/hadoop/hbase/regionserver/TestFSErrorsExposed.java 1183581
/src/test/java/org/apache/hadoop/hbase/regionserver/TestScanWildcardColumnTracker.java 1183581
/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java 1183581
Diff: https://reviews.apache.org/r/2224/diff
Testing
-------
mvn test
Thanks,
Amitanand

For patch v7, boolean ignoreCount is added to checkColumn(). I think javadoc for this new parameter should be added to ColumnTracker.java
Javadoc for long readPointToUse of ScanQueryMatcher ctor should be added.
Javadoc for boolean useRWCC of StoreFileScanner ctor and getScannersForStoreFiles() should be added.
There is duplicate code in StoreFileScanner.next(): lines 164 to 172.

Ted Yu
added a comment - 17/Oct/11 04:41 For patch v7, boolean ignoreCount is added to checkColumn(). I think javadoc for this new parameter should be added to ColumnTracker.java
Javadoc for long readPointToUse of ScanQueryMatcher ctor should be added.
Javadoc for boolean useRWCC of StoreFileScanner ctor and getScannersForStoreFiles() should be added.
There is duplicate code in StoreFileScanner.next(): lines 164 to 172.

Ted Yu
added a comment - 20/Oct/11 21:18 Maybe https://reviews.apache.org/r/2483 belongs to this JIRA.
Two new parameters are added to ScanQueryMatcher ctor: readPointToUse and ignoreDuplicates.
HBASE-4536 is going to be integrated soon where we have:
+ public ScanQueryMatcher(Scan scan, Store.ScanInfo scanInfo,
+ NavigableSet< byte []> columns, StoreScanner.ScanType scanType,
+ long earliestPutTs) {
I think any new parameters should be reviewed before deciding where they belong.

jiraposter@reviews.apache.org
added a comment - 27/Oct/11 05:43
On 2011-10-17 06:06:23, Kannan Muthukkaruppan wrote:
> Amit: Did you rebase before uploading the new patch. That, unfortunately, is making it hard to isolate the changes between r6 and r7. Will review tomorrow morning.
>
> But I did read your description about the issues you mentioned.
>
> Regarding (b)-- we had already discussed in person. That makes sense.
>
> And really nice catch on (a) too!! That is indeed subtle and tricky. Super!!!
>
Amitanand Aiyer wrote:
Looks like a lot has changed since the original revision that I based my first patch off.
Please disregard v7.
Let me submit these modifications as a separate diff. I have a sub-jira created for each part.
Seems like we are all basically +1 on this patch upto to some point - could we commit till there? We can add the other changes as separate tasks under the same umbrella task.
The other diffs are based on this diff... and they are separate issues from what this is addressing (persisting memstoreTS). Its getting confusing if we keep adding to this diff.
Karthik
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2224/#review2614
-----------------------------------------------------------
On 2011-10-15 04:08:41, Amitanand Aiyer wrote:
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2224/
-----------------------------------------------------------
(Updated 2011-10-15 04:08:41)
Review request for Ted Yu, Michael Stack, Kannan Muthukkaruppan, and Karthik Ranganathan.
Summary
-------
address the 2856 issues by writing the memstoreTS to the disk.
version v11 of the patch.
uploading it here for easier review process.
This addresses bug HBASE-2856 .
https://issues.apache.org/jira/browse/HBASE-2856
Diffs
-----
/pom.xml 1183581
/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java 1183581
/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java 1183581
/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java 1183581
/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java 1183581
/src/main/java/org/apache/hadoop/hbase/regionserver/ColumnCount.java 1183581
/src/main/java/org/apache/hadoop/hbase/regionserver/ColumnTracker.java 1183581
/src/main/java/org/apache/hadoop/hbase/regionserver/ExplicitColumnTracker.java 1183581
/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 1183581
/src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java 1183581
/src/main/java/org/apache/hadoop/hbase/regionserver/ReadWriteConsistencyControl.java 1183581
/src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java 1183581
/src/main/java/org/apache/hadoop/hbase/regionserver/ScanWildcardColumnTracker.java 1183581
/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java 1183581
/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java 1183581
/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java 1183581
/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java 1183581
/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java 1183581
/src/test/java/org/apache/hadoop/hbase/TestAcidGuarantees.java 1183581
/src/test/java/org/apache/hadoop/hbase/io/hfile/TestCacheOnWrite.java 1183581
/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileWriterV2.java 1183581
/src/test/java/org/apache/hadoop/hbase/regionserver/TestExplicitColumnTracker.java 1183581
/src/test/java/org/apache/hadoop/hbase/regionserver/TestFSErrorsExposed.java 1183581
/src/test/java/org/apache/hadoop/hbase/regionserver/TestScanWildcardColumnTracker.java 1183581
/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java 1183581
Diff: https://reviews.apache.org/r/2224/diff
Testing
-------
mvn test
Thanks,
Amitanand

stack
added a comment - 27/Oct/11 06:46 Seems like we are all basically +1 on this patch upto to some point - could we commit till there?
Sounds good to me. Can you do this Amit? (i.e. cut out the bits that have the +1s?) Good stuff.

Summary:
We ran an incremental scrape with HFileOutputFormat and
encountered major compaction storms. This is caused by the bug inHBASE-3404. The permanent fix is a little tricky without HBASE-2856. We
realized that a quicker solution for avoiding these compaction storms is
to simply exclude bulk import files from minor compactions and let them
only be handled by time-based major compactions. Add with functionality
along with a config option to enable it.

Amitanand Aiyer
added a comment - 18/Nov/11 02:00 2856-v9-all-inclusive includes all the 4 patches for the sub tasks
persist memstorTS to disk
(ii) 4485
(iii) track versions corrrectly
(iv) rename RWCC to MVCC.
This is rebased to the latest trunk HEAD. trunk@1203428
Ran the unit tests on mr. and they seem to pass except TestShell (which seems to fail even without these patches).

Jonathan Hsieh
added a comment - 19/Nov/11 00:50 I've been looping TestAcidGuarantee's fro about 6 hours now and it is still chugging along and has not failed. I'm going to let it go overnight. (I believe it used to fail within an hour)
What are thoughts on backporting this onto the 0.92 branch? (as a separate issue..)

If someone did it in next day or so, I'd be up for having it committed to 0.92 in time for first RC – there's one issue outstanding at mo (Thats good news this patch is not yet failing – its pretty amazing really). If its not done in next day or so, it'll miss the first RC. Could commit it to the RC that follows (Somehow I don't think the first RC is going to make it out).

stack
added a comment - 19/Nov/11 05:02 If someone did it in next day or so, I'd be up for having it committed to 0.92 in time for first RC – there's one issue outstanding at mo (Thats good news this patch is not yet failing – its pretty amazing really). If its not done in next day or so, it'll miss the first RC. Could commit it to the RC that follows (Somehow I don't think the first RC is going to make it out).

stack
added a comment - 19/Nov/11 05:38 I took a look. It seems a little tricky getting this in. There is more than just this patch involved. I see RWCC's begin/ends that are in trunk but not in 0.92. So we'd need more than just this patch.

Lars Hofhansl
added a comment - 20/Nov/11 06:57 Here's a patch against 0.92. I pulled in the necessary API changes from HBASE-4219 (but not the rest of the functionality).
I could use some help verifying the patch and testing this!
TestAcidGuarantees passed (but only ran it once).

Jonathan Hsieh
added a comment - 20/Nov/11 12:39 On trunk, TestAcidGuarantees ran for a solid day and a half (33+ hours) without failing.
larsh@ I'll loop the 0.92 version and let it run through today and report how it fared around midday monday.

Jonathan Hsieh
added a comment - 20/Nov/11 12:39 On trunk, TestAcidGuarantees ran for a solid day and a half (33+ hours) without failing.
larsh@ I'll loop the 0.92 version and let it run through today and report how it fared around midday monday.

Lars Hofhansl
added a comment - 20/Nov/11 19:19 - edited @Nicolas and @Amit, could you review the 0.92 patch? It turned out to be much more manual than I had wished or expected, so it is very possible that I missed something.
(I tried to upload the 0.92 patch to review board for easier verification, but apparently that does not work for branches other than trunk.)

Jonathan Hsieh
added a comment - 21/Nov/11 00:48 @larsh I posted it for you here. https://reviews.apache.org/r/2893/
I applied the patch, committed it and generated a git-patch via 'git format-patch HEAD^' which has enough info to find the right branch.

You fellas want this in 0.92? I want to cut a 0.92 RC. I have 0.92 tests passing up on jenkins a few times in a row now and all criticals and blockers are in. Should we wait? Or should we cut the RC and get this into the second RC (I"m sure there'll be one).

stack
added a comment - 21/Nov/11 04:40 You fellas want this in 0.92? I want to cut a 0.92 RC. I have 0.92 tests passing up on jenkins a few times in a row now and all criticals and blockers are in. Should we wait? Or should we cut the RC and get this into the second RC (I"m sure there'll be one).

Lars Hofhansl
added a comment - 21/Nov/11 05:59 Re: 0.92, I was going by your comment above
If someone did it in next day or so, I'd be up for having it committed to 0.92 in time for first RC.
It's not entirely clean, yet:
Results :
Failed tests: testClosing(org.apache.hadoop.hbase.client.TestHCM)
testFilterAcrossMultipleRegions(org.apache.hadoop.hbase.client.TestFromClientSide): expected:<17576> but was:<28064>
testForceSplit(org.apache.hadoop.hbase.client.TestAdmin): Scanned more than expected (6000)
testForceSplitMultiFamily(org.apache.hadoop.hbase.client.TestAdmin): Scanned more than expected (6000)
testSplitWhileBulkLoadPhase(org.apache.hadoop.hbase.mapreduce.TestLoadIncrementalHFilesSplitRecovery)
testGroupOrSplitPresplit(org.apache.hadoop.hbase.mapreduce.TestLoadIncrementalHFilesSplitRecovery)
testWholesomeSplit(org.apache.hadoop.hbase.regionserver.TestSplitTransaction)
testRollback(org.apache.hadoop.hbase.regionserver.TestSplitTransaction)
testBasicSplit(org.apache.hadoop.hbase.regionserver.TestHRegion)
Tests in error:
testShutdownFixupWhenDaughterHasSplit(org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster): test timed out after 300000 milliseconds
Tests run: 1065, Failures: 9, Errors: 1, Skipped: 7
I have no time to look at these tonight, though. But that probably points to another RC.
Would sure be nice if the acid guarantees that HBase claims would be met in 0.92

Something to keep in mind: we have a version of this for our prod branch running on some smaller test clusters, but not yet on our actual prod clusters (since we committed it at the same time you did). Also, note that between HFileV2 & this, there is no easy downgrade strategy after moving from 90 to 92. I think that putting this in a 92 RC definitely means a extra testing effort. However, it's been the last massive outstanding caveat for ACID semantics so it makes sense for 92 inclusion. I'm sure that other companies consider this a critical issue for their customers, so they would be up for accelerating this testing effort ahead of our schedule.

Nicolas Spiegelberg
added a comment - 21/Nov/11 15:51 Something to keep in mind: we have a version of this for our prod branch running on some smaller test clusters, but not yet on our actual prod clusters (since we committed it at the same time you did). Also, note that between HFileV2 & this, there is no easy downgrade strategy after moving from 90 to 92. I think that putting this in a 92 RC definitely means a extra testing effort. However, it's been the last massive outstanding caveat for ACID semantics so it makes sense for 92 inclusion. I'm sure that other companies consider this a critical issue for their customers, so they would be up for accelerating this testing effort ahead of our schedule.

testFilterAcrossMultipleRegions(org.apache.hadoop.hbase.client.TestFromClientSid
e) Time elapsed: 12.233 sec <<< FAILURE!
java.lang.AssertionError: expected:<17576> but was:<28064>
at org.junit.Assert.fail(Assert.java:93)
at org.junit.Assert.failNotEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:128)
at org.junit.Assert.assertEquals(Assert.java:472)
at org.junit.Assert.assertEquals(Assert.java:456)
at org.apache.hadoop.hbase.client.TestFromClientSide.assertRowCount(Test
FromClientSide.java:528)
at org.apache.hadoop.hbase.client.TestFromClientSide.testFilterAcrossMul
tipleRegions(TestFromClientSide.java:436)

Happens only with the 0.92 patch applied. It seems the scanner now finds too many cells.

Lars Hofhansl
added a comment - 21/Nov/11 19:32 This one looks bad:
testFilterAcrossMultipleRegions(org.apache.hadoop.hbase.client.TestFromClientSid
e) Time elapsed: 12.233 sec <<< FAILURE!
java.lang.AssertionError: expected:<17576> but was:<28064>
at org.junit.Assert.fail(Assert.java:93)
at org.junit.Assert.failNotEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:128)
at org.junit.Assert.assertEquals(Assert.java:472)
at org.junit.Assert.assertEquals(Assert.java:456)
at org.apache.hadoop.hbase.client.TestFromClientSide.assertRowCount(Test
FromClientSide.java:528)
at org.apache.hadoop.hbase.client.TestFromClientSide.testFilterAcrossMul
tipleRegions(TestFromClientSide.java:436)
Happens only with the 0.92 patch applied. It seems the scanner now finds too many cells.

On the bulkload operation, the error has something to do with the split point – in the test I force a split and the resulting error has something to do with the point where the start of the second daughter.

@Lars – since the original issue is resolved, and since this seems non-trival, maybe this should get move into a new issue?

Jonathan Hsieh
added a comment - 21/Nov/11 20:24 On the bulkload operation, the error has something to do with the split point – in the test I force a split and the resulting error has something to do with the point where the start of the second daughter.
@Lars – since the original issue is resolved, and since this seems non-trival, maybe this should get move into a new issue?