Description

couch_common.h declares the field DocInfo.rev_seq as type uint64_t. However, the database only persists it as a 32-bit value (see assemble_seq_index_value, couch_save.c:19). This means that clients trying to set values larger than 2^32-1 will find the upper 32 bits truncated when they read the document.

The field should be changed to type uint32_t. This could cause warnings in client code if it's assigning a 64-bit value to it, but I don't believe any of our code uses this field at all, currently(?)

Jens Alfke
added a comment - 17/Oct/12 12:19 PM - edited Fixing this will break binary compatibility of the libcouchstore shared library because it will change field offsets in the DocInfo struct.
It may also cause compile or runtime errors in language bindings — the only one we depend on is Python, and if our importer code doesn't set rev_seq it should be OK.

Junyi Xie (Inactive)
added a comment - 19/Oct/12 1:22 AM XDCR uses 64 bit seq number, on the source XDCR get this info from CouchDB, while on destination, we get it from ep_engine.
Damien, can you please clarify how the issue is going to impact XDCR? Thanks.

If we had to store all 64 bits of the sequence number, it would require an incompatible change in the CouchStore file format. We'd need to make a corresponding change to our CouchDB code so it can read the same databases, and probably come up with some migration tool for people who have databases created with the beta.

Jens Alfke
added a comment - 24/Oct/12 4:26 PM If we had to store all 64 bits of the sequence number, it would require an incompatible change in the CouchStore file format. We'd need to make a corresponding change to our CouchDB code so it can read the same databases, and probably come up with some migration tool for people who have databases created with the beta.

Junyi Xie (Inactive)
added a comment - 29/Oct/12 3:17 PM XDCR will be get impacted if the number of changes is over 10^32=4B.
Damien and Arraon, I assumed the storage layer should be able to store 64bit seq number, could you please clarify?

Aaron Miller (Inactive)
added a comment - 29/Oct/12 3:26 PM Which sequence number are we talking about?
The rev_seq field is the revision counter for the item and is local to the key. It is incremented when a key is updated. It is 32 bits in the file format.
The db_seq field is the identifier in the sequence index, it is the one that identifies an item within a couchstore file, and is 48 bits in the file format.

Farshid: The 32-bit limit is per document. Assuming ep-engine increments rev_seq every time it updates a document, you'll be able to update any one document 4 billion times before this rolls over. I don't know how XDCR works, but this doesn't _seem_ like it should be an issue for it.

Jens Alfke
added a comment - 29/Oct/12 3:44 PM Aaron: this is about rev_seq, not db_seq.
Farshid: The 32-bit limit is per document. Assuming ep-engine increments rev_seq every time it updates a document, you'll be able to update any one document 4 billion times before this rolls over. I don't know how XDCR works, but this doesn't _seem_ like it should be an issue for it.

Yes, we need to fix this. I think raising the stored size to 48 bits should be plenty. We'll need to increment the header version #, so that old files persisting the smaller not opened, but we give an error on opening.

Dipti, when we fix this, old files created during beta will be not be openable by the GA code. Is this acceptable? Fixing it so new code can open old files would require large changes in couchstore that I don't feel comfortable with, but if we need to do it we should know soon so we can scope out the work.

damien
added a comment - 29/Oct/12 3:50 PM - edited Yes, we need to fix this. I think raising the stored size to 48 bits should be plenty. We'll need to increment the header version #, so that old files persisting the smaller not opened, but we give an error on opening.
Dipti, when we fix this, old files created during beta will be not be openable by the GA code. Is this acceptable? Fixing it so new code can open old files would require large changes in couchstore that I don't feel comfortable with, but if we need to do it we should know soon so we can scope out the work.

I talked with Aaron about this, too. It is indeed a per-doc thing. Single datacenter situation (non-XDCR) should have no effect, since nobody except for XDCR reads the fields. On XDCR case, one scenario is a "window of remote-datacenter downtime" in bi-directional replication case, an older doc might win in the (very low probability) of value wrap-around. Not a blocker, imo, after learning all this.

Steve Yen
added a comment - 29/Oct/12 3:57 PM I talked with Aaron about this, too. It is indeed a per-doc thing. Single datacenter situation (non-XDCR) should have no effect, since nobody except for XDCR reads the fields. On XDCR case, one scenario is a "window of remote-datacenter downtime" in bi-directional replication case, an older doc might win in the (very low probability) of value wrap-around. Not a blocker, imo, after learning all this.

Damien: I'm not sure I understand why this is so important. The issue I see is that if a document is updated 4 billion times between replications, its rev_seq will roll over and may cause it not to be transferred. Is that something that will occur in real usage? (I'm asking sincerely, not rhetorically; I don't know the use cases of CouchStore as well as CouchDB.)

Note that any fix has to be made to both CouchStore and CouchDB simultaneously.

Jens Alfke
added a comment - 29/Oct/12 4:00 PM Damien: I'm not sure I understand why this is so important. The issue I see is that if a document is updated 4 billion times between replications, its rev_seq will roll over and may cause it not to be transferred. Is that something that will occur in real usage? (I'm asking sincerely, not rhetorically; I don't know the use cases of CouchStore as well as CouchDB.)
Note that any fix has to be made to both CouchStore and CouchDB simultaneously.

Yes, because when a document is created, it's assigned the high water rev of the last deletion in the whole vbucket. This way, when we create new docs that we previously deleted, we don't need to look on disk for the meta-data to ensure it now has a higher rev_seq than the deletion (if we don't, the new creation will "lose" to the remote deletion in a XDCR situation). So it's not a that single doc will be updated 4 billion times, it's that _all_ docs combined might be updated a cumulative 4 billion times. It's still unlikely, but possible, so we need more than a 32 bit number for storing the rev_seq.

damien
added a comment - 29/Oct/12 4:09 PM Yes, because when a document is created, it's assigned the high water rev of the last deletion in the whole vbucket. This way, when we create new docs that we previously deleted, we don't need to look on disk for the meta-data to ensure it now has a higher rev_seq than the deletion (if we don't, the new creation will "lose" to the remote deletion in a XDCR situation). So it's not a that single doc will be updated 4 billion times, it's that _all_ docs combined might be updated a cumulative 4 billion times. It's still unlikely, but possible, so we need more than a 32 bit number for storing the rev_seq.

Probably I did not explain clearly. We bumped up key seq number used by XDCR from 32bit to 64bit because of some special handling of deletion in XDCR, which is pretty dirty and dark and you may not want to know. :)

In short, we need to maintain a "high watermark" of key sequence number across all keys in the vbucket. That is, this field should be bigger enough to hold the max key seqno in the vbucket. Talked to Damien, we will make key seq number and global seq number both 48bits.

No change is needed on ep_engine and XDCR side. We still use 64 bit seq number, although only 48 bits are really effective.

Junyi Xie (Inactive)
added a comment - 29/Oct/12 4:14 PM Probably I did not explain clearly. We bumped up key seq number used by XDCR from 32bit to 64bit because of some special handling of deletion in XDCR, which is pretty dirty and dark and you may not want to know. :)
In short, we need to maintain a "high watermark" of key sequence number across all keys in the vbucket. That is, this field should be bigger enough to hold the max key seqno in the vbucket. Talked to Damien, we will make key seq number and global seq number both 48bits.
No change is needed on ep_engine and XDCR side. We still use 64 bit seq number, although only 48 bits are really effective.

I don't understand why we use a 64 bit seq number in memory but only 48 bits are effective and persisted. This is likely a bug we will hit in the future. You guys are the experts and understand this better than I do, but shouldn't this be consistent across the board?

breaking Beta to GA is really not advised but I think we need to ensure that XDCR and the way sequences are handled is full-proof. Can someone please write up the design for this and paste it in the bug to make sure everyone is on the same page?

Dipti Borkar
added a comment - 29/Oct/12 4:33 PM I don't understand why we use a 64 bit seq number in memory but only 48 bits are effective and persisted. This is likely a bug we will hit in the future. You guys are the experts and understand this better than I do, but shouldn't this be consistent across the board?
breaking Beta to GA is really not advised but I think we need to ensure that XDCR and the way sequences are handled is full-proof. Can someone please write up the design for this and paste it in the bug to make sure everyone is on the same page?

damien
added a comment - 29/Oct/12 4:48 PM Dipti, it's an optimization, so we don't store unneeded bits on disk. Keeping the on disk representation small as possible makes reads, writes and compaction fast.
And we aren't in any danger of overflowing 48 bits. If we continuously did 100,000 updates per sec per vbucket, it would take 60+ years before we could possibly overflow 48bits.

This value is not actually kept around in memory. It's only used in the interface to the persistence layer.

I was previously unaware of the high water rev_seq behavior that was added above couchstore to accommodate XDCR. I grok that it works like such:
* We keep a per-vbucket "high water" deleted rev_seq.
* When an item is deleted, if its rev_seq is higher than this, we set this value to that item's rev_seq
* Brand-new items, when created, have their rev_seq set to this high water mark, rather than to 0.

This is so that deleting and recreating an item will not set its rev_seq lower than before, thereby preventing it from ever winning in a XDCR conflict situation.

This behavior essentially makes rev_seq global across the data file, instead of local to the document as we expected when we designed the file format, and as such it is not unreasonable for it to get large enough to wrap around.

It appears this was done for efficiency, since metadata for deleted items is not available in RAM, and we didn't want to have to ask Couchstore if an item had previously been deleted every time we do a create, however I think we can eliminate this behavior (and once again make rev_seq local to the doc) by adding an option to couchstore to do the work of finding the rev_seq of the last deleted item at insert time, and as it will have to pass over the item's indexed "tombstone" to create it anyway, it should not actually be any less efficient.

Looking into this now.

Another option would be to increase the size of this field in the file format, but that would break file format compatibility with all the previous releases.

Aaron Miller (Inactive)
added a comment - 29/Oct/12 4:48 PM This value is not actually kept around in memory. It's only used in the interface to the persistence layer.
I was previously unaware of the high water rev_seq behavior that was added above couchstore to accommodate XDCR. I grok that it works like such:
* We keep a per-vbucket "high water" deleted rev_seq.
* When an item is deleted, if its rev_seq is higher than this, we set this value to that item's rev_seq
* Brand-new items, when created, have their rev_seq set to this high water mark, rather than to 0.
This is so that deleting and recreating an item will not set its rev_seq lower than before, thereby preventing it from ever winning in a XDCR conflict situation.
This behavior essentially makes rev_seq global across the data file, instead of local to the document as we expected when we designed the file format, and as such it is not unreasonable for it to get large enough to wrap around.
It appears this was done for efficiency, since metadata for deleted items is not available in RAM, and we didn't want to have to ask Couchstore if an item had previously been deleted every time we do a create, however I think we can eliminate this behavior (and once again make rev_seq local to the doc) by adding an option to couchstore to do the work of finding the rev_seq of the last deleted item at insert time, and as it will have to pass over the item's indexed "tombstone" to create it anyway, it should not actually be any less efficient.
Looking into this now.
Another option would be to increase the size of this field in the file format, but that would break file format compatibility with all the previous releases.

Jens Alfke
added a comment - 29/Oct/12 5:43 PM Dipti: "I don't understand why we use a 64 bit seq number in memory but only 48 bits are effective and persisted."
The C language and the CPU don't support 48-bit integers. The next size up from 32 bits is 64. So the sequence number is fundamentally a 48-bit number, but is represented in memory as a 64-bit value.
We should be more clear in the API docs about what the limits of the different values are, though. And range-check them in API calls.

Thank you for all the details. I would defer to your judgement (Damien, Aaron, Jens) to fix this. It seems like given these are global per vBucket, there is a chance of wrapping particularly in use cases like session stores that frequently delete and insert the document. If we need to make any storage changes, we should have a quick discussion with QE and the leads to understand the impact. We will need to document / release note it.

main questions are: will backup and restore still work from beta build to GA? (upgrade is not supported)

Dipti Borkar
added a comment - 29/Oct/12 8:38 PM Thank you for all the details. I would defer to your judgement (Damien, Aaron, Jens) to fix this. It seems like given these are global per vBucket, there is a chance of wrapping particularly in use cases like session stores that frequently delete and insert the document. If we need to make any storage changes, we should have a quick discussion with QE and the leads to understand the impact. We will need to document / release note it.
main questions are: will backup and restore still work from beta build to GA? (upgrade is not supported)
QE will need to retest these functional tests.

Aaron is currently reworking the way we keep rev seq and the high deleted rev, as it simplifies the code and can require less space on disk and in-memory. Once he has a patch ready, we'll evaluate the risk of merging the patch this late.

damien
added a comment - 30/Oct/12 4:14 PM Aaron is currently reworking the way we keep rev seq and the high deleted rev, as it simplifies the code and can require less space on disk and in-memory. Once he has a patch ready, we'll evaluate the risk of merging the patch this late.

Thought I'd add some notes from talking with Damien & Aaron, where they're evaluating solution #1 is the current favored approach...

solution #1: Aaron is enhancing couchstore API so that couchstore can choose the right rev-seq, even for recreated items.

+++ Pros: no changes to the current 32-bit field in the couchstore file format. This would also move us from a per-vbucket level "highwatermark" design to a per-document level of req-seq tracking.
--- Cons: changes to couchstore and to ep-engine.

solution #2: change from 32-bit field to 48-bit field in the couchstore file format.

+++ Pros: easy to understand this kind of change, so likely lower risk in implementing it.
--- Cons: file-format incompatibility with 2.0-beta, so can't support in-place, offline upgrades from 2.0-beta to 2.0-GA. Also, changes to couchstore and couchdb.

solution #3: the math tells us this is unlikely to happen soon, as the seq-num is only updated at persistence-time not at item mutation time. That is, 400K ops/sec will not have 400K seq-num increments/sec. Instead, the req-seq # is incremented only when it gets through couchstore and persisted, so the speed of persistence limits the growth of this #.

+++ Pros: easiest to understand, as it's the do-nothing approach.
--- Cons: this only delays the inevitable, as we still have to fix it, such as by implementing solution #1 in some ".next" release.

And, Aaron & Damin discussed doing both #1 and #2.

Finally, looking at the backup/restore code, it transfers the full 64-bits of rev-seq #, so it should be ok already.

Steve Yen
added a comment - 30/Oct/12 4:44 PM Thanks Damien.
Thought I'd add some notes from talking with Damien & Aaron, where they're evaluating solution #1 is the current favored approach...
solution #1: Aaron is enhancing couchstore API so that couchstore can choose the right rev-seq, even for recreated items.
+++ Pros: no changes to the current 32-bit field in the couchstore file format. This would also move us from a per-vbucket level "highwatermark" design to a per-document level of req-seq tracking.
--- Cons: changes to couchstore and to ep-engine.
solution #2: change from 32-bit field to 48-bit field in the couchstore file format.
+++ Pros: easy to understand this kind of change, so likely lower risk in implementing it.
--- Cons: file-format incompatibility with 2.0-beta, so can't support in-place, offline upgrades from 2.0-beta to 2.0-GA. Also, changes to couchstore and couchdb.
solution #3: the math tells us this is unlikely to happen soon, as the seq-num is only updated at persistence-time not at item mutation time. That is, 400K ops/sec will not have 400K seq-num increments/sec. Instead, the req-seq # is incremented only when it gets through couchstore and persisted, so the speed of persistence limits the growth of this #.
+++ Pros: easiest to understand, as it's the do-nothing approach.
--- Cons: this only delays the inevitable, as we still have to fix it, such as by implementing solution #1 in some ".next" release.
And, Aaron & Damin discussed doing both #1 and #2.
Finally, looking at the backup/restore code, it transfers the full 64-bits of rev-seq #, so it should be ok already.

Notes from another sync-up mtg that Damien called (with Aaron, Farshid, Steve)...

The agreed-upon, favored solution is now #2, increasing the 32-bit field to 48-bit field. A new toy-build will be forthcoming soon. Additional unit test(s) forthcoming. The change will start from Jen's patch that cleans up some of the magic numbers currently spread amongst the couchstore code. Damien will handle the changes on the couchdb side.

On the other solutions, solution idea #1 won't work due to intra-cluster (TAP) replication. Solution idea #3 will run out of bits too soon (e.g., 100 days).

Steve Yen
added a comment - 30/Oct/12 6:59 PM Notes from another sync-up mtg that Damien called (with Aaron, Farshid, Steve)...
The agreed-upon, favored solution is now #2, increasing the 32-bit field to 48-bit field. A new toy-build will be forthcoming soon. Additional unit test(s) forthcoming. The change will start from Jen's patch that cleans up some of the magic numbers currently spread amongst the couchstore code. Damien will handle the changes on the couchdb side.
On the other solutions, solution idea #1 won't work due to intra-cluster (TAP) replication. Solution idea #3 will run out of bits too soon (e.g., 100 days).

Ketaki, do we understand why the replication rate is so slow in your test yesterday? This change-set is a fundamental change in storage, and it may break all XDCR functionality, so we need to be very careful before merging this one.

Junyi Xie (Inactive)
added a comment - 02/Nov/12 2:57 PM Ketaki, do we understand why the replication rate is so slow in your test yesterday? This change-set is a fundamental change in storage, and it may break all XDCR functionality, so we need to be very careful before merging this one.

kzeller
added a comment - 09/Nov/12 6:30 PM RN: Document revision numbers had been stored as 32 bit but are
now stored as 48 bit values. This is to support
the functioning of XDCR and to support a larger number
of document revisions.