When XDCR fetched deletion mutation from storage, XDCR would see some garbage in document body as Marty described and it is marked as NON-JSON. This seems waster of bandwidth because destination side (either a regular XDCR receiver or ES cluster) just need the key and metadata, it does not need anything encoded in base64.

Junyi Xie (Inactive)
added a comment - 29/Jan/13 6:46 PM Hi Aaron,
When XDCR fetched deletion mutation from storage, XDCR would see some garbage in document body as Marty described and it is marked as NON-JSON. This seems waster of bandwidth because destination side (either a regular XDCR receiver or ES cluster) just need the key and metadata, it does not need anything encoded in base64.
Please be free assign to the right person if necessary. Thanks.

This does not cause any problem for Elasticsearch, when processing a delete we do not look at the binary flag or the body of the base64 data. As Junyi mentioned, it may waste bandwidth transferring bytes thats are not used for anything.

Marty Schoch
added a comment - 30/Jan/13 12:58 AM This does not cause any problem for Elasticsearch, when processing a delete we do not look at the binary flag or the body of the base64 data. As Junyi mentioned, it may waste bandwidth transferring bytes thats are not used for anything.

Couchstore considers it "valid" for deleted items to have bodies, and will use whatever content the upstream user provided. It should be fine for XDCR to consider deleted items to be "empty" and send no body, though, as from the perspective of the rest of the components in the system, a deleted item is effectively empty.

There may be some code in CouchDB/XDCR misinterpreting the empty body and reading some garbage from the beginning of the file erroneously though (Location 0 is considered to be "empty"), as I believe ep-engine correctly sets the body to empty when doing deletes. I will investigate. Do we know if this happens with every deleted item, or just some?

Aaron Miller (Inactive)
added a comment - 30/Jan/13 1:21 AM Couchstore considers it "valid" for deleted items to have bodies, and will use whatever content the upstream user provided. It should be fine for XDCR to consider deleted items to be "empty" and send no body, though, as from the perspective of the rest of the components in the system, a deleted item is effectively empty.
There may be some code in CouchDB/XDCR misinterpreting the empty body and reading some garbage from the beginning of the file erroneously though (Location 0 is considered to be "empty"), as I believe ep-engine correctly sets the body to empty when doing deletes. I will investigate. Do we know if this happens with every deleted item, or just some?

Dipti Borkar
added a comment - 30/Jan/13 1:32 AM BTW, I forget, do we propagate expired documents? if yes, then for use cases in which we have TTL, this may be a huge overhead ?
I don't know if this is something we want to fix in 2.0.1 given the timing, perhaps 2.0.2?

@aaron after we discussed the issue and theorized what was happening, the first doc I put through these steps exhibited this behavior. i can't say it always happens, but it has been very easy to reproduce.

@dipti regarding "do we propagate expired documents?" a good question the more i think about it. while the remote cluster should expire the item on its own, when the expiration hits the disk on the source cluster, i would expect that docid/revision to go across in a _revs_diff request. in the case of the elasticsearch plugin this would seem result in the doc/rev being considered missing (we already deleted it), since its missing, the _bulk_docs would then send the delete across the wire (introducing the overhead in your question) but, i'm not sure thats actually how it works in practice

Marty Schoch
added a comment - 30/Jan/13 1:55 AM @aaron after we discussed the issue and theorized what was happening, the first doc I put through these steps exhibited this behavior. i can't say it always happens, but it has been very easy to reproduce.
@dipti regarding "do we propagate expired documents?" a good question the more i think about it. while the remote cluster should expire the item on its own, when the expiration hits the disk on the source cluster, i would expect that docid/revision to go across in a _revs_diff request. in the case of the elasticsearch plugin this would seem result in the doc/rev being considered missing (we already deleted it), since its missing, the _bulk_docs would then send the delete across the wire (introducing the overhead in your question) but, i'm not sure thats actually how it works in practice

@Dipti, XDCR does replicate expired item as a deletion. But since the item can expire on its own on remote cluster, there is a race condition that no guarantee the deletion will be replicated to destination, depending on on which cluster, the expiration happens first. For ES, it does not impact the correctness, but may cause some network bandwidth overhead.

Junyi Xie (Inactive)
added a comment - 30/Jan/13 5:52 PM @Dipti, XDCR does replicate expired item as a deletion. But since the item can expire on its own on remote cluster, there is a race condition that no guarantee the deletion will be replicated to destination, depending on on which cluster, the expiration happens first. For ES, it does not impact the correctness, but may cause some network bandwidth overhead.

Dipti Borkar
added a comment - 30/Jan/13 6:34 PM Bandwidth for sure, but also latency. We need to optimize for latency.
How hard is the fix?
KV + XDCR (particularly for caching use case is very important)

@Dipti I'm not sure I understand the part about latency. If you've configured TTL on the Elasticsearch cluster, the expired document will be removed from the index within the "indices.ttl.interval" (default 60s). The only other thing is that at some later time the document expiration will also come across the wire via XDCR. This will involve both a _revs_diff and a _bulk_docs request. The net effect of these is nothing (the document has already been removed from the index) So what latency would be affected?

Marty Schoch
added a comment - 30/Jan/13 6:40 PM @Dipti I'm not sure I understand the part about latency. If you've configured TTL on the Elasticsearch cluster, the expired document will be removed from the index within the "indices.ttl.interval" (default 60s). The only other thing is that at some later time the document expiration will also come across the wire via XDCR. This will involve both a _revs_diff and a _bulk_docs request. The net effect of these is nothing (the document has already been removed from the index) So what latency would be affected?

Based on the above communication thread, there are two obvious things we can discuss here:

overhead of replicating expired items (especially with the race condition mentioned above) - yes waste of bandwidth but can this affect maintaining consistent XDCR performance including latency when the number of expired items are huge?

Aaron's suggestion - XDCR to consider a deleted/expired item to be empty thus send no body over wire.

We will discuss above two points and make a decision for feasible steps. Thanks.

Jin Lim
added a comment - 03/Feb/13 11:47 PM Based on the above communication thread, there are two obvious things we can discuss here:
overhead of replicating expired items (especially with the race condition mentioned above) - yes waste of bandwidth but can this affect maintaining consistent XDCR performance including latency when the number of expired items are huge?
Aaron's suggestion - XDCR to consider a deleted/expired item to be empty thus send no body over wire.
We will discuss above two points and make a decision for feasible steps. Thanks.

XDCR just read all mutations from storage and try to send to remote side. XDCR itself does not create or parse the document body. All info XDCR need is the key and metadata, not the body of document. The content in doc boy is totally beyond XDCR's control.

Looks like the "garbage in doc body" is persisted in storage because by Marty's testcase, the garbage is something created in memory. So even without XDCR, we may waste potentially a lot of storage space by persisting a bunch of garbage for deleted items. XDCR just expose this issue.

On XDCR side, ofcourse we can parse the doc body before sending it to avoid the garbage, but IHMO it is NOT a good fix, because it just "hides" the issue instead of fixing it. Parsing doc body is not supposed to happen in XDCR layer.

Per Aaron's comment, "ep-engine correctly sets the body to empty when doing deletes", therefore I would suggest to CouchDB/CouchStore team look at this issue first why we read garbage for deleted item.

Junyi Xie (Inactive)
added a comment - 04/Feb/13 5:00 PM Talked to Damien and Jin briefly.
XDCR just read all mutations from storage and try to send to remote side. XDCR itself does not create or parse the document body. All info XDCR need is the key and metadata, not the body of document. The content in doc boy is totally beyond XDCR's control.
Looks like the "garbage in doc body" is persisted in storage because by Marty's testcase, the garbage is something created in memory. So even without XDCR, we may waste potentially a lot of storage space by persisting a bunch of garbage for deleted items. XDCR just expose this issue.
On XDCR side, ofcourse we can parse the doc body before sending it to avoid the garbage, but IHMO it is NOT a good fix, because it just "hides" the issue instead of fixing it. Parsing doc body is not supposed to happen in XDCR layer.
Per Aaron's comment, "ep-engine correctly sets the body to empty when doing deletes", therefore I would suggest to CouchDB/CouchStore team look at this issue first why we read garbage for deleted item.

Junyi Xie (Inactive)
added a comment - 04/Feb/13 5:59 PM Per comment's from Damien, it is also possible a bug in ep_engine which messed up the doc_body ptr of deleted item. Add ep_engine in the component list.

My feeling is if we can find the culprit, it may be a small fix. But we need to figure out where the problem is. As I mentioned above, we incorrectly store some garbage of deleted item. XDCR just exposed that.

IMHO, ep_engine should first look at this issue to see if there is any dangling pointer of doc_body. If not, couchdb and couchstore will need to investigate. Thanks!

Junyi Xie (Inactive)
added a comment - 11/Feb/13 11:46 AM Dipti,
My feeling is if we can find the culprit, it may be a small fix. But we need to figure out where the problem is. As I mentioned above, we incorrectly store some garbage of deleted item. XDCR just exposed that.
IMHO, ep_engine should first look at this issue to see if there is any dangling pointer of doc_body. If not, couchdb and couchstore will need to investigate. Thanks!

Jin Lim
added a comment - 11/Feb/13 12:21 PM EP Engine (Jin) will take a look at this issue this week. As Junyi stated earlier if this turns out to be an ep engine issue, this should be a small fix. Thanks.

Farshid Ghods (Inactive)
added a comment - 11/Feb/13 12:27 PM Jin/Dipti/Yaseen,
given that this is not a regression from 2.0 and it does not impact xdcr functionality and does not cause huge bandwidth issue i recommend moving this to 2.0.2

I think it does cause a huge bandwidth issue for large documents and expirations. This is particularly the case for ad targeting. Also, many people have complained about unusual disk growth with TTL, I suspect this may be the issue. In addition to storing tombstones, we are also storing data in couchstore. questions that will help us decide are:

do we store extra data for every deleted key?

is this also the case for expired items?

is it a fixed amount of additional data that gets stored? how much per document?

I'm not so concerned about deletes since typically very small percentage of items are deleted <1%. But more concerned about expired items.

Dipti Borkar
added a comment - 11/Feb/13 12:41 PM I think it does cause a huge bandwidth issue for large documents and expirations. This is particularly the case for ad targeting. Also, many people have complained about unusual disk growth with TTL, I suspect this may be the issue. In addition to storing tombstones, we are also storing data in couchstore. questions that will help us decide are:
do we store extra data for every deleted key?
is this also the case for expired items?
is it a fixed amount of additional data that gets stored? how much per document?
I'm not so concerned about deletes since typically very small percentage of items are deleted <1%. But more concerned about expired items.

I agree we will still continue to look into this if there is a small fix. We just didn't have enough engineering cycles for debugging this, but will do so this week.
Please see my comments below for Dipti's questions above;

do we store extra data for every deleted key? yes on src node but not dst node (Junyi please correct me if wrong)

is this also the case for expired items? yes since ep engine handles both explicitly deleted and expired items with the same disk write operation (del).

is it a fixed amount of additional data that gets stored? it appears to be not much but random according to Junyi's latest debugging

Again, will do a little bit of more engineering debugging and make a final call. Thanks for practical suggestion from QE though.

Jin Lim
added a comment - 11/Feb/13 1:11 PM I agree we will still continue to look into this if there is a small fix. We just didn't have enough engineering cycles for debugging this, but will do so this week.
Please see my comments below for Dipti's questions above;
do we store extra data for every deleted key? yes on src node but not dst node (Junyi please correct me if wrong)
is this also the case for expired items? yes since ep engine handles both explicitly deleted and expired items with the same disk write operation (del).
is it a fixed amount of additional data that gets stored? it appears to be not much but random according to Junyi's latest debugging
Again, will do a little bit of more engineering debugging and make a final call. Thanks for practical suggestion from QE though.

do we store extra data for every deleted key?
It is fairly easy to reproduce, but I am not sure if it is for "every" deleted key

is this also the case for expired items?
Expired time will end up a del op from ep_engine, if the code path is the same, probably it will be a victim of this bug too.

is it a fixed amount of additional data that gets stored? how much per document?
The garbage seems from a random segment of memory (possibly due to some dangling pointer), if it is the case, we store a random chunk of data in doc_body, the size could be variant.

Junyi Xie (Inactive)
added a comment - 11/Feb/13 1:27 PM Jin,
Thanks a lot for looking into this.
Dipti,
Here is my take on your questions
do we store extra data for every deleted key?
It is fairly easy to reproduce, but I am not sure if it is for "every" deleted key
is this also the case for expired items?
Expired time will end up a del op from ep_engine, if the code path is the same, probably it will be a victim of this bug too.
is it a fixed amount of additional data that gets stored? how much per document?
The garbage seems from a random segment of memory (possibly due to some dangling pointer), if it is the case, we store a random chunk of data in doc_body, the size could be variant.

I finally did some debugging this issue and found that ep engine is handling deleted items correctly. It never passes doc body but NULL to couchstore for deleting docs.
However, with Aaron's help, it appears to be that open_doc_int() in couch_db.erl still tries to read the doc body for a deleted item instead of trasting it as a empty doc. Assign it to couchdb team (Filipe) for further debugging and a fix. Thanks.

Jin Lim
added a comment - 13/Feb/13 4:47 PM - edited I finally did some debugging this issue and found that ep engine is handling deleted items correctly. It never passes doc body but NULL to couchstore for deleting docs.
However, with Aaron's help, it appears to be that open_doc_int() in couch_db.erl still tries to read the doc body for a deleted item instead of trasting it as a empty doc. Assign it to couchdb team (Filipe) for further debugging and a fix. Thanks.

Filipe Manana (Inactive)
added a comment - 13/Feb/13 4:55 PM Is it that hard to add to the replicator code that doesn't read bodies for deleted items, or make couch_db:open_doc/2 not read bodies for deleted items?

It is not hard but replicator is not supposed to do that, and more importantly, that does not solve the problem. You still store a bunch of garbage on disk, which caused more serious disk usage issue as Dipit mentioned.

Junyi Xie (Inactive)
added a comment - 13/Feb/13 5:20 PM Filipe,
It is not hard but replicator is not supposed to do that, and more importantly, that does not solve the problem. You still store a bunch of garbage on disk, which caused more serious disk usage issue as Dipit mentioned.

From what I understood from Jin's comments, ep-engine/couchstore side are doing the right thing, that is, not persisting garbage to disk for deleted documents.
It's only a problem of couchdb not knowing that it shouldn't read bodies for deleted documents (unlike in Apache CouchDB).

Issues like this could well be solved by people who work on the replicator imho.

Filipe Manana (Inactive)
added a comment - 13/Feb/13 5:27 PM From what I understood from Jin's comments, ep-engine/couchstore side are doing the right thing, that is, not persisting garbage to disk for deleted documents.
It's only a problem of couchdb not knowing that it shouldn't read bodies for deleted documents (unlike in Apache CouchDB).
Issues like this could well be solved by people who work on the replicator imho.

Aaron/Filipe/Junyi, thanks for your input and quick help here. It looks like we can address this issue fairly simple, wether from the replicator or couchdb sides. If I may I will leave this up to your decision. However, please let me know since it is fairly simple, good fix wether we should consider push it to 2.0.1 or not. Thanks.

Jin Lim
added a comment - 13/Feb/13 5:36 PM Aaron/Filipe/Junyi, thanks for your input and quick help here. It looks like we can address this issue fairly simple, wether from the replicator or couchdb sides. If I may I will leave this up to your decision. However, please let me know since it is fairly simple, good fix wether we should consider push it to 2.0.1 or not. Thanks.

It is good news that we do not actually store garbage. If the couchdb read some garbage (as Aaron pointed out), then couchdb should fix it.

Replicator is just one consumer of this API and relies on couchdb to fetch all mutations, and IMHO it should not look into doc body. Put some code in replicator does not solve the problem, other caller of couchdb:open_doc() will still read garbage.

Junyi Xie (Inactive)
added a comment - 13/Feb/13 5:38 PM It is good news that we do not actually store garbage. If the couchdb read some garbage (as Aaron pointed out), then couchdb should fix it.
Replicator is just one consumer of this API and relies on couchdb to fetch all mutations, and IMHO it should not look into doc body. Put some code in replicator does not solve the problem, other caller of couchdb:open_doc() will still read garbage.