Description

i tried the following scenario on the latest build:
1. Create cluster of 2 nodes. load 1M items, 1K each with cbworkloadgen ./cbworkloadgen -n localhost:8091 -i10000000 -t6 -j -s1000
2. Create 1DD, with 4 simple views, of which 2 had _count reduce function so i can count all documents on that view.
3. I built the index and confirmed that the _count views return 1M items, i.e. index was built successfully.
4. I added 2 more nodes to the cluster and rebalance.
5. During the rebalance, i refreshed the _count views.
Expected behavior (with consistent views enabled by default) is that i will get the same 1M items during the reblance
Observed behavior: the number kept changing, and decreasing over time to 900K, 800K..

Iryna, can you collect the neccesary logs from this cluster and post it on this bug.
Cluster can be found at: http://184.169.209.178:8091/ (usual credentials)
Here are all the IP addresses:
ns_1@10.176.29.176 10.176.29.176 184.169.209.178 ns_1@10.176.9.41
ns_1@10.176.9.41 10.176.9.41 50.18.23.114 ns_1@10.176.9.41
ns_1@10.168.103.76 10.168.103.76 204.236.154.91 ns_1@10.176.9.41
ns_1@10.176.145.104 10.176.145.104 54.241.117.117 ns_1@10.176.9.41

I did a few things on this cluster like creating and removing indexes/nodes. so look at the logs at the end.

Sharon Barr (Inactive)
added a comment - 18/Oct/12 7:52 PM More info:
at the end of the rebalance, there were 630K items on this view, and it seems that the index was rebuilt until it reached the 1M number.
it seems that the index is NOT being built during the rebalance.

Attached are logs from the new nodes that entered the cluster.
logs from the original nodes are too large to attache.

The accurate observation is that at least the reduce views are not being (or partially) built. the count drops during the rebalance and stay low after it's done. once i trigger the full view, it is being built up again.

Sharon Barr (Inactive)
added a comment - 18/Oct/12 11:59 PM Attached are logs from the new nodes that entered the cluster.
logs from the original nodes are too large to attache.
The accurate observation is that at least the reduce views are not being (or partially) built. the count drops during the rebalance and stay low after it's done. once i trigger the full view, it is being built up again.

Attached are logs from the new nodes that entered the cluster.
logs from the original nodes are too large to attache.

The accurate observation is that at least the reduce views are not being (or partially) built. the count drops during the rebalance and stay low after it's done. once i trigger the full view, it is being built up again.

Sharon Barr (Inactive)
added a comment - 19/Oct/12 12:00 AM Attached are logs from the new nodes that entered the cluster.
logs from the original nodes are too large to attache.
The accurate observation is that at least the reduce views are not being (or partially) built. the count drops during the rebalance and stay low after it's done. once i trigger the full view, it is being built up again.

What I see in the logs, is that the initial index build is being done during rebalance.

During rebalance, vbucket state changes are happening all the time, which cause the updater to be stopped and restarted after each state transition. Remember that the initial index build is not resumable, so each restart means it starts from scratch.

I don't think there's anything that can be done here. Only way I see, is if it were possible for ns_server to know what's the final state for all vbuckets in each node, and then do a single bulk state transition request to the indexes.

This is something that was always possible since the faster initial index build method was added months ago.

Filipe Manana (Inactive)
added a comment - 19/Oct/12 4:59 AM What I see in the logs, is that the initial index build is being done during rebalance.
During rebalance, vbucket state changes are happening all the time, which cause the updater to be stopped and restarted after each state transition. Remember that the initial index build is not resumable, so each restart means it starts from scratch.
I don't think there's anything that can be done here. Only way I see, is if it were possible for ns_server to know what's the final state for all vbuckets in each node, and then do a single bulk state transition request to the indexes.
This is something that was always possible since the faster initial index build method was added months ago.

After rebalance & indexing is finished, full 500K are returned
root@ubuntu1104-64:~# curl -X GET 'http://10.3.3.95:8092/default/_design/dev_d1/_view/v1?full_set=true&connection_timeout=60000&limit=10&skip=0'
{"rows":[

Filipe Manana (Inactive)
added a comment - 19/Oct/12 5:22 AM Actually, it's not doing initial build here, but what I said before it's still valid.
For node .117 (node 1), what I see is that the index update is being stopped and restarted every 1 or 2 seconds due to vbucket state transition requests from ns_server:
https://friendpaste.com/4qH8PqvWO4tpGANlSxp0n3
In each transition, only one vbucket state changes.
I don't know if things could be better batched or not in ns_server.

Farshid Ghods (Inactive)
added a comment - 19/Oct/12 9:37 AM Deep , Iryna and I had a conversation about this scenario and we will discuss with Filipe and update the ticket to understand consistent views and index updating better

Every 1 or 2 seconds should be impossible, Filipe. At least for main index. We are supposed to wait for index building completion for each bucket movement.

Incoming vbucket transfers are not synchronized w.r.t. each other and it's possible to have some config changes due to that: i.e. we started moving in vbucket 0 and 1 second later we can start moving in vbucket 1, which AFAIK will cause all indexing progress to be thrown out if it's initial index building.

Replica index is somewhat different currently. My code doesn't wait for replica indexing completion. Which I'll be happy to fix. Does couch_set_view:monitor_partition_update/3 work for replica indexes as well?

Aleksey Kondratenko (Inactive)
added a comment - 19/Oct/12 1:57 PM Every 1 or 2 seconds should be impossible, Filipe. At least for main index. We are supposed to wait for index building completion for each bucket movement.
Incoming vbucket transfers are not synchronized w.r.t. each other and it's possible to have some config changes due to that: i.e. we started moving in vbucket 0 and 1 second later we can start moving in vbucket 1, which AFAIK will cause all indexing progress to be thrown out if it's initial index building.
But there's inherent limit of concurrent incoming vbucket movements. Constant changes should be impossible.
Replica index is somewhat different currently. My code doesn't wait for replica indexing completion. Which I'll be happy to fix. Does couch_set_view:monitor_partition_update/3 work for replica indexes as well?

I understand it might not be possible for ns_server to batch things better, but this doesn't help.

I can revert some optimizations done several months ago for query performance, which will increase the incremental indexing checkpoint frequency. While they will likely help for such small datasets (and decreasing query performance on high concurrency scenarios), for large datasets, or when there's much load on the system (many ddocs, xdcr, bucket compaction, etc etc) we'll very likely run into the same scenario as here.

When I said impossible, I was referring to initial index building. In which case waiting for single vbucket will obviously wait for all of them.

For some next release we'll definitely seek better ways to interact with indexes. The problem, as I pointed our earlier, is that from our high-level perspective we're not very aware of performance implication of what we do.

Aleksey Kondratenko (Inactive)
added a comment - 20/Oct/12 11:45 AM When I said impossible, I was referring to initial index building. In which case waiting for single vbucket will obviously wait for all of them.
For some next release we'll definitely seek better ways to interact with indexes. The problem, as I pointed our earlier, is that from our high-level perspective we're not very aware of performance implication of what we do.

btw incorrect _count doesn't imply index is not being built. I recommend actually checking if results you expect are there. We have that somewhat controversial "excluding" of reduction values during non-steady-state. And I don't know if we heavily tested this path.

Aleksey Kondratenko (Inactive)
added a comment - 22/Oct/12 12:32 PM - edited btw incorrect _count doesn't imply index is not being built. I recommend actually checking if results you expect are there. We have that somewhat controversial "excluding" of reduction values during non-steady-state. And I don't know if we heavily tested this path.

Farshid Ghods (Inactive)
added a comment - 22/Oct/12 4:35 PM the user in this case is trying a reduce query which should return the count as expected.
when you say we have controversial excluding of reduction values during non-steady state ? is that expected behavior in view-engine ?
do we expect users to see the reduction to return inconsistent results but the view results alway be consistent ?

We expect it to be consistent. When I said 'somewhat controversial' I was pointing out that it actually has to do quite a bit of work to return reduction value in non-steady state, compared to steady state. And I was trying to say I have no idea if we used to test this path a lot.

Aleksey Kondratenko (Inactive)
added a comment - 22/Oct/12 4:40 PM We expect it to be consistent. When I said 'somewhat controversial' I was pointing out that it actually has to do quite a bit of work to return reduction value in non-steady state, compared to steady state. And I was trying to say I have no idea if we used to test this path a lot.

Farshid Ghods (Inactive)
added a comment - 23/Oct/12 6:23 PM if this happens for dev views we should add this to release notes that user might get incnsistent results during rebalancing with development views

From the view engine perspective, there's no distinction (or concept) about development and production views. They're all treated the same way.
ns_server on the other hand gives different treatment to them, like for example not triggering index updates for development views.

Filipe Manana (Inactive)
added a comment - 24/Oct/12 10:44 AM From the view engine perspective, there's no distinction (or concept) about development and production views. They're all treated the same way.
ns_server on the other hand gives different treatment to them, like for example not triggering index updates for development views.

Aleksey Kondratenko (Inactive)
added a comment - 25/Oct/12 1:52 PM Farshid, I could not understand you question completely.
Pure dev views are not at all affected during rebalance. After all they only cover single vbucket. Even if this vbucket is moved I believe it'll safely use other one.
If we're speaking about dev views will full_set option set to true, they are no different than any production views.

according to the bug description we are getting inconsistent results during rebalancing but does not happen for production views.

the tester also waits until index is built before running the rebalance and full_Set is used for all view queries.
if there is no difference between dev views and productions views during rebalancing then we should not be seeing different behavior.

Farshid Ghods (Inactive)
added a comment - 25/Oct/12 2:03 PM according to the bug description we are getting inconsistent results during rebalancing but does not happen for production views.
the tester also waits until index is built before running the rebalance and full_Set is used for all view queries.
if there is no difference between dev views and productions views during rebalancing then we should not be seeing different behavior.

Aleksey Kondratenko (Inactive)
added a comment - 25/Oct/12 2:08 PM Well, there is is no difference with dev with full_set and production views. Thus it appears that this bug is genuine and needs to be fixed.

As per Frank's original idea, development design docs are supposed to be used on subset of data. People are supposed to play with map/reduce functions without throwing entire cluster's power on building full index. When that development process is done, people can try it own on full subset few times. That would build full index. We expect that once this is done, people are satisfied and will publish dev_ design document to production. Our (and stock) indexes have a nice feature where same ddocs are represented by same one index file. So work spent on building full index will not be lost as production design doc will be 100% same as just indexed in full mode dev_ design doc.

Key point is, system explicitly avoids automagically building or updating that full index underneath dev_ design docs. Because that would waste cluster's resources without explicit human request for that.

Thus indeed as part of rebalance we trigger and wait for index updates for all non-dev design docs. Explicitly skipping development ddocs. Even then user should still be able to see consistent result with stale=false queries, which would force index update before returning results.

Aleksey Kondratenko (Inactive)
added a comment - 26/Oct/12 8:41 PM Ok. I was somewhat wrong. Here's full explanation.
As per Frank's original idea, development design docs are supposed to be used on subset of data. People are supposed to play with map/reduce functions without throwing entire cluster's power on building full index. When that development process is done, people can try it own on full subset few times. That would build full index. We expect that once this is done, people are satisfied and will publish dev_ design document to production. Our (and stock) indexes have a nice feature where same ddocs are represented by same one index file. So work spent on building full index will not be lost as production design doc will be 100% same as just indexed in full mode dev_ design doc.
Key point is, system explicitly avoids automagically building or updating that full index underneath dev_ design docs. Because that would waste cluster's resources without explicit human request for that.
Thus indeed as part of rebalance we trigger and wait for index updates for all non-dev design docs. Explicitly skipping development ddocs. Even then user should still be able to see consistent result with stale=false queries, which would force index update before returning results.
So indeed not a bug and perhaps worth documenting.

Steve Yen
added a comment - 27/Oct/12 5:09 PM Thanks Aliaksey.
Based on Aliaksey's explanation, marking this non-blocker and for MC to document this non-obvious but "working as designed" behavior for dev-views.

Farshid Ghods (Inactive)
added a comment - 29/Oct/12 2:55 PM please note that this behavior is only expected for development views ( full_set = True and partial set ) . It does not apply for production views.

"If you are using development views, be aware you may see
inconsistent results if you query a development view during rebalance. For production views,
you are able to query during rebalance and get results consistent with those you would have recieved
if no rebalance were occurring."

kzeller
added a comment - 06/Dec/12 4:48 PM Added to RN as:
"If you are using development views, be aware you may see
inconsistent results if you query a development view during rebalance. For production views,
you are able to query during rebalance and get results consistent with those you would have recieved
if no rebalance were occurring."