[longevity] rebalance fails with error "Got error while trying to send close confirmation: {error,enotconn} during rebalance" when adding a failed over node back to the cluster and failing over another node

Here are messages from this process. You can see that for some reason we see no open checkpoint on source node (current master) and most weirdly we see backfill for some totally unrelated vbuckets. That may be reason why destination closes socket.

Chiyoung, I think this probably happened because for some reason we actually passed empty vbucket filter to tap start command. If that implies that ep-engine is going to send us all vbuckets rather than nothing, please assign to me so that we can fix on ns_server side.

Aleksey Kondratenko (Inactive)
added a comment - 16/Oct/12 6:21 PM Chiyoung, I think this probably happened because for some reason we actually passed empty vbucket filter to tap start command. If that implies that ep-engine is going to send us all vbuckets rather than nothing, please assign to me so that we can fix on ns_server side.

Yes, if the vbucket filter is empty, it means all the vbuckets. As you probably know, this behavior exists from 1.6.x.

I'd like to change this definition in ep-engine, but it might require some changes in the TAP client applications that our customers developed for their own internal purpose.

If it doesn't require much work in the ns-server, please go ahead and fix it in the ns-server side. Otherwise, please reassign it to me and I will then work with Steve's team to see what would be the impact on our existing customers.

Chiyoung Seo
added a comment - 16/Oct/12 7:00 PM Yes, if the vbucket filter is empty, it means all the vbuckets. As you probably know, this behavior exists from 1.6.x.
I'd like to change this definition in ep-engine, but it might require some changes in the TAP client applications that our customers developed for their own internal purpose.
If it doesn't require much work in the ns-server, please go ahead and fix it in the ns-server side. Otherwise, please reassign it to me and I will then work with Steve's team to see what would be the impact on our existing customers.

[ns_server:error,2012-08-30T13:47:33.525,ns_1@10.3.121.25:'capi_set_view_manager-saslbucket':capi_set_view_manager:do_apply_vbucket_states:133]Failed to apply index states for the following ddocs:
[{<<"_design/d11">>,
{'EXIT',
{{badmatch,{not_found,no_db_file,
[{couch_db_set,'-handle_call/3-fun-1-',3},
{lists,foldl,3}

This was caused by the empty vbucket filter for the TAP producer "replication_building_30_'ns_1@10.3.121.16'". If the ns-server passed the empty vbucket filter to a given TAP producer, the TAP producer assumes all vbuckets and iterates them one by one.

The expected vbucket filter for the above producer should have vbucket 30 only.

These checkpoint stats are interesting. We don't expect the active vbucket to have 0 as the open checkpoint id. We reset the open checkpoint id to 0 for a replica vbucket right before it receives backfill items from the producer. After backfill is completed, the replica vbucket receives the positive open checkpoint id from the producer. Then, we change the vbucket state from replica to active iff this is for vbucket takeover.

Chiyoung Seo
added a comment - 18/Oct/12 2:18 PM {<<"vb_30:persisted_checkpoint_id">>,<<"0">>}
,
{<<"vb_30:checkpoint_extension">>,<<"false">>}
,
{<<"vb_30:num_items_for_persistence">>,<<"0">>}
,
{<<"vb_30:num_checkpoints">>,<<"1">>}
,
{<<"vb_30:num_open_checkpoint_items">>,<<"0">>}
,
{<<"vb_30:num_checkpoint_items">>,<<"1">>}
,
{<<"vb_30:num_tap_cursors">>,<<"0">>}
,
{<<"vb_30:last_closed_checkpoint_id">>,<<"0">>}
,
{<<"vb_30:open_checkpoint_id">>,<<"0">>}
,
{<<"vb_30:state">>,<<"active">>}
,
These checkpoint stats are interesting. We don't expect the active vbucket to have 0 as the open checkpoint id. We reset the open checkpoint id to 0 for a replica vbucket right before it receives backfill items from the producer. After backfill is completed, the replica vbucket receives the positive open checkpoint id from the producer. Then, we change the vbucket state from replica to active iff this is for vbucket takeover.

4, Janitor_agent does "undoing temp vbucket state", and note that vbucket 30 was in active partition as shown below. But, it was moved to cleanup partition which caused persisted_checkpoint_id changes. See log msg in above comment.

1, There were replication_building for vbucket 30 from node_25 to node_13, node_16, and node_22;
2, Backfill vbucket 30 to node_13 completed;
3, Rebalance somehow was stopped before it even reached takeover phase;
4, At this point, node_25 diag had vbucket 30 persisted_checkpoint_id = "98" and open_checkpoint_id = "99";
5, Some couch_set_view to set partition state, with cleanup_partitions =

{30}

;
6, Now, node_25 diag had vbucket 30 persisted_checkpoint_id became "0";
7, Node_13 left the cluster, and this resulted in some config change which cause replication building from node_25 to node_16 and node_22 again;
8, Vbucket 30 not ready on node_25 (="0") to replicate.

Talked to Chiyoung. For a vbucket checkpoint id to become 0, there must be 1) vbucket deletion or 2) initial vbucket stream for backfill. However, initial vbucket stream to node_25 does not make sense because it is the current master. Need to understand whether there was vbucket deletion by couch_set_view.

Liang Guo (Inactive)
added a comment - 19/Oct/12 4:48 PM Sum it up -
1, There were replication_building for vbucket 30 from node_25 to node_13, node_16, and node_22;
2, Backfill vbucket 30 to node_13 completed;
3, Rebalance somehow was stopped before it even reached takeover phase;
4, At this point, node_25 diag had vbucket 30 persisted_checkpoint_id = "98" and open_checkpoint_id = "99";
5, Some couch_set_view to set partition state, with cleanup_partitions =
{30}
;
6, Now, node_25 diag had vbucket 30 persisted_checkpoint_id became "0";
7, Node_13 left the cluster, and this resulted in some config change which cause replication building from node_25 to node_16 and node_22 again;
8, Vbucket 30 not ready on node_25 (="0") to replicate.
Talked to Chiyoung. For a vbucket checkpoint id to become 0, there must be 1) vbucket deletion or 2) initial vbucket stream for backfill. However, initial vbucket stream to node_25 does not make sense because it is the current master. Need to understand whether there was vbucket deletion by couch_set_view.

I tried to understand what couch_set_view does for cleanup_partitions =

{30}

but the erlang code there totally puzzles me . So, I want to ask you what exactly couch_set_view_group do with a vbucket (30) in cleanup_partitions[]. Any vbucket deletion? Please see my comments above. Feel free to assign it back to me.

Liang Guo (Inactive)
added a comment - 19/Oct/12 4:55 PM Hi Filipe,
I tried to understand what couch_set_view does for cleanup_partitions =
{30}
but the erlang code there totally puzzles me . So, I want to ask you what exactly couch_set_view_group do with a vbucket (30) in cleanup_partitions[]. Any vbucket deletion? Please see my comments above. Feel free to assign it back to me.
Thanks,
Liang

I'd narrowed down to where vbucket 30 open checkpoint id became 0 from 99 in a tiny window. There is no additional traces in ns_server log for me to dig. There is no memcached log or ns_server log showing any vbucket state change on node_25. Dunno what else I could do with this.

Liang Guo (Inactive)
added a comment - 23/Oct/12 4:31 PM I'd narrowed down to where vbucket 30 open checkpoint id became 0 from 99 in a tiny window. There is no additional traces in ns_server log for me to dig. There is no memcached log or ns_server log showing any vbucket state change on node_25. Dunno what else I could do with this.

Farshid Ghods (Inactive)
added a comment - 26/Oct/12 3:58 PM Aliaksey/Liang,
given that we did not see this rebalance failure in system testing do you think this was a one-time issue that is very rare to occur ?

Aleksey Kondratenko (Inactive)
added a comment - 26/Oct/12 4:11 PM It is possible one of recent ep-engine fixes closed this. Hard to say. Still we'd like to understand. Rareness is IMHO not a good argument.

Here're my observations. Node 23 is the master node of the cluster. At some
point during rebalance it decides to build replicas of vbucket 30 on nodes 13,
16 and 22. Current owner of this vbucket is node 25:

[rebalance:debug,2012-08-29T19:11:20.428,ns_1@10.3.121.23:<0.31805.466>:ns_single_vbucket_mover:mover_inner_old_style:190]child replicas builder for vbucket 30 is <0.31941.466>[ns_server:debug,2012-08-29T19:11:20.436,ns_1@10.3.121.23:<0.31941.466>:ns_replicas_builder_utils:spawn_replica_builder:86]Replica building ebucketmigrator for vbucket 30 into 'ns_1@10.3.121.13' is <15197.23780.3>[ns_server:debug,2012-08-29T19:11:20.447,ns_1@10.3.121.23:<0.31941.466>:ns_replicas_builder_utils:spawn_replica_builder:86]Replica building ebucketmigrator for vbucket 30 into 'ns_1@10.3.121.16' is <15202.29015.2>[ns_server:debug,2012-08-29T19:11:20.453,ns_1@10.3.121.23:<0.31941.466>:ns_replicas_builder_utils:spawn_replica_builder:86]Replica building ebucketmigrator for vbucket 30 into 'ns_1@10.3.121.22' is <15205.9270.2>

Destination nodes start ebucketmigrators for replica building. It means that
at this particular moment of time node 25's vbucket 30 still has non-zero
checkpoint id:

As part of the code that handles rebalance completion (for any reason) we
immediately log all tap and checkpoint stats on all the nodes. In these stats
on 25 we can see that vbucket 30 already has checkpoint id of zero:

I could not find anything abnormal from ns_server perspective during that
period of time. So probably it's something wrong in ep_engine. Apparently more
logs from ep_engine would help here. Particularly it would be good if vbucket
resets were logged.

Farshid Ghods (Inactive)
added a comment - 29/Oct/12 3:28 PM Chiyoung,
do you require more information from QE to proceed ? in that case please assign te bug back to Tony.
also this issue is not reproduced easily in system testing so not sure if Tony can reproduce and collect the new logs so in that case would you be okay with deferring this bug for now ?

This issue looks too strange to me. When the rebalance got stopped by the user, then the current master should still hold the active vbucket 30 with non-zero open checkpoint id. The open checkpoint id is reset to zero iff its vbucket get reset by receiving the backfill stream from the master or when it's deleted and recreated with non-active state.

I will add more logs to ep-engine, but don't see anything suspicious in ep-engine at this time. We will see if this issue happens again in 2.0 full regressions. I'm fine with moving this to post 2.0 given the fact that it's not easily reproducible.

Chiyoung Seo
added a comment - 29/Oct/12 4:21 PM This issue looks too strange to me. When the rebalance got stopped by the user, then the current master should still hold the active vbucket 30 with non-zero open checkpoint id. The open checkpoint id is reset to zero iff its vbucket get reset by receiving the backfill stream from the master or when it's deleted and recreated with non-active state.
I will add more logs to ep-engine, but don't see anything suspicious in ep-engine at this time. We will see if this issue happens again in 2.0 full regressions. I'm fine with moving this to post 2.0 given the fact that it's not easily reproducible.

Chiyoung Seo
added a comment - 20/Nov/12 1:29 PM We spent lots of time on debugging this issue, but not able to reproduce it. We added more logs to the ep-engine. Let's open the new bug when we hit the this issue again.