Farshid Ghods (Inactive)
added a comment - 20/Nov/12 10:18 AM so this failure occurs when the other node is unavailable ?
Iryna , when you reproduce this on windows manually does the rebalance succeed second time ?
do you see any data loss ? does rebalance fail immediately ?

Iryna Mironava
added a comment - 20/Nov/12 11:22 AM it not succeed after retrying during some time window, after about 15 minutes it succeed
there is no data loss,
rebalance fails in first 1-5 mins of rebalancing

I cannot see any clear evidence of what caused this. But it looks bad enough. 60 seconds timeout to query vbucket states on both nodes was hit here.

I also see that janitor pass immediately preceding rebalance also failed but in different phase, where it was waiting for vbucket change requests to be complete also failed on both nodes. And this timeout is 30 seconds.

I'd like diag to be grabbed _immediately_ after rebalance fails so that I could see what is the state of janitor_ageng and ns_memcached on each node. I.e. without doing any cleanup please. May I have that ?

Aleksey Kondratenko (Inactive)
added a comment - 20/Nov/12 9:13 PM I cannot see any clear evidence of what caused this. But it looks bad enough. 60 seconds timeout to query vbucket states on both nodes was hit here.
I also see that janitor pass immediately preceding rebalance also failed but in different phase, where it was waiting for vbucket change requests to be complete also failed on both nodes. And this timeout is 30 seconds.
I'd like diag to be grabbed _immediately_ after rebalance fails so that I could see what is the state of janitor_ageng and ns_memcached on each node. I.e. without doing any cleanup please. May I have that ?

Farshid Ghods (Inactive)
added a comment - 22/Jan/13 11:25 AM Aliaksey,
i know this was discussed before but we want to confirm what the test does with your conclusion
we put the node behind the firewall , then wait until node is marked as unhealthy by NS_SERVER
then failover this node and click on rebalance to eject the node ( ejectedNodes = 10.1.3.118 )
we want to know why when the node was already failed over from the cluster we wait for it be ready as part of rebalance. ( or the timeouts happened on existing nodes ? )

Aleksey Kondratenko (Inactive)
added a comment - 22/Jan/13 1:58 PM This is quite subtle to explain fully, but main problem is we have to wait until failover actually completes internally and that is subject to timeouts in certain areas.
So if you keep trying for 2-3 minutes it should eventually work.

>>but main problem is we have to wait until failover actually completes internally and that is subject to timeouts in certain areas.
so when failover REST api returns it does not mean that failover process internally is completed.
is failover a synchronous call or asynchronous ? in case its asyncronous is there a way for a test to check before initiating a rebalance so that we have a test that is more deterministic

Farshid Ghods (Inactive)
added a comment - 22/Jan/13 2:09 PM >>but main problem is we have to wait until failover actually completes internally and that is subject to timeouts in certain areas.
so when failover REST api returns it does not mean that failover process internally is completed.
is failover a synchronous call or asynchronous ? in case its asyncronous is there a way for a test to check before initiating a rebalance so that we have a test that is more deterministic

Aleksey Kondratenko (Inactive)
added a comment - 22/Jan/13 2:36 PM It is best-effort sync. If it fails to be sync with reasonably short timeout, it'll silently become async.
There's no way to detect that right now.

A few things came out of the engineering talk regarding this issue:
1) This is a good catch in that it is confirming we need a better way of handling the api since it cannot %100 warranty the completion of failover.
2) However, it is not a critical or blocker since the symptom is more obvious (highly probable) while running an automated testing case.
3) Only feasible approach (per ns server team) for now is to wait and retry.

Based on this and the fact the fix would require changes across components (ep engine, etc) we may want to consider to put this into a future enhancement.
Assign this to Yaseen for his input here. Pelase assign it back to Jin or Dipti afterwards. Thanks.

Jin Lim
added a comment - 24/Jan/13 1:40 PM A few things came out of the engineering talk regarding this issue:
1) This is a good catch in that it is confirming we need a better way of handling the api since it cannot %100 warranty the completion of failover.
2) However, it is not a critical or blocker since the symptom is more obvious (highly probable) while running an automated testing case.
3) Only feasible approach (per ns server team) for now is to wait and retry.
Based on this and the fact the fix would require changes across components (ep engine, etc) we may want to consider to put this into a future enhancement.
Assign this to Yaseen for his input here. Pelase assign it back to Jin or Dipti afterwards. Thanks.

Farshid Ghods (Inactive)
added a comment - 24/Jan/13 2:13 PM Thanks Jin for confirming this. I think this expected behavior can be included in the release notes as well so that users and support team can be aware of the issue and the suggested workaround.
Andrei,
can you then modify the test accordingly.

Farshid Ghods (Inactive)
added a comment - 28/Jan/13 2:03 PM per bug scrub :
please revise the test accordingly and after running the test for few times can you propose how long customer should wait before kicking the rebalance again

could you please add this to the release notes that the user needs to wait for 30 seconds before they attempt to run rebalance operation after failing over a node.
comment son this bug should explain under what conditions this 30 seconds delay is neeeded

Farshid Ghods (Inactive)
added a comment - 11/Feb/13 12:24 PM Karen,
could you please add this to the release notes that the user needs to wait for 30 seconds before they attempt to run rebalance operation after failing over a node.
comment son this bug should explain under what conditions this 30 seconds delay is neeeded

Failover REST api is sync operation with timeout. When it fails to complete the failover process within the timeout period, it internally switches to async operation (continues the failover to completion) and immediately returns. Subsequent rebalance in this case would fail because the failover process is still running. User can wait between 30 seconds upto a minute and reattempt rebalance.

Jin Lim
added a comment - 11/Mar/13 4:20 PM - edited Failover REST api is sync operation with timeout. When it fails to complete the failover process within the timeout period, it internally switches to async operation (continues the failover to completion) and immediately returns. Subsequent rebalance in this case would fail because the failover process is still running. User can wait between 30 seconds upto a minute and reattempt rebalance.

<para>
A cluster rebalance may exit and produce the error
{not_all_nodes_are_ready_yet} if you perform the rebalance right
after failing over a node in the cluster. You may need to
wait 30 seconds after the node failover before you
attempt the cluster rebalance.
</para>
<para>This is because the failover REST API is a synchronous operation with a timeout. If it fails to
complete the failover process by the timeout, the operation internally switches into a
asynchronous operation. It will immediately return and re-attempt failover in the background which will cause
rebalance to fail since the failover operation is still running.</para>

kzeller
added a comment - 11/Mar/13 5:34 PM <para>
A cluster rebalance may exit and produce the error
{not_all_nodes_are_ready_yet} if you perform the rebalance right
after failing over a node in the cluster. You may need to
wait 30 seconds after the node failover before you
attempt the cluster rebalance.
</para>
<para>This is because the failover REST API is a synchronous operation with a timeout. If it fails to
complete the failover process by the timeout, the operation internally switches into a
asynchronous operation. It will immediately return and re-attempt failover in the background which will cause
rebalance to fail since the failover operation is still running.</para>

<para>
A cluster rebalance may exit and produce the error
{not_all_nodes_are_ready_yet} if you perform the rebalance right
after failing over a node in the cluster. You may need to
wait 30 seconds after the node failover before you
attempt the cluster rebalance.
</para>
<para>This is because the failover REST API is a synchronous operation with a timeout. If it fails to
complete the failover process by the timeout, the operation internally switches into a
asynchronous operation. It will immediately return and re-attempt failover in the background which will cause
rebalance to fail since the failover operation is still running.</para>

kzeller
added a comment - 11/Mar/13 5:34 PM Added to RN as :
<para>
A cluster rebalance may exit and produce the error
{not_all_nodes_are_ready_yet} if you perform the rebalance right
after failing over a node in the cluster. You may need to
wait 30 seconds after the node failover before you
attempt the cluster rebalance.
</para>
<para>This is because the failover REST API is a synchronous operation with a timeout. If it fails to
complete the failover process by the timeout, the operation internally switches into a
asynchronous operation. It will immediately return and re-attempt failover in the background which will cause
rebalance to fail since the failover operation is still running.</para>

Thanks for the update, Andrei. Before we update the RN first let's figure out how long a user should wait before retrying the rebalance. Can we upgrade the wait period to 1 minute and see if that is long enough? Thanks.

Jin Lim
added a comment - 10/Apr/13 3:55 PM Thanks for the update, Andrei. Before we update the RN first let's figure out how long a user should wait before retrying the rebalance. Can we upgrade the wait period to 1 minute and see if that is long enough? Thanks.

Andrei Baranouski
added a comment - 11/Apr/13 5:44 AM set timeout in 1 min http://review.couchbase.org/#/c/25587/
before that we had a timeout for 60 sec, the tests do not fall
http://www.couchbase.com/issues/browse/MB-7168?focusedCommentId=49128&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-49128

Ok, changed to : A cluster rebalance may exit and produce the error
{not_all_nodes_are_ready_yet} if you perform the rebalance right
after failing over a node in the cluster. You may need to
wait 60 seconds after the node failover before you
attempt the cluster rebalance.

kzeller
added a comment - 08/May/13 5:06 PM Ok, changed to : A cluster rebalance may exit and produce the error
{not_all_nodes_are_ready_yet} if you perform the rebalance right
after failing over a node in the cluster. You may need to
wait 60 seconds after the node failover before you
attempt the cluster rebalance.
in RN 2.0.2

Anil Kumar
added a comment - 10/May/13 1:11 PM discussed with ALK, this bug will be fixed in 2.1. talked about having UI alert but since this happens only when node dies completely for now Release Note as Known Issue should fine. thanks

Aleksey Kondratenko (Inactive)
added a comment - 21/Aug/13 5:04 AM We've discussed this as part of scrum planning.
The thinking is that upr changes have best chance of addressing this. Otherwise hard and too late for 2.2.0

Removing this from our backlog. Given this was moved out of 2.2.0 and some upr work is planned for 3.0 (and will change a lot in this area) there's nothing that we can do at this time and we'll wait until upr stuff gets clearer.

Aleksey Kondratenko (Inactive)
added a comment - 26/Aug/13 1:52 PM Removing this from our backlog. Given this was moved out of 2.2.0 and some upr work is planned for 3.0 (and will change a lot in this area) there's nothing that we can do at this time and we'll wait until upr stuff gets clearer.

* A cluster rebalance may exit and produce the error {not_all_nodes_are_ready_yet} if you perform the rebalance right after failing over a node in the cluster. You may need to wait 60 seconds after the node failover before you attempt the cluster rebalance.

This is because the failover REST API is a synchronous operation with a timeout. If it fails to complete the failover process by the timeout, the operation internally switches into a asynchronous operation. It will immediately return and re-attempt failover in the background which will cause rebalance to fail since the failover operation is still running.

kzeller
added a comment - 26/Aug/13 5:46 PM Added as Known issue for 2.2 in RN.
* A cluster rebalance may exit and produce the error {not_all_nodes_are_ready_yet} if you perform the rebalance right after failing over a node in the cluster. You may need to wait 60 seconds after the node failover before you attempt the cluster rebalance.
This is because the failover REST API is a synchronous operation with a timeout. If it fails to complete the failover process by the timeout, the operation internally switches into a asynchronous operation. It will immediately return and re-attempt failover in the background which will cause rebalance to fail since the failover operation is still running.
*Issues* : [ MB-7168 ]( http://www.couchbase.com/issues/browse/MB-7168 )