Good find. Now we have evidence that one of previously firewalled and failed over nodes is being (seemingly gradually) un-firewalled in the middle of rebalance. When rest of cluster actually already ejected this node. But this node discovers it was ejected 1 minute after it's replicator was able to push-replicate to one of existing nodes. We're starting to really hit limits of our naive cluster orchestration approach.

Aleksey Kondratenko (Inactive)
added a comment - 23/May/12 12:24 PM Good find. Now we have evidence that one of previously firewalled and failed over nodes is being (seemingly gradually) un-firewalled in the middle of rebalance. When rest of cluster actually already ejected this node. But this node discovers it was ejected 1 minute after it's replicator was able to push-replicate to one of existing nodes. We're starting to really hit limits of our naive cluster orchestration approach.

I think reasonably simple treatment (still partial and naive) is to never restart replication automatically until janitor restarts it. It has some potential data safety implications though. I.e. janitor being really conservative in some cases will not restart replications that previously were automagically restarted. So not sure.

Aleksey Kondratenko (Inactive)
added a comment - 23/May/12 12:25 PM - edited I think reasonably simple treatment (still partial and naive) is to never restart replication automatically until janitor restarts it. It has some potential data safety implications though. I.e. janitor being really conservative in some cases will not restart replications that previously were automagically restarted. So not sure.

Aleksey Kondratenko (Inactive)
added a comment - 29/May/12 8:35 PM My understanding is that probability of hitting this in practice approaches zero. We had this issue since 1.6.0 yet nobody seen reported this problem.

Farshid Ghods (Inactive)
added a comment - 29/May/12 8:37 PM this is not about abusing the firewall . its about node coming back up or re-appearing after its failed over.
if this is purely due to firewall then its ok to defer this

Farshid Ghods (Inactive)
added a comment - 29/May/12 8:56 PM firewall is our way of simulating a node disappearing and re-appearing. we can simulate that by also shutting down the network interface or pulling the network cable if it helps

Aleksey Kondratenko (Inactive)
added a comment - 29/May/12 9:03 PM I have evidence that you're enabling firewall back in some very specific way. Particularly memcached traffic is re-enabled first. And then minutes later you re-enable erlang traffic.