Description

In a WAN configuration, ZooKeeper is endlessly electing, terminating, and re-electing a ZooKeeper leader. The WAN configuration involves two groups, a central DC group of ZK servers that have a voting weight = 1, and a group of servers in remote pods with a voting weight of 0.

What we expect to see is leaders elected only in the DC, and the pods to contain only followers. What we are seeing is a continuous cycling of leaders. We have seen this consistently with 3.2.0, 3.2.0 + recommended patches (473, 479, 481, 491), and now release 3.2.1.

Benjamin Reed
added a comment - 10/Aug/09 22:48 +1 looks good. when setting the stop flags, you should really do an interrupt to wake up the wait, but that will cause a message to be printed to stdout. i'll open another jira to fix that.

Looks good to me, I tested out a number of q configs (2/3/4/5/7/9) with various weights/groups (granted, not exhaustive), and the quorum always formed. Was able to connect client, also tried stopping/starting servers to ensure rejoin of quorum. Looks good to me.

Also - the logging seems better, no longer errors for things that are not errors.

Patrick Hunt
added a comment - 10/Aug/09 20:11 Looks good to me, I tested out a number of q configs (2/3/4/5/7/9) with various weights/groups (granted, not exhaustive), and the quorum always formed. Was able to connect client, also tried stopping/starting servers to ensure rejoin of quorum. Looks good to me.
Also - the logging seems better, no longer errors for things that are not errors.

Patrick Hunt
added a comment - 07/Aug/09 21:09 Todd, the basic process from our end is that you should enter a JIRA and attach a patch. if you are contributing outside core, then you prolly want to add your stuff to src/contrib
http://wiki.apache.org/hadoop/ZooKeeper/HowToContribute

Patrick,
Thanks for the update. I'm closely following the dev alias and
appreciate the effort the ZK team is putting in. For the time being,
I'll stick with 3.1.1 and solve our WAN issues with an ensemble
synchronizer.

I'm in the middle of writing that bit right now.

BTW - Should I succeed in convincing my company to allow me to open
source various components that I've written (on top of zookeeper), what
is the process for that?

Todd Greenwood-Geer
added a comment - 07/Aug/09 20:57 Patrick,
Thanks for the update. I'm closely following the dev alias and
appreciate the effort the ZK team is putting in. For the time being,
I'll stick with 3.1.1 and solve our WAN issues with an ensemble
synchronizer.
I'm in the middle of writing that bit right now.
BTW - Should I succeed in convincing my company to allow me to open
source various components that I've written (on top of zookeeper), what
is the process for that?
-Todd

This patch includes documentation a reimplementation of HierarchicalQuorumTest.

The new implementation of HierarchicalQuorumTest is based on QuorumBase, the main differences being that HQT uses hierarchical quorums and FLE for leader election. It uses testHammerBasic of ClientTest to verify that upon the election of a leader the ensemble works as expected.

When I initially implemented the test, it was failing to terminate due to FLE failing to shutdown properly. I implemented some modifications to FLE to make sure that it shuts down correctly.

Flavio Junqueira
added a comment - 07/Aug/09 13:29 This patch includes documentation a reimplementation of HierarchicalQuorumTest.
The new implementation of HierarchicalQuorumTest is based on QuorumBase, the main differences being that HQT uses hierarchical quorums and FLE for leader election. It uses testHammerBasic of ClientTest to verify that upon the election of a leader the ensemble works as expected.
When I initially implemented the test, it was failing to terminate due to FLE failing to shutdown properly. I implemented some modifications to FLE to make sure that it shuts down correctly.

Mahadev konar
added a comment - 06/Aug/09 18:51 flavio, can you include the example in the forrest docs? It would be good for folks using it. It gets quite confusing when using flexible quorums. exmaples/docs should help.
thanks

I have generated a patch for this issue. I verified that I didn't do the correct checks in ZOOKEEPER-491, so I try to fix it in this patch. I have also modified the test to fix the problem with the fail assertion, and I have inspected the logs to see if it is behaving as expected. I can see no problem at this time with this patch.

Flavio Junqueira
added a comment - 06/Aug/09 03:22 I have generated a patch for this issue. I verified that I didn't do the correct checks in ZOOKEEPER-491 , so I try to fix it in this patch. I have also modified the test to fix the problem with the fail assertion, and I have inspected the logs to see if it is behaving as expected. I can see no problem at this time with this patch.
If someone else is interested in checking it out, please do it.

Flavio Junqueira
added a comment - 06/Aug/09 03:18 Pat, we have a description of how to configure in the "Cluster options" of the Administrator guide. We are missing an example, which is in the source code as you point out.

Patrick Hunt
added a comment - 05/Aug/09 22:33 there are docs in the source code that provide good low level detail on flex quorum implementation
HOWEVER, there are NO docs in the Ops guide detailing user level flex quorum operation
we need to add docs (as part of this fix) to forrest detailing how to operate/troubleshoot/debug flex quorum

Patrick Hunt
added a comment - 05/Aug/09 22:24 Todd,I did see an issue with your config, it's not:
group.1:1:2:3
rather it's:
group.1=1:2:3
(should be = not : )
Regardless though - even after I fix this it's still not forming a cluster properly, we're still looking.

Patrick Hunt
added a comment - 05/Aug/09 22:21 Please fix the following as well - incorrect logging levels are being used in quorum code, example:
2009-08-05 15:17:02,733 - ERROR [WorkerSender Thread:QuorumCnxManager@341] - There is a connection for server 1
2009-08-05 15:17:02,753 - ERROR [WorkerSender Thread:QuorumCnxManager@341] - There is a connection for server 2
this is INFO, not ERROR

Patrick Hunt
added a comment - 05/Aug/09 17:03 I attached zk498-test.tar.gz - this is a 5 server config (2 0weight) that fails to achieve quorum.
run start.sh/stop.sh and checkout the individual logs for details.

this is probably due because the test is calling assert in a thread other than the main test thread - which junit will not track/knowabout.

One problem I see with these tests (0weight test I looked at) – it doesn't have a client attempt to connect to the various servers as part of declaring success. Really we should only consider "success"ful test (ie assert that) if a client can connect to each server in the cluster and change/seechanges. As part of fixing this we really need to do a sanity check by testing the various command lines and checking that a client can connect.

I'm not even sure FLEnewepochtest/fletest/etc... are passing either. new epoch seems to just thrash...

Also I tried 3 & 5 server quorums "by hand from the command line" with 0 weight and they see similar issues to what Todd is seeing.

Patrick Hunt
added a comment - 05/Aug/09 16:59 Looks to me like 0 weight is still busted, fle0weighttest is actually failing on my machine, however it's reported as success:
------------- Standard Error -----------------
Exception in thread "Thread-108" junit.framework.AssertionFailedError: Elected zero-weight server
at junit.framework.Assert.fail(Assert.java:47)
at org.apache.zookeeper.test.FLEZeroWeightTest$LEThread.run(FLEZeroWeightTest.java:138)
------------- ---------------- ---------------
this is probably due because the test is calling assert in a thread other than the main test thread - which junit will not track/knowabout.
One problem I see with these tests (0weight test I looked at) – it doesn't have a client attempt to connect to the various servers as part of declaring success. Really we should only consider "success"ful test (ie assert that) if a client can connect to each server in the cluster and change/seechanges. As part of fixing this we really need to do a sanity check by testing the various command lines and checking that a client can connect.
I'm not even sure FLEnewepochtest/fletest/etc... are passing either. new epoch seems to just thrash...
Also I tried 3 & 5 server quorums "by hand from the command line" with 0 weight and they see similar issues to what Todd is seeing.
this is happening for me on both the trunk and 3.2 branch source.