cancelElection fails on uninitialized ElectionContext

Details

Description

I had a solr collection that basically was out of memory (no exception, just continuous 80-90 second full GCs). This of course is not a good state, but when in this state ever time you come out of the GC your zookeeper session has expired causing all kinds of havoc. Anyway I found bug in the condition where during LeaderElector.setup() if you get a Zookeeper error the LeaderElector.context get set to a context that is not fully initialized (ie hasn't called joinElection..

Anyway once this happens the node can no longer attempt to join elections because every time the LeaderElector attempts to call cancelElection() on the previous ElectionContext..

Some logs below and I've attached a patch that does 2 things:

Move the setting of LeaderElector.context in the setup call to then of the call so it is only set if the setup completes.

Added a check to see if leaderSeqPath is null in ElectionContext.cancelElection

Made leaderSeqPath volatile as it is being directly accessed by multiple threads.

set LeaderElector.context = null when joinElection fails

There may be other issues.. the patch is focused on breaking the failure loop that occurs when initialization of the ElectionContext fails.

Added a new test called TestLeaderElectionZkExpiry which expires zk sessions in a tight loop. This test is able to reproduce the problem reported in this issue 6 out of 10 times on my machine. The patch given by Steven fixes the problems.

Shalin Shekhar Mangar
added a comment - 30/Jun/14 19:25 Added a new test called TestLeaderElectionZkExpiry which expires zk sessions in a tight loop. This test is able to reproduce the problem reported in this issue 6 out of 10 times on my machine. The patch given by Steven fixes the problems.