Details

Description

ASyncHammerTest is failing intermittently on hudson trunk. There is no clear reason why this is happening, but
it seems from the logs that a session connection to a follower is failing during session establishment - the
failure seems to be a problem either on the follower or leader. The server gets the session create request, but
it stalls in the request processor pipeline. (we see it go in, but we do not see it com eout)

unfortunately all efforts to reproduce this on non-hudson trunk have failed. Even trying to reproduce by
running on hudson host itself (manually) has failed.

We need to instrument the client session creation code in the test to dump the thread stack if the
session creation fails.

It sounds like the createSession request goes as far as ToBeAppliedProcessor, but it doesn't make it to FinalRequestProcessor. If my observation is correct, I think it is getting lost between the two. Is that possible?

Flavio Junqueira
added a comment - 01/Dec/09 15:07 I have also been able to verify that createSession operations that do not complete are exiting FinalRequestProcessor.processRequest here (around line 138):
if (request.cnxn == null) {
return;
}
which is executed before the switch/case block that would finalize the operation.

according to the latest log the commit processor thread is exiting. I notice that we are not logging exceptions from that thread. We should include logging the exception as part of this fix. Really we need to add to the ThreadGroup – handle uncaught exceptions – log them at error level

Patrick Hunt
added a comment - 01/Dec/09 19:05 according to the latest log the commit processor thread is exiting. I notice that we are not logging exceptions from that thread. We should include logging the exception as part of this fix. Really we need to add to the ThreadGroup – handle uncaught exceptions – log them at error level

It is ok to have cnxn = null in FinalRequestProcessor. For example, if a follower is forwarding a request, cnxn will be null for the leader. The problem, as Pat points out, seems to be that CommitProcessor is exiting at the follower that was supposed to finalize it.

Here is the stack trace from a faulty run:

java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:350)
at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1065)
at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:360)
at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)

Flavio Junqueira
added a comment - 01/Dec/09 19:13 It is ok to have cnxn = null in FinalRequestProcessor. For example, if a follower is forwarding a request, cnxn will be null for the leader. The problem, as Pat points out, seems to be that CommitProcessor is exiting at the follower that was supposed to finalize it.
Here is the stack trace from a faulty run:
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:350)
at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1065)
at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:360)
at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)

Mahadev konar
added a comment - 04/Dec/09 03:27 looks like the build is normal now.
http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/571/
ill let the hudson trunk build run a few more times and then close this jira.

Patrick Hunt
added a comment - 04/Dec/09 18:08 this is still happening, but in another area:
http://hudson.zones.apache.org/hudson/view/ZooKeeper/job/ZooKeeper-trunk/577/testReport/org.apache.zookeeper.test/AsyncHammerTest/testObserversHammer/
It is good to have the handler here!
2009-12-04 13:46:13,493 - ERROR [CommitProcessor:0:CommitProcessor@146] - Unexpected exception causing CommitProcessor to exit
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:64)
at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:378)
at org.apache.zookeeper.server.NIOServerCnxn.sendCloseSession(NIOServerCnxn.java:349)
at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:356)
at org.apache.zookeeper.server.quorum.Leader$ToBeAppliedRequestProcessor.processRequest(Leader.java:532)
at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)

Mahadev konar
added a comment - 04/Dec/09 23:38 this patch fixes the 2 latest failuires we saw:
http://bugs.sun.com/view_bug.do?bug_id=6427854 , a bug in the jvm which is fixed by a selector.open.close() call in a static block in NIOServerCnxn (thanks ben for the idea)
the other cancelled key exception is fixed by wrapping sendbuffer with try catch for exceptions.
i looked for any other calls that can cause commitprocessor to fail in this way and could not find anything else. hopefully this is the last of such problems.