Details

Description

Let's say we have 100 clients (group A) already connected to three-node ZK ensemble with session timeout of 15 second. And we have 1000 clients (group B) already connected to the same ZK ensemble, all watching several nodes (with 15 second session timeout)

Consider a case in which All clients in group B suddenly hung or deadlocked (JVM OOME) all at the same time. 15 seconds later, all sessions in group B gets expired, creating session closing stampede. Depending on the number of this clients in group B, all request/response ZK ensemble should process get delayed up to 8 seconds (1000 clients we have tested).

This delay causes some clients in group A their sessions expired due to delay in getting heartbeat response. This causes normal servers to drop out of clusters. This is a serious problem in our installation, since some of our services running batch servers or CI servers creating the same scenario as above almost everyday.

I am attaching a graph showing ping response time delay.

I think ordering of creating/closing sessions and ping exchange isn't important (quorum state machine). at least ping request / response should be handle independently (different queue and different thread) to keep realtime-ness of ping.

As a workaround, we are raising session timeout to 50 seconds.
But this causes max. failover of cluster to significantly increased, thus initial QoS we promised cannot be met.

We have monitored ZK ensemble server. There are no significant I/O activity, this workload is CPU bound. 0-3% io wait (this is normal system I/O wait due to system log flush, page cache flush daemon). User CPU goes up to 30-40% during this closing stampede.

2) are you co-locating other services on the same host(s) that make up
the ZK serving cluster?

NO. Dedicated ZK ensemble

3) have you followed the admin guide's "things to avoid"?http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems
In particular ensuring that you are not swapping or going into gc
pause (both on the server and the client)
a) try turning on GC logging and ensure that you are not going into GC
pause, see the troubleshooting guide, this is the most common cause of
high latency for the clients

No full GC (jstat output)

b) ensure that you are not swapping

No swapping. No significant IO (0-3% iowait utilization)

c) ensure that other processes are not causing log writing
(transactional logging) to be slow.

Chang Song
added a comment - 16/Apr/11 06:28 1) are you running in a virtualized environment?
NO
2) are you co-locating other services on the same host(s) that make up
the ZK serving cluster?
NO. Dedicated ZK ensemble
3) have you followed the admin guide's "things to avoid"?
http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems
In particular ensuring that you are not swapping or going into gc
pause (both on the server and the client)
a) try turning on GC logging and ensure that you are not going into GC
pause, see the troubleshooting guide, this is the most common cause of
high latency for the clients
No full GC (jstat output)
b) ensure that you are not swapping
No swapping. No significant IO (0-3% iowait utilization)
c) ensure that other processes are not causing log writing
(transactional logging) to be slow.
No other processes are running on these hosts.

1) what is the connectivity like btw the servers?
What is the ping time btw them?

ping time is under 1ms (inter-IDC)

Is the system perhaps loading down the network during this test,
causing network latency to increase? Are all the nic cards (server and
client) configured correctly? I've seen a number of cases where
clients and/or server had incorrectly configured nics (ethtool
reported 10 MB/sec half duplex for what should be 1gigeth)

They all are configured Full-Duplex 1Gbps NIC.
Switches are all operational.

2) regarding IO, if you run 'iostat -x 2' on the zk servers while your
issue is happening, what's the %util of the disk? what's the iowait
look like?

0-3% at the time of testing. It varies. There is no significant IO
from Zookeeper process. Just normal page cache flush activity.

Chang Song
added a comment - 16/Apr/11 06:35
1) what is the connectivity like btw the servers?
What is the ping time btw them?
ping time is under 1ms (inter-IDC)
Is the system perhaps loading down the network during this test,
causing network latency to increase? Are all the nic cards (server and
client) configured correctly? I've seen a number of cases where
clients and/or server had incorrectly configured nics (ethtool
reported 10 MB/sec half duplex for what should be 1gigeth)
They all are configured Full-Duplex 1Gbps NIC.
Switches are all operational.
2) regarding IO, if you run 'iostat -x 2' on the zk servers while your
issue is happening, what's the %util of the disk? what's the iowait
look like?
0-3% at the time of testing. It varies. There is no significant IO
from Zookeeper process. Just normal page cache flush activity.

Chang Song
added a comment - 16/Apr/11 06:41 How to reproduce
1. 50 clients connect to ZK ensemble watching a node with 15 sec session timeout (calling group A)
2. Let 1000 clients in other group (called group B) connect to the ZK ensemble, but hang the JVM (or intentionally create OOM case). Again with 15 sec session timeout
3. Watch clients in group A drop out of the cluster due to ping delay (session expired)

Chang Song
added a comment - 16/Apr/11 06:45
What about using multiple ZK clusters for this, then?
Sorry, we cannot do that due to service deployment policy
Have you tested a cluster where the machines are set up correctly with separate snapshot and log disks?
We have two test/staging ZK ensemble setup, and one production. They reproduce the same result.
Since IO wasn't an issue, there is no point of separating snapshot and log.
Are your ZK machines doing any other tasks?
This is a dedicated ZK ensemble

Camille Fournier
added a comment - 18/Apr/11 18:38 This might be related to this email chain from February:
http://mail-archives.apache.org/mod_mbox/zookeeper-user/201102.mbox/%3C6642FC1CAF133548AA8FDF497C547F0A23C0C5265B@NYWEXMBX2126.msad.ms.com%3E
I theorized at the time that the issue might be due to synchronization on the session table, but never got enough information to finish the investigation.

locked java.lang.Object@@3186c91
at java.nio.channels.spi.AbstractSelectableChannel.implCloseChannel(AbstractSelectableChannel.java:201)
at java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:97)

locked java.lang.Object@@7fc600
at sun.nio.ch.SocketAdaptor.close(SocketAdaptor.java:352)
at org.apache.zookeeper.server.NIOServerCnxn.closeSock(NIOServerCnxn.java:1463)
at org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:1412)

locked java.util.HashSet@@4b94a39
at org.apache.zookeeper.server.NIOServerCnxn$Factory.closeSessionWithoutWakeup(NIOServerCnxn.java:343)
at org.apache.zookeeper.server.NIOServerCnxn$Factory.closeSession(NIOServerCnxn.java:330)

{
// calling this if we have the cnxn results in the client's
// close session response being lost - we've already closed
// the session/socket here before we can send the closeSession
// in the switch block below
scxn.closeSession(request.sessionId);
return;
}

Chang,
I am quite confused. Not sure how setting the tcp linger time to 0 is helping you? What OS are you using? I find it surprising that setting the linger time to be 0 doesnt delay packets to other sockets.

Mahadev konar
added a comment - 19/Apr/11 14:17 Chang,
I am quite confused. Not sure how setting the tcp linger time to 0 is helping you? What OS are you using? I find it surprising that setting the linger time to be 0 doesnt delay packets to other sockets.

SO_LINGER
Sets or gets the SO_LINGER option. The argument is a linger structure.
struct linger {
int l_onoff; /* linger active */
int l_linger; /* how many seconds to linger for */
};
When enabled, a close(2) or shutdown(2) will not return until all queued messages for the socket have been successfully sent or the linger timeout has been reached. Otherwise, the call returns immediately and the closing is done in the background. When the socket is closed as part of exit(2), it always lingers in the background.

So, since we have a single thread that calls close() and IO on sockets, it is possible that this can cause a huge delay in IO for other sockets. Anyone else hasmore information on this please feel free to update!

Mahadev konar
added a comment - 19/Apr/11 18:14 Ben and I looked at the man page of socket and here is what it says:
SO_LINGER
Sets or gets the SO_LINGER option. The argument is a linger structure.
struct linger {
int l_onoff; /* linger active */
int l_linger; /* how many seconds to linger for */
};
When enabled, a close(2) or shutdown(2) will not return until all queued messages for the socket have been successfully sent or the linger timeout has been reached. Otherwise, the call returns immediately and the closing is done in the background. When the socket is closed as part of exit(2), it always lingers in the background.
So, since we have a single thread that calls close() and IO on sockets, it is possible that this can cause a huge delay in IO for other sockets. Anyone else hasmore information on this please feel free to update!

Ted Dunning
added a comment - 19/Apr/11 20:46 Is it possible that somebody on the server side has a lock while trying to close a socket to a failed machine?
Or is it possible that there is a limited number of threads available for this closing activity?

One thing I need to make correction, my team didn't set linger to 0.
They disabled the linger option, close() returns immediately (lingering done by TCP stack)

This is different than setting linger to 0, which simply flush all the buffers and send TCP RST.
By disabling linger option, system will hold on to the socket in FIN_WAIT_1 state for dead clients.
note that we have no timeout for FIN_WAIT_1 sockets.

We'll have to experiment with setting linger option 0 to move socket from ESTABLISHED to CLOSED states immediately.

Chang Song
added a comment - 19/Apr/11 22:17 OK.
Linger option with 2 second timeout on Apache web server works because it is connection per process/thread model. Blocking in one process does not block other request processing.
If you look at the code below, I don't see why it shouldn't block.
One 2 sec delay causes some other closing activity to 4 or 6 or 8 seconds, and so forth.
NIOServerCnxn.Factory.closeSession()
synchronized void closeSession( long sessionId) { selector.wakeup(); closeSessionWithoutWakeup(sessionId); }
@SuppressWarnings( "unchecked" )
private void closeSessionWithoutWakeup( long sessionId) {
HashSet<NIOServerCnxn> cnxns;
synchronized ( this .cnxns) { cnxns = (HashSet<NIOServerCnxn>) this .cnxns.clone(); }
for (NIOServerCnxn cnxn : cnxns) {
if (cnxn.sessionId == sessionId) {
try { cnxn.close(); }
catch (Exception e) { LOG.warn( "exception during session close" , e); }
break ;
}
}
}
One thing I need to make correction, my team didn't set linger to 0.
They disabled the linger option, close() returns immediately (lingering done by TCP stack)
This is different than setting linger to 0, which simply flush all the buffers and send TCP RST.
By disabling linger option, system will hold on to the socket in FIN_WAIT_1 state for dead clients.
note that we have no timeout for FIN_WAIT_1 sockets.
We'll have to experiment with setting linger option 0 to move socket from ESTABLISHED to CLOSED states immediately.

I don't see any problem with disabling linger option due to net.ipv4.tcp_orphan_retries being 200ms on Linux. socket without lingering option to dead server closed to FIN_WAIT_1 setting times out immediately (after 200ms) and immediately CLOSED.

Chang Song
added a comment - 21/Apr/11 09:10 I don't see any problem with disabling linger option due to net.ipv4.tcp_orphan_retries being 200ms on Linux. socket without lingering option to dead server closed to FIN_WAIT_1 setting times out immediately (after 200ms) and immediately CLOSED.
So our final fix is
setSoLinger(false, -1)
Please consider this fix for next release.
Thank you.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

Hadoop QA
added a comment - 28/Apr/11 17:13 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12477238/ZOOKEEPER-1049.patch
against trunk revision 1091841.
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
-1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed core unit tests.
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/235//testReport/
Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/235//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/235//console
This message is automatically generated.