SplitLog Rescan BusyWaits upon Zk.CONNECTIONLOSS

Details

Type: Bug

Status:Resolved

Priority: Minor

Resolution:
Later

Affects Version/s:
None

Fix Version/s:
None

Component/s:
None

Labels:

None

Description

We ran into a production issue yesterday where the SplitLogManager tried to create a Rescan node in ZK. The createAsync() generated a KeeperException.CONNECTIONLOSS that was immedately sent to processResult(), createRescan node with --retry_count was called, and this created a CPU busywait that also clogged up the logs. We should handle this better.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

Hadoop QA
added a comment - 08/Jan/15 10:12 -1 overall . Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12524903/HBASE-5890.patch
against master branch at commit 645fbd7d87450b31b67b1e535cdb9c1cf50ffd16.
ATTACHMENT ID: 12524903
+1 @author . The patch does not contain any @author tags.
-1 tests included . The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
-1 patch . The patch command could not apply the patch.
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/12359//console
This message is automatically generated.

Prakash Khemani
added a comment - 27/Apr/12 23:29 Most likely, it isn't a good idea to sleep in the zookeeper callback thread. (isn't the zk client single threaded?)
Can these be queued in a DelayedQueue(socket-timeout) and retried from SplitLogManager.TimeoutMonitor.chore()

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

+1 hadoop23. The patch compiles against the hadoop 0.23.x profile.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

Hadoop QA
added a comment - 27/Apr/12 20:56 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12524903/HBASE-5890.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 hadoop23. The patch compiles against the hadoop 0.23.x profile.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
-1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in .
Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1670//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1670//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1670//console
This message is automatically generated.

The original idea is to have a timeout when we encounter this error. Since we have a recoverable ZK, it seems okay to retry after connection loss; but we should have some sort of dampening so that this isn't a CPU & log hog.

Nicolas Spiegelberg
added a comment - 27/Apr/12 19:44 The original idea is to have a timeout when we encounter this error. Since we have a recoverable ZK, it seems okay to retry after connection loss; but we should have some sort of dampening so that this isn't a CPU & log hog.