Details

Description

If a single process creates two unique fileSystems to the same NN using FileSystem.newInstance(), and one of them issues a close(), the leasechecker thread is interrupted. This interrupt races with the rpc namenode.renew() and can cause a ClosedByInterruptException. This closes the underlying channel and the other filesystem, sharing the connection will get errors.

Aaron T. Myers
added a comment - 05/Dec/12 01:39 The latest patch looks good to me. +1 pending Jenkins.
Jenkins appears to be down right now, and I think Jenkins flakiness caused test-patch not to notice this one. May need to manually kick it once Jenkins comes back up.

Also figured I'd write up a short summary of this, since the above discussion is long and somewhat hard to follow after 2.5 years and ~15 attachments

The issue at hand is what happens when an IPC caller thread (i.e the user thread who is making an IPC call, for example to the NN) is interrupted while in the process of writing the call to the wire. Java NIO's semantics are that a ClosedByInterruptException is thrown on the blocked thread, and also that the underlying channel is closed. In the context of IPC, this meant that the caller thread would receive a ClosedByInterruptException, and that any other threads which were sharing the same IPC socket would then receive ClosedChannelExceptions, even though those other threads were never meant to be interrupted.

The solution is to change the call-sending code such that the actual write() call happens on a new thread, created by the SEND_PARAMS_EXECUTOR in the patch. Since the user code has no reference to this thread, it won't ever get interrupted, even if someone interrupts the user thread making the call. So, the user thread will receive an InterruptedException, but any other threads using the same socket continue to run unaffected.

Todd Lipcon
added a comment - 03/Dec/12 23:02 Also figured I'd write up a short summary of this, since the above discussion is long and somewhat hard to follow after 2.5 years and ~15 attachments
The issue at hand is what happens when an IPC caller thread (i.e the user thread who is making an IPC call, for example to the NN) is interrupted while in the process of writing the call to the wire. Java NIO's semantics are that a ClosedByInterruptException is thrown on the blocked thread, and also that the underlying channel is closed . In the context of IPC, this meant that the caller thread would receive a ClosedByInterruptException, and that any other threads which were sharing the same IPC socket would then receive ClosedChannelExceptions, even though those other threads were never meant to be interrupted.
The solution is to change the call-sending code such that the actual write() call happens on a new thread, created by the SEND_PARAMS_EXECUTOR in the patch. Since the user code has no reference to this thread, it won't ever get interrupted, even if someone interrupts the user thread making the call. So, the user thread will receive an InterruptedException, but any other threads using the same socket continue to run unaffected.

As expected, the CPU time on the client was increased and the throughput went down by about 13%, since the RPC calls are now being shuttled between threads on the client side. That's unfortunate, but given that this fixes an important bug, and given that client side RPC throughput is rarely a bottleneck in common usage scenarios, I think it is acceptable.

This patch is also nearly identical to a patch that we've shipped in CDH since June 2010, so I'm fairly confident that the approach is correct.

Todd Lipcon
added a comment - 03/Dec/12 22:41 Here's an updated patch against trunk.
I ran all of the unit tests in the ipc package locally and they passed. I also tried the new unit tests without the patch, and they failed as expected.
Given that there was a deadlock found in an early rev of this patch, I also ran all of the IPC unit tests under jcarder to look for lock inversions and it found none.
I ran the RPCCallBenchmark for 30 seconds with and without the patch, with the following results:
With patch:
====== Results ======
Options:
rpcEngine=class org.apache.hadoop.ipc.ProtobufRpcEngine
serverThreads=30
serverReaderThreads=4
clientThreads=30
host=0.0.0.0
port=12345
secondsToRun=30
msgSize=1024
Total calls per second: 24668.0
CPU time per call on client: 58639 ns
CPU time per call on server: 64893 ns
Without patch:
====== Results ======
Options:
rpcEngine=class org.apache.hadoop.ipc.ProtobufRpcEngine
serverThreads=30
serverReaderThreads=4
clientThreads=30
host=0.0.0.0
port=12345
secondsToRun=30
msgSize=1024
Total calls per second: 27881.0
CPU time per call on client: 68079 ns
CPU time per call on server: 62582 ns
As expected, the CPU time on the client was increased and the throughput went down by about 13%, since the RPC calls are now being shuttled between threads on the client side. That's unfortunate, but given that this fixes an important bug, and given that client side RPC throughput is rarely a bottleneck in common usage scenarios, I think it is acceptable.
This patch is also nearly identical to a patch that we've shipped in CDH since June 2010, so I'm fairly confident that the approach is correct.

Harsh J
added a comment - 23/Sep/12 12:33 Cancelling patch as it fails several relevant tests. Needs to be re-investigated.
Todd - Would you have a chance to look into this soon? Is this still required?

Test failed since one of the new tests was leaving the main thread in an interrupted state, so a Thread.sleep crashed in the next test case. I added a "Thread.interrupted()" in the tests to clear state between test cases.

Todd Lipcon
added a comment - 27/Aug/11 00:06 Test failed since one of the new tests was leaving the main thread in an interrupted state, so a Thread.sleep crashed in the next test case. I added a "Thread.interrupted()" in the tests to clear state between test cases.
The -1 findbugs issue is due to HADOOP-7587 (not this patch)

Todd Lipcon
added a comment - 26/Aug/11 20:40 Rebased on trunk. A few changes since the earlier revs:
added a new "simple" test case with a single thread
moved the serialization of the call response into the calling thread, outside of the lock acquisition, to get better CPU parallelism
used guava's ThreadFactoryBuilder to make the thread factory
Cleaned up some of the warning messages to include thread name

sam rash
added a comment - 04/Mar/11 23:52 I haven't noticed this. In the rev we're running, inside sendParams(), markClosed() is called outside the synchronized(Connection.this.out), so that lock isn't held...?
(I'm using the original patch I uploaded before we changed it to a Future--it uses a latch)

The issue is that in the patch, we have the following inverted lock orders:
sendParam's senderFuture: Connection.out -> Connection (in markClosed)
sendPing: Connection -> Connection.out (explicit sync)

Hadoop QA
added a comment - 19/Feb/11 22:09 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12459934/hadoop-6762-10.txt
against trunk revision 1071364.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 3 new or modified tests.
-1 patch. The patch command could not apply the patch.
Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/272//console
This message is automatically generated.

Hadoop QA
added a comment - 05/Nov/10 06:17 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12447484/hadoop-6762-9.txt
against trunk revision 1031422.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 3 new or modified tests.
-1 patch. The patch command could not apply the patch.
Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/35//console
This message is automatically generated.

1. the reason you stated--we can pass the actual cause
2. my read is that only if you are the thread doing the IO should you throw InterruptedIOException. In this case, the thread truly was interrupted--the IO happens in the other thread. http://download-llnw.oracle.com/javase/6/docs/api/java/io/InterruptedIOException.html
it also has a field for the partial bytes transfered. In fact, this patch is supposed to keep the IO from actually being interrupted

how can we make it so the test will pass? does the test have a faulty assumption that I should fix?

sam rash
added a comment - 25/Aug/10 23:39 I think IOException makes sense for 2 reasons
1. the reason you stated--we can pass the actual cause
2. my read is that only if you are the thread doing the IO should you throw InterruptedIOException. In this case, the thread truly was interrupted--the IO happens in the other thread.
http://download-llnw.oracle.com/javase/6/docs/api/java/io/InterruptedIOException.html
it also has a field for the partial bytes transfered. In fact, this patch is supposed to keep the IO from actually being interrupted
how can we make it so the test will pass? does the test have a faulty assumption that I should fix?

actually, I might take it back - it looks like you can't pass cause information into an InterruptedIOException, so it's better to have callers check e.getCause() instanceof InterruptedException instead? What do you think? The gridmix case above already checks for ClosedByInterruptException, so it can easy check for InterruptedException too.

Todd Lipcon
added a comment - 25/Aug/10 19:24 actually, I might take it back - it looks like you can't pass cause information into an InterruptedIOException, so it's better to have callers check e.getCause() instanceof InterruptedException instead? What do you think? The gridmix case above already checks for ClosedByInterruptException, so it can easy check for InterruptedException too.

With this patch, I see occasinal failures of TestGridmixSubmission when the JobMonitor gets interrupted:

10/08/25 11:05:20 WARN ipc.Client: interrupted waiting to send params to server
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1215)
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:754)
at org.apache.hadoop.ipc.Client.call(Client.java:1001)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:224)
at org.apache.hadoop.mapred.$Proxy10.getJobStatus(Unknown Source)
at org.apache.hadoop.mapred.JobClient$NetworkedJob.updateStatus(JobClient.java:250)
at org.apache.hadoop.mapred.JobClient$NetworkedJob.isSuccessful(JobClient.java:339)
at org.apache.hadoop.mapreduce.Job.isSuccessful(Job.java:332)
at org.apache.hadoop.mapred.gridmix.JobMonitor$MonitorThread.process(JobMonitor.java:134)
at org.apache.hadoop.mapred.gridmix.JobMonitor$MonitorThread.run(JobMonitor.java:175)
10/08/25 11:05:20 WARN gridmix.JobMonitor: Lost job GRIDMIX00000
java.io.IOException: java.lang.InterruptedException
at org.apache.hadoop.ipc.Client.call(Client.java:1007)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:224)
at org.apache.hadoop.mapred.$Proxy10.getJobStatus(Unknown Source)
at org.apache.hadoop.mapred.JobClient$NetworkedJob.updateStatus(JobClient.java:250)
at org.apache.hadoop.mapred.JobClient$NetworkedJob.isSuccessful(JobClient.java:339)
at org.apache.hadoop.mapreduce.Job.isSuccessful(Job.java:332)
at org.apache.hadoop.mapred.gridmix.JobMonitor$MonitorThread.process(JobMonitor.java:134)
at org.apache.hadoop.mapred.gridmix.JobMonitor$MonitorThread.run(JobMonitor.java:175)
Caused by: java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1215)
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:754)
at org.apache.hadoop.ipc.Client.call(Client.java:1001)
... 7 more

I think the patch is doing the right thing here for the most part, but it should throw InterruptedIOException instead of just a normal IOException. Then the caller can at least catch it distinctly from a normal IOException.

Todd Lipcon
added a comment - 25/Aug/10 19:19 Hi Sam,
With this patch, I see occasinal failures of TestGridmixSubmission when the JobMonitor gets interrupted:
10/08/25 11:05:20 WARN ipc.Client: interrupted waiting to send params to server
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1215)
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:754)
at org.apache.hadoop.ipc.Client.call(Client.java:1001)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:224)
at org.apache.hadoop.mapred.$Proxy10.getJobStatus(Unknown Source)
at org.apache.hadoop.mapred.JobClient$NetworkedJob.updateStatus(JobClient.java:250)
at org.apache.hadoop.mapred.JobClient$NetworkedJob.isSuccessful(JobClient.java:339)
at org.apache.hadoop.mapreduce.Job.isSuccessful(Job.java:332)
at org.apache.hadoop.mapred.gridmix.JobMonitor$MonitorThread.process(JobMonitor.java:134)
at org.apache.hadoop.mapred.gridmix.JobMonitor$MonitorThread.run(JobMonitor.java:175)
10/08/25 11:05:20 WARN gridmix.JobMonitor: Lost job GRIDMIX00000
java.io.IOException: java.lang.InterruptedException
at org.apache.hadoop.ipc.Client.call(Client.java:1007)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:224)
at org.apache.hadoop.mapred.$Proxy10.getJobStatus(Unknown Source)
at org.apache.hadoop.mapred.JobClient$NetworkedJob.updateStatus(JobClient.java:250)
at org.apache.hadoop.mapred.JobClient$NetworkedJob.isSuccessful(JobClient.java:339)
at org.apache.hadoop.mapreduce.Job.isSuccessful(Job.java:332)
at org.apache.hadoop.mapred.gridmix.JobMonitor$MonitorThread.process(JobMonitor.java:134)
at org.apache.hadoop.mapred.gridmix.JobMonitor$MonitorThread.run(JobMonitor.java:175)
Caused by: java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1215)
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:754)
at org.apache.hadoop.ipc.Client.call(Client.java:1001)
... 7 more
I think the patch is doing the right thing here for the most part, but it should throw InterruptedIOException instead of just a normal IOException. Then the caller can at least catch it distinctly from a normal IOException.

the javadoc warnings do not appear to be from my patch
(and I see 6, not 1)

[exec][javadoc] Constructing Javadoc information...[exec][javadoc] /grid/0/hudson/hudson-slave/workspace/Hadoop-Patch-h4.grid.sp2.yahoo.net/trunk/src/java/org/apache/hadoop/security/KerberosName.java:31: warning: sun.securi
ty.krb5.Config is Sun proprietary API and may be removed in a future release[exec][javadoc] import sun.security.krb5.Config;[exec][javadoc] ^[exec][javadoc] /grid/0/hudson/hudson-slave/workspace/Hadoop-Patch-h4.grid.sp2.yahoo.net/trunk/src/java/org/apache/hadoop/security/KerberosName.java:32: warning: sun.securi
ty.krb5.KrbException is Sun proprietary API and may be removed in a future release[exec][javadoc] import sun.security.krb5.KrbException;[exec][javadoc] ^[exec][javadoc] /grid/0/hudson/hudson-slave/workspace/Hadoop-Patch-h4.grid.sp2.yahoo.net/trunk/src/java/org/apache/hadoop/security/KerberosName.java:81: warning: sun.securi
ty.krb5.Config is Sun proprietary API and may be removed in a future release[exec][javadoc] private static Config kerbConf;[exec][javadoc] ^[exec][javadoc] /grid/0/hudson/hudson-slave/workspace/Hadoop-Patch-h4.grid.sp2.yahoo.net/trunk/src/java/org/apache/hadoop/security/SecurityUtil.java:33: warning: sun.securi
ty.jgss.krb5.Krb5Util is Sun proprietary API and may be removed in a future release[exec][javadoc] import sun.security.jgss.krb5.Krb5Util;[exec][javadoc] ^[exec][javadoc] /grid/0/hudson/hudson-slave/workspace/Hadoop-Patch-h4.grid.sp2.yahoo.net/trunk/src/java/org/apache/hadoop/security/SecurityUtil.java:34: warning: sun.securi
ty.krb5.Credentials is Sun proprietary API and may be removed in a future release[exec][javadoc] import sun.security.krb5.Credentials;[exec][javadoc] ^[exec][javadoc] /grid/0/hudson/hudson-slave/workspace/Hadoop-Patch-h4.grid.sp2.yahoo.net/trunk/src/java/org/apache/hadoop/security/SecurityUtil.java:35: warning: sun.securi
ty.krb5.PrincipalName is Sun proprietary API and may be removed in a future release[exec][javadoc] import sun.security.krb5.PrincipalName;[exec][javadoc] ^[exec][javadoc] ExcludePrivateAnnotationsStandardDoclet[exec][javadoc] Standard Doclet version 1.6.0_11[exec][javadoc] Building tree for all the packages and classes...[exec][javadoc] Building index for all the packages and classes...[exec][javadoc] Building index for all classes...[exec][javadoc] Generating /grid/0/hudson/hudson-slave/workspace/Hadoop-Patch-h4.grid.sp2.yahoo.net/trunk/build/docs/api/stylesheet.css...[exec][javadoc] 6 warnings[exec]

Hey Sam, one small thing in the newest patch: there are some unspecified generics - eg should be new AtomicReference<Throwable>, not just new AtomicReference. Otherwise we get some warnings compiling the tests.

Todd Lipcon
added a comment - 18/Jun/10 03:56 Hey Sam, one small thing in the newest patch: there are some unspecified generics - eg should be new AtomicReference<Throwable>, not just new AtomicReference. Otherwise we get some warnings compiling the tests.

so the two patches do differ, probably in one line, Thread.sleep(1000);

As I said, I ran ant test-core in a loop 100x and did not see the failure. If someone does, though, I can do something more sophisticated in the test (like using another latch to make sure the leader has started before interrupting it and causing the BrokenBarrierException)

sam rash
added a comment - 10/Jun/10 18:00 yes, I did
so the two patches do differ, probably in one line, Thread.sleep(1000);
As I said, I ran ant test-core in a loop 100x and did not see the failure. If someone does, though, I can do something more sophisticated in the test (like using another latch to make sure the leader has started before interrupting it and causing the BrokenBarrierException)
let me know

Note that one of the path segments differs between the two URLs. I think it just pulls an attachment ID from that path segment and downloads that, regardless of the name at the end of the path (your two wget urls have the same path segment, I guess you manually edited the last bit?)

Todd Lipcon
added a comment - 10/Jun/10 17:47 Note that one of the path segments differs between the two URLs. I think it just pulls an attachment ID from that path segment and downloads that, regardless of the name at the end of the path (your two wget urls have the same path segment, I guess you manually edited the last bit?)

sam rash
added a comment - 10/Jun/10 02:26 this is downright odd--if I click and view
https://issues.apache.org/jira/secure/attachment/12446356/hadoop-6762-7.txt
https://issues.apache.org/jira/secure/attachment/12446530/hadoop-6762-8.txt
they differ (the latter has a Thread.sleep() in the unit test)
but grabbing with wget
wget https://issues.apache.org/jira/secure/attachment/12446530/hadoop-6762-7.txt --no-check-certificate
wget https://issues.apache.org/jira/secure/attachment/12446530/hadoop-6762-8.txt --no-check-certificate
they both have the Thread.sleep()
is something up with JIRA? or what did I do?

-8 is the latest. i ran it 100 times and did not get any failures. do you still see a failure with rev 8? if so, I can do some more work with latches to tighten it up (right now I have a simple sleep that should allow the threads to get warmed up, and a latch to let them gracefully exit)

sam rash
added a comment - 08/Jun/10 06:02 -8 is the latest. i ran it 100 times and did not get any failures. do you still see a failure with rev 8? if so, I can do some more work with latches to tighten it up (right now I have a simple sleep that should allow the threads to get warmed up, and a latch to let them gracefully exit)

testRPCInterrupted3:--
i also feel this may be the test problem.
in this test, inside run method it is throwing the exception i.e java.util.concurrent.BrokenBarrierException
This is happening some times.
I added some wait to complete all the threads before assertion.
This is happenning randomly...most of the times passing the test.
Please post the latest test.

Uma Mahesh
added a comment - 08/Jun/10 05:29 testRPCInterrupted3:--
i also feel this may be the test problem.
in this test, inside run method it is throwing the exception i.e java.util.concurrent.BrokenBarrierException
This is happening some times.
I added some wait to complete all the threads before assertion.
This is happenning randomly...most of the times passing the test.
Please post the latest test.

sam rash
added a comment - 07/Jun/10 22:40 turns out this is just a really bad case of logging. the exception was a BrokenBarrierException and I wasn't logging the real exception. changed that and fixed the test so this doesn't occur

Actually, this might be an artifact of the test. The threads in the executor's threadpool will be interrupted on shutdown. the test doesn't wait for all of the threads to terminate gracefully before doing so. This means there is a race in the test for rpc threads to finish before getting interrupted. I have an updated test that uses a latch to wait for all threads to terminate before letting the test complete. I haven't been able to repro, but can post the patch this afternoon and let you try.

please do still post what exception and any relevant looking log parts here. it will help me ascertain if this is the case

sam rash
added a comment - 07/Jun/10 18:29 Actually, this might be an artifact of the test. The threads in the executor's threadpool will be interrupted on shutdown. the test doesn't wait for all of the threads to terminate gracefully before doing so. This means there is a race in the test for rpc threads to finish before getting interrupted. I have an updated test that uses a latch to wait for all threads to terminate before letting the test complete. I haven't been able to repro, but can post the patch this afternoon and let you try.
please do still post what exception and any relevant looking log parts here. it will help me ascertain if this is the case

can you elaborate on what you ran and what error messages + exceptions you saw in the test logs? the thread that does the IO cannot receive an interrupt so if you are seeing an error, i don't think it can be the same one.

sam rash
added a comment - 07/Jun/10 17:57 can you elaborate on what you ran and what error messages + exceptions you saw in the test logs? the thread that does the IO cannot receive an interrupt so if you are seeing an error, i don't think it can be the same one.

btw, is it appropriate to delete the very old patches as they aren't really relevant? or just leave them?

Usually just leave them - makes it easier to follow the conversation and see how the patch evolved.

(which there might a timeout on the actual out.write call already--i need to double-check)

Looks like there is one - a timeout is set when NetUtils.getOutputStream is called on the socket in setupIOStreams(). So I don't think we need a separate timeout in awaiting the send param.

I think the patch is good. I ran it with my test case since Friday night and didn't see any RPC hangs. The test case eventually failed with "Too many open files" but I think it's some other bug/socket leak in DFS code, not IPC. Mark Patch Available to swing it through Hudson?

Todd Lipcon
added a comment - 06/Jun/10 21:24 btw, is it appropriate to delete the very old patches as they aren't really relevant? or just leave them?
Usually just leave them - makes it easier to follow the conversation and see how the patch evolved.
(which there might a timeout on the actual out.write call already--i need to double-check)
Looks like there is one - a timeout is set when NetUtils.getOutputStream is called on the socket in setupIOStreams(). So I don't think we need a separate timeout in awaiting the send param.
I think the patch is good. I ran it with my test case since Friday night and didn't see any RPC hangs. The test case eventually failed with "Too many open files" but I think it's some other bug/socket leak in DFS code, not IPC. Mark Patch Available to swing it through Hudson?

todd : actually i want open the timeout discussion back up. I think there might be some confusion. The 20s timeout that was there was not on the RPC, but only on sending the RPC. Basically this is the time to write to the socket (which there might a timeout on the actual out.write call already--i need to double-check).

sam rash
added a comment - 06/Jun/10 18:04 todd : actually i want open the timeout discussion back up. I think there might be some confusion. The 20s timeout that was there was not on the RPC, but only on sending the RPC. Basically this is the time to write to the socket (which there might a timeout on the actual out.write call already--i need to double-check).
The RPC itself still would have no timeout.
let me know what you think.

note : used a dedicated lock var around the submit + wait for a sendParams call. I didn't want to worry about reasoning through lock contention/deadlock with synchronized methods (why I didn't synchronize on "this")

I almost have an updated patch. I loved the idea of sync on Connection.this.out around submit + wait, but this causes deadlock as the out.write() call inside is a synchronized function. I instead sync'd on the connection object itself (seems safe and gives the same result. 1 : 1 Connection/socket => bounds the threads by the # of connections
(really, very clean/clever idea to optimize thread use)

one catch with the static executor instance: nothing ever shuts it down. While this isn't an issue for actual execution, it won't cause unit tests to hang, will it? (I made a thread factory that makes the threads daemons for this purpose)

sam rash
added a comment - 04/Jun/10 18:32 I almost have an updated patch. I loved the idea of sync on Connection.this.out around submit + wait, but this causes deadlock as the out.write() call inside is a synchronized function. I instead sync'd on the connection object itself (seems safe and gives the same result. 1 : 1 Connection/socket => bounds the threads by the # of connections
(really, very clean/clever idea to optimize thread use)
one catch with the static executor instance: nothing ever shuts it down. While this isn't an issue for actual execution, it won't cause unit tests to hang, will it? (I made a thread factory that makes the threads daemons for this purpose)
patch coming after i run a larger set of unit tests

Todd Lipcon
added a comment - 04/Jun/10 03:39 Either way seems cool. I've never seen call param serialization be a bottleneck where parallelization would really help. Whichever one makes the code cleaner

hmm, actually with a Future, if there is a runtime exception, it'll show up as an ExecutionException on future.get(), so there isn't a need to move the buffer construction outside (requires another try/catch clause due to IOException). We can use getCause() on the ExecutionException to find the underlying exception and use markClosed() if it's an IOException, and re-throw if it's a runtime exception (same behavior as now)

or we can move it into the caller thread. Using the future requires handling the ExecutionException anyway, so I sort of lean this way as it kills two birds. i don't know how much more parallelism we gain moving the buffer construction outside the sync block.

sam rash
added a comment - 04/Jun/10 03:32 hmm, actually with a Future, if there is a runtime exception, it'll show up as an ExecutionException on future.get(), so there isn't a need to move the buffer construction outside (requires another try/catch clause due to IOException). We can use getCause() on the ExecutionException to find the underlying exception and use markClosed() if it's an IOException, and re-throw if it's a runtime exception (same behavior as now)
or we can move it into the caller thread. Using the future requires handling the ExecutionException anyway, so I sort of lean this way as it kills two birds. i don't know how much more parallelism we gain moving the buffer construction outside the sync block.
what do you think?

re: timeout, so if a server disappeared, the ping would fail and the RPC would fail that way? if that's the case, then I think removing the timeout on the Future.get() is fine.

Yep, that should be the case. Of course a server can stay up but be unresponsive (eg deadlocked). In those cases, while it's annoying that clients get blocked forever, I don't know that changing the behavior to be timeout based would be a change we could really make at this point without worrying that it would break lots and lots of downstream users

We have seem one case of distributed deadlock here on the IPC workers in the DN, so this isn't 100% theory

Yep, I've seen internode deadlocks several times as well. Not pretty! However, I can't think of a situation where this could happen here – the only thing that can block one of these sendParam calls is TCP backpressure on the socket, and that only happens when the network is stalled. I don't see a case where allowing other threads to start sending would have unstalled a prior sender.

We could actually enforce the max one thread per connection thing by synchronizing on Connection.this.out outside the submission of the runnable. That way we know there's only one sending going on at a time, and we're just using the thread exactly for avoiding interruption and nothing else.

Todd Lipcon
added a comment - 04/Jun/10 03:25 re: timeout, so if a server disappeared, the ping would fail and the RPC would fail that way? if that's the case, then I think removing the timeout on the Future.get() is fine.
Yep, that should be the case. Of course a server can stay up but be unresponsive (eg deadlocked). In those cases, while it's annoying that clients get blocked forever, I don't know that changing the behavior to be timeout based would be a change we could really make at this point without worrying that it would break lots and lots of downstream users
We have seem one case of distributed deadlock here on the IPC workers in the DN, so this isn't 100% theory
Yep, I've seen internode deadlocks several times as well. Not pretty! However, I can't think of a situation where this could happen here – the only thing that can block one of these sendParam calls is TCP backpressure on the socket, and that only happens when the network is stalled. I don't see a case where allowing other threads to start sending would have unstalled a prior sender.
We could actually enforce the max one thread per connection thing by synchronizing on Connection.this.out outside the submission of the runnable. That way we know there's only one sending going on at a time, and we're just using the thread exactly for avoiding interruption and nothing else.

sam rash
added a comment - 04/Jun/10 03:18 re: timeout, so if a server disappeared, the ping would fail and the RPC would fail that way? if that's the case, then I think removing the timeout on the Future.get() is fine.

This concerns me as if we don't bound the size of this, we could get a massive # of threads; if we do bound it, then it seems to me we might have some form of deadlock possible where one RPC depends on another (indirectly) and due to limited threads, it can't complete. Basically we would want at least one thread per Connection, but no more (which is what have now)

We have seem one case of distributed deadlock here on the IPC workers in the DN, so this isn't 100% theory

While it is an extra resource, for simplicity and safety, I do prefer one thread Connection.

sam rash
added a comment - 04/Jun/10 03:15 forgot to comment on static cachedThreadPool:
This concerns me as if we don't bound the size of this, we could get a massive # of threads; if we do bound it, then it seems to me we might have some form of deadlock possible where one RPC depends on another (indirectly) and due to limited threads, it can't complete. Basically we would want at least one thread per Connection, but no more (which is what have now)
We have seem one case of distributed deadlock here on the IPC workers in the DN, so this isn't 100% theory
While it is an extra resource, for simplicity and safety, I do prefer one thread Connection.
What do you think?

re timeout: I'm a little nervous about such a change in the semantics of IPC at this point. The ping system ensures that the other side isn't completely dead, so some people use IPCs that are supposed to take a really long time, and rely on ping to know it's at least still connected. Maybe if you find it useful you could introduce a new parameter for the IPC timeout, and have it default to 0 (no timeout?)

I could also see a situation where we wait for the ping time, and then print a LOG.warn("IPC call Protocol.callName to <IP> still waiting after 60000ms") once every ping interval. This would help debugging without changing behavior. (I too have often wished for such a thing)

Todd Lipcon
added a comment - 04/Jun/10 03:11 re timeout: I'm a little nervous about such a change in the semantics of IPC at this point. The ping system ensures that the other side isn't completely dead, so some people use IPCs that are supposed to take a really long time, and rely on ping to know it's at least still connected. Maybe if you find it useful you could introduce a new parameter for the IPC timeout, and have it default to 0 (no timeout?)
I could also see a situation where we wait for the ping time, and then print a LOG.warn("IPC call Protocol.callName to <IP> still waiting after 60000ms") once every ping interval. This would help debugging without changing behavior. (I too have often wished for such a thing)

1. latch : I agree. I'll change it to use a future & get (with timeout, see #2).

2. actually the timeout helped me debug a switch issue today. I would see a flurry of timeouts when I saturated a switch that wasn't performing to spec. getting the timeouts was far preferable than hanging indefinitely. i agree it changes the behavior, but it's the same we do pings at I think. Also, playing a game of adversary, if somehow the connection thread in the executor did die, the latch would hang indefinitely.

perhaps the timeout value should be something else? I choose the timeout value that was used to connect the socket (pingInterval) as it seemed appropriate.

3. ah, i misread your comment above--that's a great idea. Only the actual push out the socket needs to be in critical section (in theory could improve perf a tiny bit).

sam rash
added a comment - 04/Jun/10 03:07 thanks for reviewing the patch so quickly.
1. latch : I agree. I'll change it to use a future & get (with timeout, see #2).
2. actually the timeout helped me debug a switch issue today. I would see a flurry of timeouts when I saturated a switch that wasn't performing to spec. getting the timeouts was far preferable than hanging indefinitely. i agree it changes the behavior, but it's the same we do pings at I think. Also, playing a game of adversary, if somehow the connection thread in the executor did die, the latch would hang indefinitely.
perhaps the timeout value should be something else? I choose the timeout value that was used to connect the socket (pingInterval) as it seemed appropriate.
3. ah, i misread your comment above--that's a great idea. Only the actual push out the socket needs to be in critical section (in theory could improve perf a tiny bit).
I'll get to the changes and post another patch.

Instead of using the CountdownLatch, can you change the Runnable to a Callable<Void>() and then get back a Future<Void>? Seems a little cleaner.

The behavior is different since we've added a timeout waiting to sendParam. Do you think this change is necessary? (under what case would we block forever waiting to write?)

Regarding the issue I raised above with a Writable param that throws NPE, can we move the actual buffer construction back into the calling thread? Then if it throws an RTE, the user will see it (rather than it getting lost somewhere). There's still an issue that this will leave the call on the connection queue, but that's probably worth a separate jira.

Regarding the resource usage issue: can we just use a static cachedthreadpool that's shared across all of RPC for sending params? In the common case it would only have 0 or 1 threads but it could grow as necessary and then shrink back when idle.

Todd Lipcon
added a comment - 04/Jun/10 02:53 Looking at the code:
Instead of using the CountdownLatch, can you change the Runnable to a Callable<Void>() and then get back a Future<Void>? Seems a little cleaner.
The behavior is different since we've added a timeout waiting to sendParam. Do you think this change is necessary? (under what case would we block forever waiting to write?)
Regarding the issue I raised above with a Writable param that throws NPE, can we move the actual buffer construction back into the calling thread? Then if it throws an RTE, the user will see it (rather than it getting lost somewhere). There's still an issue that this will leave the call on the connection queue, but that's probably worth a separate jira.
Regarding the resource usage issue: can we just use a static cachedthreadpool that's shared across all of RPC for sending params? In the common case it would only have 0 or 1 threads but it could grow as necessary and then shrink back when idle.

This patch should largely work against 0.20. I originally patched our 0.20 and then ported to trunk (which was mostly re-writing the test case)

if this doesn't apply, let me know and i'll back-port this (including test case) to 0.20

also, I do agree on one point about the extra thread: while I don't see it as inherently complex, I do see it as an additional resource. if we could solve it w/o the thread and similar in code complexity, that would be the superior solution imo

sam rash
added a comment - 04/Jun/10 02:38 This patch should largely work against 0.20. I originally patched our 0.20 and then ported to trunk (which was mostly re-writing the test case)
if this doesn't apply, let me know and i'll back-port this (including test case) to 0.20
also, I do agree on one point about the extra thread: while I don't see it as inherently complex, I do see it as an additional resource. if we could solve it w/o the thread and similar in code complexity, that would be the superior solution imo

Todd Lipcon
added a comment - 04/Jun/10 02:29 Hey Sam, that's a fair argument. Let me look over your patch more closely. Do you have this against 0.20 as well? The test case I have that caused me to notice this is an 0.20 based cluster test.

Hmm, how would pending calls complete? They already have a Connection object with a socket channel that is in bad shape. basically there would have to be a check inside a sync block that the channel is valid before sending. If it's not, it would have to create a new socket (or whole Connection, again all in sync block). Does this make sense? A bunch of threads get the Connection object and pile up on the synchronized(this.out) and if one of them is interrupted, the whole pile will get errors. I think having the test & fix code is more complicated than using another thread actually, but I may be biased (having already done it the other way)

FWIW, we're already using this on our 0.20 branch in production where we have up to 200+ threads using the same RPC instance.

also, i don't actually think the code is complex--it's using an executor so the thread management is as simple as it can get.
We can even get rid of the latch--it's not necessary, but I wanted the change to function exactly as it does now, so I put it in.

sam rash
added a comment - 04/Jun/10 02:27 Hmm, how would pending calls complete? They already have a Connection object with a socket channel that is in bad shape. basically there would have to be a check inside a sync block that the channel is valid before sending. If it's not, it would have to create a new socket (or whole Connection, again all in sync block). Does this make sense? A bunch of threads get the Connection object and pile up on the synchronized(this.out) and if one of them is interrupted, the whole pile will get errors. I think having the test & fix code is more complicated than using another thread actually, but I may be biased (having already done it the other way)
FWIW, we're already using this on our 0.20 branch in production where we have up to 200+ threads using the same RPC instance.
also, i don't actually think the code is complex--it's using an executor so the thread management is as simple as it can get.
We can even get rid of the latch--it's not necessary, but I wanted the change to function exactly as it does now, so I put it in.

Hey Sam. Sorry, I misunderstood your point earlier. You're definitely right that interrupting one thread shouldn't take down the RPC connection.

Adding yet another thread to IPC seems a bit complicated, though. What about if we added a flag to Connection saying "no more sends on this connection", so that interrupting the sender did kill the connection but lets currently pending calls complete? Then when the queue has been quiesced the connection shuts down like it does today?

The issue I'm thinking is that we solve the interrupt problem, but don't solve the general case of exceptions during sendparam. The user can still have a Writable which eg throws an NPE, and we're back to the same problem of lack of isolation between writers.

Todd Lipcon
added a comment - 04/Jun/10 02:17 Hey Sam. Sorry, I misunderstood your point earlier. You're definitely right that interrupting one thread shouldn't take down the RPC connection.
Adding yet another thread to IPC seems a bit complicated, though. What about if we added a flag to Connection saying "no more sends on this connection", so that interrupting the sender did kill the connection but lets currently pending calls complete? Then when the queue has been quiesced the connection shuts down like it does today?
The issue I'm thinking is that we solve the interrupt problem, but don't solve the general case of exceptions during sendparam. The user can still have a Writable which eg throws an NPE, and we're back to the same problem of lack of isolation between writers.

maybe my test case is not sufficiently deterministic so it works differently for you. I replaced "Thread.currentThread.interrupt()" with "this.interrupt()" in the code I pasted above.

The test still fails
(and regardless, I don't think that can affect the underlying problem that if something interrupts a thread that is doing RPC and that thread is blocked on the channel, it gets closed automatically)

sam rash
added a comment - 03/Jun/10 18:29 maybe my test case is not sufficiently deterministic so it works differently for you. I replaced "Thread.currentThread.interrupt()" with "this.interrupt()" in the code I pasted above.
The test still fails
(and regardless, I don't think that can affect the underlying problem that if something interrupts a thread that is doing RPC and that thread is blocked on the channel, it gets closed automatically)

sam rash
added a comment - 03/Jun/10 18:09 maybe you can elaborate on where you are adding the interrupt() call as trying what I understand you to have suggested didn't make the test pass.
Is this what you mean for sendParam:
public void sendParam(Call call) {
if (shouldCloseConnection.get()) {
return ;
}
DataOutputBuffer d= null ;
try {
synchronized ( this .out) {
if (LOG.isDebugEnabled())
LOG.debug(getName() + " sending #" + call.id);
// for serializing the
//data to be written
d = new DataOutputBuffer();
d.writeInt(call.id);
call.param.write(d);
byte [] data = d.getData();
int dataLength = d.getLength();
out.writeInt(dataLength); //first put the data length
out.write(data, 0, dataLength); //write the data
out.flush();
}
} catch (IOException e) {
Thread .currentThread().interrupt(); //add this --should have no effect on ClosedByInterrupt as it sets the current thread to be interrupted already
markClosed(e);
} finally {
//the buffer is just an in-memory buffer, but it is still polite to
// close early
IOUtils.closeStream(d);
}
}

from the doc, this closes the channel and sets the interrupt status (ie interrupt() shouldn't have an effect--means my test doesn't repro the same thing I saw).
What I then saw was that other RPC instances using the same channel would get ChannelClosedException. The only way to avoid this in the FileSystem case was to move the thread that uses the channel to its own so the lease checker won't interrupt it.

I'll play with the test case and see why your change makes it pass when it doesn't seem like it can't fix the underlying problem

sam rash
added a comment - 03/Jun/10 16:08 so you mean add Thread.currentThread.interrupt() above markClosed(e) ? I don't think this fixes the underlying problem.
The problem is if a channel is doing a wait for IO and an interrupt comes in, you get a ClosedByInterrupt exception
http://java.sun.com/javase/6/docs/api/java/nio/channels/ClosedByInterruptException.html
from the doc, this closes the channel and sets the interrupt status (ie interrupt() shouldn't have an effect--means my test doesn't repro the same thing I saw).
What I then saw was that other RPC instances using the same channel would get ChannelClosedException. The only way to avoid this in the FileSystem case was to move the thread that uses the channel to its own so the lease checker won't interrupt it.
I'll play with the test case and see why your change makes it pass when it doesn't seem like it can't fix the underlying problem

Hey Sam. Is it not sufficient to simply add a "this.interrupt()" in the catch clause of sendParams?

I was running into an occasional issue with 1 minute pauses during DFSClient pipeline recovery, and I think it's due to this issue. I added this.interrupt() and it seems to pass test case, I'll see if it also fixes my 1 minute pauses.

Todd Lipcon
added a comment - 03/Jun/10 07:39 Hey Sam. Is it not sufficient to simply add a "this.interrupt()" in the catch clause of sendParams?
I was running into an occasional issue with 1 minute pauses during DFSClient pipeline recovery, and I think it's due to this issue. I added this.interrupt() and it seems to pass test case, I'll see if it also fixes my 1 minute pauses.

sam rash
added a comment - 14/May/10 00:34 1. previous patch had a bug where a race condition could cause indefinite hang of the client
2. cleaned up case of close by shutting down executor (speeds up shutdown as thread is interrupted)

the general problem is that 'client' threads hold the socket and do writes to it to send RPCs. If a client thread receives an interrupt, it will leave the socket in an unusable state.

i have a test for this general case and a patch which moves the actual writing to the socket to a thread owned by the Client object. This means a client can be interrupted and not ruin the socket for other clients.

note: other socket errors may occur that make the socket unusable. The patch doesn't handle this (only intended to help with interrupted cases since that is common with filesystem.close).

we might also want to consider finding a way to fail fast when RPC goes bad. Near as I can tell from watching this happen, until the filesystem is closed, the underlying RPC is in a bad state. It seems like we could fail on one operation, detect the bad socket and perhaps recreate the socket or the whole RPC object. not sure where this retry logic goes

sam rash
added a comment - 12/May/10 17:28 the general problem is that 'client' threads hold the socket and do writes to it to send RPCs. If a client thread receives an interrupt, it will leave the socket in an unusable state.
i have a test for this general case and a patch which moves the actual writing to the socket to a thread owned by the Client object. This means a client can be interrupted and not ruin the socket for other clients.
note: other socket errors may occur that make the socket unusable. The patch doesn't handle this (only intended to help with interrupted cases since that is common with filesystem.close).
we might also want to consider finding a way to fail fast when RPC goes bad. Near as I can tell from watching this happen, until the filesystem is closed, the underlying RPC is in a bad state. It seems like we could fail on one operation, detect the bad socket and perhaps recreate the socket or the whole RPC object. not sure where this retry logic goes