XML-RPC server deadlocks under heavy load

Details

Description

When running a XML-RPC server under heavy load, it eventually deadlocks inside the thread pool that manages the 'worker' threads which handle the individual XML-RPC requests - the classes involved are org.apache.xmlrpc.util.ThreadPool and org.apache.xmlrpc.util.ThreadPool$Poolable. jstack on the hung process shows:

----------
Found one Java-level deadlock:
=============================
"XML-RPC-13":
waiting to lock monitor 0x08d10bec (object 0xbb6959c0, a org.apache.xmlrpc.util.ThreadPool),
which is held by "XML-RPC Weblistener"
"XML-RPC Weblistener":
waiting to lock monitor 0x08d1186c (object 0xbd2ed340, a org.apache.xmlrpc.util.ThreadPool$Poolable$1),
which is held by "XML-RPC-2"
"XML-RPC-2":
waiting to lock monitor 0x08d112f4 (object 0xbd2ed570, a org.apache.xmlrpc.util.ThreadPool$Poolable),
which is held by "XML-RPC Weblistener"

And another slight variant on the above, from a debug build of revision 769436.

----------
Found one Java-level deadlock:
=============================
"XML-RPC-6":
waiting to lock monitor 0x0870d8ec (object 0xbac020f8, a org.apache.xmlrpc.util.ThreadPool),
which is held by "XML-RPC Weblistener"
"XML-RPC Weblistener":
waiting to lock monitor 0x0814de4c (object 0xbad73820, a org.apache.xmlrpc.util.ThreadPool$Poolable$1),
which is held by "XML-RPC-5"
"XML-RPC-5":
waiting to lock monitor 0x0814eacc (object 0xbad73b48, a org.apache.xmlrpc.util.ThreadPool$Poolable),
which is held by "XML-RPC Weblistener"

Jochen Wiedmann
added a comment - 28/Apr/09 15:54 The line numbers in the stack information don't match my version of the sources. Any idea why that's the case? Any possibility to supply a matching version?

I've pulled from the repo so what I have is basically the same as head (rev 769427), except that I've added a few debugging bits and pieces. I can roll back and update the stack traces.

However I think the cause of the problem is that the implementation is fatally flawed - there's a nonstatic member class (ThreadPool.Poolable) that synchronizes on the enclosing class (ThreadPool), and the enclosing class in turn synchronizes on the enclosed class. No surprise therefore that it deadlocks.

I suggest we replace the custom ThreadPool with the java.util.concurrent one in JDK 1.5

Alan Burlison
added a comment - 28/Apr/09 16:14 I've pulled from the repo so what I have is basically the same as head (rev 769427), except that I've added a few debugging bits and pieces. I can roll back and update the stack traces.
However I think the cause of the problem is that the implementation is fatally flawed - there's a nonstatic member class (ThreadPool.Poolable) that synchronizes on the enclosing class (ThreadPool), and the enclosing class in turn synchronizes on the enclosed class. No surprise therefore that it deadlocks.
I suggest we replace the custom ThreadPool with the java.util.concurrent one in JDK 1.5

Please try this updated version. Btw, I do not believe that the concept of inner classes has any relatiionship with synchronization. Although the non-static inner class has a reference to the outer class, they are nevertheless completely separate objects.

Jochen Wiedmann
added a comment - 28/Apr/09 20:35 Please try this updated version. Btw, I do not believe that the concept of inner classes has any relatiionship with synchronization. Although the non-static inner class has a reference to the outer class, they are nevertheless completely separate objects.

Alan Burlison
added a comment - 28/Apr/09 20:47 I agree the problem is not the class hierarchy, it is that outer class A is locking inner class B, and in turn B is trying to lock A. You could get exactly the same brokenness with non-nested classes.

----------
Found one Java-level deadlock:
=============================
"XML-RPC-4":
waiting to lock monitor 0x0814dbdc (object 0xf46b3d50, a org.apache.xmlrpc.util.ThreadPool$Poolable),
which is held by "XML-RPC Weblistener"
"XML-RPC Weblistener":
waiting to lock monitor 0x0814ee38 (object 0xf46b3d68, a org.apache.xmlrpc.util.ThreadPool$Poolable$1),
which is held by "XML-RPC-4"

Here are the test files I'm using. To test I'm running the server, then running multiple instances of the client (up to 10). I'm running this on a 4 x 3Ghz machine, and it deadlocks in a matter of seconds.

Alan Burlison
added a comment - 28/Apr/09 22:28 Here are the test files I'm using. To test I'm running the server, then running multiple instances of the client (up to 10). I'm running this on a 4 x 3Ghz machine, and it deadlocks in a matter of seconds.

Jochen Wiedmann
added a comment - 29/Apr/09 06:52 Ok, I was able to reproduce your problem. (The new ThreadPool seems to be better, btw, though it is still locking.) Question: How can I get those neat informational traces you provided above?

The traces come from jstack. When the server has deadlocked, use jps to get the pid of its process, then run 'jstack <pid>'.

In my testing the new code isn't any different. The place where it deadlocks hasn't been touched by your changes, so that's not too surprising. The problem is the ordering of the locks. If you look in the trace above, the weblistener lock ordering is (ThreadPool$Poolable, ThreadPool$Poolable$1) and the worker lock ordering is (ThreadPool$Poolable$1, ThreadPool$Poolable). The $1 object is the anonymous Thread object created inside the Poolable constructor.

The only way of fixing this is to change things so that both the weblistener and the worker thread acquire those locks in the same order. As the locks are in part taken out by synchronized methods (e.g. isShuttingDown), the only way of doing that is to change the order in which the calls are made, and that basically means a major rewrite.

The nub of the problem is that the weblistener thread is locking 'down' into the Poolable and the Poolable thread is locking 'up' into the weblistener, and if those two things happen at the same time, you'll get a deadlock. Taking out locks via the use of synchronized methods means that the lock ordering is going to be determined by the call ordering, and if object A locks and then calls synchronized methods on object B whist object B simultaneously locks and calls synchronized methods on object A the resultant lock ordering is (A, B) and (B, A) - i.e. deadlock. That's an architectural issue, and I suspect the only way of fixing it is to rewrite most of the code.

That's why I suggested using the java.util.concurrent thread pool - that code will have been tested, and will scale. There's a trade-off between reinventing the wheel and losing JDK 1.4 compatibility as the j.u.concurrent stuff didn't come in until JDK 1.5. However there is a backport of the j.u.concurrent stuff - see http://backport-jsr166.sourceforge.net/

Alan Burlison
added a comment - 29/Apr/09 09:27 The traces come from jstack. When the server has deadlocked, use jps to get the pid of its process, then run 'jstack <pid>'.
In my testing the new code isn't any different. The place where it deadlocks hasn't been touched by your changes, so that's not too surprising. The problem is the ordering of the locks. If you look in the trace above, the weblistener lock ordering is (ThreadPool$Poolable, ThreadPool$Poolable$1) and the worker lock ordering is (ThreadPool$Poolable$1, ThreadPool$Poolable). The $1 object is the anonymous Thread object created inside the Poolable constructor.
The only way of fixing this is to change things so that both the weblistener and the worker thread acquire those locks in the same order. As the locks are in part taken out by synchronized methods (e.g. isShuttingDown), the only way of doing that is to change the order in which the calls are made, and that basically means a major rewrite.
The nub of the problem is that the weblistener thread is locking 'down' into the Poolable and the Poolable thread is locking 'up' into the weblistener, and if those two things happen at the same time, you'll get a deadlock. Taking out locks via the use of synchronized methods means that the lock ordering is going to be determined by the call ordering, and if object A locks and then calls synchronized methods on object B whist object B simultaneously locks and calls synchronized methods on object A the resultant lock ordering is (A, B) and (B, A) - i.e. deadlock. That's an architectural issue, and I suspect the only way of fixing it is to rewrite most of the code.
That's why I suggested using the java.util.concurrent thread pool - that code will have been tested, and will scale. There's a trade-off between reinventing the wheel and losing JDK 1.4 compatibility as the j.u.concurrent stuff didn't come in until JDK 1.5. However there is a backport of the j.u.concurrent stuff - see http://backport-jsr166.sourceforge.net/

Thanks for the hint to jps and jstack. It showed me that the patched version of the ThreadPool at least did indeed decouple the ThreadPool and the Poolable class. With that knowledge it was easy to deduce a fix: By declaring the variables Poolable.shuttingDown and Poolable.task as volatile I could remove the synchronized keyword from several Poolable methods.

With the new version of ThreadPool and a slightly more interesting version of the Client, I was able to perform 1000000 calls to the server from 10 concurrent clients. No client ceased to work.

Jochen Wiedmann
added a comment - 29/Apr/09 20:59 Thanks for the hint to jps and jstack. It showed me that the patched version of the ThreadPool at least did indeed decouple the ThreadPool and the Poolable class. With that knowledge it was easy to deduce a fix: By declaring the variables Poolable.shuttingDown and Poolable.task as volatile I could remove the synchronized keyword from several Poolable methods.
With the new version of ThreadPool and a slightly more interesting version of the Client, I was able to perform 1000000 calls to the server from 10 concurrent clients. No client ceased to work.

Wow, that was quick - thanks a bunch, much appreciated. I'll pull your fix tomorrow and give it a good hammering. The two variables you identified (Poolable.shuttingDown and Poolable.task) were the one's I'd noted too, and the volatile declarations are a neat way of resolving the lock ordering issue that the various synchronized methods & blocks were causing, so I'm pretty confident you will have nailed it - I'll let you know the results of my testing. Thanks again!

Alan Burlison
added a comment - 29/Apr/09 23:27 Wow, that was quick - thanks a bunch, much appreciated. I'll pull your fix tomorrow and give it a good hammering. The two variables you identified (Poolable.shuttingDown and Poolable.task) were the one's I'd noted too, and the volatile declarations are a neat way of resolving the lock ordering issue that the various synchronized methods & blocks were causing, so I'm pretty confident you will have nailed it - I'll let you know the results of my testing. Thanks again!