I think I finally understood the cause of this issue - it is crazily subtle. Good news is that it was already (accidentally) fixed as a side-effect of https://hg.python.org/jython/rev/17e40de9a541 (I wondered why it never occurred in recent trunk-versions).
Bad news is that some things silently go awfully wrong and I don't see a proper way to detect this on Jython-side.
So, the whole story of this bug... here we go.
guava uses reentrant locks to secure all sorts of access to data structures. It uses finally blocks to unlock such locks. Note that - as locks are reentrant - it can occur that finally-enclosed calls are nested.
This bug originates in finally-blocks not being executed; thus some locks fail to unlock and later cause deadlocks. While this can happen in various scenarios, it typically involves guava and the bypassed unlock typically happens in test_json (i.e. in org.python.modules.sre.SRE_STATE$CACHE, which is backed by com.google.common.cache.LocalCache).
But how can a finally-block be bypassed?
https://docs.oracle.com/javase/tutorial/essential/exceptions/finally.html
tells us it can be caused by thread interruption or System.exit, but I asserted that none of these is the case here. The document misses to mention that OutOfMemoryError and StackOverflowError can terminate the JVM without executing an enclosing finally-block:
http://stackoverflow.com/questions/22969000/finally-block-will-be-executed-in-case-of-outofmemoryerrorhttp://stackoverflow.com/questions/7753443/behaviour-of-jvm-during-out-of-memory-error-list-s-new-arrayliststringhttp://stackoverflow.com/questions/17836120/stack-overflow-error-handling-in-finally-block
Normally that doesn't matter much since everything blows up anyway. However recall that in guava, calls enclosed by finally blocks are sometimes nested. In that case the outer finally block can catch the error (and is often able to, because while the error propagates up the stack, memory might get freed) and then resume normal execution - with the slight detail that the inner finally block was bypassed and lock-count was messed up. The nasty thing here is that no error is displayed - things go silently wrong. Also, it is somewhat random how much of the stack is skipped before normal execution resumes - sometimes it drops even two unlocks at a time.
I observed that test_json just happens to yield very deep call-stacks, probably because it is a nested test suite. Extending xss (https://hg.python.org/jython/rev/17e40de9a541) somewhat solves it; however the issue only shifts to deeper stacks. My concern is that OutOfMemoryError or StackOverflowError are not displayed properly; just a (seemingly) unexplainable deadlock occurs later on (and I can tell, it takes a lot of time and thought to discover the actual cause).
I suppose, this phenomenon causes all of the recently observed deadlocks (e.g. in #2565 an unlock is lost on PyType$LazyClassToTypeHolder.classToType, which is backed by com.google.common.collect.MapMakerInternalMap) and I don't see a way to detect it properly in Jython. Increasing xss is the best thing we can do on Jython-side so far.
guava could maybe detect bypassed finally-blocks by explicitly catching OutOfMemoryError/StackOverflowError and issue a warning or enforce a hard exit then (or double-check lock-consistency once proper execution eventually resumed).
Actually, the JVM itself should not resume execution once an OutOfMemoryError or StackOverflowError was thrown. These errors should IMHO explicitly propagate through finally blocks in the entire stack rather than bypassing only some of them in non-deterministic fashion.
So I am likelyy going to file issues regarding this scenario in guava and at Oracle/JVM.
Opinions?