Description

When the following two queries are run against tables that contain the necessary fields, using multiple threads, Derby deadlocks and none of the queries ever returns. Derby apparently detects no deadlock condition, either.

SELECT t0.* FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE t1.parentidhash IN AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid) AND t0.jobid=?
SELECT t0.* FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE t1.parentidhash IN AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.newField=?) AND t0.jobid=?

This code comes from Apache ManifoldCF, and has occurred when there are five or more threads trying to execute these two queries at the same time. Originally we found this on 10.5.3.0. It was hoped that 10.7.1.1 would fix the problem, but it hasn't.

Has the application set derby.locks.waitTimeout? The Object.wait() call in the stack trace should have a timeout, unless derby.locks.waitTimeout is negative, so I don't understand how it could be hanging there forever.

Are there any threads updating the database in parallel with the threads that perform these queries, or are the SELECT threads the only threads working on the database?

Knut Anders Hatlen
added a comment - 28/Feb/11 15:17 Has the application set derby.locks.waitTimeout? The Object.wait() call in the stack trace should have a timeout, unless derby.locks.waitTimeout is negative, so I don't understand how it could be hanging there forever.
Are there any threads updating the database in parallel with the threads that perform these queries, or are the SELECT threads the only threads working on the database?

(1) Yes, the wait timeout is set to infinity.
(2) Database updating is happening at the same time by other threads, but I could not find any thread that was blocking in an update when this block occurred. However, I should note that these queries are likely to be taking place within a transaction, and it is possible that writes occurred in the same transaction already, but I cannot confirm that from the thread dump.

Karl Wright
added a comment - 28/Feb/11 15:29 (1) Yes, the wait timeout is set to infinity.
(2) Database updating is happening at the same time by other threads, but I could not find any thread that was blocking in an update when this block occurred. However, I should note that these queries are likely to be taking place within a transaction, and it is possible that writes occurred in the same transaction already, but I cannot confirm that from the thread dump.

Is derby.locks.deadlockTimeout also set to infinity? If so, that would explain why the deadlock isn't reported, if a deadlock is the underlying problem here.

Note that a thread doesn't need to be blocked in an update in order to hold an exclusive lock. It's enough that it has performed an update without committing or aborting the transaction after on. You can connect to the database and execute the statement "select * from syscs_diag.lock_table" to find out if some transaction is holding an exclusive lock.

Knut Anders Hatlen
added a comment - 28/Feb/11 17:02 Is derby.locks.deadlockTimeout also set to infinity? If so, that would explain why the deadlock isn't reported, if a deadlock is the underlying problem here.
Note that a thread doesn't need to be blocked in an update in order to hold an exclusive lock. It's enough that it has performed an update without committing or aborting the transaction after on. You can connect to the database and execute the statement "select * from syscs_diag.lock_table" to find out if some transaction is holding an exclusive lock.

Karl Wright
added a comment - 28/Feb/11 17:08 The deadlock timeout is NOT set to infinity, but to 60 seconds. It would be perfectly OK if Derby returned the appropriate SQL exception indicating deadlock under this circumstance.
The suggestion to connect to the database is great but unfortunately this is using the embedded JDBC driver, so trying to connect while it's hung probably won't work.

Also, just to make it very clear, I am already certain that a transaction IS holding an exclusive lock. The issue is that deadlock is never detected by Derby, and it must be detected. If it is not, there is no possibility of recovery from this situation, other than relying on the lock timeout, which then limits the longest running transaction I can have.

Karl Wright
added a comment - 28/Feb/11 17:16 Also, just to make it very clear, I am already certain that a transaction IS holding an exclusive lock. The issue is that deadlock is never detected by Derby, and it must be detected. If it is not, there is no possibility of recovery from this situation, other than relying on the lock timeout, which then limits the longest running transaction I can have.

Note that Derby will only detect a deadlock if the thread holding the exclusive lock is also waiting for a lock itself. That is, there has to be a cycle in the wait graph. It's not enough that a transaction is simply holding an exclusive lock for a very long time without committing. I'm not sure if that's what's happening here, but that situation is supposed to be handled by the lock timeout.

Knut Anders Hatlen
added a comment - 28/Feb/11 22:14 Note that Derby will only detect a deadlock if the thread holding the exclusive lock is also waiting for a lock itself. That is, there has to be a cycle in the wait graph. It's not enough that a transaction is simply holding an exclusive lock for a very long time without committing. I'm not sure if that's what's happening here, but that situation is supposed to be handled by the lock timeout.
One way to connect to the embedded database while the system is hung, is to set the system property derby.drda.startNetworkServer=true - http://db.apache.org/derby/docs/10.7/adminguide/radminconfigstartnetworkserver.html - and connect to the database with ij using the client driver.

Karl Wright
added a comment - 28/Feb/11 23:27 Note that Derby will only detect a deadlock if the thread holding the exclusive lock is also waiting for a lock itself.
Of course. I see no reason why that should not be the case here. There are no threads with a transaction open that are waiting on anything else.

Is there anything else that would be helpful to diagnose why Derby does not detect a deadlock here? I am not certain how to read this output, nor am I sure it will help because it doesn't tell us what lock the code is TRYING to grab.

Karl Wright
added a comment - 02/Mar/11 19:14 Is there anything else that would be helpful to diagnose why Derby does not detect a deadlock here? I am not certain how to read this output, nor am I sure it will help because it doesn't tell us what lock the code is TRYING to grab.

It might be useful to include the STATE column from the diagnostic table, so that we see which locks are granted and which are being waited for. Then we may be able to reconstruct the wait graph and see if it contains any cycles that the deadlock detection algorithm has missed.

Knut Anders Hatlen
added a comment - 02/Mar/11 21:46 Hi Karl,
It might be useful to include the STATE column from the diagnostic table, so that we see which locks are granted and which are being waited for. Then we may be able to reconstruct the wait graph and see if it contains any cycles that the deadlock detection algorithm has missed.

Karl Wright
added a comment - 02/Mar/11 22:27 I'm sorry - which diagnostic table has a STATE column you wish to see? I've printed all columns from the Locks diagnostic table, as you can see, and there is no STATE column.

It looks to me like this is getting truncated. I'm confirming; the guy with this problem who I am in communication with may have inadvertantly done something funny with cut-and-paste. When I do the same thing here, I get the columns you describe. Stay tuned.

Karl Wright
added a comment - 03/Mar/11 11:07 It looks to me like this is getting truncated. I'm confirming; the guy with this problem who I am in communication with may have inadvertantly done something funny with cut-and-paste. When I do the same thing here, I get the columns you describe. Stay tuned.

I think I'm able to reproduce the problem. At least, I've been able to construct a deadlock that looks similar to the one in the lock dump Karl provided. See the attached Java program (Derby5073.java). There's a cycle involving three transactions, as well as some other transactions that wait on the same locks without actually being part of the cycle themselves.

When disabling waitTimeout, the program runs forever, whereas it should have been terminated with a deadlock error. Even when it's stuck, it keeps one CPU fully occupied. Perhaps an infinite loop in the deadlock detection?

Knut Anders Hatlen
added a comment - 09/Mar/11 17:15 I think I'm able to reproduce the problem. At least, I've been able to construct a deadlock that looks similar to the one in the lock dump Karl provided. See the attached Java program (Derby5073.java). There's a cycle involving three transactions, as well as some other transactions that wait on the same locks without actually being part of the cycle themselves.
When disabling waitTimeout, the program runs forever, whereas it should have been terminated with a deadlock error. Even when it's stuck, it keeps one CPU fully occupied. Perhaps an infinite loop in the deadlock detection?

I took a look at this in the debugger. The deadlock is indeed detected, and a victim is chosen and woken up. The victim rechecks that there still is a deadlock, so that it doesn't have to give up if it's no longer part of a deadlock. When rechecking, it finds that there is a deadlock, and picks another victim, whereas it goes back to sleep waiting for the lock again itself. The new victim will do the same, and pick yet another victim. And so it goes on.

There is code in place to prevent a victim from picking another victim when rechecking the deadlock, but for some reason that code doesn't work in this scenario.

Knut Anders Hatlen
added a comment - 10/Mar/11 12:08 I took a look at this in the debugger. The deadlock is indeed detected, and a victim is chosen and woken up. The victim rechecks that there still is a deadlock, so that it doesn't have to give up if it's no longer part of a deadlock. When rechecking, it finds that there is a deadlock, and picks another victim, whereas it goes back to sleep waiting for the lock again itself. The new victim will do the same, and pick yet another victim. And so it goes on.
There is code in place to prevent a victim from picking another victim when rechecking the deadlock, but for some reason that code doesn't work in this scenario.

As far as I can see, this code doesn't make any sense. space will already have the same value as waitingLock.getCompatabilitySpace(), so the operation is actually a no-op. (waitingLock is obtained by calling waiters.get(space), and the waiters Map is built up by (waitingLock.getCompatabilitySpace(), waitingLock) value pairs, see LockControl.addWaiters().) Furthermore, this leads to "space" being considered twice in a row by the deadlock detection, so that it thinks that the transaction owning that compatibility space is waiting for one of its own locks. It therefore detects the deadlock prematurely, and before it has seen all transactions involved in it, and incorrectly concludes that the original victim wasn't involved.

By changing that last piece of code from a no-op to actually moving one step ahead in the wait graph, the repro does fail with a deadlock error. That is, change the assignment to:

space = ((ActiveLock) waitOn).getCompatabilitySpace();

I tried running the regression tests with that change, and they all passed. I do find the deadlock detection code a bit hard to follow, so I'm not totally convinced this is the right change.

Knut Anders Hatlen
added a comment - 10/Mar/11 16:07 The code is in Deadlock.handle():
// See if the checker is in the deadlock and we
// already picked as a victim
if ((checker.equals(space)) && (deadlockWake == Constants.WAITING_LOCK_DEADLOCK))
{
victim = checker;
break;
}
It never kicks in, and instead it goes further down in the method and wakes another victim:
ActiveLock victimLock = (ActiveLock) waiters.get(victim);
victimLock.wakeUp(Constants.WAITING_LOCK_DEADLOCK);
The new victim wakes up from it's waiting state in ActiveLock.waitForGrant()/ConcurrentLockSet.lockObject(), calls checkDeadlock() and ends up Deadlock.handle() again.
I think the problem may be caused by the following piece of code in Deadlock.look():
} else
{
// simply waiting on another waiter
space = waitingLock.getCompatabilitySpace();
}
As far as I can see, this code doesn't make any sense. space will already have the same value as waitingLock.getCompatabilitySpace(), so the operation is actually a no-op. (waitingLock is obtained by calling waiters.get(space), and the waiters Map is built up by (waitingLock.getCompatabilitySpace(), waitingLock) value pairs, see LockControl.addWaiters().) Furthermore, this leads to "space" being considered twice in a row by the deadlock detection, so that it thinks that the transaction owning that compatibility space is waiting for one of its own locks. It therefore detects the deadlock prematurely, and before it has seen all transactions involved in it, and incorrectly concludes that the original victim wasn't involved.
By changing that last piece of code from a no-op to actually moving one step ahead in the wait graph, the repro does fail with a deadlock error. That is, change the assignment to:
space = ((ActiveLock) waitOn).getCompatabilitySpace();
I tried running the regression tests with that change, and they all passed. I do find the deadlock detection code a bit hard to follow, so I'm not totally convinced this is the right change.

I have stared at this code before and also find it hard to understand. There are hardly any tests in the code for deadlocks so passing does not mean much.
A long time ago I looked at DERBY-3980 and did not get very far, I wonder if it related. your comments there and here seem similar.

Mike Matrigali
added a comment - 10/Mar/11 23:38 I have stared at this code before and also find it hard to understand. There are hardly any tests in the code for deadlocks so passing does not mean much.
A long time ago I looked at DERBY-3980 and did not get very far, I wonder if it related. your comments there and here seem similar.

When running the repro attached here, I do get a deadlock with the patch. However, typically three or four of the six threads get a deadlock exception, whereas it should be enough with one victim. Not sure if that's something caused by my change, or if it's a general problem with our algorithm that it just picks one random transaction that happens to be waiting for a lock in the cycle, even if the transaction isn't necessarily part of the cycle itself.

Knut Anders Hatlen
added a comment - 11/Mar/11 08:56 When running the repro attached here, I do get a deadlock with the patch. However, typically three or four of the six threads get a deadlock exception, whereas it should be enough with one victim. Not sure if that's something caused by my change, or if it's a general problem with our algorithm that it just picks one random transaction that happens to be waiting for a lock in the cycle, even if the transaction isn't necessarily part of the cycle itself.

No one is waiting for tx 182, so aborting it doesn't really resolve the deadlock, and therefore other transactions will have to be aborted later. Also note that the data for row (1,7) has been split in two in the output.

It looks to me as if the problem is that the deadlock detection gets confused by the threads that wait for shared locks on the same rows as the transactions involved in the deadlock are waiting for shared locks on.

Those transactions are not really involved in the deadlock. But because they started waiting for the locks earlier than the threads actually involved in the deadlock, they are first in the queue of waiters. The deadlock detection algorithm incorrectly assumes that these transactions are blocking for the transactions involved in the deadlock.

The assumption is incorrect because they are all waiting for shared locks, so once the first waiter in the queue gets it, the second waiter will also get it. This means that the second waiter isn't waiting for the first waiter here, but they're both waiting for some other lock that is blocking them both.

The attached modified patch (1b) changes the algorithm so that it skips past chains of waiters that have compatible lock requests. That made the repro fail with exactly one deadlock exception.

Knut Anders Hatlen
added a comment - 11/Mar/11 10:37 It looks to me as if the problem is that the deadlock detection gets confused by the threads that wait for shared locks on the same rows as the transactions involved in the deadlock are waiting for shared locks on.
Those transactions are not really involved in the deadlock. But because they started waiting for the locks earlier than the threads actually involved in the deadlock, they are first in the queue of waiters. The deadlock detection algorithm incorrectly assumes that these transactions are blocking for the transactions involved in the deadlock.
The assumption is incorrect because they are all waiting for shared locks, so once the first waiter in the queue gets it, the second waiter will also get it. This means that the second waiter isn't waiting for the first waiter here, but they're both waiting for some other lock that is blocking them both.
The attached modified patch (1b) changes the algorithm so that it skips past chains of waiters that have compatible lock requests. That made the repro fail with exactly one deadlock exception.

Bryan Pendleton
added a comment - 11/Mar/11 14:58 Your analysis makes a lot of sense to me. Thank you for taking the time to
walk us through the code and your thought process, it has been very illuminating.
I think you have identified a real problem and your approach is good.

Thank you Knut for working on this issue. I stared at the deadlock code a very long time n the context of DERBY-3980 before giving up and so am thrilled to see you are able to get your head around it. I want to ask a favor while you are in there. If there is any insight in the form of comments that you can add to the deadlock code, I think that would be extremely educational and helpful to anyone that has to visit the deadlock code again.

Kathey Marsden
added a comment - 11/Mar/11 17:07 Thank you Knut for working on this issue. I stared at the deadlock code a very long time n the context of DERBY-3980 before giving up and so am thrilled to see you are able to get your head around it. I want to ask a favor while you are in there. If there is any insight in the form of comments that you can add to the deadlock code, I think that would be extremely educational and helpful to anyone that has to visit the deadlock code again.

Thanks for the feedback, Bryan and Kathey. I'll polish the patch and incorporate the repros attached to this issue and DERBY-3980 into a JUnit test that can run as part of the regression test suite.

Since we don't know for sure that the repro I attached reproduces the exact same problem as originally reported, we should probably check in the fix under DERBY-3980 and mark that issue as resolved. When Karl gets a chance to verify that this also fixes the problem he saw, we can close this issue as a duplicate of DERBY-3980.

And, yes, I'll annotate the deadlock detection code with comments reflecting my understanding of what it does and how it should work. Good idea. Thanks!

Knut Anders Hatlen
added a comment - 11/Mar/11 18:19 Thanks for the feedback, Bryan and Kathey. I'll polish the patch and incorporate the repros attached to this issue and DERBY-3980 into a JUnit test that can run as part of the regression test suite.
Since we don't know for sure that the repro I attached reproduces the exact same problem as originally reported, we should probably check in the fix under DERBY-3980 and mark that issue as resolved. When Karl gets a chance to verify that this also fixes the problem he saw, we can close this issue as a duplicate of DERBY-3980 .
And, yes, I'll annotate the deadlock detection code with comments reflecting my understanding of what it does and how it should work. Good idea. Thanks!

Knut Anders Hatlen
added a comment - 14/Mar/11 16:49 Hi Karl,
I've posted a patch over at DERBY-3980 (derby-3980-1a.diff) and also committed it to the subversion repository. It would be great if you could test if the fix also solves the problem you're seeing.
Thanks.

I built Derby trunk and checked it in, and notified the field that a fix may possibly be had for the problem they've been seeing. I expect they will respond in about 12 hours, and then I will let you know.

Karl Wright
added a comment - 14/Mar/11 19:21 I built Derby trunk and checked it in, and notified the field that a fix may possibly be had for the problem they've been seeing. I expect they will respond in about 12 hours, and then I will let you know.

Hmm. I merged the fix into 10.7.1.1, and ran a test, and got a deadlock hang again. Strange.
I'm going to have to try the same case with a straight trunk build just to be sure I didn't mess anything up.

Karl Wright
added a comment - 16/Mar/11 22:07 Hmm. I merged the fix into 10.7.1.1, and ran a test, and got a deadlock hang again. Strange.
I'm going to have to try the same case with a straight trunk build just to be sure I didn't mess anything up.

A trunk build does the same thing on my test case. I got a thread dump when I thought it had locked itself up. There were a whole pile of threads all waiting for grants, and then there was this thread:

locked <0x29280528> (a java.util.Hashtable)
at org.apache.derby.shared.common.sanity.SanityManager.DEBUG_ON(SanityManager.java:229)
at org.apache.derby.impl.services.locks.AbstractPool.lockObject(AbstractPool.java:101)
at org.apache.derby.impl.store.raw.xact.RowLocking2.lockRecordForRead(RowLocking2.java:165)
at org.apache.derby.impl.store.access.heap.HeapController.lockRow(HeapController.java:520)
at org.apache.derby.impl.store.access.heap.HeapController.lockRow(HeapController.java:638)
at org.apache.derby.impl.store.access.btree.index.B2IRowLocking3.lockRowOnPage(B2IRowLocking3.java:280)
at org.apache.derby.impl.store.access.btree.index.B2IRowLocking3._lockScanRow(B2IRowLocking3.java:599)
at org.apache.derby.impl.store.access.btree.index.B2IRowLockingRR.lockScanRow(B2IRowLockingRR.java:105)
at org.apache.derby.impl.store.access.btree.BTreeForwardScan.fetchRows(BTreeForwardScan.java:305)
at org.apache.derby.impl.store.access.btree.BTreeScan.fetchNext(BTreeScan.java:1599)
at org.apache.derby.impl.sql.execute.TableScanResultSet.getNextRowCore(TableScanResultSet.java:577)
at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(ProjectRestrictResultSet.java:261)
at org.apache.derby.impl.sql.execute.JoinResultSet.reopenCore(JoinResultSet.java:181)
at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.reopenCore(ProjectRestrictResultSet.java:218)
at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.reopenCore(ProjectRestrictResultSet.java:218)
at org.apache.derby.impl.sql.execute.AnyResultSet.reopenCore(AnyResultSet.java:139)
at org.apache.derby.impl.sql.execute.AnyResultSet.openCore(AnyResultSet.java:103)
at org.apache.derby.exe.ac74f89535x012exc0d8x72ccx000000d1a4e011.g0(Unknown Source)
at org.apache.derby.exe.ac74f89535x012exc0d8x72ccx000000d1a4e011.e1(Unknown Source)
at org.apache.derby.impl.services.reflect.DirectCall.invoke(ReflectGeneratedClass.java:141)
at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(ProjectRestrictResultSet.java:273)
at org.apache.derby.impl.sql.execute.NormalizeResultSet.getNextRowCore(NormalizeResultSet.java:185)
at org.apache.derby.impl.sql.execute.DMLWriteResultSet.getNextRowCore(DMLWriteResultSet.java:127)
at org.apache.derby.impl.sql.execute.UpdateResultSet.collectAffectedRows(UpdateResultSet.java:440)
at org.apache.derby.impl.sql.execute.UpdateResultSet.open(UpdateResultSet.java:264)
at org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(GenericPreparedStatement.java:436)
at org.apache.derby.impl.sql.GenericPreparedStatement.execute(GenericPreparedStatement.java:317)
at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1242)

locked <0x29b139f0> (a org.apache.derby.impl.jdbc.EmbedConnection40)
at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(EmbedPreparedStatement.java:1686)
at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeUpdate(EmbedPreparedStatement.java:308)
at org.apache.manifoldcf.core.database.Database.execute(Database.java:606)
at org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:421)

A thread dump a couple of minutes later was identical, except for that one thread, which now looked like this:

locked <0x29b14878> (a org.apache.derby.impl.services.locks.LockSpace)
at org.apache.derby.impl.services.locks.AbstractPool.unlock(AbstractPool.java:181)
at org.apache.derby.impl.store.raw.xact.RowLocking2.unlockRecordAfterRead(RowLocking2.java:185)
at org.apache.derby.impl.store.access.heap.HeapController.unlockRowAfterRead(HeapController.java:668)
at org.apache.derby.impl.store.access.btree.index.B2IRowLocking2.unlockScanRecordAfterRead(B2IRowLocking2.java:96)
at org.apache.derby.impl.store.access.btree.BTreeForwardScan.fetchRows(BTreeForwardScan.java:221)
at org.apache.derby.impl.store.access.btree.BTreeScan.fetchNext(BTreeScan.java:1599)
at org.apache.derby.impl.sql.execute.TableScanResultSet.getNextRowCore(TableScanResultSet.java:577)
at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(ProjectRestrictResultSet.java:261)
at org.apache.derby.impl.sql.execute.JoinResultSet.reopenCore(JoinResultSet.java:181)
at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.reopenCore(ProjectRestrictResultSet.java:218)
at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.reopenCore(ProjectRestrictResultSet.java:218)
at org.apache.derby.impl.sql.execute.AnyResultSet.reopenCore(AnyResultSet.java:139)
at org.apache.derby.impl.sql.execute.AnyResultSet.openCore(AnyResultSet.java:103)
at org.apache.derby.exe.ac74f89535x012exc0d8x72ccx000000d1a4e011.g0(Unknown Source)
at org.apache.derby.exe.ac74f89535x012exc0d8x72ccx000000d1a4e011.e1(Unknown Source)
at org.apache.derby.impl.services.reflect.DirectCall.invoke(ReflectGeneratedClass.java:141)
at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(ProjectRestrictResultSet.java:273)
at org.apache.derby.impl.sql.execute.NormalizeResultSet.getNextRowCore(NormalizeResultSet.java:185)
at org.apache.derby.impl.sql.execute.DMLWriteResultSet.getNextRowCore(DMLWriteResultSet.java:127)
at org.apache.derby.impl.sql.execute.UpdateResultSet.collectAffectedRows(UpdateResultSet.java:440)
at org.apache.derby.impl.sql.execute.UpdateResultSet.open(UpdateResultSet.java:264)
at org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(GenericPreparedStatement.java:436)
at org.apache.derby.impl.sql.GenericPreparedStatement.execute(GenericPreparedStatement.java:317)
at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1242)

locked <0x29b139f0> (a org.apache.derby.impl.jdbc.EmbedConnection40)
at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(EmbedPreparedStatement.java:1686)
at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeUpdate(EmbedPreparedStatement.java:308)
at org.apache.manifoldcf.core.database.Database.execute(Database.java:606)
at org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:421)

This could just be a very-long-running query, blocking lots of other stuff. I'll have to run with the standard tricks to see what the locks wind up looking like. Stay tuned.

Karl Wright
added a comment - 16/Mar/11 23:00 A trunk build does the same thing on my test case. I got a thread dump when I thought it had locked itself up. There were a whole pile of threads all waiting for grants, and then there was this thread:
Thread-24341" daemon prio=6 tid=0x05934400 nid=0x1cd4 runnable [0x0816e000]
java.lang.Thread.State: RUNNABLE
at java.util.Hashtable.get(Hashtable.java:333)
locked <0x29280528> (a java.util.Hashtable)
at org.apache.derby.shared.common.sanity.SanityManager.DEBUG_ON(SanityManager.java:229)
at org.apache.derby.impl.services.locks.AbstractPool.lockObject(AbstractPool.java:101)
at org.apache.derby.impl.store.raw.xact.RowLocking2.lockRecordForRead(RowLocking2.java:165)
at org.apache.derby.impl.store.access.heap.HeapController.lockRow(HeapController.java:520)
at org.apache.derby.impl.store.access.heap.HeapController.lockRow(HeapController.java:638)
at org.apache.derby.impl.store.access.btree.index.B2IRowLocking3.lockRowOnPage(B2IRowLocking3.java:280)
at org.apache.derby.impl.store.access.btree.index.B2IRowLocking3._lockScanRow(B2IRowLocking3.java:599)
at org.apache.derby.impl.store.access.btree.index.B2IRowLockingRR.lockScanRow(B2IRowLockingRR.java:105)
at org.apache.derby.impl.store.access.btree.BTreeForwardScan.fetchRows(BTreeForwardScan.java:305)
at org.apache.derby.impl.store.access.btree.BTreeScan.fetchNext(BTreeScan.java:1599)
at org.apache.derby.impl.sql.execute.TableScanResultSet.getNextRowCore(TableScanResultSet.java:577)
at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(ProjectRestrictResultSet.java:261)
at org.apache.derby.impl.sql.execute.JoinResultSet.reopenCore(JoinResultSet.java:181)
at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.reopenCore(ProjectRestrictResultSet.java:218)
at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.reopenCore(ProjectRestrictResultSet.java:218)
at org.apache.derby.impl.sql.execute.AnyResultSet.reopenCore(AnyResultSet.java:139)
at org.apache.derby.impl.sql.execute.AnyResultSet.openCore(AnyResultSet.java:103)
at org.apache.derby.exe.ac74f89535x012exc0d8x72ccx000000d1a4e011.g0(Unknown Source)
at org.apache.derby.exe.ac74f89535x012exc0d8x72ccx000000d1a4e011.e1(Unknown Source)
at org.apache.derby.impl.services.reflect.DirectCall.invoke(ReflectGeneratedClass.java:141)
at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(ProjectRestrictResultSet.java:273)
at org.apache.derby.impl.sql.execute.NormalizeResultSet.getNextRowCore(NormalizeResultSet.java:185)
at org.apache.derby.impl.sql.execute.DMLWriteResultSet.getNextRowCore(DMLWriteResultSet.java:127)
at org.apache.derby.impl.sql.execute.UpdateResultSet.collectAffectedRows(UpdateResultSet.java:440)
at org.apache.derby.impl.sql.execute.UpdateResultSet.open(UpdateResultSet.java:264)
at org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(GenericPreparedStatement.java:436)
at org.apache.derby.impl.sql.GenericPreparedStatement.execute(GenericPreparedStatement.java:317)
at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1242)
locked <0x29b139f0> (a org.apache.derby.impl.jdbc.EmbedConnection40)
at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(EmbedPreparedStatement.java:1686)
at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeUpdate(EmbedPreparedStatement.java:308)
at org.apache.manifoldcf.core.database.Database.execute(Database.java:606)
at org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:421)
A thread dump a couple of minutes later was identical, except for that one thread, which now looked like this:
"Thread-24341" daemon prio=6 tid=0x05934400 nid=0x1cd4 runnable [0x0816e000]
java.lang.Thread.State: RUNNABLE
at java.lang.Object.hashCode(Native Method)
at org.apache.derby.impl.services.locks.Lock.hashCode(Lock.java:145)
at java.util.HashMap.removeEntryForKey(HashMap.java:548)
at java.util.HashMap.remove(HashMap.java:538)
at org.apache.derby.impl.services.locks.ConcurrentLockSet.unlockReference(ConcurrentLockSet.java:799)
at org.apache.derby.impl.services.locks.LockSpace.unlockReference(LockSpace.java:276)
locked <0x29b14878> (a org.apache.derby.impl.services.locks.LockSpace)
at org.apache.derby.impl.services.locks.AbstractPool.unlock(AbstractPool.java:181)
at org.apache.derby.impl.store.raw.xact.RowLocking2.unlockRecordAfterRead(RowLocking2.java:185)
at org.apache.derby.impl.store.access.heap.HeapController.unlockRowAfterRead(HeapController.java:668)
at org.apache.derby.impl.store.access.btree.index.B2IRowLocking2.unlockScanRecordAfterRead(B2IRowLocking2.java:96)
at org.apache.derby.impl.store.access.btree.BTreeForwardScan.fetchRows(BTreeForwardScan.java:221)
at org.apache.derby.impl.store.access.btree.BTreeScan.fetchNext(BTreeScan.java:1599)
at org.apache.derby.impl.sql.execute.TableScanResultSet.getNextRowCore(TableScanResultSet.java:577)
at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(ProjectRestrictResultSet.java:261)
at org.apache.derby.impl.sql.execute.JoinResultSet.reopenCore(JoinResultSet.java:181)
at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.reopenCore(ProjectRestrictResultSet.java:218)
at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.reopenCore(ProjectRestrictResultSet.java:218)
at org.apache.derby.impl.sql.execute.AnyResultSet.reopenCore(AnyResultSet.java:139)
at org.apache.derby.impl.sql.execute.AnyResultSet.openCore(AnyResultSet.java:103)
at org.apache.derby.exe.ac74f89535x012exc0d8x72ccx000000d1a4e011.g0(Unknown Source)
at org.apache.derby.exe.ac74f89535x012exc0d8x72ccx000000d1a4e011.e1(Unknown Source)
at org.apache.derby.impl.services.reflect.DirectCall.invoke(ReflectGeneratedClass.java:141)
at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(ProjectRestrictResultSet.java:273)
at org.apache.derby.impl.sql.execute.NormalizeResultSet.getNextRowCore(NormalizeResultSet.java:185)
at org.apache.derby.impl.sql.execute.DMLWriteResultSet.getNextRowCore(DMLWriteResultSet.java:127)
at org.apache.derby.impl.sql.execute.UpdateResultSet.collectAffectedRows(UpdateResultSet.java:440)
at org.apache.derby.impl.sql.execute.UpdateResultSet.open(UpdateResultSet.java:264)
at org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(GenericPreparedStatement.java:436)
at org.apache.derby.impl.sql.GenericPreparedStatement.execute(GenericPreparedStatement.java:317)
at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1242)
locked <0x29b139f0> (a org.apache.derby.impl.jdbc.EmbedConnection40)
at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(EmbedPreparedStatement.java:1686)
at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeUpdate(EmbedPreparedStatement.java:308)
at org.apache.manifoldcf.core.database.Database.execute(Database.java:606)
at org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:421)
This could just be a very-long-running query, blocking lots of other stuff. I'll have to run with the standard tricks to see what the locks wind up looking like. Stay tuned.

More research confirms absolutely no progress of any kind over tens of minutes. The ManifoldCF reports queries running longer than 60 seconds and there have been no such reports, which would have been expected if we were running progressively slower queries. The number of rows in the involved tables is not that high either ( about 64000 in the biggest, 12000 in the second biggest).
The data is hard to explain WITHOUT deadlock because of the persistence of threads waiting on the database; there is no churn here. And yet if I read the lock dump there's no obvious deadlock, although I am not sure I'm reading it right. Anybody see anything that might be of use?

Karl Wright
added a comment - 17/Mar/11 00:00 More research confirms absolutely no progress of any kind over tens of minutes. The ManifoldCF reports queries running longer than 60 seconds and there have been no such reports, which would have been expected if we were running progressively slower queries. The number of rows in the involved tables is not that high either ( about 64000 in the biggest, 12000 in the second biggest).
The data is hard to explain WITHOUT deadlock because of the persistence of threads waiting on the database; there is no churn here. And yet if I read the lock dump there's no obvious deadlock, although I am not sure I'm reading it right. Anybody see anything that might be of use?
By the way, this is a trunk build now.

Hey, it looks like it unwedged itself after some 30 minutes. ManifoldCF then complained about a lot of long-running queries. Some were long running because they were blocked, obviously, but at least one of them is the cause.

So this is cleared up; it's not a deadlock apparently, just a bad plan.

Found a query that took more than a minute (1533032 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash))]
Parameter 0: 'D'
Parameter 1: '-1'
Parameter 2: '1300315252437'
Parameter 3: 'F1C0F5CC2ED7A5D0B4AB58A8F202F025AA3BD133'
Found a query that took more than a minute (1521069 ms): [UPDATE intrinsiclink SET isnew=? WHERE jobid=? AND linktype=? AND parentidhash=? AND childidhash=?]
Parameter 0: 'E'
Parameter 1: '1300315252437'
Parameter 2: 'link'
Parameter 3: '5AD2BBC595BA81D08B9C0FBF7C435692E255154B'
Parameter 4: 'F75CCB0A5B55E281A506712D331CDD44F0D10F04'
Found a query that took more than a minute (1531593 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
Parameter 0: 'F13204EB47B99ACE783AC62341763D4DEA64DAD3'
Parameter 1: 'B'
Parameter 2: '1300315252437'
Found a query that took more than a minute (1535056 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
Parameter 0: 'C4FA1B2511EB16B47891A614290D7EFDC542EFB3'
Parameter 1: 'B'
Parameter 2: '1300315252437'
Found a query that took more than a minute (1529958 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
Parameter 0: '01A734A295128E67F823F4371F084312EF9DBDFD'
Parameter 1: 'B'
Parameter 2: '1300315252437'
Found a query that took more than a minute (1529928 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
Parameter 0: 'F27D1DB1C2E2927F7DD5BD7B8B613008CF0FEA52'
Parameter 1: 'B'
Parameter 2: '1300315252437'
Found a query that took more than a minute (1530203 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
Parameter 0: 'F20EB42B4BDDBD5FE94C20873B5F06DC285F04D0'
Parameter 1: 'B'
Parameter 2: '1300315252437'
Found a query that took more than a minute (1514744 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE t1.parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid) AND t0.jobid=?]
Parameter 0: 'A36704A5DEB0591D5044DA8C0AF92997B8DCDA12'
Parameter 1: '1300315252437'
Found a query that took more than a minute (1526832 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
Found a query that took more than a minute (1524210 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
Parameter 0: 'F8848399A2AE8C7C613E55AAC80054ED6D4C4996'
Parameter 1: 'B'
Parameter 2: '1300315252437'
Parameter 0: '037C78E60C534C68924C36E80D6A469B7371987C'
Parameter 1: 'B'
Parameter 2: '1300315252437'
Found a query that took more than a minute (1627441 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND t1.isnew=?))]
Found a query that took more than a minute (1627444 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash))]
Parameter 0: 'D'
Parameter 0: 'D'
Parameter 1: '-1'
Parameter 1: '-1'
Parameter 2: '1300315252437'
Parameter 2: '1300315252437'
Parameter 3: '01A734A295128E67F823F4371F084312EF9DBDFD'
Parameter 3: 'A36704A5DEB0591D5044DA8C0AF92997B8DCDA12'
Parameter 4: 'B'
Found a query that took more than a minute (1627445 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND t1.isnew=?))]
Parameter 0: 'D'
Parameter 1: '-1'
Parameter 2: '1300315252437'
Parameter 3: 'F8848399A2AE8C7C613E55AAC80054ED6D4C4996'
Parameter 4: 'B'
Found a query that took more than a minute (1627427 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND t1.isnew=?))]
Parameter 0: 'D'
Parameter 1: '-1'
Parameter 2: '1300315252437'
Parameter 3: 'F27D1DB1C2E2927F7DD5BD7B8B613008CF0FEA52'
Parameter 4: 'B'
Found a query that took more than a minute (1627444 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND t1.isnew=?))]
Parameter 0: 'D'
Parameter 1: '-1'
Parameter 2: '1300315252437'
Parameter 3: 'F13204EB47B99ACE783AC62341763D4DEA64DAD3'
Parameter 4: 'B'
Found a query that took more than a minute (1627445 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND t1.isnew=?))]
Parameter 0: 'D'
Parameter 1: '-1'
Parameter 2: '1300315252437'
Parameter 3: 'C4FA1B2511EB16B47891A614290D7EFDC542EFB3'
Parameter 4: 'B'
Found a query that took more than a minute (1627429 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND t1.isnew=?))]
Parameter 0: 'D'
Parameter 1: '-1'
Parameter 2: '1300315252437'
Parameter 3: '037C78E60C534C68924C36E80D6A469B7371987C'
Parameter 4: 'B'
Found a query that took more than a minute (3095263 ms): [SELECT jobid,CAST(COUNT(dochash) AS BIGINT) AS doccount FROM jobqueue t1 GROUP BY jobid]
Found a query that took more than a minute (1622113 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
Parameter 0: '035F0D4F860F38F039F3C0D4D642D0CA2F83D349'
Parameter 1: 'B'
Parameter 2: '1300315252437'
Found a query that took more than a minute (1627657 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
Parameter 0: 'F75CCB0A5B55E281A506712D331CDD44F0D10F04'
Parameter 1: 'B'
Parameter 2: '1300315252437'
Found a query that took more than a minute (1622790 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
Parameter 0: '01595CFF50BEE46E068F5AEBB539483F73C672AC'
Parameter 1: 'B'
Parameter 2: '1300315252437'

Karl Wright
added a comment - 17/Mar/11 00:28 Hey, it looks like it unwedged itself after some 30 minutes. ManifoldCF then complained about a lot of long-running queries. Some were long running because they were blocked, obviously, but at least one of them is the cause.
So this is cleared up; it's not a deadlock apparently, just a bad plan.
Found a query that took more than a minute (1533032 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash))]
Parameter 0: 'D'
Parameter 1: '-1'
Parameter 2: '1300315252437'
Parameter 3: 'F1C0F5CC2ED7A5D0B4AB58A8F202F025AA3BD133'
Found a query that took more than a minute (1521069 ms): [UPDATE intrinsiclink SET isnew=? WHERE jobid=? AND linktype=? AND parentidhash=? AND childidhash=?]
Parameter 0: 'E'
Parameter 1: '1300315252437'
Parameter 2: 'link'
Parameter 3: '5AD2BBC595BA81D08B9C0FBF7C435692E255154B'
Parameter 4: 'F75CCB0A5B55E281A506712D331CDD44F0D10F04'
Found a query that took more than a minute (1531593 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
Parameter 0: 'F13204EB47B99ACE783AC62341763D4DEA64DAD3'
Parameter 1: 'B'
Parameter 2: '1300315252437'
Found a query that took more than a minute (1535056 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
Parameter 0: 'C4FA1B2511EB16B47891A614290D7EFDC542EFB3'
Parameter 1: 'B'
Parameter 2: '1300315252437'
Found a query that took more than a minute (1529958 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
Parameter 0: '01A734A295128E67F823F4371F084312EF9DBDFD'
Parameter 1: 'B'
Parameter 2: '1300315252437'
Found a query that took more than a minute (1529928 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
Parameter 0: 'F27D1DB1C2E2927F7DD5BD7B8B613008CF0FEA52'
Parameter 1: 'B'
Parameter 2: '1300315252437'
Found a query that took more than a minute (1530203 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
Parameter 0: 'F20EB42B4BDDBD5FE94C20873B5F06DC285F04D0'
Parameter 1: 'B'
Parameter 2: '1300315252437'
Found a query that took more than a minute (1514744 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE t1.parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid) AND t0.jobid=?]
Parameter 0: 'A36704A5DEB0591D5044DA8C0AF92997B8DCDA12'
Parameter 1: '1300315252437'
Found a query that took more than a minute (1526832 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
Found a query that took more than a minute (1524210 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
Parameter 0: 'F8848399A2AE8C7C613E55AAC80054ED6D4C4996'
Parameter 1: 'B'
Parameter 2: '1300315252437'
Parameter 0: '037C78E60C534C68924C36E80D6A469B7371987C'
Parameter 1: 'B'
Parameter 2: '1300315252437'
Found a query that took more than a minute (1627441 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND t1.isnew=?))]
Found a query that took more than a minute (1627444 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash))]
Parameter 0: 'D'
Parameter 0: 'D'
Parameter 1: '-1'
Parameter 1: '-1'
Parameter 2: '1300315252437'
Parameter 2: '1300315252437'
Parameter 3: '01A734A295128E67F823F4371F084312EF9DBDFD'
Parameter 3: 'A36704A5DEB0591D5044DA8C0AF92997B8DCDA12'
Parameter 4: 'B'
Found a query that took more than a minute (1627445 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND t1.isnew=?))]
Parameter 0: 'D'
Parameter 1: '-1'
Parameter 2: '1300315252437'
Parameter 3: 'F8848399A2AE8C7C613E55AAC80054ED6D4C4996'
Parameter 4: 'B'
Found a query that took more than a minute (1627427 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND t1.isnew=?))]
Parameter 0: 'D'
Parameter 1: '-1'
Parameter 2: '1300315252437'
Parameter 3: 'F27D1DB1C2E2927F7DD5BD7B8B613008CF0FEA52'
Parameter 4: 'B'
Found a query that took more than a minute (1627444 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND t1.isnew=?))]
Parameter 0: 'D'
Parameter 1: '-1'
Parameter 2: '1300315252437'
Parameter 3: 'F13204EB47B99ACE783AC62341763D4DEA64DAD3'
Parameter 4: 'B'
Found a query that took more than a minute (1627445 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND t1.isnew=?))]
Parameter 0: 'D'
Parameter 1: '-1'
Parameter 2: '1300315252437'
Parameter 3: 'C4FA1B2511EB16B47891A614290D7EFDC542EFB3'
Parameter 4: 'B'
Found a query that took more than a minute (1627429 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id IN(SELECT ownerid FROM hopdeletedeps t0 WHERE ((t0.jobid=? AND t0.childidhash=?)) AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE t1.linktype=t0.linktype AND t1.jobid=t0.jobid AND t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND t1.isnew=?))]
Parameter 0: 'D'
Parameter 1: '-1'
Parameter 2: '1300315252437'
Parameter 3: '037C78E60C534C68924C36E80D6A469B7371987C'
Parameter 4: 'B'
Found a query that took more than a minute (3095263 ms): [SELECT jobid,CAST(COUNT(dochash) AS BIGINT) AS doccount FROM jobqueue t1 GROUP BY jobid]
Found a query that took more than a minute (1622113 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
Parameter 0: '035F0D4F860F38F039F3C0D4D642D0CA2F83D349'
Parameter 1: 'B'
Parameter 2: '1300315252437'
Found a query that took more than a minute (1627657 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
Parameter 0: 'F75CCB0A5B55E281A506712D331CDD44F0D10F04'
Parameter 1: 'B'
Parameter 2: '1300315252437'
Found a query that took more than a minute (1622790 ms): [SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN (?) AND t1.childidhash=t0.dochash AND t0.jobid=t1.jobid AND t1.isnew=?) AND t0.jobid=?]
Parameter 0: '01595CFF50BEE46E068F5AEBB539483F73C672AC'
Parameter 1: 'B'
Parameter 2: '1300315252437'

Turning on query plan output had some interesting results. The following queries are doing full table scans. In all cases, I cannot see why the planner does not use an existing index that was created specifically to support the query. The other interesting thing to note is that NONE of these queries appeared in my list of "long running queries" above, so they were not involved in the 30-minute stall recorded there. But maybe the underlying cause is similar?

SELECT parentidhash FROM intrinsiclink WHERE (jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?)
(identical index exists: intrinsiclink: (jobid,linktype,parentidhash,childidhash) )

SELECT id,distance,linktype FROM hopcount WHERE (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?)
(identical index exists: hopcount: (jobid,linktype,parentidhash) )

SELECT id FROM jobqueue WHERE (jobid=? AND status=?) OR (jobid=? AND status=?) OR (jobid=? AND status=?) OR (jobid=? AND status=?) OR (jobid=? AND status=?) OR (jobid=? AND status=?)
(identical index exists: jobqueue: (jobid,status) )

SELECT parentidhash,linktype,distance FROM hopcount WHERE (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?)
(identical index exists: hopcount: (jobid,linktype,parentidhash) )

FWIW, all of these queries use the corresponding index when run under PostgreSQL.
I'm happy to create a new ticket for this problem, if appropriate. Comments?

Karl Wright
added a comment - 17/Mar/11 07:51 Turning on query plan output had some interesting results. The following queries are doing full table scans. In all cases, I cannot see why the planner does not use an existing index that was created specifically to support the query. The other interesting thing to note is that NONE of these queries appeared in my list of "long running queries" above, so they were not involved in the 30-minute stall recorded there. But maybe the underlying cause is similar?
SELECT parentidhash FROM intrinsiclink WHERE (jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?) OR(jobid=? AND linktype=? AND parentidhash=? AND childidhash=?)
(identical index exists: intrinsiclink: (jobid,linktype,parentidhash,childidhash) )
SELECT id,distance,linktype FROM hopcount WHERE (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?)
(identical index exists: hopcount: (jobid,linktype,parentidhash) )
SELECT id FROM jobqueue WHERE (jobid=? AND status=?) OR (jobid=? AND status=?) OR (jobid=? AND status=?) OR (jobid=? AND status=?) OR (jobid=? AND status=?) OR (jobid=? AND status=?)
(identical index exists: jobqueue: (jobid,status) )
SELECT parentidhash,linktype,distance FROM hopcount WHERE (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?) OR (jobid=? AND linktype=? AND parentidhash=?)
(identical index exists: hopcount: (jobid,linktype,parentidhash) )
FWIW, all of these queries use the corresponding index when run under PostgreSQL.
I'm happy to create a new ticket for this problem, if appropriate. Comments?

Thanks for verifying that the infinite waiting has been fixed, Karl. I think it would be better to file a separate issue for investigating the bad plan. I'm not sure we have any good way of using indexes for the above query currently, though. Except doing a range scan of the index using the smallest and largest value of (jobid,linktype,parentidhash) in the supplied list of predicates, but that may very well degenerate to a full scan in many cases, depending on the actual values. We do have something called multi-probe scans that might be useful for such queries, but it is my understanding that they are only used for single-column keys.

Knut Anders Hatlen
added a comment - 17/Mar/11 09:23 Thanks for verifying that the infinite waiting has been fixed, Karl. I think it would be better to file a separate issue for investigating the bad plan. I'm not sure we have any good way of using indexes for the above query currently, though. Except doing a range scan of the index using the smallest and largest value of (jobid,linktype,parentidhash) in the supplied list of predicates, but that may very well degenerate to a full scan in many cases, depending on the actual values. We do have something called multi-probe scans that might be useful for such queries, but it is my understanding that they are only used for single-column keys.
One quick check that might indicate whether the problem is the optimizer or we're just lacking the proper access methods to use the indexes efficiently, would be to add an optimizer override to force the use of the index - http://db.apache.org/derby/docs/10.7/tuning/ctundepthoptover.html .
You may also want to make sure that the optimizer has up to date statistics for the indexes by calling SYSCS_UTIL.SYSCS_UPDATE_STATISTICS on the tables involved in the query - http://db.apache.org/derby/docs/10.7/ref/rrefupdatestatsproc.html .

The update statistics macro I cannot get to run. For all schemas I give it, except SYS, it tells me the schema is not found, even though I have explicitly created it. It doesn't accept APP either. For SYS, it complains that ALTER TABLE is not allowed.

In ij, "SHOW SCHEMAS" does not list all the schemas I've created either:

Karl Wright
added a comment - 17/Mar/11 16:02 The update statistics macro I cannot get to run. For all schemas I give it, except SYS, it tells me the schema is not found, even though I have explicitly created it. It doesn't accept APP either. For SYS, it complains that ALTER TABLE is not allowed.
In ij, "SHOW SCHEMAS" does not list all the schemas I've created either:
TABLE_SCHEM
------------------------------
APP
NULLID
SQLJ
SYS
SYSCAT
SYSCS_DIAG
SYSCS_UTIL
SYSFUN
SYSIBM
SYSPROC
SYSSTAT
11 rows selected
.... so I am not sure what is going on with this.

Karl Wright
added a comment - 17/Mar/11 16:19 More research shows that the problem is that you cannot calculate statistics for tables that are part of the app or default schema. Tables part of any other schema are OK.

Knut Anders Hatlen
added a comment - 23/Mar/11 12:21 I've added some more comments to the Deadlock class (see the patch comments.diff):
a high level description of the algorithm in the class javadoc
more javadoc for the methods
expanded on the comments inside the actual detection algorithm in Deadlock.look()
Committed revision 1084561.

Thanks for all the help with tracking this down, Karl, and for verifying the the fix! I'm resolving the issue as a duplicate of DERBY-3980 since the two issues turned out to be caused by the same underlying bug(s) in the deadlock detection algorithm.

Knut Anders Hatlen
added a comment - 23/Mar/11 12:27 Thanks for all the help with tracking this down, Karl, and for verifying the the fix! I'm resolving the issue as a duplicate of DERBY-3980 since the two issues turned out to be caused by the same underlying bug(s) in the deadlock detection algorithm.