Windows XP, same problem occurred using Apache Derby 10.5.3.0 and still with the newly release 10.6.1.0 version.
For completeness, also noting using c3p0 0.9.1-pre10 and hibernate 3.0.5. Attempted trying later versions of both (c3p0 and hibernate) to see if this resulted in a workaround, but had no success.

Description

Running into a NullPointerException error in the Apache Derby database over multiple versions of the derby jars. From testing, this issue intermittently occurs during moderate load test scenarios, but has never occurred in production. This is using Derby as embedded and always occurs on the same statement as shown below and in the attachment. Following the error, hibernate throws an exception which results in the code attempting to rollback the transaction. The rollback fails as the NullPointerException appears to kill the connection.

When I look directly under C:\p4clients\svn10.5\client1\10.5\systest\out142\derbyall\derbynetclientmats\DerbyNetClient\derbynetmats\derbynetmats, I don't see NPE in the files created for dblook_test_net

I have attached the top level derbyall suite report files for reference.

I ran several hours of load against the 952726 nightly build jars with no errors. Previously, I was hitting this issue within 5-10 minutes on every run, so this appears to be a good fix. Looks good. Thanks.

Seth Katzman
added a comment - 09/Jun/10 16:59 I ran several hours of load against the 952726 nightly build jars with no errors. Previously, I was hitting this issue within 5-10 minutes on every run, so this appears to be a good fix. Looks good. Thanks.

Merged fix to 10.6 and committed revision 952585.
Marking the issue as resolved.

Seth, if you want to verify the fix in your environment, you can either check out the 10.6 branch and build your own jars, or you can wait until the fix has been picked up by the nightly build of the 10.6 branch at http://dbtg.foundry.sun.com/derby/bits/10.6/ (download one with SVN revision number at or above 952585 when available). Please let us know how the testing goes.

Knut Anders Hatlen
added a comment - 08/Jun/10 10:39 Merged fix to 10.6 and committed revision 952585.
Marking the issue as resolved.
Seth, if you want to verify the fix in your environment, you can either check out the 10.6 branch and build your own jars, or you can wait until the fix has been picked up by the nightly build of the 10.6 branch at http://dbtg.foundry.sun.com/derby/bits/10.6/ (download one with SVN revision number at or above 952585 when available). Please let us know how the testing goes.

Knut Anders Hatlen
added a comment - 02/Jun/10 09:38 Uploading a patch for this issue (derby-4676-1a.diff) with the following changes:
M java/engine/org/apache/derby/impl/store/access/conglomerate/OpenConglomerate.java
Updated javadoc for latchPage(), lockPositionForRead() and lockPositionForWrite() with info about the expected state upon return if the row has been removed.
M java/engine/org/apache/derby/impl/store/access/conglomerate/GenericConglomerateController.java
Added a check to see if the row had been removed after calling lockPositionForRead/Write. If so, return false from fetch(), as specified by the ConglomerateController#fetch javadoc.
M java/testing/org/apache/derbyTesting/functionTests/tests/store/_Suite.java
A java/testing/org/apache/derbyTesting/functionTests/tests/store/Derby4676Test.java
Added a regression test that fails (intermittently) if the bug is present.
Derbyall and suites.All ran cleanly with the patch. I've also run the new regression test 500 times without seeing any errors.

Knut Anders Hatlen
added a comment - 01/Jun/10 12:19 I'm attaching a JUnit regression test case for the bug. It doesn't always fail, since the bug is highly dependent on the timing, but it triggers the NPE in about every other run in my environment.

Thanks Bryan. I'll go ahead with adding the null check and updating the comment, as well as adding a store test that exposes the problem. I'll wait for DERBY-4678 to go in first, so that we only have one fetch() method to fix in GenericConglomerateController.

Knut Anders Hatlen
added a comment - 28/May/10 08:35 Thanks Bryan. I'll go ahead with adding the null check and updating the comment, as well as adding a store test that exposes the problem. I'll wait for DERBY-4678 to go in first, so that we only have one fetch() method to fix in GenericConglomerateController.

Bryan Pendleton
added a comment - 28/May/10 02:24 Given a choice between changing the code, or changing the comments,
I usually choose to change the comments.
I think you have isolated and explained the problem clearly; either of
your proposed fixes seems reasonable to me.

fetch() first latches the heap page on which the indexed row is supposed to be found. Then it attempts to lock the row.

If the row cannot be locked immediately, the latch is released before the thread is suspended while it waits for the lock to be obtained. Once it wakes up again, it will re-obtain the latch on the heap page. OpenConglomerate.latchPage() additionally checks if the row is still present on the page, and if it isn't, it will release the latch again and return false. OpenConglomerate.lockPositionForRead/Write however doesn't check the return value from latchPage() and doesn't distinguish between the case where the latch was successfully re-obtained and the case where it was not.

Another simpler fix than what I suggested above, would be to check if pos.current_page is null after the attempt to lock the row, and assume that this means the row (or the entire page) is gone. In that case we should update the javadoc comment for lockPositionForRead() to state that this is how it behaves, and remove the sentence that incorrectly states that it raises an exception.

Knut Anders Hatlen
added a comment - 27/May/10 14:02 Hi Bryan,
'Yes' is the answer to all you questions above.
fetch() first latches the heap page on which the indexed row is supposed to be found. Then it attempts to lock the row.
If the row cannot be locked immediately, the latch is released before the thread is suspended while it waits for the lock to be obtained. Once it wakes up again, it will re-obtain the latch on the heap page. OpenConglomerate.latchPage() additionally checks if the row is still present on the page, and if it isn't, it will release the latch again and return false. OpenConglomerate.lockPositionForRead/Write however doesn't check the return value from latchPage() and doesn't distinguish between the case where the latch was successfully re-obtained and the case where it was not.
Another simpler fix than what I suggested above, would be to check if pos.current_page is null after the attempt to lock the row, and assume that this means the row (or the entire page) is gone. In that case we should update the javadoc comment for lockPositionForRead() to state that this is how it behaves, and remove the sentence that incorrectly states that it raises an exception.

Doesn't GenericConglomerateController.fetch latch the page just before
it calls lockPositionForRead? Is the patch latch being given up inside the
lock call, and then re-acquired? Is that why this only shows up under very
high concurrency and lock contention scenarios?

Bryan Pendleton
added a comment - 27/May/10 01:50 Doesn't GenericConglomerateController.fetch latch the page just before
it calls lockPositionForRead? Is the patch latch being given up inside the
lock call, and then re-acquired? Is that why this only shows up under very
high concurrency and lock contention scenarios?

It is set to null by OpenConglomerate.latchPage() (called from fetch() via OpenConglomerate.lockPositionForRead()) when it discovers that the record was deleted by another transaction before the row was locked.

One thing to note is that lockPositionForRead()'s javadoc says:

@param moveForwardIfRowDisappears

If true, then this routine must handle the case where

the row id we are waiting on disappears when the latch

is released. If false an exception will be thrown if

the row disappears.

However, there is no exception thrown by this method when the row has disappeared, even though fetch() calls it with moveForwardIfRowDisappears=false.

I think one possible fix is to make lockPositionForRead() raise an exception in this situation (can be detected by checking the return value from latchPage(), which is currently not checked). The javadoc for fetch() says that it should return false if the row has disappeared, so it could catch that exception and return false.

fetch() is the only method that calls lockPositionForRead() with moveForwardIfRowDisappears=false, so such a change should only affect the case where we currently see a NullPointerException.

Knut Anders Hatlen
added a comment - 26/May/10 14:28 I looked at this in a debugger, and it seems like it's pos.current_page that's null in this code in GenericConglomerateController.fetch():
boolean ret_val =
(pos.current_page.fetchFromSlot(
pos.current_rh, pos.current_slot,
row,
new FetchDescriptor(
row.length, validColumns, (Qualifier[][]) null),
false) != null);
It is set to null by OpenConglomerate.latchPage() (called from fetch() via OpenConglomerate.lockPositionForRead()) when it discovers that the record was deleted by another transaction before the row was locked.
One thing to note is that lockPositionForRead()'s javadoc says:
@param moveForwardIfRowDisappears
If true, then this routine must handle the case where
the row id we are waiting on disappears when the latch
is released. If false an exception will be thrown if
the row disappears.
However, there is no exception thrown by this method when the row has disappeared, even though fetch() calls it with moveForwardIfRowDisappears=false.
I think one possible fix is to make lockPositionForRead() raise an exception in this situation (can be detected by checking the return value from latchPage(), which is currently not checked). The javadoc for fetch() says that it should return false if the row has disappeared, so it could catch that exception and return false.
fetch() is the only method that calls lockPositionForRead() with moveForwardIfRowDisappears=false, so such a change should only affect the case where we currently see a NullPointerException.

I am able to reproduce the bug with the attached java class. The repro has two threads running. One of them repeatedly inserts and deletes a record in a table. The other thread repeatedly reads rows from the same table, using an index on one of the columns.

Here's the exception I see on trunk:

$ java D4676
java.sql.SQLException: Java exception: ': java.lang.NullPointerException'.
at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:95)
at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Util.java:142)
at org.apache.derby.impl.jdbc.Util.javaException(Util.java:299)
at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(TransactionResourceImpl.java:403)
at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(TransactionResourceImpl.java:346)
at org.apache.derby.impl.jdbc.EmbedConnection.handleException(EmbedConnection.java:2269)
at org.apache.derby.impl.jdbc.ConnectionChild.handleException(ConnectionChild.java:81)
at org.apache.derby.impl.jdbc.EmbedResultSet.closeOnTransactionError(EmbedResultSet.java:4321)
at org.apache.derby.impl.jdbc.EmbedResultSet.movePosition(EmbedResultSet.java:469)
at org.apache.derby.impl.jdbc.EmbedResultSet.next(EmbedResultSet.java:373)
at D4676$Reader.run(D4676.java:53)
Caused by: java.sql.SQLException: Java exception: ': java.lang.NullPointerException'.
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(SQLExceptionFactory.java:45)
at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(SQLExceptionFactory40.java:119)
at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:70)
... 10 more
Caused by: java.lang.NullPointerException
at org.apache.derby.impl.store.access.conglomerate.GenericConglomerateController.fetch(GenericConglomerateController.java:291)
at org.apache.derby.impl.sql.execute.IndexRowToBaseRowResultSet.getNextRowCore(IndexRowToBaseRowResultSet.java:389)
at org.apache.derby.impl.sql.execute.BasicNoPutResultSetImpl.getNextRow(BasicNoPutResultSetImpl.java:477)
at org.apache.derby.impl.jdbc.EmbedResultSet.movePosition(EmbedResultSet.java:429)
... 2 more

Knut Anders Hatlen
added a comment - 25/May/10 18:44 Thanks, Seth.
I am able to reproduce the bug with the attached java class. The repro has two threads running. One of them repeatedly inserts and deletes a record in a table. The other thread repeatedly reads rows from the same table, using an index on one of the columns.
Here's the exception I see on trunk:
$ java D4676
java.sql.SQLException: Java exception: ': java.lang.NullPointerException'.
at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:95)
at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Util.java:142)
at org.apache.derby.impl.jdbc.Util.javaException(Util.java:299)
at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(TransactionResourceImpl.java:403)
at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(TransactionResourceImpl.java:346)
at org.apache.derby.impl.jdbc.EmbedConnection.handleException(EmbedConnection.java:2269)
at org.apache.derby.impl.jdbc.ConnectionChild.handleException(ConnectionChild.java:81)
at org.apache.derby.impl.jdbc.EmbedResultSet.closeOnTransactionError(EmbedResultSet.java:4321)
at org.apache.derby.impl.jdbc.EmbedResultSet.movePosition(EmbedResultSet.java:469)
at org.apache.derby.impl.jdbc.EmbedResultSet.next(EmbedResultSet.java:373)
at D4676$Reader.run(D4676.java:53)
Caused by: java.sql.SQLException: Java exception: ': java.lang.NullPointerException'.
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(SQLExceptionFactory.java:45)
at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(SQLExceptionFactory40.java:119)
at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:70)
... 10 more
Caused by: java.lang.NullPointerException
at org.apache.derby.impl.store.access.conglomerate.GenericConglomerateController.fetch(GenericConglomerateController.java:291)
at org.apache.derby.impl.sql.execute.IndexRowToBaseRowResultSet.getNextRowCore(IndexRowToBaseRowResultSet.java:389)
at org.apache.derby.impl.sql.execute.BasicNoPutResultSetImpl.getNextRow(BasicNoPutResultSetImpl.java:477)
at org.apache.derby.impl.jdbc.EmbedResultSet.movePosition(EmbedResultSet.java:429)
... 2 more

I have attached the new logs (derby and application) using the 10.6.1.0 debug jars. The derby log file shows the initial NullPointerException with source line numbers. The application log file shows the debug dump of thread states with an assert failure.

I also double checked the transaction isolation level of the connection being used. The transaction isolation level is always returning 2 which looks to be TRANSACTION_READ_COMMITTED.

Seth Katzman
added a comment - 25/May/10 15:37 I have attached the new logs (derby and application) using the 10.6.1.0 debug jars. The derby log file shows the initial NullPointerException with source line numbers. The application log file shows the debug dump of thread states with an assert failure.
I also double checked the transaction isolation level of the connection being used. The transaction isolation level is always returning 2 which looks to be TRANSACTION_READ_COMMITTED.

Thanks for the quick reply. I'll grab the debug jars and attempt to recreate the exception condition. I'll add further comments/enclosures as I find out more (connection transaction isolation and debug logs).

If you have the chance, it would be great if you could try to run with a debug version of Derby (available from the download page, look for the lib-debug tarball) so that we get line numbers in the stack trace and perhaps an assert failure with more information.

Another thing that came to mind: What's the transaction isolation level for the connection? I was wondering if this could be caused by another thread deleting a row right after it was found in the index and before it was read from the heap, which I think may happen if the isolation level is read uncommitted.

Knut Anders Hatlen
added a comment - 24/May/10 23:11 Hi Seth,
If you have the chance, it would be great if you could try to run with a debug version of Derby (available from the download page, look for the lib-debug tarball) so that we get line numbers in the stack trace and perhaps an assert failure with more information.
Another thing that came to mind: What's the transaction isolation level for the connection? I was wondering if this could be caused by another thread deleting a row right after it was found in the index and before it was read from the heap, which I think may happen if the isolation level is read uncommitted.