Details

Discovered on Ubuntu 12.04 with Oracle JDK 1.7.0_25 running Clojure 1.4. Test case produced on Windows 8 with Oracle JDK 1.7.0_09 running Clojure 1.4 and Clojure 1.5.1. Analysis seems to indicate that OS and Java version are not critical. Clojure 1.6 pre-release code has not been tested, but since clojure.lang.LockingTransaction has not changed since Clojure 1.4, it seems likely the defect is still present.

Discovered on Ubuntu 12.04 with Oracle JDK 1.7.0_25 running Clojure 1.4. Test case produced on Windows 8 with Oracle JDK 1.7.0_09 running Clojure 1.4 and Clojure 1.5.1. Analysis seems to indicate that OS and Java version are not critical. Clojure 1.6 pre-release code has not been tested, but since clojure.lang.LockingTransaction has not changed since Clojure 1.4, it seems likely the defect is still present.

Patch:

Code

Approval:

Vetted

Description

Summary

Using the Clojure 1.4 library strictly from Java code, a simple transaction dispatches an action to an Agent. When called from a simple driver, such as a unit test, where there is no interaction with the Clojure library/runtime (specifically, clojure.lang.RT), a ConcurrentModificationException is thrown from inside LockingTransaction.run() while it is iterating through the actions list, dispatching each action to its Agent after committing the transaction.

While the circumstances under which this occurs are probably fairly rare and a simple workaround exists (see final paragraph), thus the "Minor" priority, it seems like it would not be very complicated to fix LockingTransaction to handle the actions list more safely.

Analysis

Based on some debugging, here's what seems to be happening:

Transaction A is run, dispatching action Z, which gets added, via LockingTransaction.enqueue(), to the actions list, which is a java.util.ArrayList<Agent.Action>.

Transaction A completes and is successfully committed.

LockingTransaction.run() does post-commit cleanup, freeing locks and putting a stop() to transaction A, which nulls the transaction's Info object reference.

Notifications are sent and we start iterating the list of actions to be dispatched.

The run() method calls Agent.dispatchAction(). Because the thread's transaction object is no longer considered to be "running" (due to the Info object being null) and no action is being processed on the thread (so its nested vector is null), the action is enqueue()-ed with the Agent.

As part of the enqueue() process, the action is cons()-ed onto the Agent's ActionQueue. Here's where the unique circumstances come into play.

At this point, we haven't really interacted with the Clojure runtime, specifically the clojure.lang.RT class, so its initiation machinery kicks in.

Down in the depths, it executes transaction B to add a library to its list of loaded libraries.

The still-existing-but-not-running thread-local transaction object fires up, runs and commits, with its existing, intact actions list, still containing action Z, enqueued during transaction A, which has not yet finished its post-commit process.

The post-commit process for transaction B runs, including a nested attempt to dispatch action Z, again, which succeeds.

The actions list is cleared before exiting the run() method.

Upon returning way back up the stack to our not-quite-finished-post-processing transaction A, we continue iterating the now-cleared actions list, which promptly throws the ConcurrentModificationException.

A quick perusal of the LockingTransaction code shows that the only interaction with the actions list is adding an item to it in the LockingTransaction.enqueue() method, iterating it in the post-processing section of run() and clearing it in the finally clause of that section, so it's easy to see how a transaction started by any of the action-dispatching machinery would cause problems. Any such activity in the actions themselves would not be an issue, since they'd occur on other threads, but the dispatch stuff all runs on the same thread. The few moving parts that occur in this code seem fairly safe, as long as the runtime, clojure.lang.RT, is already initialized, but if that occurs during this phase, all bets appear to be off.

Test Case

The attached Java class can be compiled and run with just the Clojure 1.4 JAR on the class path. With the change described near the end of the file (comment one line and uncomment another), the Clojure 1.5.1 JAR can be used, instead, producing the same result.

A single Agent named count is created, holding an Integer value of 1. A transaction is run which dispatches an action (referred to as Z in the above description) that will increment the value of count to 2. Following this, another action is dispatched to count to enable monitoring the completion of the incrementing action. Lastly, the final value of count is printed before the application exits.

Running the class with no command-line arguments produces the above-mentioned exception and prints an incorrect final result, due to action Z being run a second time as described in step 6.4. Running with any command-line argument triggers a simple workaround that just references a static value from the clojure.lang.RT class, which invokes the class initialization before anything else happens, such that the exception is not thrown and the correct result is produced.

The additional detail provided in the last comment is correct and matches what I observed. However, protecting the "actions" list against enqueuing isn't really the issue, since this problem would occur even if the "nested" transaction didn't involve any actions. The problem is that committing a transaction clears the "actions" list.

I think the suggestion to trigger initialization of RT in the static initializer of LockingTransaction would address this issue, though in a somewhat roundabout way. A more direct approach might be to make a temporary copy of the contents of the "actions" list prior to iterating it. Obviously, the latter would make extra work on every transaction commit, so it may not be desirable. I don't know what impact the former approach would have, other than the long-term effect of it being non-obvious to a later observer why that code is needed.

I haven't thought this through completely or done any tests, so my reasoning may be wrong, but one more argument for the approach of making a copy of the "actions" list is that a Ref notification/watch could run a transaction, which, being on the same thread, would also clear the "actions" list. This case wouldn't cause the exception, since the iteration of the list would not have started, yet, but it would cause actions enqueued by the "outer" transaction to not be dispatched.

Brandon Ibach
added a comment - 18/Oct/13 2:02 PM The additional detail provided in the last comment is correct and matches what I observed. However, protecting the "actions" list against enqueuing isn't really the issue, since this problem would occur even if the "nested" transaction didn't involve any actions. The problem is that committing a transaction clears the "actions" list.
I think the suggestion to trigger initialization of RT in the static initializer of LockingTransaction would address this issue, though in a somewhat roundabout way. A more direct approach might be to make a temporary copy of the contents of the "actions" list prior to iterating it. Obviously, the latter would make extra work on every transaction commit, so it may not be desirable. I don't know what impact the former approach would have, other than the long-term effect of it being non-obvious to a later observer why that code is needed.
I haven't thought this through completely or done any tests, so my reasoning may be wrong, but one more argument for the approach of making a copy of the "actions" list is that a Ref notification/watch could run a transaction, which, being on the same thread, would also clear the "actions" list. This case wouldn't cause the exception, since the iteration of the list would not have started, yet, but it would cause actions enqueued by the "outer" transaction to not be dispatched.

Guillermo Winkler
added a comment - 18/Oct/13 2:26 PM - edited You're right, problem is not on enqueue are inner transactions clearing the list.
Anyway the protection can be a smarter way of iteration.
Wouldn't a possible solution be treating the action list as a Queue? Pop'in each action would also clear only that action.

Andy Fingerhut
added a comment - 19/Oct/13 5:32 PM Patch clj-1260-fixws.diff is identical to Guillermo's clj-1260.diff, except it eliminates warnings when applying the patch that were due to trailing white space.

1. User's Java code calls LockingTransaction.runInTransaction.
2. In the transaction's Callable, user calls Agent.dispatch.
3. LockingTransaction adds the agent action to its actions list.
4. The transaction commits.
5. LockingTransaction begins iterating over actions and dispatching them.
6. Agent.dispatchAction checks to see if there is a running transaction, which there is not.
7. Agent.dispatchAction line 255 calls Agent.enqueue.
8. Agent.enqueue line 264 calls PersistentQueue.cons.
9. PersistentQueue.cons line 127 calls RT.list, triggering initization of RT.
10. RT's static initializer calls doInit.
11. doInit loads core.clj.
12. core.clj loads other files which contain ns declarations.
13. ns expands to code which commutes loaded-libs in a transaction.
14. The new transaction gets the same ThreadLocal LockingTransaction.
15. The new transaction executes and clears actions.
16. Back up the stack in the original transaction, LockingTransaction tries to continue iterating over actions and throws ConcurrentModificationException.

In short, this situation can only occur when the mere act ofenqueueing agent actions, not executing them, can create another
transaction on the same thread. This can happen when clojure.lang.RT
has not been initialized.

This would never occur in a Clojure program because RT is already
initialized when the Clojure program starts, and none of the Java code
involved in enqueueing agent actions will create new transactions.

The workaround for Java is trivial: just make sure clojure.lang.RT is
initialized before using any other Clojure classes.

However, the patch clj-1260-fixws.diff is a small and safe change. Its
author, Guillermo Winkler, has a CA.

The approach is to treat the actions list in LockingTransaction
as a queue, popping actions off one at a time, and never to
explicitly clear the queue.

The more important question is if/when RT should be initialized
automatically.

Stuart Sierra
added a comment - 17/Jan/14 1:27 PM - edited Here's my understanding of what happens:
1. User's Java code calls LockingTransaction.runInTransaction.
2. In the transaction's Callable, user calls Agent.dispatch.
3. LockingTransaction adds the agent action to its actions list.
4. The transaction commits.
5. LockingTransaction begins iterating over actions and dispatching them.
6. Agent.dispatchAction checks to see if there is a running transaction, which there is not.
7. Agent.dispatchAction line 255 calls Agent.enqueue.
8. Agent.enqueue line 264 calls PersistentQueue.cons.
9. PersistentQueue.cons line 127 calls RT.list, triggering initization of RT.
10. RT's static initializer calls doInit.
11. doInit loads core.clj.
12. core.clj loads other files which contain ns declarations.
13. ns expands to code which commutes loaded-libs in a transaction.
14. The new transaction gets the same ThreadLocal LockingTransaction.
15. The new transaction executes and clears actions.
16. Back up the stack in the original transaction, LockingTransaction tries to continue iterating over actions and throws ConcurrentModificationException.
In short, this situation can only occur when the mere act of
enqueueing agent actions, not executing them, can create another
transaction on the same thread. This can happen when clojure.lang.RT
has not been initialized.
This would never occur in a Clojure program because RT is already
initialized when the Clojure program starts, and none of the Java code
involved in enqueueing agent actions will create new transactions.
The workaround for Java is trivial: just make sure clojure.lang.RT is
initialized before using any other Clojure classes.
However, the patch clj-1260-fixws.diff is a small and safe change. Its
author, Guillermo Winkler, has a CA.
The approach is to treat the actions list in LockingTransaction
as a queue, popping actions off one at a time, and never to
explicitly clear the queue.
The more important question is if/when RT should be initialized
automatically.

Guillermo Winkler
added a comment - 17/Jan/14 1:46 PM Solving this problem by initializing RT feels to me like complecting two things not necessarily related, if PersistentQueue stops using RT, relationship disappears and you may still have the bug.
Anyway RT could be initialized safely so anyone that may depend on it finds it already there, avoiding potential clashes on initialization.

Alex Miller
added a comment - 17/Jan/14 1:49 PM And following up on that, Clojure 1.6 has a new public Java API in clojure.java.api.Clojure. Loading that class or using any method on the class will serve to load clojure.lang.RT as a side effect.

When you are using Clojure from Java, it is expected that you should properly initialize the Clojure runtime by (at least) loading the RT class (for Clojure <1.6) or the Clojure class (for Clojure >= 1.6).

Alex Miller
added a comment - 21/Jan/14 9:32 AM When you are using Clojure from Java, it is expected that you should properly initialize the Clojure runtime by (at least) loading the RT class (for Clojure <1.6) or the Clojure class (for Clojure >= 1.6).