Details

Description

When used within a servlet container (Jetty/Tomcat/JBossAS/Immutant/etc), the thread locals Var.dvals (used to store dynamic bindings) and LockingTransaction.transaction (used to store the currently active transaction(s)) prevent all of the classes loaded by an application's clojure runtime from being garbage collected, resulting in a memory leak.

Cause: The issue comes from threads living beyond the lifetime of a deployment - servlet containers use thread pools that are shared across all applications within the container. Currently, the dvals and transaction thread locals are not discarded when they are no longer needed, causing their contents to retain a hard reference to their classloaders, which, in turn, causes all of the classes loaded under the application's classloader to be retained until the thread exits (which is generally at JVM shutdown).

Solution: I've attached a patch that does the following:

Var.dvals is initialized to a canonical TOP Frame

Var.dvals is now removed when the thread bindings are popped to the TOP

The outer transaction in LockingTransaction.transaction now removes the thread local when it is finished

There is still the opportunity for memory leaks if agents or futures are used, and the executors used for them are not shutdown when the app is undeployed. That's a solvable problem, but should probably be solved by the containers themselves (and/or the war generation tools) instead of in clojure itself.

This patch has a small performance impact: its use of a try/finally around running transactions to remove the outer transaction adds 4-6 microseconds to each transaction call on my hardware.

Providing an automated test for this patch is difficult - I've tested it locally with repeated deployments to a container while monitoring GC and permgen. All of clojure's tests pass with it applied.

The code that calls transaction.remove() seems unncessarily subtle. There are two exits from the method, and only one is protected by the finally block.

If the "outer" case was a top-level if, the logic would be more clear, and only the "outer" case would need try/finally, which might reduce the performance penalty in the case of deeply nested dosyncs.

Did your transaction overhead of 4-6 microseconds test only one level of dosync, or many?

Stuart Halloway
added a comment - 24/May/13 10:04 AM The code that calls transaction.remove() seems unncessarily subtle. There are two exits from the method, and only one is protected by the finally block.
If the "outer" case was a top-level if, the logic would be more clear, and only the "outer" case would need try/finally, which might reduce the performance penalty in the case of deeply nested dosyncs.
Did your transaction overhead of 4-6 microseconds test only one level of dosync, or many?

Because the unwind code calls remove at the top (as opposed to set(null)), the code should now be safe for use with Clojure-defined ThreadLocal subclasses.

Therefore, Var's use of an initialValue should be irrelevant to this patch, and it should be possible to fix this bug with a patch half the size of the current patch, touching only LockingTransaction.runInTransaction and Var.popThreadBindings.

Stuart Halloway
added a comment - 24/May/13 10:13 AM Because the unwind code calls remove at the top (as opposed to set(null)), the code should now be safe for use with Clojure-defined ThreadLocal subclasses.
Therefore, Var's use of an initialValue should be irrelevant to this patch, and it should be possible to fix this bug with a patch half the size of the current patch, touching only LockingTransaction.runInTransaction and Var.popThreadBindings.

re: Tomcat ThreadLocal warnings

With Clojure 1.5.1 using my test app (linked below), I see:

Jun 14, 2013 6:35:22 AM org.apache.catalina.loader.WebappClassLoader checkThreadLocalMapForLeaks
SEVERE: The web application [/leak] created a ThreadLocal with key of type [clojure.lang.Var$1] (value [clojure.lang.Var$1@4902919]) and a value of type [clojure.lang.Var.Frame] (value [clojure.lang.Var$Frame@147a2aa6]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak.
Jun 14, 2013 6:35:22 AM org.apache.catalina.loader.WebappClassLoader checkThreadLocalMapForLeaks
SEVERE: The web application [/leak] created a ThreadLocal with key of type [java.lang.ThreadLocal] (value [java.lang.ThreadLocal@608602ca]) and a value of type [clojure.lang.LockingTransaction] (value [clojure.lang.LockingTransaction@7e214d47]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak.

With the original patch (threadlocal-removal-tcrawley-2012-12-11.diff) and the one attached today (threadlocal-removal-tcrawley-2013-06-14.diff), I no longer see these warnings.

re: the LockingTransaction.runInTransaction changes

In today's patch (threadlocal-removal-tcrawley-2013-06-14.diff), I modified runInTransaction to have one exit point, and only wrap a call to run with a try/finally in the outer transaction case. It does introduce two locations where run can be called to preserve the case where an inner transaction has null info:

However, this will likely not reduce the speed penalty I observed in my testing, as I was only using a single level of dosync when capturing timing data.

re: removing initialValue from dvals

My original solution kept initialValue, but I then apparently discovered cases where the leak still occurred (see the mailing list thread).

Unfortunately, I can neither recreate that case, nor find in my notes, test code, or the clojure code a reason why keeping initialValue would allow the ThreadLocals to leak when popThreadBindings is patched (assuming one doesn't call Var.getThreadBindings from Java without calling Var.popThreadBindings).

Therefore, I've attached a simpler patch (threadlocal-removal-tcrawley-2013-06-14.diff) that just patches LockingTransaction.runInTransaction and Var.popThreadBindings.

I've also created a project that demonstrates the leak with 1.5.1, and that the leak does not appear with this patch applied to 1.6.0-master. See its README for usage details.

The patched version of 1.6.0-master is available as [org.clojars.tcrawley/clojure "1.6.0-clearthreadlocals"] if anyone wants to give it a try in their own projects. Note that since its group isn't 'org.clojure', you may need to add exclusions to your project to prevent another version of clojure being included.

re: Tomcat ThreadLocal warnings

With Clojure 1.5.1 using my test app (linked below), I see:

Jun 14, 2013 6:35:22 AM org.apache.catalina.loader.WebappClassLoader checkThreadLocalMapForLeaks
SEVERE: The web application [/leak] created a ThreadLocal with key of type [clojure.lang.Var$1] (value [clojure.lang.Var$1@4902919]) and a value of type [clojure.lang.Var.Frame] (value [clojure.lang.Var$Frame@147a2aa6]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak.
Jun 14, 2013 6:35:22 AM org.apache.catalina.loader.WebappClassLoader checkThreadLocalMapForLeaks
SEVERE: The web application [/leak] created a ThreadLocal with key of type [java.lang.ThreadLocal] (value [java.lang.ThreadLocal@608602ca]) and a value of type [clojure.lang.LockingTransaction] (value [clojure.lang.LockingTransaction@7e214d47]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak.

With the original patch (threadlocal-removal-tcrawley-2012-12-11.diff) and the one attached today (threadlocal-removal-tcrawley-2013-06-14.diff), I no longer see these warnings.

re: the LockingTransaction.runInTransaction changes

In today's patch (threadlocal-removal-tcrawley-2013-06-14.diff), I modified runInTransaction to have one exit point, and only wrap a call to run with a try/finally in the outer transaction case. It does introduce two locations where run can be called to preserve the case where an inner transaction has null info:

However, this will likely not reduce the speed penalty I observed in my testing, as I was only using a single level of dosync when capturing timing data.

re: removing initialValue from dvals

My original solution kept initialValue, but I then apparently discovered cases where the leak still occurred (see the mailing list thread).
Unfortunately, I can neither recreate that case, nor find in my notes, test code, or the clojure code a reason why keeping initialValue would allow the ThreadLocals to leak when popThreadBindings is patched (assuming one doesn't call Var.getThreadBindings from Java without calling Var.popThreadBindings).
Therefore, I've attached a simpler patch (threadlocal-removal-tcrawley-2013-06-14.diff) that just patches LockingTransaction.runInTransaction and Var.popThreadBindings.
I've also created a project that demonstrates the leak with 1.5.1, and that the leak does not appear with this patch applied to 1.6.0-master. See its README for usage details.
The patched version of 1.6.0-master is available as [org.clojars.tcrawley/clojure "1.6.0-clearthreadlocals"] if anyone wants to give it a try in their own projects. Note that since its group isn't 'org.clojure', you may need to add exclusions to your project to prevent another version of clojure being included.

Andy Fingerhut
added a comment - 14/Jun/13 10:56 AM Presumptuously changing ticket approval from Incomplete back to its former Vetted state, since Toby's comments and new patch seem to address the comments that led Stu to change it to Incomplete.

I looked at the updated patch and it seems good to me. In the LockingTransaction.runinTransaction code the cases are driven by where t=null and t.info=null. Of those 4 cases, I believe the same call is being made in all but the case of t == null (where a new LockingTransaction is created) and t.info != null. However, I believe since a new txn is created and t.info should start as null, this case does not actually exist in practice.

Greatly appreciate Chas's experience feedback and all of Toby and Stu's work to make this change solid!

Alex Miller
added a comment - 23/Aug/13 11:08 AM I looked at the updated patch and it seems good to me. In the LockingTransaction.runinTransaction code the cases are driven by where t=null and t.info=null. Of those 4 cases, I believe the same call is being made in all but the case of t == null (where a new LockingTransaction is created) and t.info != null. However, I believe since a new txn is created and t.info should start as null, this case does not actually exist in practice.
Greatly appreciate Chas's experience feedback and all of Toby and Stu's work to make this change solid!
Marking screened.

I just attached a new patch (threadlocal-removal-tcrawley-2013-11-24.diff) that achieves the same ThreadLocal removal as the previous patch, but addresses the issues with binding conveyance reported in CLJ-1299.

Toby Crawley
added a comment - 24/Nov/13 5:22 PM I just attached a new patch (threadlocal-removal-tcrawley-2013-11-24.diff) that achieves the same ThreadLocal removal as the previous patch, but addresses the issues with binding conveyance reported in CLJ-1299.