Tuesday, September 22, 2009

With the Oracle acquisition looming on the horizon, and the uncertainty that comes with it, it feels like many organizations within Sun are bringing projects in a state in which they can be transitioned in a clean way. Transitioned meaning adopted or abandoned.

I also find myself doing this: for instance I'm trying to put the finishing touches on JMSJCA, and trying to put Hulp in a state where other people can use it, etc.

Maybe it was in that light that a few days ago, I got a certificate from the SEED organization within Sun. SEED is a mentoring program within Sun in which I had the good fortune to participate. I also was a SEED mentor. Or perhaps I should let the speak for itself.

Monday, September 21, 2009

Two years ago I wrote a blog entry about Nested Diagnostics Contexts in GlassFish. The approach was based on custom Logger objects. Now, two years later when running GlassFish on AIX, an issue with that turned up: a deadlock. Digging into it, there are two lessons I'd like to share: one about deadlocks, and the other one about workarounds.

Deadlock in the JDK

At the time of my writing the NDC facility, calling addLogger() and getLogger() from different threads may cause a deadlock in the JDK classes. Why is this?

Now let's consider another thread Thread 2 tries to register a custom logger using addLogger(), e.g.

LogManager.getLogManager().addLogger(new Logger() {
...
});

There are two locks that come into the picture: one is the lock on Logger.class, and the other one on LogManager.getLogManager(). Thread 1 simply calls Logger.getLogger() which first locks Logger.class, and while having this lock, will call LogManager.addLogger(). This call will get a lock on LogManager.getLogManager(). This will try to lock Logger.class again, which is no problem because it already had the lock on that. Now consider Thread 2: it will first lock LogManager.getLogManager() through its call to addLogger(), and while having this lock, it will try to lock Logger.class through its call to Logger.getLogger().

So Thread 1 will lock first A and then B, while Thread 2 will first lock B and then A. A classical deadlock situation.

This is clearly a bug in the JDK. With the knowledge of this bug, as the developer of the custom logger, we can use a simple workaround for this problem. Before calling addLogger(), first lock Logger.class. By doing that, we can guarantee that the locks are called in the same order: first Logger.class and then LogManager.getLogManager().

This bug was actually reported by a customer, see bug report, and was fixed in 6u11 and 5.0u18. The LogManager now no longer calls Logger.getLogger(), so there's no locking of Logger.class anymore.

Deadlock in GlassFish

GlassFish installs its own LogManager, and there's some locking going on there too. It has an internal lock which it locks in addLogger():

As a developer of the custom logger, we can still use the same fix: first lock Logger.class before calling addLogger() so that the locking sequence becomes: Logger.class, transient lock on LogManager.getLogManager() followed by a lock on _unInitializedLoggers, lock on Logger.class().

A fix in GlassFish would simply make addLogger() synchronized. An alternative fix is to make the synchronized block smaller: it should not extend over doInitializeLogger().

Deadlock with the IBM JDK

All was well until this was run on the IBM JDK. As it turns out, Logger.getLogger() doesn't lock Logger.class, it locks LogManager.getLogManager(). Now there is a new deadlock: Thread 1 calls Logger.getLogger() and by doing so, locks LogManager.getLogManager(), and then tries to lock _unInitializedLoggers and later tries to lock LogManager.getLogManager() again. Thread 2 calls addLogger() and by doing so locks first _unInitializedLoggers which later tries to lock LogManager.getLogManager(). The problem is back: the sequence of locking is reversed.

Lessons learned

The first lesson is that of the deadlock itself: one should be extremely careful when locking an object, and then calling into another object, especially if that object is meant to be a baseclass in a public API. Both Logger and LogManager may be overridden, and are clearly very public.

The second lesson is that instead of building workarounds, try to get a fix the original bug. Or at least try to have the original issue addressed while adding a workaround. In my case, I never filed a ticket against the JDK or against GlassFish. I should have known better.

... or not?

At this very moment, GlassFish v2.1.1 is about to be shipped, and we need GlassFish v2.1.1 on AIX for the upcoming CAPS release. A bug fix in GlassFish v2.1.1 is probably not going to get in. So I'm making yet another workaround: don't use a custom logger at all as to avoid the whole addLogger() problem. Instead, use standard Logger objects, but provide a custom filter that manipulates the log record. This will work on both unpatched versions of GlassFish, and also on unpatched versions of the JDK. Still, I hope the GlassFish problem will be fixed.