A Java Concurrency Bug, and How We Solved It

Everyone agrees that good debugging is critical. But it is not entirely trivial when it comes to multi-threaded applications. So here’s the story how I tracked down a nasty bug in the Java5 ReentrantReadWriteLock.

Update 15/4: To be a bit clearer, the Java bug in question is on 1) Java 5; and 2) only when using fair locks, non fairness seems to work, however, that was never an option for us, if nothing else due to this other Java bug… Thanks to Artur Biesiadowski for the heads up.

Our product Firebase is designed to offload threading issues and scalability from game developers to a generic platform. However, one of our customers experienced mysterious “freezes” when running a fairly small system load in their staging environment.

First step on the way: getting stack dumps when the system is frozen. In other words, request that the client do a “kill -3” on the server process when it’s hanged, as this dumps all threads and their current stack traces to the standard output. This we did, but only got confused by it, all threads seemed dormant and there was no visible dead-lock in the stack traces.

However, several threads were all mysteriously waiting on a read lock deep down in the server, and seemingly not getting any further, despite that fact that no-one was holding the write lock. This wasn’t conclusive though as there was one thread waiting to take the write lock and this would block the readers. But given that the server was actually frozen it looked suspicious. So my first analysis concluded:

As far as I can see, the only abnormal thing that could have caused this stack is if a thread have taken the lock (read or write) and then crashed without releasing it, however there doesn’t seem to be any place in the code not safeguarded by try/finally (where the lock is released in the finally clause).

Implied in that conclusion is of course that this might either be a normal behavior and we’re looking in the wrong direction, or that we have a more serious Java error on our hands.

There’s a lot of information to be had from a ReentrantReadWriteLock, including the threads waiting for either read or write privileges, and the thread holding a write lock (if any), but not (strangely enough) the threads actually holding a read lock. And as a reader thread can effectively block the entire lock by not unlocking while a writer is waiting, this is information you really need to know.

So the next step was to get hold of the reader threads. I did this by sub-classing the ReentrantReadWriteLock to return my own version of the read lock, which, simplified, did this:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Set readers=Collections.synchronizedSet(newHashSet());

publicSet getReaders(){

returnreaders;

}

publicvoidlock(){

super.lock();

readers.add(Thread.currentThread());

}

publicvoidunlock(){

super.unlock();

readers.remove(Thread.currentThread());

}

Given this code, we now have a set containing a snapshot of the threads holding the read lock. I then added a JMX command for printing the following information to standard out for the given read/write lock:

The threads waiting for a read lock, including their stack traces.

The threads waiting for the write lock, including their stack traces.

Any threads holding a read lock, including their stack traces.

The thread holding the write lock, including its stack trace, if any.

I shipped this patched code to the client and asked them to freeze the server with the patch, print the new debug information, and then send me the output. Which they did, and the relevant output looked like this (very much shortened):

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Holding reader:

Thread[DQueueHandoffExecutorPoolThread{GAME}-1,5,main]

sun.misc.Unsafe.park(NativeMethod)

java.util.concurrent.locks.LockSupport.park(LockSupport.java:118)

[…]

Waiting writer:

Thread[Incoming,dqueue,127.0.0.1:7801,5,Thread Pools]

sun.misc.Unsafe.park(NativeMethod)

[…]

Waiting reader:

Thread[DQueueHandoffExecutorPoolThread{GAME}-1,5,main]

sun.misc.Unsafe.park(NativeMethod)

[…]

See anything strange here? It appears that the same thread is both holding and waiting for a read lock at the same time. But this is supposed to be a reentrant lock. In which case…

So, the freeze was caused by this: There’s a bug in Java5 (but not in Java6) where a fair ReentrantReadWriteLock stops a reader from re-entering the lock if there’s a writer waiting. It is of course easy to write a test case for, which you can find here.

This is now submitted as a bug to Sun, but I have yet to get a confirmation and bug number.

As for Firebase, it is now patched and released to use manually tracked re-entrance for read/write locks through the entire server when running under Java5, looking, again very much simplified, like this:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

privateThreadLocal count=newThreadLocal();

publicvoidlock(){

if(get()==0){

// don't lock if we alread have a count, in other words, only

// really lock the first time we enter the lock

super.lock();

}

// increment counter

increment();

}

publicvoidunlock(){

if(get()==0){

// we don't have the lock, this is an illegal state

thrownewIllegalMonitorStateException();

}elseif(get()==1){

// we have the lock, and this is the “first” one, so unlock

super.unlock();

remove();

}else{

// this is not the “first” lock, so just count down

decrement();

}

}

// --- HELPER METHODS --- //

privatevoidremove(){

count.remove();

}

privateintget(){

AtomicIntegeri=count.get();

if(i==null)return0;

else{

returni.intValue();

}

}

privatevoidincrement(){

AtomicIntegeri=count.get();

if(i==null){

i=newAtomicInteger(0);

count.set(i);

}

i.incrementAndGet();

}

privatevoiddecrement(){

AtomicIntegeri=count.get();

if(i==null){

// we should never get here...

thrownewIllegalStateException();

}

i.decrementAndGet();

}

The above code isn’t trivial, but shouldn’t be too hard to decode: We’re simple managing a “count” each time a thread takes the read lock, but we’re only actually locking the first time, the other times we simply increment the counter. On unlock, we decrement the counter, and if it is the “last” lock, if the counter equals 1, we do the real unlock and remove the counter.

There are two things to learn, 1) in any sufficiently complex server, the above debug information is nice to have from the start on any important lock; and 2) it really is time to migrate to Java6.

Firebase contains several guaranties regarding ordering of events within the system. In this particular instance the lock guards a routing mechanism sitting in the heart of our distribution mechanism. Suffice to say, without fairness in this particular lock we wouldn’t be able to promise strict ordering of events as they are sent from the clients to the games.