Monday, 21 March 2016

I've just run into this issue this week, and it's very cute, so this here is a summary. Akka has their own MPSC linked queue implementation, and this week in was suggested they'd swap to using JCTools. The context in which the suggestion was made was a recently closed bug with the mystery title: AbstractNodeQueue suffers from nepotism
An Akka user claimed to not be suffering from the bug because they were using JCTools, but as it turned out the issue was indeed present so I had to fix it etc... But what the hell is GC nepotism? You won't find it in your GC textbooks, but it's a catchy term for a very real problem.

Your Dead Olds Relatives Can Promote You!

In the open JDK a generational GCs, there are distinct generations, Old and New. The 2 generations get collected separately, but are not independent from each other. In particular, references from NewGen to OldGen will keep OldGen references alive through an OldGen collection cycle, and similarly references from OldGen to NewGen will keep objects alive through a NewGen collection. This is all quite straight forward, but has a surprising implication:

Dead OldGen objects will keep dead NewGen objects they reference alive until collected by an OldGen.

The reverse is also true, but since the whole notion of generational GC is based on the assumption that new gen is collected far more often then the old gen the above use case is the real issue. How does this effect you? Is this even a real problem?
Turns out it's real enough to have convinced Doug Lea to fix a particular manifestation of it, but also subtle enough to have slipped past him (and me, and the Akka guys, and Martin Thompson, and others no doubt) in the first place.

The nodes are created and linked together on offer, and dereferenced as we poll. The poll method clears MUST clear the value because the queue retains the last node. If we were to let the last node keep the value we will have effectively leaked this value reference.

Ignoring the values, imagine the queue is part of the live set and we have enqueued a few elements, the arrow denotes what next references:head -> N3 -> nulltail -> N0 -> N1 -> N2 -> N3If we now dequeue the elements we'll get back to empty, brackets denote live referenced from dead:head -> N3 -> nulltail -> N3 -> nullDEAD: N0 -> N1 -> N2 (-> N3)

Dead objects are traditionally considered the GC problem, and cleaning up about to be collected objects is considered an anti-pattern. After all, what is the point of cleaning up garbage? you are just duplicating work. Indeed in a single generation collector this is a non-issue.

But consider for a moment the possibility that one of the above nodes was promoted into the OldGen. This is something which may not happen in short lived tests, or even 'lucky' long lived tests, but is obviously quite possible. We will have:

head -> N3 -> nulltail -> N3 -> nullZombie NEW: N1 -> N2 (-> N3)

DEAD OLD: N0 (-> N1 -> N2 (-> N3))

The young are alive, as far as the old are concerned. The old are dead, but their promotion gives life to their 'children' until they get collected. To demonstrate this issue I've written the following program:

If you look at the GC logs, or plug in Visual VM with the Visual GC plugin you will notice the young GC is entirely ineffective. Everything gets promoted to old, until an old GC clears the lot. After the old GC the problem disappears (in the test program, but is likely to repeat in real programs). This is 'pathological' because ANY promoted node will result in the promotion of ALL following nodes until a GC resolves the issue. At this point you might be shaking your head sadly and saying: "Ye thick old fart, surely this is a silly issue which NEVER happens in the real world". I've had my share of heads being shook at me and am not offended, but consider that this issue is real enough for other people:

This war story from Tony Printezis referring to this as a recurring pattern in Java services (I wholeheartedly recommend you watch Tony's presentation from beginning to end)

An Exercise To the Reader

My plan was to walk through the way this issue has been fixed in JCTools vs Akka vs Agrona. Each solution reflecting different considerations, but sadly I'm quite short on time for a tidy summary and will give my observations which you can choose to read up on and argue with.All three implementations follow the algorithm outlined by D. Vyukov. Note that Agrona is ignoring the Queue interface requirement for poll(return null iff queue is empty).

In Akka the next reference is nulled on poll. This solves the issue, but at the cost of further weakening the accuracy of the size() method.

In JCTools the next reference is set to point to the discarded node. This helps in telling the difference between a removed node and the head. This piece of information is used in the size method to skip gaps in the size counted chain when racing with the consumer thread. The result is a more accurate size estimate. The change does come at a cost to the size method however as we are no longer able to rely on the chain remaining whole throughout the size operation.

In Agrona, the next reference is nulled as in Akka. A further effort is made to reuse a special empty node to avoid repeating promotion of transient nodes in empty queues. The effort is arguably worth while on systems with large numbers of active but mostly empty queues.

Open questions/discussion:

How common is this issue for 'classic' data structures? it seems reasonable that a host of queues/stacks/tree data structures are similarly vulnerable. A scan of JDK sources suggests this is an known pitfall to them, but I would bet many third party frameworks writers didn't get the memo.

How common is this issue for idiomatic 'user' Java code? given the tendency towards composition rather than inheritance and the drive towards smaller objects, we may argue that user object graphs often approximate trees. If we have short-lived 'trees' a parent node can certainly drag a useless chunk of state to the old gen. If a reference bridges 2 trees we may have something approximating the pathological behaviour above. I'm not aware of any studies trying to determine the impact of nepotism on user code, part of the problem would be the definition of 'typical' user code.

Where does the term 'nepotism' come from? I made an attempt at tracing the origins, but got nowhere. The problem seems to be documented at least 8 years ago, but the term nepotism is not used by most writing. [UPDATE: it seems the term can be tracked down to a GC paper published in 1991:An Adaptive Tenuring Policy for Generation]