Wednesday, 29 October 2014

In Java, not all value stores are created equal, in particular storing object references is different to storing primitive values. This makes perfect sense when we consider that the JVM is a magical place where object allocation, relocation and deletion are somebody else's problem. So while in theory writing a reference field is the same as writing the same sized primitive (an int on 32bit JVMs or with compressed oops on, or a long otherwise) in practice some accounting takes place to support GC. In this post we'll have a look at one such accounting overhead, the write barrier.

What's an OOP?

An OOP (Ordinary Object Pointer) is the way the JVM views Java object references. They are pointer representations rather than actual pointers (though they may be usable pointers). Since objects are managed memory OOPs reads/writes may require a memory barrier of the memory management kind (as opposed to the JMM ordering barrier kind):

"A barrier is a block on reading from or writing to certain memory locations by certain threads or processes.

Barriers can be implemented in either software or hardware. Software barriers involve additional instructions around load or store operations, which would typically be added by a cooperative compiler. Hardware barriers don’t require compiler support, and may be implemented on common operating systems by using memory protection."

Card Marking

All modern JVMs support a generational GC process, which works under the assumption that allocated objects mostly live short and careless lives. This assumption leads to GC algorithm where different generations are treated differently, and where cross generation references pose a challenge. Now imagine the time to collect the young generation is upon our JVM, what do we need to do to determine which young objects are still alive (ignoring the Phantom/Weak/Soft reference debate and finalizers)?

An object is alive if it is referenced by a live object.

An object is alive if a static reference to it exists (part of the root set).

An object is alive if a stack reference to it exists (part of the root set).

The GC process therefore:

"Tracing garbage collectors, such as copying, mark-sweep, and mark-compact, all start scanning from the root set, traversing references between objects, until all live objects have been visited.

A generational tracing collector starts from the root set, but does not traverse references that lead to objects in the older generation, which reduces the size of the object graph to be traced. But this creates a problem -- what if an object in the older generation references a younger object, which is not reachable through any other chain of references from a root?" - Brian Goetz, GC in the HotSpot JVM

Illustration By Alexey Ragozin

It is worth reading the whole article to get more context on the cross generational reference problem, but the solution is card marking:

"...the heap is divided into a set of cards, each of which is usually smaller than a memory page. The JVM maintains a card map, with one bit (or byte, in some implementations) corresponding to each card in the heap. Each time a pointer field in an object in the heap is modified, the corresponding bit in the card map for that card is set."

A good explanation of card marking is also given here by Alexey Ragozin. I have taken liberty to include his great illustration of the process.
So there you have it, every time an object reference is updated the compiler has to inject some accounting logic towards card marking. So how does this effect the code generated for your methods?

Default Card Marking

OpenJDK/Oracle 1.6/1.7/1.8 JVMs default to the following card marking logic (assembly for a setter such as setFoo(Object bar) ):

So setting a reference throws in the overhead of a few instructions, which boil down to:

CARD_TABLE [this address >> 9] = 0;

This is significant overhead when compared to primitive fields, but is considered necessary tax for memory management. The tradeoff here is between the benefit of card marking (limiting the scope of required old generation scanning on young generation collection) vs. the fixed operation overhead for all reference writes. The associated write to memory for card marking can sometimes cause performance issues for highly concurrent code. This is why in OpenJDK7 we have a new option called UseCondCardMark.
[UPDATE: as JP points out in the comments, the (>> 9) is converting the address to the relevant card index. Cards are 512 bytes in size so the shift is in fact address/512 to find the card index. ]

Conditional Card Marking

This is a bit more work, but avoids the potentially concurrent writes to the card table, thus side stepping some potential false sharing through minimising recurring writes. I have been unable to make JDK8 generate similar code with the same flag regardless of which GC algorithm I run with (I can see the code in the OJDK codebase... not sure what's the issue, feedback/suggestions/corrections welcome).

Card Marking G1GC style?

Is complicated... have a look:

To figure out exactly what this was about I had to have a read in the Hotspot codebase. A rough translation would be:

The runtime calls are an extra overhead whenever we are unlucky enough to either:

Write a reference while card marking is in process (and old value was not null)

Target object is 'older' than new value (and new value is not null) [UPDATE: (SRC^TGT>>20 != 0) is a cross region rather than a cross generational check. Thanks Gleb!]

The interesting point to me is that the generated assembly ends up being somewhat fatter (nothing like your mamma) and has a significantly worse 'cold' case (cold as in less likely to happen), so in theory mixing up the generations will be painful.

Summary

Writing references incurs some overhead not present for primitive values. The overhead is in the order of a few instructions which is significant when compared to primitive types, but minor when we assume most applications read more than they write and have a healthy data/object ratio. Estimating the card marking impact is non-trivial and I will be looking to benchmark it in a later post. For now I hope the above helps you recognise card marking logic in your print assembly output and sheds some light on what the write barrier and card marking is about.

Tuesday, 28 October 2014

2 years ago I started on this blog with a short and relatively under-exciting post about intrinsics. I was not happy with that first post. But you have to start somewhere I guess ;-). I set myself a target of writing 2 posts a month and pretty much kept to it (43 posts and 1 page). Some posts took huge investment, some less, I learnt something new while writing every one of them.
I spent last week at Joker Conf and Gee Con, I don't think I'd have been invited to speak in either was it not for my blog. I'm also pretty sure I owe my current job (and other job offers) to the blog. I'm still surprised to meet people who read it. Most seem happy. It proved to be allot of work, but just the sort of excuse I needed to dig deeper into corners of Java and concurrency I find exciting. Some of the effort that went into the blog became the ground work for JCTools. I guess what I'm trying to say is it worked out very well for me both in driving my learning process and gaining me some recognition that led to rewarding experiences and opportunities. Also, some other people seem to enjoy it :-)
The name of the blog proved puzzling for many (not a big surprise really), so in case you're still wondering where it came from, here's the relevant strip from Calvin & Hobbes:

I am a huge Bill Watterson fan, you should buy yourself the full C&H set, it will prove a more lasting reading material than any performance/Java/programming book you own. Also, I've seen many performance related discussions go a similar way to the above exchange...
A huge thank you to the readers, commentors and reviewers, urging me this way and steering me that way. Let's see if I can keep it up another 2 years :-)