Cliff Click on Azul's Pauseless GC, Zing, JVM Languages

Recorded at:

Bio With more than twenty-five years experience developing compilers, Cliff serves as Azul Systems' Chief JVM Architect. Cliff joined Azul in 2002 from Sun Microsystems where he was the architect and lead developer of the HotSpot Server Compiler, a technology that has delivered dramatic improvements in Java performance since its inception.

The target audience for GOTO conferences are software developers, IT architects and project managers. GOTO Aarhus is an annual event in Denmark. The idea for GOTO (formerly known as JAOO) came about as the management at Trifork was dissatisfied with the conferences that existed and wanted to create a forum that would give the development staff inspiration, energy and desire to learn coupled with the chance to network with the IT community.

I came in shortly after Anamorphic was acquired by Sun so the original team was at a company called Anamorphic, they came in with a technology that was targeted at SmallTalk and they re-targeted it for Java and they hired me shortly afterwards to do a new JIT for their virtual machine.

I got hired in to write a JIT and as part of my PhD research my advisor signed off on a piece of Math that’s what my PhD officially is for, but the very large appendix at the end of the thesis is all about how to write a fast optimizing compiler and that’s what Sun hired me in to do.

Azul Systems. They make two different products the stuff that we’ve sort of been selling for the last several years is custom hardware to run large server Java applications. The hardware is targeted for running highly threaded Java programs, with nearly a thousand cores on a big box. The major selling point is that we do GC better than anyone else on the planet we have customers in production with heaps of hundred of gigabytes and tens of gigs a second of allocation rate with max pause times of only an order of ten to twenty milliseconds.

The second thing that we do is we have a really nice VM internal introspection tool, basically we have some of the best low level VM profiling tools around. And so there’s that one product and the next thing to do we’ve announced and we haven’t gone full GA yet but we are working on it, is the same virtual machine but running on a plain x86.

Each chip has fifty four cores chips are like that, we can stack up to sixteen chips in a box and so that’s sixteen times fifty four it’s eight hundred and sixty four something like that number of cores. Each individual core is somewhat slower than an equivalent of x86 but you just get a whole lot of them. The box has super computer level bandwidth it’s way more bandwidth than a stack of x86s, it’s actually fairly low power very low power compared to x86s and then there is some custom instructions for running Java better, and then there is custom micro kernel style OS to do multi threaded well, it’s a better way to run large thread count Java programs.

On a custom hardware? The custom hardware has a couple of things that are special about it. So there is the large core count, each core is a full blown CPU it’s a 64 bit RISC with a standard three address instructions, thirty two GPRs,16K L1 Cache and 16K ICache, a bunch of them share a two meg L2 and there is a bunch of those clusters of cores and L2s on a die. So, there is a lot of CPU power there, and actually a lot of cache, then the individual cores have instructions for instance to do Java range checks better. A major thing we do, is we do just in time zeroing of the heap directly into L1 cache, and that let’s us avoid reading memory about to zero over.

And that in turn cuts down our bandwidth requirements by about a third which is a pretty substantial reduction of bandwidth usage. The box already has a lot of bandwidth coupled with sort of a very large reduction of bandwidth needs is we don’t run out of bandwidth. If you load up all the cores and you make them do something busy and they go and get working, they keep working at a steady rate, they don’t degrade because you run out of bandwidth. It’s sort of different from a standard x86 situation where all works well if everything is running out of cache but as soon as you start running outside cache and off the main memory you quickly run out of bandwidth and then everything just slows down.

Yes, ok, a couple of things going on there: we have read and write barriers in hardware, the read barrier is an instruction which tests the value of a loaded pointer to confirm whether or not it passes whatever GC invariants that needs to be done. And if you do pass, then it’s like a one clock cycle kind of operation and if you fail then you do a fast user mode trap to a fixed address where the GC kicks in and inspects the pointer and does the right thing whatever it happens to be. Using a read barrier lets us change the nature of the GC algorithm.

Most of the GCs that are out there today, and the the classic canonical systems, are what is called snapshot at the beginning, style algorithms, and we are instead using what is called "Eager update style". And the different algorithm lets us, it’s a much simpler algorithm, you have to have a read barrier is the downside, the upside is the algorithm basically simple like the theoretical algorithm is simple, because the theoretical algorithm is simple you can make it robust in practice, you can make it scalable, you make it high performance, you make it concurrent, buzz word enabled in all the right ways, without the huge bug tail.

So, we had our GC working years before CMS was as stable as it is now and we are still way more stable than CMS is. We're well ahead of G1, as the next upcoming GC algorithm out there. We are looking at G1 numbers now and our algorithm is like a hundred x better, it’s a huge step forward, and that’s just possible because we switched to an eager update style algorithm and had a read barrier.

Card marks are part of the write barrier side of things, not the read barrier side. So the read barrier side, the eager update algorithm means when you load a pointer out of memory, if it doesn't match the GC invariants that are in place at the moment, you eagerly updated on the spot; whereas the snapshot at the beginning algorithm says at the beginning of some mystical point in time of a GC cycle, everything that is alive then, is going to remain alive to the end. And then you don’t have to track when you load a value, since it was alive at the beginning you are going to declare it alive at the end of the GC cycle, you just have to not loose any pointers, so when you store and it wipes an old pointer out, you have to hang on to a record of it somewhere so you can finish marking through and tracking it down.

So part of those algorithms is they have some way to track pointer stores. Pointer stores are a lot more rare than pointer loads and so it’s a lower cost tracking mechanism for the standard execution of a running thread, but the penalty being though that the algorithms become very complicated to efficiently track the pointer stores, and to track all the other operations that go on there, I’ve been looking at the bug tail in CMS and it seams to go on forever. It’s a harder invariant to maintain, whereas eager update says your Java program will never witness a stale pointer of any kind, one that doesn’t meet the current GC invariants whereas the snapshot at the beginning ones you can pull out a pointer that is not colored correctly or not marked right, or hasn’t had some other property done to it.

Then you can propagate it around and make copies of it throughout the heap and you have to have some way to go and find other copies and update them later. And that doesn’t happen for us: if you load a pointer up, then you’ve moved the base object it belongs to, you update the pointer it now points to the new location and then you propagate the new pointer around and not the stale one so you don’t have the issue of pushing around stale pointers and making copies of them and such.

Yes, the basic algorithm starts with a read barrier on pointer loads, to clean up any missing GC invariants and then afterwards the actual algorithm is a fairly straight forward rendition of a mark-compact kind of moving collector. The sort of interesting tricks we do when we decide to move objects until we protect the page, and people who load pointers to that page will take a GC trap, and it helped move the object to its new location where it’s going to be and then they get a pointer to the new object and carry on from there. That means that we have TLB page protections in our custom hardware that are for GC, you fail it as a normal Java thread but you are allowed to touch the page as a GC thread and normal Java threads just declare themselves a GC thread help move the objects they want, and then they go back to being a normal Java thread.

The marking algorithm is actually Sun’s HotSpot parallel marking algorithm, the only extra trick we have is there is a mark bit kept per pointer so when you load a pointer up it knows whether or not you’ve marked through that pointer or not and one of the traps it’ll do is you haven’t marked it yet, you go ahead and mark it on the spot, you add it to the collectors marking stack and then you go back and do whatever other, you’re done with the GC trap if it’s not been marked through. There is a paper in VEE 2005 on how the algorithm works. If you are interested, go read the paper it would be much better than have me try and explain it here on the spot.

Yes, we tried for a long time to get Intel to put in the hardware support and they were interested but they are not moving very fast, and we had to give up waiting on Intel to put in some sort of custom support. So we gave up on some stuff and we are doing it the hard way for some other stuff. Stuff we gave up on includes hardware transactional memory and the just in time zeroing in the cache. And that means that you don’t get the bandwidth benefits out of the just in time zeroing since x86s are typically also a little tight on bandwidth that’s a hard one to give up but everyone is in the same boat there.

The other thing we do is doing a read barrier software and we worked a long time to come up with a sequence of instructions to do the read barrier as cheaply as possible, it is an interesting performance hit, single threaded performance is down, it’s not as bad as Metronome but it’s more on the order of fifteen-twenty percent to do the read barrier in software, there is an interesting hit there, the upside is we are seeing really excellent pause times for very large heaps on x86. Excellent pause times meaning two-three milliseconds max parse times for the very large heaps. We believe we can eventually get it down to under a millisecond. This would be the max pause, not the typical pause.

So on the custom hardware side, we are using the TLB as part of the read barrier. On the x86 side of things we are doing a lot of page remapping games, but they are not for the read barrier. For the read barrier side, we have an instruction sequence and a clever mapping of some pages to do sort of a read-mask-read kind of game, and we’ll fault out if the pointer lacks to write invariants. There are some multi mapped page games being played there. However we’ve always done very aggressive page re-mappings for the way the algorithm works, so it frees up physical memory before it frees up virtual memory.

So there is basically the VM itself is managing the mapping between physical and virtual memory, and changes those mappings at a very high throughput rate on our custom OS it was not an issue, on standard Linux, those kind of calls, are typically very slow and they are much too slow for what we need them for so we have custom kernel hacks on otherwise standard RedHat Linux to do a very fast very bulk page table re-mappings.

There is no custom hardware here at all, it’s custom software. The only custom thing going on is we’ve got kernel mods to go let us do bulk page remapping games. And we remap at a very high through put rate.

Yes, ok, versus Zing, isn't marketing wonderful. We are basing ourselves now on OpenJDK and to do that we have to release our stuff as open source as well because it’s substantially derived work off at the Open JDK. So the managed runtime is essentially a source drop as of a couple of months ago, and we are due to put on another source drop before long here. Essentially what it is I am building and running on my desk every day. Zing is the productized version of that that we intend to sell for money. There are a few features we don’t have to put out in the public domain, and those are going to be extras, are going to be probably the high res profiling tools.

But the good garbage collector is going to be out there in the public domain. What is out there now is there, it’s just that as of several months ago it had all the bugs and the status of a not quite beta ready product. So as we have progressed over time the MRI stuff will become more stable and functional for people just to download and run. We will probably always have a version of Zing which is low cost to no cost for developers. But if you want to put it in production and you want support for the VM that will be an extra charge, it will be similar to like a VMWare/RedHat model there is going to be a free version and then there is going to be support for people who want to run in production.

Right. We’ll probably do a little both for a while, the custom hardware is currently selling quite well thank you. Like I said before, it’s one of the best GC around there is a certain class of problems where we just can’t be beat. The economy is recovering and people are buying and it’s the go-to solution for a certain class of problems. On the x86 side of things, we will probably have a blessed version of x86 with the right set of devices and hardware that were willing support because we are going to do kernel mods and if the kernel breaks RedHat is going to look at us funny and say "Well you’ve got funny kernel mods let’s go talk to the Azul folks".

So we will have canned versions that we support directly we’ll probably have a version which will be one of the VMWare or Xen or KVM style hardware platforms. And then there will be a free version which will not be supported in any way and sort of potluck and how that works, but it will be otherwise a pretty standard you get a Linux running your box, you have these kernel patches these kernel mods and then in theory it all works.

We’ll you just give up on having some total throughput issues so the hundreds of cores really means that you are allowed to scale up but you don’t have to. Because the cores are slower you have to, to get parity performance. On x86 you will need fewer cores to get the same level of performance, it’s a little easier job to slice things thinner. Other than that it will just be you have more cores. We know how to use cores, they are faster so we don’t need the same number of them, but we’ve long had the compiler system use a thread pool, a self sizing thread pool, so during application start-up there is lots of room for the JIT to get busy, the thread pool kicks up and size and you get lot of threads JITing.

The GC is a concurrent GC it needs cores to run in the background but the number of cores it needs to run in the background is pretty much related to the size of your heap and the rate at which you are producing garbage, so you need some fraction of your total normal core as you are running through your Java program are going to reserve for running the GC in the background. Typically it’s not an issue you got an eight core machine you have six cores producing work, you’ll gave two cores cleaning up garbage behind them something like that.

These are sort of a different questions, the answer to the first question is that we are based on OpenJDK and so we are following behind OpenJDK. As the stuff comes in to open JDK we’ll pick it up the next time we integrate. So we are going to support invokedynamic eventually but I don’t know exactly when. Do I have any ideas on how to make this better? I am kind of picking and choosing where I want to put my engineering effort to speed up stuff. At the moment I am happy to let the OpenJDK folks go make invoke dynamic work.

If it turns out that it’s the wave of the future and everyone is writing in funny new languages, Clojure, Scala whatever, and all do invokedynamic and we can add some real value there, maybe I will put some engineer effort to optimize that above and beyond what happens with the OpenJDK. At the moment I am going to live with OpenJDK's solution for that piece.

The obvious one they have is sort of a miss match between the language semantics and what the VM supports around arithmetic. So a bunch of these languages have either infinite precision arithmetic, or they are going to fail over to BigDecimal or BigInteger, or I think JavaScript is using capital D Doubles, is using Doubles for their Math there are issues there. And then the languages are expected to treat integers like their first class citizens, methods can be invoked directly on an integer. In order to do that they end up using Integer instead of the primitive int and therefore all integers are now capital Integers and they suddenly have a very high allocation rate of Integers.

And so those languages are dominated by the speed which you can allocate Integers. One of the issues there is Integers the one interesting field in it is a final field and the final field semantics demand you have to do memory fence, after the allocation and then suddenly every time you did a plus 1 or i++, that ++ was Math that made a new Integer that needed a memory fence as well as an allocation and the combination was very difficult to accelerate. A bunch of work went in and the recent version of OpenJDK doing sort of auto boxing optimizations where the JIT was trying to decide that you were making an Integer and then throwing it away, it was dead on the next instruction and could he avoid actual allocation and just keep it as a lower case int internally?

And optimization only goes so far. So I think those languages still suffer at the moment of sort of very high allocation rate because of things like Integer being the default for what otherwise is a primitive int.

Value types - I don’t get why people care all that much to avoid having primitives in the language. For me, I guess I have been an old school hacker since the dawn of time and primitives ints were just what you did everything with. And here when people try to tell me "No, no, no we don’t want to have any primitives or at least we want to hide them as much as we can", well there is a real cost to hiding them. There is a reason why they have been around first and have been around forever and they are very efficient for obvious reasons they map directly to hardware. So I would claim or propose that we add primitives to the language but I get that people don’t want to do that.

So if you don’t want to do that, then one penalty you pay is things run slower, and the options are now wait for the VM vendors to improve their VMs to sort of automatically optimize them but they will never get as good as a primitive int. But they might get closer than what we are at, so the VMs can be improved. I guess OpenJDK they did a bunch of work they can do more, whether or not there is more here kind of depends on how popular these languages become, and whether or not we feel we can add significant value here. Things could happen there, personally I still think it’s sort of a misguided effort.

Not invokedynamic, but hiding primitives from the language. Invokedynamic makes sense it’s an interesting way to enable people to do essentially custom tailored dispatch rules. It’s not invokevirtual, it’s not the standard vtable loopup, it's some new different thing that you’re defining on the fly. And I think that’s a pretty spiffy thing to go attempt to do. I am all for invokedynamic. But like I said, I am going to let OpenJDK go there first and we’ll follow when we integrate next with OpenJDK.

Tail recursion it’s another one that I think the VM can do a good job on. And we should just go after it at the VM level don’t need any funny language support for it or new changes to the VM spec. That one is a more straight forward engineering job. It unfortunately means that it touches not just the JITs but also all the runtime infrastructure it seams harder to do because it crosses boundaries like at the Open JDK level the VM team and the JDK team and the runtime team are all separate groups, and this one crosses a bunch of boundaries there so it might be harder politically to get to go there. At Azul Systems I can probably burn it out in a week or two if it becomes like the killer new feature to add it will probably happen.

I think so. I think the goal of that was basically to distance the JVM from Java and say there is a future here outside of Java. And we would like to have the VM be sort of the platform of choice for people going forward with new languages so yes, Clojure, Scala, JRuby, Jython, Groovy, there is a whole giant list of them. They are using a JVM for obvious reasons: they come to the JIT, come with a compiler, co-generator comes to the garbage collection comes to the multi threading support comes with Java memory model, comes with a huge collection of pre-built libraries, so it’s an obvious interesting platform to target a new language for, and we’d like to be a part of that going forward.

Are they powerful enough? There will be plenty of people tell you never. For a very small amount of hardware cost we can have the read barrier done again in hardware and get back that fifteen to twenty percent that we threw away. Fifteen to twenty percent single threaded performance speed up for a tiny amount of hardware seems like a worthwhile tradeoff to me, whether Intel goes there, I don’t know certainly if some alternative to Intel would come up and say we have a fast x86 and it has a read barrier we would check it out. And that might make a very compelling solution to people to pick up somebody who's not necessarily Intel there.

The other piece that we thought was very interesting was the just in time zeroing it’s a giant bandwidth win, specially for managed languages where you are using garbage collection which typically does bump pointer allocation and each new bump pointer of a new object involves a new cache line which has obviously not been in your caches since the last GC cycle so it’s a guaranteed cache miss. Having that turned into no memory bandwidth is an interesting thing. So those two pieces I feel they are compelling good reasons to get hardware support for whether or not the hardware vendors will choose to go there, I can’t say.

The laws of Physics have come down on the rules for raising the clock rate, and the hardware designers have run out of ideas for getting more performance out of a single core, by adding more transistors to it. So doubling the size of your cache is not going to double your performance, it’s not going to give you five percent these days, it’s very minor. They done all the out of order, auto renaming, branch prediction, all the magic tricks in hardware, those cores aren’t going to get any faster even if you add more transistors. But we are getting more transistors so what do you do with them?

You either make smaller die and they are cheaper and that’s exactly what is happening in the cell phone, in embedded spaces, you are getting more potent computers in ever small factors. Or for server apps, where power and cooling are taken care of, you are going to get more cores. And that trend is going to continue as long as Moore's law holds up. If the manufacturing people can keep making larger dies or smaller transistors, you are going to get more transistors, it should be more cores. Can you do something with them? That’s the question. One of the bottlenecks that has been holding back the Java server market has been garbage collection.

You can't have a heap that’s more than four gigs or your GC pauses become intolerable. And four gigs can be comfortably handled by two CPUs, a handful of CPUs. If the cores keep coming, you can't make those CPUs do any work, or you need a bigger heap but you can’t handle a bigger heap because you can’t handle the GC pauses. So people were sort of stuck and they said than even now with a single die I am having multiple JVMs running and a cluster configuration because I simply can’t handle the one large heap. If we fix the one large heap issue I think people once again will be looking at having VMs that span the whole server, eat all the memory on the box and are usefully using ten, twenty, forty CPUs you have floating around available within the server.

You still have to solve the parallel programming problem, that’s a hard one, but one of the issues for scaling we can get rid of the GC side of things, we can fix that piece.

That is part of the parallel programming piece, can you make a program that has always different tasks that are running concurrently reasonably well. The whole robustness side is still probably going to be done with HA high availability clustering arranging your apps so it spreads across multiple different server boxes so if one goes down, others take over. I don’t see HA style solutions going away any time soon. We'll just allow you to have a single larger VM you’ll need fewer nodes in the cluster.

Yes, ok, so that is "Under the hood on the x86". That is a long discussion on the magic tricks that x86 does in order to make a single thread of execution run faster while not actually having any faster transistors. So that was the micro architectural speedups you can do. You know what caches are and why they're effective and how far they can go, what you can do with it and when you do a cache miss, can you do other stuff, and a bunch of takeaways out of that or things are really complicated already, and more complications don’t appear to help anymore. Memory is sort of the new disk if you like, if you are going to main memory you are going slow, and so the extent to which you can run out of cache is a lot better off.

The CPU's sort of design goal has changed from running instructions fast to running as fast as you can to get to the next cache miss. And see how many cache misses you can get running concurrently. What is another major theme out of that one? I think this is probably the big picture items. Another one is the memory model at the hardware level is substantially weaker than what Java memory model, it’s different from the Java memory model. And that means that you get very strange out of order observation values. So any one particular memory location read of a field or a particular item, in actual hardware has multiple places where that value is kept. They are not necessarily kept in sync they are eventually consistent but not at the moment consistent and so you can have very funny stale read issues.

If you skip the volatile keyword or synchronized keywords and it happens to run at the moment it doesn’t mean it’s going to run correctly later, it probably fails to be correct only at a very heavy load which is when your servers are at the worst possible strain that’s when you start seeing all the bugs that come out of failing to synchronize properly. And the reason that its going that route is in order to make systems run faster the hardware designers are having the CPU trying to talk core to core as little as possible so they're delaying communication between each other and that means that if this fellow does write and this fellow does read of the same location whether or not he sees the written value depends on the luck of the draw and the timing of things.

And that situation is probably not going to get better, it’s going to get worse, as the die gets bigger and the transistors get smaller to get smaller cores on a die the distance from core to core it’s going to get larger and the time takes in terms of your clock cycles and work accomplished for communication to go to one core to the next even on the same socket it’s going to grow. And certainly on an x86 if you get multiple sockets the ones that are multiple hops away it might be hundreds of cycles which at a four wide issue rate is heading to a thousand instructions between when somebody wrote a value and when somebody else witnesses the write.

So, if you have issues about forcing the order of your writes, and your reads order, some other thread, some other CPU somewhere else will get them out of order, see them in a different order and get a stale read value and that will lead you subtle race condition bugs.

Yes, that’s a different question than the one you just asked. I have certainly written large complicated parallel Java programs, most my day to day coding is in C and C++ I do concurrent algorithms in them all the time, in a very specialized subset of C++ language and I sort of know, a compiler is a white box I know exactly what it does, and I have a very strong handle on what the hardware does and so I get it right by dint of being an expert and writing sort of a subset of all possible concurrent algorithms writing very straight forward, very simple ones, point source solutions for use inside the VM.

What I do for fun for concurrent algorithms I typically look for different parallel programming paradigms, I am currently staring at state machine style paradigms, because they have, my non blocking hash map is done this way, essentially infinite scalability, there is no limit to the number of CPUs that can get involved in a non blocking hash map in a state machine style operation. And there aren’t’ any good tools like the hardware guys do state machines all the time, and they have a lot of useful tools to help them manage state machines, manage their description, do source code control on changes to them, verify interesting properties about them and so on.

That kind of stuff is not available for the software side of state machines but if you get a state machine done right, it’s incredibly fast and incredibly scalable, it’s much much faster than any other locking variation and it’s much more scalable and like I said, once you get it done right, it has no data races, or another way to put it, it’s only a giant data race, it's some combination of those two notions all taken at the same time. So I am playing with state machines.