Shenandoah GC

Shenandoah is the low pause time garbage collector that reduces GC pause times by performing more garbage collection work concurrently with the running Java program. Shenandoah does the bulk of GC work concurrently, including the concurrent compaction, which means its pause times are no longer directly proportional to the size of the heap. Garbage collecting a 200 GB heap or a 2 GB heap should have the same predictable pause behavior.

Shenandoah is in upstream OpenJDK since JDK 12, under JEP 189. Backports to JDK 8u and JDK 11u are available as well, see the "Build and Run" section for more details.

Reporting Bugs

There are several ways to report bugs. Here is the checklist:

Try the latest nightly builds, see "Build, Run, Test" section below.

Try narrowing down the bug using techniques from "Functional Diagnostics" and "Performance Diagnostics" sections below.

Try to come up with the minimal test case for the issue. Larger projects are fine too, as long as developers can build and run it themselves.

Submit the bug report, using either of these three options:

All JDK versions: Send the message shenandoah-dev mailing list. That mailing list is moderated, so either subscribe first, or wait for moderator to accept your message.

JDK 8, JDK 11: If you have the Red Hat subscription, and running the binaries from Fedora or RHEL, report the bug through the Red Hat support

JDK 12 and later: Submit the OpenJDK bug report clearly mentioning which build you are running. This would require Oracle folks to accept the bug first, before handing it over to Shenandoah devs.

Be prepared to answer a few follow-up questions

If you have time, be prepared to test the patches and/or nightly binaries

Overview

Shenandoah is the regionalized collector, it maintains the heap as the collection of regions.

Init Mark initiates the concurrent marking. It prepares the heap and application threads for concurrent mark, and then scans the root set. This is the first pause in the cycle, and the most dominant consumer is the root set scan. Therefore, its duration is dependent on the root set size.

Concurrent Marking walks over the heap, and traces reachable objects. This phase runs alongside the application, and its duration is dependent on the number of live objects and the structure of object graph in the heap. Since the application is free to allocate new data during this phase, the heap occupancy goes up during concurrent marking.

Final Mark finishes the concurrent marking by draining all pending marking/update queues and re-scanning the root set. It also initializes evacuation by figuring out the regions to be evacuated (collection set), pre-evacuating some roots, and generally prepares runtime for the next phase. Part of this work can be done concurrently during Concurrent precleaning phase. This is the second pause in the cycle, and the most dominant time consumers here are draining the queues and scanning the root set.

Concurrent Cleanup reclaims immediate garbage regions – that is, the regions where no live objects are present, as detected after the concurrent mark.

Concurrent Evacuation copies the objects out of collection set to other regions. This is the major difference against other OpenJDK GCs. This phase is again running along with application, and so application is free to allocate. Its duration is dependent on the size of chosen collection set for the cycle.

Init Update Refs initializes the update references phase. It does almost nothing except making sure all GC and applications threads have finished evacuation, and then preparing GC for next phase. This is the third pause in the cycle, the shortest of them all.

Concurrent Update References walks over the heap, and updates the references to objects that were moved during concurrent evacuation. This is the major difference against other OpenJDK GCs. Its duration is dependent on number of objects in heap, but not the object graph structure, because it scans the heap linearly. This phase runs concurrently with the application.

Final Update Refs finishes the update references phase by re-updating the existing root set. It also recycles the regions from the collection set, because now heap does not have references to (stale) objects to them. This is the last pause in the cycle, and its duration is dependent on the size of root set.

Concurrent Cleanup reclaims the collection set regions, which now have no references to.

Performance Guidelines and Diagnostics

This section describes the approaches to test and diagnose performance behaviors with Shenandoah.

Basic configuration and command line options:

Shenandoah performance, like the performance of almost all other GCs, depends on heap size. It should perform better in cases when there is enough heap space to accommodate allocations while concurrent phases are running (see "Failure Modes" section below). The time for concurrent phases correlates with the live data set size (LDS) -- the space taken by live data. Therefore, the reasonable heap size is dependent on LDS and allocation pressure in the workload: for a given allocation rate, larger LDS-es require proportionally larger heap sizes; for a given LDS, larger allocation rates require larger heap sizes. For some workloads with minuscule live data sets and moderate allocation pressure, 1...2 GB heaps performs well. We routinely test on 4...128 GB heaps on various workloads with up to 80% LDS size. Don't be shy to try different heap sizes to see what fits your workload.

-Xlog:gc+stats (since JDK 9) or -verbose:gc (up to JDK 8) would print the summary table on Shenandoah internal timings at the end of the run.

It is almost always a good idea to run with logging enabled. This summary table conveys important information about GC performance, and we would almost inevitably ask for one in a performance bug report. Heuristics logs are useful to figure out GC outliers.

-XX:+UseTransparentHugePages: this greatly improves performance on large heaps. It is recommended to set /sys/kernel/mm/transparent_hugepage/enabled and /sys/kernel/mm/transparent_hugepage/defrag to "madvise" on Linux. When running with AlwaysPreTouch, then init/shutdown would be faster, because it will pretouch with larger pages. It will also pay the defrag costs upfront at startup

-XX:+UseNUMA: while Shenandoah does not support NUMA explicitly yet, it is a good idea to enable this to enable NUMA interleaving on multi-socket hosts. Coupled with AlwaysPreTouch, it provides better performance than the default out-of-the-box configuration

-XX:-UseBiasedLocking: there is a tradeoff between uncontended (biased) locking throughput, and the safepoints JVM does to enable and disable them as needed. For latency-oriented workloads, it makes sense to turn biased locking off.

-XX:+DisableExplicitGC: invoking System.gc() from user code forces Shenandoah to perform STW Full GC, which is detrimental to GC pauses; it makes sense to disable this to protect from libraries that do this. There is an alternative -XX:+ExplicitGCInvokesConcurrent, that can force concurrent cycle instead of Full GC on System.gc(), you are advised to use that in cases System.gc() calls do matter.

Heuristics

Heuristics tell when Shenandoah starts the GC cycle, and regions it deems for evacuation. Heuristics can be selected with -XX:ShenandoahGCHeuristics=<name>. Some heuristics accept configuration parameters, which might help to tailor the GC operation to your use case better. Available heuristics include:

adaptive (default). This heuristics observes the previous GC cycles, and tries to start the next GC cycle so that it could complete before heap is exhausted

-XX:ShenandoahInitFreeThreshold=#: Initial threshold at which to trigger "learning" collections

-XX:ShenandoahAllocSpikeFactor=#: How much heap to reserve for absorbing allocation spikes

-XX:ShenandoahGarbageThreshold=#: Sets the percentage of garbage a region need to contain before it can be marked for collection.

static (previously and ironically known as dynamic). This heuristics decide to start GC cycle based on heap occupancy and allocation pressure. Useful tuning knobs for this heuristics are:

-XX:ShenandoahFreeThreshold=#: Set the percentage of free heap at which a GC cycle is started

-XX:ShenandoahAllocationThreshold=#: Set percentage of memory allocated since last GC cycle before a new GC cycle is started

-XX:ShenandoahGarbageThreshold=#: Sets the percentage of garbage a region need to contain before it can be marked for collection

compact (previously erroneously known as continuous). This heuristics runs GC cycles continuously, starting the next cycle as soon as previous cycle finishes, as long as allocations happen. This heuristics would normally incur throughput overheads, but shall provide the most prompt space reclamation. Useful tuning knobs are:

-XX:ConcGCThreads=#: Trim down the number of concurrent GC threads to make more room for application to run

-XX:ShenandoahAllocationThreshold=#: Set percentage of memory allocated since last GC cycle before starting another one

passive (diagnostic). This heuristics tells GC to be completely passive. Once available memory runs out, Full Stop-The-World GC would be triggered. This heuristics is used for functional testing, but sometimes it is useful for bisecting performance anomalies with GC barriers (see below), or figuring out the actual live data size in the application.

aggressive (diagnostic). This heuristics tells GC to be completely active. It will start the new GC cycle as soon as the previous one finishes (like "compact"), and it will evacuate all live objects. This heuristics is useful for functional testing of the collector itself. It incurs heavy performance penalty.

In some cycles, Update References phase is merged with Concurrent Marking phase, at heuristics discretion. You can forcefully enable/disable Update References with -XX:ShenandoahUpdateRefsEarly=[on|off].

Failure Modes

Concurrent GC like Shenandoah implicitly relies on collecting faster than application allocates. If allocation pressure is high, and there is not enough space to absorb the allocations while GC is running, Allocation Failure would eventually happen. Shenandoah has a graceful degradation ladder that helps to survive the cases like these. The ladder consists of:

Pacing (-XX:+ShenandoahPacing, enabled by default). When GC is running, it has the idea how much GC work is needed to be done, and how much free space is available for application. Pacer would try to stall allocating threads when the GC progress is not fast enough. When GC collects faster than application allocates, pacer naturally does not stall. Note that pacing introduces the local per-thread latency that is not visible in usual profiling tools. This is why the stalls are not indefinite, and they are bounded by -XX:ShenandoahPacingMaxDelay=#ms. After max delay expires, the allocation would happen anyway. Most of the time, mild allocation spikes are absorbed by the pacer. When allocation pressure is very high, pacer would not be able to cope, and the degradation moves to the next step.Usual latency induced: <10 ms

Degenerated GC (-XX:+ShenandoahDegeneratedGC, enabled by default). If application runs into allocation failure, then Shenandoah would dive into stop-the-world pause, stop the entire application, and continue the cycle under the pause. The Degenerated cycle might be faster than the concurrent one, because it does not contend with the application over the resources, and it uses -XX:ParallelGCThreads, not -XX:ConcCGThreads for thread pool sizing. In many cases, Degenerated cycles happen after a significant amount of GC work is already done, and a small part of GC work needs to be completed. This is why the STW pause is not usually large. It would be reported as GC pause in the GC log. Degenerated GC may happen if GC cycle started too late, or if a very significant allocation spike had occurred.Usual latency induced: <100 ms, but can be more, depending on the degeneration point

Full GC. If nothing helped, for example, when Degenerated GC had not freed up enough memory, Full GC cycle would happen, and compact the heap to the max. Certain scenarios, like the unusually fragmented heap coupled with implementation performance bugs and overlooks, would be fixed only by Full GC. This last-ditch GC guarantees that application would not fail with OOM, if there is at least some memory is available.Usual latency induced: >100 ms, but can be more, especially on a very occupied heap

In addition to usual GC log that would print individual Degenerated GC and Full GC events, -Xlog:gc+stats would show something like this at the end of the run:

From this, there are a few things to try if application runs into either of these degradation steps:

Give the application more heap. This would allow to absorb more allocations when GC is running.

Trim down the amount of live data in the heap. This would allow GC cycle to run faster, and cope with allocations better.

Trim down the allocation pressure. For example, reduce the number of allocating threads, or fix the major allocation hogs in the application.

Tune up heuristics to start the GC cycle as soon as possible. It might not help if GC logs already say that GC is running back-to-back cycles.

Crank up the pacing delay. This would stall allocating threads more, instead of elevating to Degenerated and Full GCs – note this still introduce latency for those allocating threads!

Performance Analysis

Approaches to performance analysis:

Some odd performance behaviors – like the allocation failure GCs, or long final marks – can be explained by heuristics issues. -Xlog:gc+ergo is your friend there. If you have a long-running workload, running under Shenandoah Visualizer would let you understand the high-level GC behavior. Sometimes the odd behavior is clearly visible there.

Some performance difference can be explained by larger allocation pressure under Shenandoah, since it includes the forwarding pointer with each object. Look at the allocation rates to see if that might be a problem, and do experiments that could confirm this (for example, beefing up objects should diminish the performance difference against another collector). There are some cases when larger footprint means dropping out of CPU cache, look for L1/L2/LLC miss differences.

Many throughput differences can be explained by GC barriers overhead. When running with -XX:ShenandoahGCHeuristics=passive, and that heuristics only, barriers are not required for correctness, and so heuristics disables them. It is then possible to enable the barriers selectively back, and see what barriers are affecting throughput performance. The list of barriers that "passive" heuristics is disabling is listed in GC output, like this:

Navigate the report, and see where are suspiciously hot methods/paths are. Pressing "a" on the method usually gives a more detailed disassembly for it

Profiling the barriers code requires PrintAssembly-enabled build. We recommend to use JMH -prof perfasm to create the isolated scenarios and looking at the generated code under Shenandoah.

It is important to understand that GC pauses might not be the only significant contributor to response times in regular applications. Having large GC pause spells the problem with response time with a very high probability, but the absence of long GC pauses does not always mean decent response time. Queueing delays, network latencies, other services latencies, OS scheduler jitter, etc. could be the contributing cost. Running Shenandoah with response time measurement is recommended to get the full picture of what is going on in the system, which can then be used to correlate with GC pause time statistics.

For example, this is a sample report with jHiccup on one of the workloads:

Functional Diagnostics

This section describes the ways one can diagnose and/or debug Shenandoah.

These are the steps you can do to narrow the problem area:

Run with -XX:+ShenandoahVerify. This is the first line of defense against GC bugs. It is available in both release and fastdebug builds. If Verifier identifies a problem, then it is most likely the GC bug. A simple reproducer would be convenient to have in order to diagnose this better. In many cases, what happened before in the GC matters, e.g. the last actions taken by GC. That history is usually recorded in associated hs_err_pidXXXX.log, be sure to include it in full while reporting the bug.

Run with fastdebug build. In many cases, this will produce the meaningful assert messages, pointing at the earliest moment when GC detected the functional anomaly, and Shenandoah asserts a lot. These builds can be produced by adding --enable-debug to configure, and re-building. As usual, hs_err_pidXXXX.log conveniently records the environment and historical data that helps to investigate the assertion failure.

Run with -XX:ShenandoahGCHeuristics=passive, which will do the stop-the-world GCs only, and avoid doing most of concurrent work. If the issue goes away in passive mode, it must be the bug in concurrent phases and/or barriers.

Run with different compilers: -Xint (interpreter only), -XX:TieredStopAtLevel=1 (C1 only), -XX:-TieredCompilation (interpreter and C2 only) – to dissect which modes fail, and which are not. This would highlight if the issue is in barriers handling or optimizations in interpreter, C1, or C2. This is usually helpful to combine with fastdebug builds, because compilers assert things too.

Run with -XX:ShenandoahGCHeuristics=aggressive. This heuristics runs back-to-back GCs, and evacuates all non-empty regions. Since Shenandoah does most GC heavy-lifting concurrently, this does not block application from executing, although GC would consume much more cycles in this mode and slow the application down. Note that enabling Verifier in this mode would probably degrade performance to unpractical levels.

Put logging statements in the code around the failing assert to understand the issue better. With enough logging, you can retrace everything that happened in the collector.

Add more asserts around the suspicious parts in the code. Look at macro definitions in shenandoahAsserts.hpp to see what rich asserts are available

Attach a native debugger, for example, gdb, by asking VM to do the external action on failure with -XX:OnError="gdb - %p" (%p would get substituted with process PID)

Create a simple reproducer and hand it over to Shenandoah developers. :)

Build, Download, Install, and Run

Shenandoah is under development in mainline JDK since 12. Apart from mainline builds, there are downstream builds available for current JDKs. Note that Oracle JDK and Oracle-provided OpenJDK builds do not include Shenandoah. The changes flow between the development repos and builds as described on a simplified diagram below. If you are an early adopter, trying a bleeding edge should be more profitable performance-wise, but may risk exposure to not-yet-discovered bugs. If you are looking at running Shenandoah in the actual deployments, using the most stable builds is preferred.

More detailed, there are different ways to get Shenandoah:

Binary Development Builds

There are (nightly/weekly) development builds available at these locations:

Note: OpenJDK is normally compiled with all warnings treated as errors. Newer compilers may have more warnings that codebase had not yet caught up with. You can pass --disable-warnings-as-errors to ./configure in those cases.

In all cases for building from source it is optional, but advisable to run the tests. This is especially important on platforms beyond what Shenandoah currently targets, and/or building with too new or too old toolchains. You will need jtreg to run the tests, and it makes sense to run test against fastdebug build first: