Menu

Tag Archives: sgen

Now and then we get bug reports of test cases crashing on a specific configuration in an indeterminate manner or frequency, both from our own build bots as well as from our users. One of those I looked at recently involved a runtime test case on Linux/AMD64. The test would only crash infrequently, but luckily I could reproduce it in a VM.

What the program does is run two threads, one of which causes null reference exceptions in a loop over and over again:

Mono generates null reference exceptions by just letting the code trigger a segmentation violation signal, which we catch in a signal handler that then proceeds to throw the NRE. This approach, as opposed to doing a null check whenever we access an object, is that there is no overhead at all in the case the reference is legit, and the code we generate is shorter.

Garbage collection, on the other hand, also involves signals: stopping the world for a collection on Linux is done by sending a signal to each managed thread. Each thread catches its signal, which tells it that it has to suspend until the GC has done its work.[1]

When we look for the cause of a bug like this the first thing we try to do is catch the crash in a debugger, in this case GDB. Unfortunately, the crash was caused by a segmentation violation that our signal handler cannot handle (because it doesn’t occur in managed code), but a regular run of the program also triggers segmentation violations like crazy, so just breaking at each SIGSEGV didn’t get us anywhere.

My first attempt at getting to the crashing SIGSEGV was to write an Expect script that drove GDB to the first SIGSEGV, presumably caused by the null reference in the managed code loop, remember the offending address and then keep continuing until it hit a SIGSEGV at a different address. Sadly (and quite typical of cases like this), running the code that way in GDB never succeeded in triggering the bug, so we had to think of other options.

At first we thought the crash was caused by the interaction of the two signals (the segmentation violation from the null reference and the stop-the-world signal), but we looked into other hypotheses. We tried two different ways of getting rid of the segmentation violation: using explicit null checks to trigger NREs and just throwing an exception manually without accessing a null reference.

Explicit null checks (enabled with the environment variable MONO_DEBUG=explicit-null-checks) makes the JIT emit less efficient and more elaborate null checks whenever an object is accessed, so an NRE can be thrown without going through a SIGSEGV.

The other way to get rid of the SIGSEGV while still retaining most of the code’s behavior is to just throw an exception, like so:

We figured it probably didn’t matter which type the exception was, and indeed it did not: both with explicit null checks as well as with the manual exception throw the crash still occurred. We also looked into whether invoking the garbage collector was critical to the crash, and it turned out it was: just making the other thread sleep instead of calling the GC made the crash go away. Also, despite what the bug report claims, the crash was not dependent on SGen—it occurred with the Boehm collector, as well.

Armed with this new data we set out to finally get a close look at the crash in the debugger. Since the crash still only occurred very infrequently I brought my Expect script to bear again, and after a while we did get the crash in GDB. It did, regrettably, give us less information that we had hoped: instead of crashing somewhere in unmanaged code while dereferencing an invalid pointer, the crash resulted from the program counter being set to an invalid address, namely zero.

There are three main ways of setting the program counter on AMD64: CALLing a function, RETurning from a function and JMPing to an address. Spending some time looking at the stack dump it seemed that the most likely cause was a jump, which did not make our job easier, since it leaves no trace of where it came from.

After some more tinkering we decided to investigate whether the point at which the exception throwing thread was interrupted by the GC played any role. We got lucky—it so happened that every time the test crashed, the last time the thread was stopped was always in the exact same piece of code; even more, at the exact same instruction. This is the code in which the thread suspension occurred, with the instruction where it happened labeled:

This piece of code is executed as the last step of throwing an exception. It jumps to the catch handler, which involves first restoring a bunch of registers, including the stack pointer, and then doing the actual jump. We guessed that this jump to register r11 was the one that landed us at address zero. Observing in a few more crashes that r11 was indeed always zero corroborated that hypothesis.

The code that calls this restore_context() function looks like this:

void
mono_amd64_throw_exception (MonoObject *exception, ...)
{
MonoContext ctx;
... figure out where on the stack the matching catch clause is ...
... set ctx to the values required to end up in that catch block ...
restore_context (&ctx);
}

Note that the context—the data structure that contains all the values for the registers, including the program counter—is a local variable of this function, i.e., it resides in this function’s stack frame. Said stack frame is on the same stack as the catch handler that it eventually dispatches to, but further down, of course. We pass the address of the context to the above restore_context() function.

It seemed obvious now that for whatever reason the value for the program counter was set to zero in the context, so we added an assertion to check for that directly before the call to restore_context(). The assertion never failed, but the program still crashed.

Looking at the assembly code again Rodrigo suddenly proclaimed that he knew exactly what caused the bug: after setting the stack pointer to the catch handler’s frame we did another load from the context, which we weren’t allowed to do. By moving the stack pointer upward the context ended up lying not above the stack pointer, where it used to be, but below:

Most ABIs specify that programs must assume that data below the stack pointer (plus the red zone) is subject to arbitrary change because it’s used by the operating system for invoking signal handlers, for example—exactly what happened in our case with the thread suspend signal.

The correct way to do the context restore, then, is to load everything from the context before restoring the stack pointer. In our case we use the scratch register r8 to hold the value that will end up in rsp:

Looking at our code for x86, after fixing AMD64, we found that we had the same bug there, but nobody had reported it yet, probably because it was harder to trigger.

Reality is a tiny bit more complicated: Sending a signal to a thread is asynchronous, meaning that the GC has no direct way of knowing that a thread has already gotten the signal and reacted to it. Instead, the suspend signal handlers message back to the GC that the point of suspension has been reached, upon which they actually suspend, waiting for yet another signal from the GC, telling them to resume. Note that if the resume signal arrived between the point where the GC is notified of the suspension and the point of initiating the wait for said resume signal, the signal would be lost and the thread would never resume. ↩

Introduction

In my last blog post about SGen, Mono’s garbage collector, I wrote about concurrent mark-and-sweep, which makes most of the garbage collection work happen concurrently with the running program.

One thing we cannot do in a concurrent collector, as opposed to a stop-the-world collector, is move objects around in memory[1]. This is typically not a problem in a mark-and-sweep collector, but SGen goes beyond basic mark-and-sweep and prevents fragmentation by evacuating objects when the need arises.

Fragmentation is no less of an issue in a concurrent collector, so being unable to do evacuation is problematic. Our best solution at this point is to stop the world when it happens[2]. The threshold we use to decide when to do evacuation with concurrent mark-and-sweep is higher than on our non-concurrent collector, but there is still a chance that a full collection pause will hit the program when interactive speeds are desired.

The API

We’re introducing a new API in the Mono.Runtime class. The application can use it to ask the GC not to do full major collection pauses:

One possible use case would be a game, where you want the lowest latency possible while the player is in a level, but between levels, or during loads, a longer collection pause is not much of an issue. Note that there are still no guarantees about the lengths of the pauses—this is a best-effort approach, not a real-time garbage collector.

This API will be available in the next Mono release, but only via reflection. It will be a public API on mobile soon.

This is technically possible, but it involves using a read barrier, which would not only be forbiddingly complicated to implement in Mono, but would also probably incur a significant performance hit as well as break our embedding API. ↩

An approach that would keep marking mostly concurrent would be to decide which blocks to evacuate before starting concurrent mark, and then during marking to keep tabs on the references from already marked objects to objects within those blocks. In the finishing collection pause we could evacuate the marked objects in those blocks, but we’d have to update the references to them. Depending on the workload that might still require a significant pause. ↩

Introduction

Like most garbage collectors, SGen is of the stop-the-world variety, which means that when the time has come to perform its work, it stops all managed threads in their tracks and only resumes them once it has finished.

Nursery collections are usually relatively short affairs, and can be sped up even further by using a smaller nursery size. Major collections, on the other hand, can take a long time because they must scan every single live object in the system. Long garbage collection pauses are clearly a serious problem for many applications, be they interactive or on the server, and garbage collector engineers make considerable efforts to shorten pause times.

One approach to this is to do parts of the major collection concurrently with the running program, aka the mutator. SGen’s major collector, mark-and-sweep, consists of two main phases, unsurprisingly called mark and sweep.

Thanks to Zoltan’s recent work—lazy sweep—SGen already does most of sweep’s work outside of major collection pauses. I have been working on the other part of the story for the past few weeks: concurrent mark.

How it works

A major garbage collection starts with an initial marking pause, during which all roots are scanned.[1] Then it restarts the world and in the background a worker thread performs the marking and scanning of major heap objects, in the usual manner. When the concurrent marking phase is done, the collector stops the world again to finish up the collection in the final marking pause.

Dealing with concurrent mutation

Consider the following code, which manipulates a linked list, swapping the order between the two nodes following a:

Let’s assume that right after a->next has been set in line 3, the concurrent worker thread scans and marks the node referenced by a, then follows next and processes c. If at that point our program hasn’t come around to setting c->next to b in line 5, the garbage collector will never encounter b—it will have been lost because for a brief moment that node wasn’t part of the list.

We can recruit the help of our friend the write barrier to deal with this issue, but we need to extend its scope a little: instead of just recording stores that point from the major heap to the nursery, it must now also record stores pointing from the major heap to the major heap. In other words: all reference stores that happen in the major heap, no matter where they point to, must be recorded.[2]

With this new write barrier the store to c->next will record the node c as requiring a re-scan. This re-scanning of all modified major heap objects happens in the final marking pause, so it’s not further disturbed by the mutator.

What if the final marking pause happens before c->next is set in line 5, however? The node b might not be pointed to by any object in the heap, so we won’t find it that way, yet it is still live. It’s live because it’s referenced by the variable b on the stack, so the final marking pause must also re-scan all roots.[3]

Nursery collections

Nursery collections that happen while concurrent mark is running stop the world, but concurrent mark keeps chugging along. It never follows references into the nursery anyway, so there is little potential for conflict.

There is an issue with the card table, however: nursery collections clear the card table, but its data is required during the final marking pause, which might only occur a few nursery collections later. The solution to this is to maintain a separate card table that accumulates the marked bits from the “regular” card table, and to use this union table in the final scan.

Results

This graph shows maximum pause times for a few of the programs in our GC benchmarks suite. All of the pause times are relative to Boehm’s, so values below 1 are better than Boehm’s. Note that the values correspond to the longest stop-the-world pause that occurred when running the program, not the median or some average.

The three SGen configurations are standard mark-and-sweep in blue, concurrent mark-and-sweep in green and concurrent mark-and-sweep in conjunction with cementing, in yellow.

The programs represented in this graph are those whose maximum pause times with Boehm were longer than 10ms. The ones with the longest pause times are on the left—on graph4 Boehm had a pause of 2.5s—which are those that benefit most from concurrent mark, as expected: the longest pause on graph4 is less than 180ms with concurrent mark. I have not yet investigated why graph8 is such an outlier.

For concurrent mark the roots not only include the usual suspects like thread stacks and global variables, but also nursery objects. In fact, the initial marking pause performs a nursery collection, after which only pinned objects will be left in the nursery. Those are treated as roots, or rather the references they contain are. ↩

Since we use a single card table for both kinds of references, this will increase the number of cards marked that don’t have references pertinent to nursery collections. The card scanning overhead will therefore increase, but in practice this doesn’t seem to be a problem. An alternative would be to use two separate card tables, one for references to nursery objects and the other for references to major heap objects. That would be more expensive both in terms of memory usage as well as in write barrier performance. ↩

This entails doing a full mark-scan loop, of course, which might still take a very long time in pathological cases. Imagine a program that keeps a huge linked list around, referenced only by a single object on the major heap. At some point concurrent mark kicks in. The program now loads the reference to the list into a local variable and nulls the reference on the heap. The garbage collector will not be able to mark the list until it scans the stack in the final marking pause, and will now have to traverse the list with the world stopped. ↩

Introduction

In previous installments of my blog post series on SGen, Mono’s generational garbage collector, I wrote about the major and minor heaps and their organizations. What I have for the most part neglected to pontificate on is how they interact with each other.

The advantage of generational garbage collectors comes from being able to collect the minor heap, also referred to as the nursery, very quickly because it doesn’t have to look at the objects in the major heap. There is a problem, however, which this illustration from my talk on SGen explains:

Let’s say we collect the nursery, starting from the single root reference. We find a graph of four live objects, colored blue, and promote them to the major heap, discarding all the others, including the three white ones. But wait—there’s a reference in some object on the major heap pointing to one of the objects we have just destroyed! This cannot stand.

Where did this reference come from? Consider the previous nursery collection. We’ll ignore pinning for now, so after that collection the nursery was empty. The mutator (that’s what garbage collector people affectionately call the pesky thing that insists on messing with the heap and that most people refer to as the program) allocated that white object, and stored a reference to it in an object on the major heap—in that order, but not necessarily in direct succession.

One way to handle those references is to scan the whole major heap for them and to treat them as roots. That would make nursery collections as slow as major collections, so it’s not a good idea. What’s needed is a way to record all such references happening between nursery collections. The nursery collector would then process them like ordinary roots and everything would be in order again.

Write barriers

Write barriers[1] do exactly that: On every store of a reference in an object they check whether the reference resides in the major heap and points to the nursery. If those conditions are satisfied, the reference is recorded for the benefit of the next nursery collection. This sounds like an awful overhead for each store, but it really isn’t, and for almost all applications it’s far outweighed by the advantages from generational collection.[2]

A note on terminology: The literature sometimes uses “write barrier” to refer to just the code that identifies references to remember, and sometimes to refer to both the code that identifies as well as records those references. I shall be using the term in the latter sense.

SGen’s write barrier is a so-called card table. We also implement an alternative, a sequential store buffer, which I will not discuss here.

Card tables

The basic idea is to conceptually divide the major heap into small, equally sized regions, called cards, and to have one bit per card that is set—the card is marked—when a reference to a nursery object is written to it.[3]

The advantages of this scheme are that it requires only a handful of instructions to record a store, so it can be inlined, and that its space overhead is small and proportional to the heap size. The disadvantage is that it doesn’t record the exact locations of the references we’re interested in.

In SGen’s case, cards are 512 bytes large, so the nursery collector needs to scan roughly that amount of memory to potentially locate a single reference.[4] The references it does find it treats like ordinary roots: the referenced objects are promoted to the major heap and the references are updated to reflect their new locations. The newly promoted objects are then scanned in turn, of course.

Once all marked cards have been scanned there are no more references from the major heap to the nursery, so all cards mark bits are cleared, to be ready for the next round.

Pinned objects

So far we have been assuming that when we encounter a reference from the major heap to the nursery, we can always move the nursery object to the major heap, thereby rendering the bothersome reference harmless.

As I explained in the blog post on the nursery, sometimes objects become pinned, which implies that they cannot be moved. This threatens to confound our cunning plan: we can only clear the card mark bits because there are no references left that we need to keep track of. The solution here is simple: Cards containing references to pinned objects must stay marked so that they will be processed again during the next nursery collection, in the hopes that the object referred to will no longer be pinned.

This can lead to another potential problem: Assume that some nursery object remains pinned, and that there are lots of objects in the major heap that refer to it.[5] Lots of cards would remain perpetually marked and nursery collections would become dominated by card scanning. That is in fact what we noticed on two benchmarks in our small GC benchmark suite—nursery collections took much longer than expected and the programs took twice as long or more to run with SGen than with Boehm.

Cementing

I just implemented a new feature—not yet pushed to master—that solves this new problem in a very simple way. I call it “cementing”, because it institutionalizes bothersome pinned objects: if we encounter a certain number of references from the major heap to a particular pinned nursery object, we just give up on it. Instead of hoping that it will become unpinned soon, we cement it—the punishment for badly behaved pinned objects. A cemented object will always be pinned at the start of future nursery collections as a matter of principle. The consequences of this are that it will be treated as a root, and that it will not be moved, which in turn means that we don’t have to worry about references to it from the major heap any more: the object is kept alive and the references can stay the same. Cards with references to cemented objects can be cleared, reducing the card scanning load.

Major collections give us a blank slate because we need to scan all live major heap objects anyway, so we use them as opportunities to reset cementing, giving those poor objects another chance after all.

The preliminary results from cementing are encouraging. It doesn’t seem to introduce performance regressions, and for the two benchmarks affected, it works wonders:

The red bars represent the wall clock run time with Boehm. The green bars are SGen run times without cementing, and blue are SGen with cementing.

There is at least one other meaning for the term “write barrier”, namely a machine instruction (or series of instructions) that ensures that all stores preceding it are seen by other threads before the stores following it. Those are not the barriers we’re looking for. ↩

An alternative way of solving the reference problem is a read barrier. Read barriers make it possible to write moving concurrent collectors, a magnificent example of which is Azul’s C4 collector. If you want a glimpse of the future of garbage collection, read the paper. ↩

SGen uses a whole byte per card for marking. Not only can whole bytes be written easier than single bits, but multiple threads can do so without conflicts between contiguous ones. If we used individual bits, we’d have to use CAS to avoid losing marks by other threads. ↩

Nitpickers will notice that a marked card might not even contain a single reference to a nursery object—the mutator can write a reference, hence mark the card, and then later overwrite it with null, for example. ↩

A simple way to trigger this is to allocate some object high up on the call stack, keep a reference to it on the stack, and then to construct a long list, with every node containing a reference to said object. Keep the list around and its nodes will eventually be promoted, staining the major heap with perpetually marked cards. ↩

Introduction

SGen is Mono’s generational garbage collector. DTrace is a programmable dynamic tracing framework and is available for OS X and Solaris, among others.

DTrace provides an interface for applications to define their own probes which can then be observed, and actions triggered. We have recently added a few probes to SGen and I’d like to show a few simple things that can be done with them.

The program I’m using to demonstrate DTrace here is IronPython, running pystone. SGen’s DTrace probes are currently only available on OS X, on mono master.

The graph for 0 shows collection times for the nursery, the graph for 1 shows times for the major collections. We can see here that the majority of nursery collections took about half a millisecond, with 4 outliers taking around 10 ms. There were 3 major collections, two of which took around 16 ms and the third one barely took any time at all.

Apple’s Instruments profiling application gathers data using DTrace and can be extended with custom DTrace scripts. Here’s what a visualization of SGen collector pause times looks like:

The dots lower down in the scatter diagram are likely nursery collections, the higher up ones major collections.

GC lock

Not counting garbage collections, how often and for how long is the GC lock held?

SGen takes the lock quite often, but apart from one outlier at about 1 millisecond, it’s released very quickly. Further analysis (not presented here) shows that the outlier is a result of thread-pool initialization during startup, when a few objects are allocated pinned, requiring allocating some memory from the operating system.

SGen has probes for five different kinds of object allocation, hence the repetition. User-space pointers to the namespace and class name are passed in the third and fourth argument, respectively, so we concatenate those with a dot in the middle to give the full class name.

Whenever possible SGen uses the managed allocator – a small managed piece of code to quickly allocate objects in the nursery, which handles the most common case. The managed allocator is not instrumented with a DTrace probe, however, so for this script to function properly the managed allocator must be turned off via the MONO_GC_DEBUG environment variable.

This looks a bit suspicious. We had about 100 nursery collections, so according to the object counts above there were on average 13 object arrays and 14 string arrays pinned per nursery collection, in addition to lots of other objects. All of those supposedly amount to only about 4 kilobytes of memory. Are we counting something wrong or are these very short arrays? Let’s see:

Apparently those arrays really are very short, all around 16 to 32 bytes long. But maybe the probe is lying to us about the object size? If we know about Mono’s internal object layout we can read out the array length directly from the heap1:

Each of them gives us the distribution of the array lengths for array objects of specific sizes. If everything is correct then for each array object size there must only be one array length, which is exactly what we get. For example, all 481 array objects of size 20 bytes have a length of one.

In fact, the numbers show that the sizes we get are exactly what we should expect given Mono’s object layout on 32-bits: A reference array with a given length has a size of 16 + 4*n bytes.

Footnotes:

Introduction

In this installment of my series on SGen, Mono’s new garbage collector, we shall be looking at how finalizers and weak references are implemented, and why you (almost certainly) should not use finalizers.

Tracking object lifetime

Both finalizers and weak references need to track the lifetime of certain objects in order to take an action when those objects become unreachable. To that end SGen keeps lists of finalizable objects and weak references which it checks against at the end of every collection.

If the object referred to by a weak reference has become unreachable, the weak reference is nulled.

If a finalizable object is deemed unreachable by the collector, it is put onto the finalization queue and it is marked, since it must be kept alive until the finalizer has run. Of course, all objects that it references have to be marked as well, so the main collection loop is activated again.

Generations

A nursery collection cannot collect an object in the major heap, so it is not necessary to check the status of old objects after a nursery collection. That is why SGen keeps separate finalization and weak reference lists for the nursery and major heaps.

Invoking finalizers

SGen uses a dedicated thread for invoking finalizers. The finalization queue is processed one object at a time. As long as an object is in the finalization queue it is also considered live, i.e. the finalization queue is a GC root.

Resurrection

Resurrecting an object means making it reachable again from within its finalizer or a finalizer that can still reach the object (or via a tracking weak reference). The garbage collector does not have to treat this case specially—until the finalizer has run the object is considered live by virtue of its being in the finalization queue, and afterwards it is live because it is reachable through some other root(s).

Tracking weak references

If an object is weakly referenced and has a finalizer, the weak reference will be nulled during the same collection as the finalizer is put in the finalization queue. That is not always desirable, especially for objects that might be resurrected.

Tracking references solve the problem by keeping the reference intact at least until the finalizer has run. Once the finalizer has finished, a tracking reference acts like a standard weak reference, i.e. it will be nulled once the object becomes unreachable, typically during the next collection (unless the object was resurrected).

When SGen encounters a tracking reference, instead of nulling it, it turns it into a non-tracking reference. The referenced object is now on the finalization queue and therefore considered live again, so the reference will not be nulled before the finalizer has run.

Why finalization is Evil

Finalization should not be used to manage scarce resources, such as file descriptors.

Time of finalization is not determined

Unless you force a garbage collection and wait for the finalizers to finish, the time at which they are run is not determined. If your program does not allocate a lot of memory, or your heap is huge, it might take a very long time until a major collection is triggered, so dead objects on the major heap might not be finalized in time before your scarce resource is depleted.

In this highly recommended talk (starting at 8:35) Cliff Click gives an example where it was necessary to put a hack into the JVM garbage collector that triggered a collection whenever the system ran out of file handles, because Apache Tomcat relied on finalizers to reclaim them.

Finalizers run one after another

A finalizer that performs a time consuming task will delay the execution of other finalizers. An especially worrisome case is finalizers that do potentially blocking I/O. Depending on the timeout of the operation, other finalizers can be blocked for a long time.

Finalizers make objects live longer

Since finalizable objects are considered live until their finalizers have run, they also consume memory until their finalizers have run and the next garbage collection is triggered. This not only applies to the finalizable objects themselves but to all objects they reference as well. This is particularly problematic if finalization is blocked by an ill-behaved finalizer.

Order of finalization is not determined

A finalizer cannot count on other finalizers running before or after it, irrespective of whether it references or is referenced by those objects or not. In other words, a finalizer cannot assume that an object it references has not been finalized yet. That is true even if that object is known to be strongly held by a GC root, namely when shutting down.

An exception to this are critical finalizers, which are run after “normal” finalizers, but between normal finalizers there is no determined order, neither between critical finalizers.

[Update]

It turns out that SGen’s handling of tracking references, as described above, is not only conceptually wrong, it was also incorrectly implemented, which, ironically, mostly fixed the bug in the design.

Our conceptual mistake was to demote tracking references to non-tracking ones the first time the object became unreachable, which would have led to the reference being non-tracking even if its target was re-registered for finalization. The bug in the implementation was that we didn’t actually do that. Instead, SGen would keep tracking references around until their targets became really, truly unreachable, even via finalization, i.e. until it was done finalizing, without any further chance of resurrection. Only at that point would it turn the tracking reference into a non-tracking one, thereby making the object alive once again and keeping it around for one more garbage collection cycle.

The bug is now fixed. Thanks to Alan for noticing this case and investigating.

Introduction

In the last post of my series on SGen, Mono’s new garbage collector, I talked about the young generation, called the nursery, and its collector. In this post I shall talk about the old, or major, generation and the two different collectors that implement it.

The older of the two, the copying collector, is mostly used for reference nowadays. It has been supplanted by the Mark-and-Sweep collector in practical use because it performs better in most cases and still has room for improvement.

The Mono man page gives details on how to specify which collector to use and how to change their parameters with the environment variable MONO_GC_PARAMS.

Copying

The copying major collector is a very simple copying garbage collector. Instead of using one big consecutive region of memory, however, it uses fixed-size blocks (of 128 KB) that are allocated and freed on demand. Within those blocks allocation is done by pointer bumping.

At a major collection all objects for which this is possible are copied to newly allocated blocks. Pinned objects clearly cannot be copied, so they stay in place. All blocks that have been vacated completely are freed. In the ones that still contain pinned objects we zero the regions that are now empty. We don’t actually reuse that space in those blocks but just keep them around until all their objects have become unpinned. This is clearly an area where things can be improved.

Why two collectors?

The copying collector was implemented when SGen was still young. It shares most of the copying machinery with the nursery collector, so it was easy to implement. As a production collector, however, it is not suited for most workloads. The old generation is expected to be large and composed of objects that have a long life, so copying them is bad from the perspective of memory usage (it might double for the duration of the collection) as well as of cache behavior.

Mark-and-Sweep

Mark-and-Sweep is SGen’s default major collector and is actively being developed and improved.

In contrast to the copying collector a mark-and-sweep collector has to deal with individual objects being freed and new objects filling up that space again. (This is different from pinned objects leaving holes in blocks for the copying collector because in that case it is expected that there will only be a tiny number of remaining pinned objects, so the free regions will be large, whereas here it is expected that many objects survive, leaving lots of small holes). This creates the problem of fragmentation – lots of small holes between objects that cannot be filled anymore, resulting in lots of wasted space.

Blocks

The basic idea of SGen’s Mark-and-Sweep collector is to have fixed-size blocks, each of which is divided into equally sized object slots of a variety of different sizes – this is similar to what Boehm does. One advantage of this over allocating objects sequentially is that the holes are always the same size, so if there are new objects arriving at that size the holes can always be filled.

Of course it is not feasible to support blocks with slots of every conceivable object size, so we space out the different supported sizes so as not to waste too much space (in the future we might want to choose the sizes dynamically, to fit the workload). Another thing that works to our advantage is that, since the slots have a known size, we don’t have to keep tabs on where they start and end. All we need is one mark bit per object which we keep in a separate region for better memory locality (we could use unused bits in the first object word, like the nursery collector does). For faster processing we’re actually keeping one bit per potential object start address, i.e. 1 bit per 8 bytes.

Since we only have a limited number of different slot sizes, we can also keep free lists, so allocation is more efficient because it doesn’t involve search.

During garbage collection it is necessary to get from an object to its block in order to access its mark bit. The size of each block is 16 KB, and they are allocated on 16 KB boundaries, so to get from an object that is inside a block to the address of its containing block a simple logical AND is sufficient. Unfortunately, large objects live outside of blocks and for them this operation would yield a bogus block address. To determine whether an object is large we need to examine its class metadata which is referenced from its VTable, requiring three loads (the “fixed heap” variant of the collector solves this problem and is described below.)

Each block has a “block info” data structure that contains metadata about the block. Apart from the size of the object slots and several other pieces of data it also contains the mark bits.

Fixed Heap

In the standard Mark-and-Sweep collector the blocks are allocated on demand, so they are potentially scattered all over the process’s address space. The fixed heap variant, on the other hand, allocates (using mmap) a heap of a fixed size (excluding the large object space) on startup and then assigns blocks out of that space. The advantage here is that we can check whether a pointer is within a block simply by checking whether it falls within the bounds of the allocated space, so no loads are necessary to get to its mark bit.

In standard Mark-and-Sweep the first word of each block links to its block info. Fixed heap can forego this load as well because the block infos are also allocated sequentially, so all that is needed to get to a block info is its index which is the same as its block’s index, which can be determined from the block address.

The obvious disadvantage of fixed heap Mark-and-Sweep is that the heap cannot grow beyond its fixed size. On some workloads fixed heap yields significant performance improvements. I will try to give some benchmarking results in a later post.

Non-Reference Objects

Objects that don’t contain references don’t have to be scanned, which implies that they also don’t have to be put on the gray stack. To handle this special case more efficiently Mark-and-Sweep uses separate blocks for objects with versus without references. The block info contains a flag that says which type the block is.

For the fixed heap collector this separation means that we don’t even have to load a single word of an object or its metadata if it doesn’t contain references.

Evacuation

Even despite the segregation of blocks according to fixed object sizes one form of memory waste can still occur. If the workload shifts from having lots of objects of a specific size to having only a few of them, a lot of blocks for that object size will have very low occupancy – perhaps a single object will still be live in the block and no new objects will arrive to fill the empty slots.

Unlike Boehm, SGen has the liberty to move objects around. During the sweep stage of a collection we keep statistics on the occupancy of blocks of all slot sizes. If the occupancy for a given slot size falls below a certain (user definable) threshold then we mark that slot size for evacuation on the next major collection.

When blocks of a slot size are being evacuated Mark-and-Sweep acts like a copying collector on the objects within those blocks. Instead of marking them they are copied to newly allocated blocks, filling them sequentially and emptying all the sparsely occupied ones which will then be freed during the following sweep.

Blocks with pinned objects are exempt from this whole process since they cannot be freed anyway.

Note that the occupancy situation might change between a sweep and the following major collection – the workload could change in the mean time and produce lots of live objects of that size which would all be copied, without actually saving any space. SGen cannot tell which objects will be live without doing the mark phase because that’s the mark phase’s job, so we take the risk of that happening.

As a side-note: Sun’s Garbage-First collector solves this problem by running the mark phase concurrently with the mutator, so it is actually able to tell at the start of a garbage collection pause which blocks will contain mostly garbage.

Parallel Mark

The mark phase can be parallelized with reasonable effort. Essentially, a number of threads each start with a set of root objects independently. Each thread has its own gray stack, but care must be taken that work is not repeated. Two threads can compete for marking the same object, and if they both put the object into their gray stacks, the object will be scanned twice, which is to be avoided.

Some care must also be taken with handling objects that still reside in the nursery. They have to be copied to the major heap and a forwarding pointer must be installed. The forwarding pointer can only be installed after the space on the major heap has been allocated (otherwise where would it point to?), so two threads might do that independently but only one will win out and the other will have done the allocation in vain.

It is not actually sufficient to just partition the set of root objects between the worker threads – some threads might end up having a far smaller share of the live object graph to work with. SGen currently uses a shared gray stack section buffer where worker threads deposit parts of their gray queues for the others to take.

Parallel mark achieves quite impressive performance gains on some workloads, but on most the gains are still somewhat disappointing. I suspect the work distribution needs improvement.

Concurrent Sweep

The sweep phase goes through all blocks in the system, resets the mark bits, zeroes the object slots that were freed and rebuilds the free lists.

None of those data structures are actually needed until the next nursery collection, so the sweep lends itself to be done in the background while the mutator is running, which is what Mark-and-Sweep can do optionally. The next nursery collection will in that case wait for the sweep thread to finish until it proceeds.

Introduction

This is the second part of a series of blog posts on SGen, Mono’s new garbage collector. In this installment we shall be taking a closer look at the first generation, also called the “nursery”.

Nursery

As the name implies, the nursery is the generation where objects are allocated, or, to be more poetic, born. In SGen, it is a contiguous region of memory that is allocated when SGen starts up, and it does not change size. The default size is 4 MB, but a different size can be specified via the environment variable MONO_GC_PARAMS when invoking Mono. See the man page for details.

Allocation

In a classic semi-space collector allocation is done in a linear fashion by pointer bumping, i.e. simply incrementing the pointer that indicates the start of the current semi-space’s region that is still unused by the amount of memory that is required for the new object.

In a multi-threaded environment like Mono using a single pointer would introduce additional costs because we would have to do at least a compare-and-swap to increment it, potentially looping because it might fail due to contention. The standard solution to this problem is to give each thread a little piece of the nursery exclusively from which it can bump-allocate without contention. Synchronization is only needed when a thread has filled up its piece and needs a new one, which should be far less frequent. These pieces of nursery are called “thread local allocation buffers”, or “TLABs” for short.

TLABs are assigned to threads lazily. The relevant pointers, most notably the “bumper” and the address of the TLAB’s end, are stored in thread local variables. Currently TLABs are fixed in size to 4 KB (actually that’s the maximum size – they might be smaller because the nursery can be fragmented). In the future we might dynamically adapt that size to the current situation. The less threads that are allocating objects, the larger the TLABs should be so that they need to be assigned less often. On the other hand, the more threads we allocate TLABs to the sooner we run out of nursery space, so we will collect prematurely. In that situation the TLABs should be smaller. Ideally a thread should get a TLAB whose size is proportional to its rate of allocation.

Calling from managed into unmanaged code involves doing a relatively costly transition. To avoid this the fast paths of the allocation functions are managed code. They only handle allocation from the TLAB. If the thread does not have a TLAB assigned yet or it is not of sufficient size to fulfill the allocation request we call the full, unmanaged allocation function.

Collection

Collecting the nursery is not much different in principle from collecting the major heap, so much of what I discuss here is also valid there.

Coloring

The collection process can be thought of as a coloring game. At the start of the collection all objects start out as white. First the collector goes through all the roots and marks those gray, signifying that they are reachable but their contents have not been processed yet. Each gray object is then scanned for further references. The objects found during that scanning are also colored gray if they are still white. When the collector has finished scanning a gray object it paints it black, meaning that object is reachable and fully processed. This is repeated until there are no gray objects left, at which point all white objects are known not to be reachable and can be disposed of whereas all black objects must be considered reachable.

One of the central data structures in this process, then, is the “gray set”, i.e. the set of all gray objects. In SGen it is implemented as a stack.

The roots

As I mentioned above as well as in my overview post, the collection starts with the roots. These include, most prominently, the references on the stacks of all managed threads and static class fields. In addition to that there are numerous other roots that the runtime registers and uses for internal purposes. Since SGen is a generational collector, the roots for the nursery collection also include all references from the major heap to objects in the nursery. These are kept track of with write barriers which I will discuss in a later blog post.

Forwarding

In a copying collector like SGen’s nursery collector, which copies reachable objects from the nursery to the major heap, coloring an object also implies copying it, which in turn requires that all references to it must be updated. To that end we must keep track of where each object has been copied to. The easiest way to do this is to use a forwarding pointer. In Mono, the first pointer-sized word of each object is a pointer to the object’s vtable, which is aligned to at least 4 bytes. This alignment leaves the two least significant bits for SGen to play with during a collection, so we use one of them to indicate whether that particular object is forwarded, i.e. has already been copied to the major heap. If it is set, the rest of the word doesn’t actually point to the vtable anymore, but to the object’s new location on the major heap. The second bit is used for pinning, which I will discuss further down.

The loop

Here is a simplified (pinning is not handled) pseudo-code implementation of the central loop for the nursery collector:

In line 2 we pop an object out of the gray stack and in line 3 we loop over all the references that it contains. refp is a pointer to the reference we’re currently looking at. In line 4 we fetch the actual reference. We are only interested in collecting the nursery, so if the reference already points to the old generation we skip it (lines 5 and 6). If the object is already forwarded (line 7) we fetch its new location (line 8). Otherwise we must copy it to the major heap. To that end we allocate the space required (line 10), do the actual copying (line 11) and forward the object (line 12). That newly copied object must also be processed, so we push it on the gray stack (line 13). Finally, for both cases, we update the reference to point to the new location of the object (line 15).

Pinning

In the overview post I mentioned that we currently cannot scan stack frames precisely, i.e. we do not know whether a word that looks like a reference just does so by coincidence or is actually one. Work for scanning managed stack frames precisely is under way, but for unmanaged frames this is not possible anyway because the C compiler does not provide us with the necessary information (some runtimes avoid this problem by not exposing managed references to unmanaged code directly).

Because we lack this knowledge we must proceed conservatively and assume that what looks like a reference is one, which means considering the object it points to as reachable. On the other hand, since we cannot be sure, we are not allowed to modify this supposed reference, because it might just be a number. That implies that we also cannot move the object. In other words the object is “pinned”.

Pinned objects are a bit of a complication. Having pinned objects in the nursery means that we cannot clean it out completely and so it becomes fragmented. It also requires an additional check in the central loop.

Finishing up

After all the copying is done what’s left is to clean up the nursery for the next round of allocations. SGen doesn’t actually zero the nursery memory at this point, however. Instead that happens at the point when TLABs are assigned to threads. The advantage is not only that it potentially gets distributed over more than one thread, but also that is has better cache behavior – the TLAB is likely to be touched again soon by that thread.

What has to happen at this point, though, is to take score of the regions of nursery memory that are available for allocation, which means everything apart from pinned objects. Since we have a list of all pinned objects this is done very efficiently – I will discuss this “pin queue” in a later post. It is these “fragments” from which TLABs will be assigned in the next mutator round.

I should mention that there is a bit of additional work to be done prior to fragment creation, namely making sure unreachable finalizable objects are finalized and weak references are zeroed if the objects they pointed to have become unreachable. Both are very similar issues and shall be discussed in a later post.

Introduction

SGen is Mono’s new garbage collector that we have been working on intensely for almost two years now and that has been becoming stable and competitive during the past few months.

In this series of blog posts I will try to explain how garbage collection works in general, how SGen works in particular, how to get the best performance out of it and, finally, what our plans are for the future.

This first post will give a very brief overview over what a garbage collector does and how it does it, before outlining the broad architecture of SGen.

Why a new garbage collector?

Since its inception Mono, like many other garbage collected runtimes, has been using the Boehm-Demers-Weiser collector, which I shall refer to as “Boehm” henceforth. Boehm’s main advantages are its portability, stability and the ease with which it can be embedded. It is designed as a garbage collector for C and C++, not for managed languages, however, so it does not come as a surprise that it falls short for that purpose compared to collectors dedicated to such languages or runtimes.

Our goal with SGen was to overcome Boehm’s limitations and provide better performance for managed applications, in terms of allocation speed, collector pauses as well as memory utilization. In this and the following posts of this series I will mention points where we improve upon Boehm whenever appropriate.

Garbage collection

Before digging into the details of SGen it seems prudent to discuss how garbage collection actually works. I will only discuss topics that are relevant to SGen, and even paint those only with very broad strokes. For more comprehensive overviews of garbage collection see “Uniprocessor Garbage Collection Techniques” by Wilson or, for more detailed information, Jones and Lins’s book.

The garbage collector, in short, is that part of the language’s or platform’s runtime that keeps a program from exhausting the computer’s memory, despite the lack of an explicit way to free objects that have been allocated. It does that by discovering which objects might still be reached by the program and getting rid of the rest. It considers as reachable objects those that are directly or indirectly referenced by the so-called “roots”, which are global variables, local variables on the stacks or in registers and other references explicitly registered as roots. In other words, all those objects are considered reachable, as well as those that they reference, and those that they reference, and so on. The pedantic will note that the program might not actually be able to reach all of those. Computing the minimum set of objects that the program could reach is impossible, however, which we know thanks to thisguy.

SGen, as well as Boehm, are “stop-the-world” collectors, meaning that they do their work while the actual program, slightingly called the “mutator” in garbage collection lingo, is stopped.

There are three classic garbage collection algorithms: Mark-and-sweep, Copying, and Compaction, of which only the first two are relevant to our discussion of SGen since we have not implemented a compacting collector and have no plans of doing so in the foreseeable future.

Mark-and-Sweep

Mark-and-sweep is the oldest and probably most widely implemented garbage collection algorithm. Boehm is a mark-and-sweep collector.

The idea is to have a “mark” bit on each object that is set on all reachable ones in the “mark” phase of the collection, by recursively traversing references starting from the root set. Of course in practice this is not actually implemented recursively, but by using a queue or stack.

The “sweep” phase then frees the memory occupied by the objects that were not marked and resets the mark bits on the marked ones.

Of course many variants of this algorithm exist that vary in details.

Copying

Traditional mark-and-sweep has two main weaknesses. First, it needs to visit all objects, including the unreachable ones. Second, it can suffer from memory fragmentation like a malloc/free allocator.

A copying collector addresses both problems by copying all reachable objects to a new memory region, allocating them linearly one after the other. The old memory region can then be discarded wholesale. The classic copying collector is the so called “semi-space” algorithm where the heap is divided into two halves. Allocation is done linearly from one half until it is full, at which point the collector copies the reachable objects to the second half. Allocation proceeds from there and at the next collection the now empty first half is used as the copying destination, the so-called “to-space”.

Since with a copying collector objects change their addresses it is obvious that the collector also has to update all the references to the objects that it moves. In Mono this is not always possible because we cannot in all cases unambiguously identify whether a value in memory (usually on the stack) is really a reference or just a number that points to an object by coincidence. I will discuss how we deal with this problem in a later installment of this series.

Generational garbage collection

It is observed that for most workloads the majority of objects either die very quickly or live for a very long period of time. One can take advantage of this so-called “generational hypothesis” by dividing the heap into two or more “generations” that are collected at different frequencies.

Objects begin their lives in the first generation, also referred to as the “nursery”. If they manage to stick around long enough they get “promoted” to the second generation, etc. Since all objects are born in the nursery it grows very quickly and needs to be collected often. If the generational hypothesis holds only a small fraction of those objects will make it to the next generation so it needs to be collected less frequently. Also, it is expected that while only a small fraction of the objects in the nursery will survive a collection, the percentage will be higher for older generations, so a collection algorithm that is better suited to high survival rates can be used for those. Some collectors go so far as to give objects “tenure” at some point, making them immortal so they don’t burden the collection anymore.

One difficulty with generational collection is that it’s not quite possible to collect a young generation without looking at the older generations at all, because the mutator might have stored a reference to a young generation object in an older generation object. Even if that young object is not referenced from anywhere else it must still be considered alive. Clearly scanning all objects in older generations for such references would defeat the whole purpose of separating them in the first place. To address this problem generational collectors make the mutator register all new references from old to young generations. This registration is referred to as a “write barrier” and will be discussed in a later installment. It is also possible to register reads instead of writes, with a “read barrier”, but this is impractical without hardware support. It’s obvious that the write barrier must be very efficient since it’s invoked for every write to a reference field or array member.

SGen

SGen, which historically stands for “Simple Generational”, is a generational garbage collector with two generations. The nursery is collected with a copying collector that immediately promotes all live objects, if possible, to the old generation (or “major heap”).

For the old generation we have two collectors that the user can choose between: A mark-and-sweep and a copying collector. The mark-and-sweep collector is available in four variants, with and without a parallel mark phase and with a dynamic or fixed-size heap.

In addition to that SGen has a separate space for large objects (larger than 8000 bytes), called the “Large Object Space”, or “LOS”, which is logically part of the old generation. Large objects are not born in the nursery but directly in the LOS, and they are not moved.

One major difficulty SGen has to deal with is objects that are “pinned”, which means that they cannot be moved. The reason for this is usually that they are referenced from stack frames that we cannot scan “precisely”, i.e. for which we do not know what is a reference and what is not. Work is currently under way to scan managed stack frames precisely, but we will always have to handle unmanaged stack frames, too, for which no type information is available and which can therefore only be scanned “conservatively”. More on this in one of the following posts.

Future installments

Having covered a few basic concepts of garbage collection I shall go into more detail about SGen in the blog posts following this one. Here is a tentative list of the posts I intend to write in the coming weeks. Comments and suggestions are of course welcome.

Allocating objects and the minor collector

How is the nursery organized, how are new objects allocated and how is the nursery collected?

The major collectors

An overview of SGen’s major collectors, mark-and-sweep and copying. I’ll discuss how they organize the heap and how they collect.