In mark-sweep-compact garbage collection algorithm you have to stop-the-world when relocating objects because reference graph becomes inconsistent and you have to replace values of all references pointing to the object.

But what if you had a hash table with object ID as a key and pointer as value, and references would point to said ID instead of object address... then fixing references would only require changing one value and pause would only be needed if object is tried to be written into during copying...

3 Answers
3

Updating references is not the only thing that requires a pause. The standard algorithms commonly grouped under "mark-sweep" all assume that the entire object graph remains unaltered while it's being marked. Correctly handling modifications (new objects created, references changed) requires rather tricky alternative algorithms, like as the tri-color algorithm. The umbrella term is "concurrent garbage collection".

But yes, updating references after compaction also needs a pause. And yes, using indirection (e.g. via a persistent object ID and a hash table to real pointers) can greatly reduce the pausing. It might even be possible to make this part lock-free if one so desires. It would still be as tricky to get right as any low-level shared-memory concurrency, but there is no fundamental reason it wouldn't work.

However, it would have severe disadvantages. Aside from taking extra space (at least two extra words for all objects), it makes every dereference much more expensive. Even something as simple as getting an attribute now involves a full hash table search. I'd estimate the performance hit to be way worse than for incremental tracing.

Well we have a lot of memory today so we could have let's say 50 Mb table and hash could be simple modulo so only one instruction...
–
mrpyoApr 17 '14 at 22:06

3

@mrpyo fetching the size of the hash table, modulo operation, dereference from hash table offset to get the actual object pointer, dereference to the object itself. Plus possibly some register shuffling. We end up at 4+ instructions. Also, this scheme has problems concerning memory locality: Now, both the hash table and the data itself have to fit into the cache.
–
amonApr 17 '14 at 22:10

@mrpyo You need one entry (object ID -> current address) per object, right? And regardless of how cheap the hash function is, you will have collisions and need to resolve them. Also what amon said.
–
delnanApr 17 '14 at 22:11

@amon it's only a matter of time before CPUs have 50MB or more of cache :)
–
ӍσᶎApr 18 '14 at 1:49

1

@Ӎσᶎ By the time we can put 50 MiB of transistors on a chip and still have latency low enough for it to work as L1 or L2 cache (L3 caches are already up to 15 MiB in size, but usually off-chip AFAIK and far worse latency than L1 and L2), we'll have accordingly massive quantities of main memory (and data to put in it). The table can't be fixed size, it must grow with the heap.
–
delnanApr 18 '14 at 10:59

All problems in computer science can be solved by another level of indirection … except for the problem of too many layers of indirection

Your approach does not immediately solve the problem of garbage collection, but only moves it up one level. And at what cost! Now, every memory access goes through another pointer dereference. We can't cache the result location, because it might have been relocated in the meanwhile, we must always go through the object ID. In most systems, this indirection is not acceptable, and stopping the world is assumed to have a lower total runtime cost.

I said your proposition only moves the problem, not solves it. The issue is around the reuse of object IDs. The object IDs are now our equivalent of pointers, and there is only a finite amount of addresses. It is conceivable (esp. on a 32 bit system) that during the lifetime of your program, more than INT_MAX objects will have been created, e.g. in a loop like

while (true) {
Object garbage = new Object();
}

If we just increment the object ID for each object, we will run out of IDs at some point. Therefore we have to find out which IDs are still in use and which are free so that they can be reclaimed. Sound familiar? We are now back at square one.

One can presumably use ID's that are just 'large enough' say 256 bit bignums? I am not saying this idea is good overall, but you can almost certainly get around reusing IDS.
–
ValityApr 18 '14 at 3:21

@Vality realistically yes – as far as we can see that would get around the issue of ID reuse. But this is just another “640K ought to be enough for anybody” argument, and doesn't actually solve the problem. A more catastrophic aspect is that the size of all objects (and the hash table) would have to increase to accommodate these oversized pseudo-pointers, and that during the hash access we need to compare this bigint to other IDs which will probably hog multiple registers, and take multiple instructions to complete (on 64bit: 8×load, 4×compare, 3×and which is a 5×increase over native ints).
–
amonApr 18 '14 at 8:07

Yeah, you would run out of IDs after some time and would need to change all of them which would require a pause. But possibly it would be a rare event...
–
mrpyoApr 18 '14 at 9:22

@amon Very much agreed, all very good points there, it is far better to have a genuinely sustainable system I agree. This is going to be unbearably slow whatever you do so is anyway only interesting in theory. Personally I am not a big garbage collector fan anyway however :P
–
ValityApr 18 '14 at 11:19

@amon: there's more code in the world than just this that would go wrong once you wrap a 64 bit ID (584 years of nanoseconds, and you can probably arrange for memory allocation to take 1ns especially if you don't shard the global counter that spits out the IDs!). But sure, if you don't need to rely on that then you don't.
–
Steve JessopApr 18 '14 at 17:30

There is no error in your line of thought, you've just described something very close to how the original Java garbage collector worked

The original Java virtual machine [6] and some Smalltalk virtual machines use indirect pointers, called handles in [6],to refer to objects. Handles allow easy relocation of objects during garbage collection since, with handles, there isonly one direct pointer to each object: the one in its handle. All other references to the object indirect through the han-dle. In such handle-based memory systems, while object addresses change over the lifetime of objects and therefore cannot be used for hashing, handle addresses remain constant.

In Sun’s current implementation of the Java Virtual Machine, a reference to
a class instance is a pointer to a handle that is itself a pair of pointers: one to a table
containing the methods of the object and a pointer to the Class object that represents
the type of the object, and the other to the memory allocated from the Java
heap for the object data.

Presumably these handles weren't keys in a hashtable (as in the question), though? There's no need, just a structure containing a pointer. Then the handles are all the same size so they can be allocated out of a heap allocator. Which by its nature doesn't need internal compaction since it doesn't get fragmented. You might mourn the inability of the large blocks used by that allocator, to be themselves relocated. Which can be solved by another level of indirection ;-)
–
Steve JessopApr 18 '14 at 17:37

@SteveJessop yes, there wasn't a hashtable in the gc implementation, though the value of the handle was also the value returned by Object.getHashCode()
–
Pete KirkhamApr 18 '14 at 20:40