Friday, October 16, 2009

GC improvements

In the last week, I (Armin) have been taking some time off the
JIT work to improve our GCs. More precisely, our GCs now take
one or two words less for every object. This further reduce the
memory usage of PyPy, as we will show at the end.

Background information: RPython object model

We first need to understand the RPython object model as
implemented by our GCs and our C backend. (Note that the
object model of the Python interpreter is built on top of
that, but is more complicated -- e.g. Python-level objects
are much more flexible than RPython objects.)

The instances of A and B look like this in memory (all cells
are one word):

GC header

vtable ptr of A

hash

x

GC header

vtable ptr of B

hash

x

y

The first word, the GC header, describes the layout. It
encodes on half a word the shape of the object, including where it
contains further pointers, so that the GC can trace it. The
other half contains GC flags (e.g. the mark bit of a
mark-and-sweep GC).

The second word is used for method dispatch. It is similar to a
C++ vtable pointer. It points to static data that is mostly a
table of methods (as function pointers), containing e.g. the method f
of the example.

The hash field is not necessarily there; it is only present in classes
whose hash is ever taken in the RPython program (which includes being
keys in a dictionary). It is an "identity hash": it works like
object.__hash__() in Python, but it cannot just be the address of
the object in case of a GC that moves objects around.

Finally, the x and y fields are, obviously, used to store the value
of the fields. Note that instances of B can be used in places that
expect a pointer to an instance of A.

Unifying the vtable ptr with the GC header

The first idea of saving a word in every object is the observation
that both the vtable ptr and the GC header store information about
the class of the object. Therefore it is natural to try to only have
one of them. The problem is that we still need bits for the GC flags,
so the field that we have to remove is the vtable pointer.

This means that method dispatch needs to be more clever: it
cannot directly read the vtable ptr, but needs to compute it
from the half-word of the GC header. Fortunately, this can be
done with no extra instruction on the assembler level. Here is
how things will look like in the end, assuming a 32-bit x86
machine (but note that as usual we just generate portable C).

The trick for achieving efficiency is that we store all
vtables together in memory, and make sure that they don't take
more than 256 KB in total (16 bits, plus 2 bits of alignment).
Here is how the assembler code (produced by the normal C
compiler, e.g. gcc) for calling a method looks like. Before
the change:

Note that the complex addressing scheme done by the second MOV
is still just one instruction: the vtable_start and
method_offset are constants, so they are combined. And as the
vtables are anyway aligned at a word boundary, we can use
4*EDX to address them, giving us 256 KB instead of just 64 KB
of vtables.

Optimizing the hash field

In PyPy's Python interpreter, all application-level objects
are represented as an instance of some subclass of W_Root.
Since all of these objects could potentially be stored in a
dictionary by the application Python program, all these
objects need a hash field. Of course, in practice, only a
fraction of all objects in a Python program end up having
their hash ever taken. Thus this field of W_Root is wasted
memory most of the time.

(Up to now, we had a hack in place to save the hash field
on a few classes like W_IntegerObject, but that meant that
the Python expression ``object.__hash__(42)'' would raise
a TypeError in PyPy.)

The solution we implemented now (done by some Java GCs, among
others) is to add a hash field to an object when the
(identity) hash of that object is actually taken. This means
that we had to enhance our GCs to support this. When objects
are allocated, we don't reserve any space for the hash:

object at 0x74B028

...00...

x

y

When the hash of an object is taken, we use its current memory
address, and set a flag in the GC header saying that this
particular object needs a hash:

object at 0x74B028

...01...

x

y

If the GC needs to move the object to another memory location,
it will make the new version of the object bigger, i.e. it
will also allocate space for the hash field:

object at 0x825F60

...11...

x

y

0x74B028

This hash field is immediately initialized with the old memory
address, which is the hash value that we gave so far for the
object. To not disturb the layout of the object, we always
put the extra hash field at the end. Of course, once set,
the hash value does not change even if the object needs to
move again.

Results

Running the following program on PyPy's Python interpreter
with n=4000000:

The total amount of RAM used on a 32-bit Linux is 247 MB,
completing in 10.3 seconds. On CPython, it consumes 684 MB
and takes 89 seconds to complete... This nicely shows that
our GCs are much faster at allocating objects, and that our
objects can be much smaller than CPython's.

Armin Rigo & Carl Friedrich Bolz

In the last week, I (Armin) have been taking some time off the
JIT work to improve our GCs. More precisely, our GCs now take
one or two words less for every object. This further reduce the
memory usage of PyPy, as we will show at the end.

Background information: RPython object model

We first need to understand the RPython object model as
implemented by our GCs and our C backend. (Note that the
object model of the Python interpreter is built on top of
that, but is more complicated -- e.g. Python-level objects
are much more flexible than RPython objects.)

The instances of A and B look like this in memory (all cells
are one word):

GC header

vtable ptr of A

hash

x

GC header

vtable ptr of B

hash

x

y

The first word, the GC header, describes the layout. It
encodes on half a word the shape of the object, including where it
contains further pointers, so that the GC can trace it. The
other half contains GC flags (e.g. the mark bit of a
mark-and-sweep GC).

The second word is used for method dispatch. It is similar to a
C++ vtable pointer. It points to static data that is mostly a
table of methods (as function pointers), containing e.g. the method f
of the example.

The hash field is not necessarily there; it is only present in classes
whose hash is ever taken in the RPython program (which includes being
keys in a dictionary). It is an "identity hash": it works like
object.__hash__() in Python, but it cannot just be the address of
the object in case of a GC that moves objects around.

Finally, the x and y fields are, obviously, used to store the value
of the fields. Note that instances of B can be used in places that
expect a pointer to an instance of A.

Unifying the vtable ptr with the GC header

The first idea of saving a word in every object is the observation
that both the vtable ptr and the GC header store information about
the class of the object. Therefore it is natural to try to only have
one of them. The problem is that we still need bits for the GC flags,
so the field that we have to remove is the vtable pointer.

This means that method dispatch needs to be more clever: it
cannot directly read the vtable ptr, but needs to compute it
from the half-word of the GC header. Fortunately, this can be
done with no extra instruction on the assembler level. Here is
how things will look like in the end, assuming a 32-bit x86
machine (but note that as usual we just generate portable C).

The trick for achieving efficiency is that we store all
vtables together in memory, and make sure that they don't take
more than 256 KB in total (16 bits, plus 2 bits of alignment).
Here is how the assembler code (produced by the normal C
compiler, e.g. gcc) for calling a method looks like. Before
the change:

Note that the complex addressing scheme done by the second MOV
is still just one instruction: the vtable_start and
method_offset are constants, so they are combined. And as the
vtables are anyway aligned at a word boundary, we can use
4*EDX to address them, giving us 256 KB instead of just 64 KB
of vtables.

Optimizing the hash field

In PyPy's Python interpreter, all application-level objects
are represented as an instance of some subclass of W_Root.
Since all of these objects could potentially be stored in a
dictionary by the application Python program, all these
objects need a hash field. Of course, in practice, only a
fraction of all objects in a Python program end up having
their hash ever taken. Thus this field of W_Root is wasted
memory most of the time.

(Up to now, we had a hack in place to save the hash field
on a few classes like W_IntegerObject, but that meant that
the Python expression ``object.__hash__(42)'' would raise
a TypeError in PyPy.)

The solution we implemented now (done by some Java GCs, among
others) is to add a hash field to an object when the
(identity) hash of that object is actually taken. This means
that we had to enhance our GCs to support this. When objects
are allocated, we don't reserve any space for the hash:

object at 0x74B028

...00...

x

y

When the hash of an object is taken, we use its current memory
address, and set a flag in the GC header saying that this
particular object needs a hash:

object at 0x74B028

...01...

x

y

If the GC needs to move the object to another memory location,
it will make the new version of the object bigger, i.e. it
will also allocate space for the hash field:

object at 0x825F60

...11...

x

y

0x74B028

This hash field is immediately initialized with the old memory
address, which is the hash value that we gave so far for the
object. To not disturb the layout of the object, we always
put the extra hash field at the end. Of course, once set,
the hash value does not change even if the object needs to
move again.

Results

Running the following program on PyPy's Python interpreter
with n=4000000:

The total amount of RAM used on a 32-bit Linux is 247 MB,
completing in 10.3 seconds. On CPython, it consumes 684 MB
and takes 89 seconds to complete... This nicely shows that
our GCs are much faster at allocating objects, and that our
objects can be much smaller than CPython's.

31 comments:

Not really GC related and you may have covered this in another post, but how does PyPy handle id() in a world where the object may move? Is the hash field reused for this when necessary as well? If so, how do you deal with the possibility of another object being allocated at the same address as the original object? If not, how do you avoid having an object's id() change when it's moved?

kbob: If PyPy is anything like CPython the randomness isn't so important. The CPython dictionary hash collision resolution strategy is extremely efficient, even amongst hashes with very similar values.

Shams: Excellent question! The implementation of id that we have is basically a weak key dict mapping objects to ids on demand. This has the fun side-effect that the ids of PyPy's object start with 1 on count up from there.

This is rather inefficient (e.g. your garbage collections become linearly slower the more objects you have that have their id taken), but there is not much else you can do. Jython uses a similar solution. For this reason, calling id a lot is essentially discouraged in code you want to run on PyPy.

kbob: I think they should be random enough. You get a collision if you ask the hash of object a, then a collection happens that moves a, then you ask object b for its hash and object b happens to be in the place where object a was before. That sounds unlikely.

If you write contrived code that has a loop that repeatedly allocates an object, asks its hash by putting it into a dict and then forces a nursery collection, you can get collision: all those objects will be at the beginning of the nursery when their hash is taken. Unlikely again to occur in practise.

Alex: you are right. We use exactly CPython's algorithm for implementing dicts, so having bad hash functions is not a big problem. However, if you really have hash value collisions (e.g. everything hashes to 1) your dict still degenerates to essentially a linear search.

Whenever I talk to other people about your project, I always state you are the best example I can imagine of REAL innovation in computer languages.

That said, I gather the only thing making id() different from hash() is that you need to guarantee that the values for live objects are always unique.

You could just use the same strategy as with the hash, sticking the id value along the object the next time the object is moved by the GC.

Meanwhile, from the time id() is called to the time the object is moved, you can just temporarily store an {address: id} mapping somewhere. Entries would be removed from the map once the objects get moved. From then on the id would be attached to the object.

If GC cycles are frequent, the map doesn't have to grow too large.

I don't know if the need for id reuse after the id space gets exhausted is important or not. Once you get to the end of the space, you would have to scan the map and heap to find a convenient "hole" to reuse, I suppose.

Thanks, Carl. Following up what Skandalfo said, (although this is probably a poor forum for such discussions), it seems like you could reuse the hash field for id as well. Given that the minimum size for a Python object is > 1 byte, you should have at least that much space for offsetting the hash/id. As the GC/allocator has to store information about addresses and blocks anyway it should be a relatively simple matter of building and maintaining a bloom filter of offsets in use for a particular base address.

Of course, this also constraints the addresses at which Python objects may be allocated and the lower bits in the address may already be used for other purposes...

Skandalof, Shahms: I guess there are possible ways to make id a bit faster than what we have now. What we have now is well-tested and works reasonably enough. I assume anyway that there is not too much Python code whose performance depends critically on having an extremely efficient implementation of id (and if there is, I am prepared to ask the author to rewrite the code instead :-) ).

Shahms: I confess I don't understand your proposal. Do you mean you can have at most as many live objects as the available address space divided by the object alignment?

When I talked about id space I wasn't referring to the memory required to store the per-object id value, but the fact that if you assign the id values using sequential values, and those values are, for instance, 64 bit integers, you could theoretically create and destroy a lot of objects in a long lived process and the sequence would wrap around.

About making hash/id the same, I've just checked that CPython does indeed use the id() value as the value returned by the default hash() implementation.

You could just do the same, and use the id value as the "master" one. For hash() you would just call id(). This allows you to use just one value attached to the objects for both functions.

The cost of that approach would be having to assign an id immediately (having to put it into the temporary map, then having to look it up in the map until the next time the object is moved) for the call to hash() (with no call to id()) too.

The good thing compared to the weak key dict, is that the temporary map doesn't need to be garbage collected at all. The entries are removed when objects are moved (or collected).

Carl, no doubt you're right. I know that I can probably count the number of times I've needed to use id() on one hand and I'm pretty sure the vast majority of those cases was sticking an-hashable object in a dict.

Probably Guido should have refrained from making it available in CPython at the time. I suppose it was just easy to add it to the language with the memory allocation model of CPython. The fact is that I don't really see any use for id() once you have the "is" operator and the hash() method...

Too bad for the current implementation of pickle and deepcopy. The fault in that case is CPython's general view that id() is cheap, despite repeated attempts to convince them otherwise. These attempts have been done notably by guys from Jython, even before PyPy time; indeed id() is a mess for any implementation apart from CPython's simple non-moving GC).

A suitable replacement would be e.g. a 'collections.identitydict' type, if someone feels like going Yet Another Time to python-dev with this issue.

Is there any possibility to translate pypy under OSX 10.6 as 32bit? Translation works but I get an "ValueError: bad marshal data" when running pypy-c. I assume that is due to the fact that I got a 64bit binary.

Wouldn't it free up the GC from all that burden if only a set of live ids were kept? (ie: no weak dict)

So, when you get an id() call, you check the object to see if there's a cached id (much like the hash hack) - if not, you generate a random (or sequential) unused id and store it both in the "live ids" set and in the object's structure, as done with hash values.

So, successive calls to id() would be as fast as in CPython, and garbage collection would be fast too (only an extra set deletion per object whose id was obtained).

In fact, this set could be implemented as a bit array with "free lists", which could be very very efficient, given that its size will be bound by the number of live objects.

But that can be patched - the weak key dict could still be used for those objects that haven't been collected yet. Since addition of the id would most likely happen in the nursery, or the first generation at most (big assumption), I don't think the dict would grow very big even under heavy id() usage.

I'm astonished a bit by your need to pack vtables together within 256KB. How many bits do you need for mark-and-sweep marking or similar stuff? The usual solution I've seen for this is to use the low two bits of the vtable pointer for flags, usually, and mask them off when reading the vtable pointer. Would it work here?

If that isn't enough, then you have to pack vtables together as you do (maybe in a bigger space if you can use more bits).

I can think of one place where I use a lot of id() calls, and that's in PEAK-Rules' generic function implementation, for indexing "is" tests.

For example, if you have a bunch of methods that test if "x is Something" (for different values of Something), then a dictionary of id()'s is used to identify which of these tests went off. While the total number of Somethings isn't likely to be high, the weakref dict in PyPy means that every 'x' the function is called with will end up burning memory and speed to hold an id forever.

While it's perhaps the case that I could avoid this by using a linear search (ugh) in cases where the number of Somethings is small, it's an example of a place where id() makes an operation neat, fast, and simple in regular Python.

Of course, if there were another way to put arbitrary (i.e possibly-unhashable, comparable only by identity) objects in a dictionary, and then determine whether a given object was one of them, that'd certainly be a suitable substitute.

Or, if PyPI offered a temp_id() that would simply let you *check* identity, without forcing the object to hold onto it, that'd work fine too. Say, if there was a has_id() that told you if an id() is outstanding for the object already, or a get_id() that returned None for an object whose id() had never been taken.

With an API like that, I could prevent memory/speed blowup by not having each call of the function adding more objects to PyPy's id() dict.

(Heck, perhaps such an API should be added across Python versions, i.e., to CPython and Jython as well.)