Garbage Collection for Python

Portable Garbage Collection

Summary

Reference cycles involving lists, tuples, instances, classes,
dictionaries, and functions are found. Instances with
__del__ methods are handled in a sane way. It is easy to
add GC support to new types. GC enabled Python is binary
compatible with regular Python.

Generational collection works (currently three generations).
The overhead measured by pybench is about 4 percent. Virtually
all extension modules should work unchanged (I had to modify new
and cPickle in the standard distribution). A new module called
gc is available for tuning the collector and setting debugging
options.

The collector should be portable across all platforms. The
patched version of Python passes all regression tests and runs
Grail, Idle and Sketch without problems.

The portable garbage collection has been included in Python since
version 2.0. It is enabled by default. Be happy.

Why do we need it?

The current version of Python uses reference counting to keep
track of allocated memory. Each object in Python has a
reference count which indicates how many objects are pointing
to it. When this reference count reaches zero the object is
freed. This works well for most programs. However, there is one
fundamental flaw with reference counting and it is due to
something called reference cycles. The simplest
example of a reference cycle is one object that refers to
itself. For example:

>>> l = []
>>> l.append(l)
>>> del l

The reference count for the list created is now one. However,
since it cannot not be reached from inside Python and cannot
possibly be used again, it should be considered garbage. In the
current version of Python, this list will never be freed.

Creating reference cycles is usually not good programming
practice and can almost always be avoided. However, sometimes
it is difficult to avoid creating reference cycles and other
times the programmer does not even realize it is happening. For
long running programs such as servers this is especially
troublesome. People do not want their servers to run out of
memory because reference counting failed to free unreachable
objects. For large programs it is difficult to find how
reference cycles are being created.

What about "traditional" Garbage Collection?

Traditional garbage collection (eg. mark and sweep or stop and
copy) usually works as follows:

Find the root objects of the system. These are
things like the global environment (like the __main__ module
in Python) and objects on the stack.

Search from these objects and find all objects reachable
from them. This objects are all "alive".

Free all other objects.

Unfortunately this approach cannot be used in the current
version of Python. Because of the way extension modules work,
Python can never fully determine the root set. If the root set
cannot be determined accurately we risk freeing objects still
referenced from somewhere. Even if extension modules were
designed differently, the is no portable way of finding what
objects are currently on the C stack. Also, reference counting
provides some nice benefits in terms of locality of memory
reference and finalizer semantics that Python programmers have
come to expect. What would be best is if we could find a way of
still using reference counting but also free reference cycles.

How does this Approach Work?

Conceptually, this approach does the opposite of traditional
garbage collection. Instead of trying to find all reachable
objects it tries to find unreachable objects. This is much
safer because if the algorithm fails we are no worse off than
with no garbage collection (except for the time and space
wasted).

Since we are still using reference counting, the garbage
collector only has to find reference cycles. The reference
counting will handle freeing other types of garbage. First we
observe that reference cycles can only be created by
container objects. These are objects which can hold
references to other objects. In Python lists, dictionaries,
instances, classes, and tuples are all examples of container
objects. Integers and strings are not containers. With this
observation we realize that non-container objects can be
ignored for the purposes of garbage collection. This is a
useful optimization because things like integers and strings
should be fast and small.

Our idea now is to keep track of all container objects.
There are several ways that this can be done but one of the
best is using doubly linked lists with the link fields inside
the objects structure. This allows objects to be quickly
inserted and removed from the set as well as not requiring
extra memory allocations. When a container is created it is
inserted into this set and when deleted it is removed.

Now that we have access to all the container objects, how
to we find reference cycles? First we to add another field to
container objects in addition to the two link pointers. We will
call this field gc_refs. Here are the steps to find
reference cycles:

For each container object, set gc_refs equal to
the object's reference count.

For each container object, find which container objects
it references and decrement the referenced container's
gc_refs field.

All container objects that now have a gc_refs
field greater than one are referenced from outside the set of
container objects. We cannot free these objects so we move
them to a different set.

Any objects referenced from the objects moved also cannot
be freed. We move them and all the objects reachable from
them too.

Objects left in our original set are referenced only by
objects within that set (ie. they are inaccessible from
Python and are garbage). We can now go about freeing these
objects.

The trouble with Finalizers

One problem with our grand scheme is the use of finalizers. In
Python these are __del__ methods on instances. With
reference counting finalizers work fine. When an object's
reference count goes to zero the finalizer is called before the
object is freed. This is straight forward and easily understood
by programmers.

With garbage collection, calling finalizers becomes a
tricker problem, especially in the face of reference cycles. If
two objects in a reference cycle both have finalizers, what
should be done? Which should be called first? After calling the
first finalizer the object cannot be freed as the second
finalizer still may access it.

Since there is no good solution to this problem, cycles
that are referenced from objects with finalizers are not freed.
Instead these objects are added to a global list of
uncollectable garbage. It should almost always be possible for
the program to be rewritten so that this does not occur. As a
last resort, the program can access the global list and free
cycles in a way that makes sense for application.

What are the costs?

As some people say, there is no such thing as a free lunch.
However, this form of garbage collection is fairly cheap. One
of the biggest costs is the three extra words of memory
required for each container object. There is also the overhead
of maintaining the set of containers. For the current version
of the collector this overhead accounts for about a 4% slowdown
according to
pybench.

The collector currently keeps track of three generations of
objects. By adjusting the collection parameters the time spent
collecting can be made as small as desired. For some
applications it may make sense to disable automatic collection
and explicitly call the collection routine. However, with the
default collection parameters, the time spent collecting does
not seem to be significant when running pybench. Obviously
applications that allocate container objects extensively will
cause more time to be spent collecting.

The current patch adds a new configure option to enable the
collector. Python with the collector is binary compatible with
standard Python. If the option is disabled there is no effect
on the workings of the Python interpreter.

How can I use it?

Just download a current version of Python. The garbage collector
has been included since version 2.0 and is enabled by default. If
you are using Python 1.5.2 there is an old patch available that may work. If you
are on Windows you can download a replacement python15.dll.

Boehm-Demers Conservative Garbage collection

This patch adds modifies Python 1.5.2
to use the Boehm-Demers conservative garbage collector. You
must apply this patch
first however. Reference counting is still used. The garbage
collector only frees memory that the reference counting does
not (ie. reference cycles). This seems to provide the best
performance. To use: