Saturday, October 26, 2013

If you've ever tried to run your programs with coverage.py under PyPy,
you've probably experienced some incredible slowness. Take this simple
program:

deff():return1defmain():i=10000000whilei:i-=f()main()

Running time coverage.py run test.py five times, and looking at the best
run, here's how PyPy 2.1 stacks up against CPython 2.7.5:

Python

Time

Normalized to CPython

CPython 2.7.5

3.879s

1.0x

PyPy 2.1

53.330s

13.7x slower

Totally ridiculous. I got turned onto this problem because on one of my
projects CPython takes about 1.5 minutes to run our test suite on the build
bot, but PyPy takes 8-10 minutes.

So I sat down to address it. And the results:

Python

Time

Normalized to CPython

CPython 2.7.5

3.879s

1.0x

PyPy 2.1

53.330s

13.7x slower

PyPy head

1.433s

2.7x faster

Not bad.

Technical details

So how'd we do it? Previously, using sys.settrace() (which coverage.py
uses under the hood) disabled the JIT. Except it didn't just disable the JIT,
it did it in a particularly insidious way — the JIT had no idea it was being
disabled!

Instead, every time PyPy discovered that one of your functions was a hotspot,
it would start tracing to observe what the program was doing, and right when it
was about to finish, coverage would run and cause the JIT to abort. Tracing
is a slow process, it makes up for it by generating fast machine code at the
end, but tracing is still incredibly slow. But we never actually got to the
"generate fast machine code" stage. Instead we'd pay all the cost of tracing,
but then we'd abort, and reap none of the benefits.

To fix this, we adjusted some of the heuristics in the JIT, to better show it
how sys.settrace(<tracefunc>) works. Previously the JIT saw it as an opaque
function which gets the frame object, and couldn't tell whether or not it
messed with the frame object. Now we let the JIT look inside the<tracefunc> function, so it's able to see that coverage.py isn't
messing with the frame in any weird ways, it's just reading the line number and
file path out of it.

I asked several friends in the VM implementation and research field if they
were aware of any other research into making VMs stay fast when debugging tools
like coverage.py are running. No one I spoke to was aware of any (but I
didn't do a particularly exhaustive review of the literature, I just tweeted at
a few people), so I'm pleased to say that PyPy is quite possibly the first VM
to work on optimizing code in debugging mode! This is possible because of our
years spent investing in meta-tracing research.

Happy testing,
Alex

If you've ever tried to run your programs with coverage.py under PyPy,
you've probably experienced some incredible slowness. Take this simple
program:

deff():return1defmain():i=10000000whilei:i-=f()main()

Running time coverage.py run test.py five times, and looking at the best
run, here's how PyPy 2.1 stacks up against CPython 2.7.5:

Python

Time

Normalized to CPython

CPython 2.7.5

3.879s

1.0x

PyPy 2.1

53.330s

13.7x slower

Totally ridiculous. I got turned onto this problem because on one of my
projects CPython takes about 1.5 minutes to run our test suite on the build
bot, but PyPy takes 8-10 minutes.

So I sat down to address it. And the results:

Python

Time

Normalized to CPython

CPython 2.7.5

3.879s

1.0x

PyPy 2.1

53.330s

13.7x slower

PyPy head

1.433s

2.7x faster

Not bad.

Technical details

So how'd we do it? Previously, using sys.settrace() (which coverage.py
uses under the hood) disabled the JIT. Except it didn't just disable the JIT,
it did it in a particularly insidious way — the JIT had no idea it was being
disabled!

Instead, every time PyPy discovered that one of your functions was a hotspot,
it would start tracing to observe what the program was doing, and right when it
was about to finish, coverage would run and cause the JIT to abort. Tracing
is a slow process, it makes up for it by generating fast machine code at the
end, but tracing is still incredibly slow. But we never actually got to the
"generate fast machine code" stage. Instead we'd pay all the cost of tracing,
but then we'd abort, and reap none of the benefits.

To fix this, we adjusted some of the heuristics in the JIT, to better show it
how sys.settrace(<tracefunc>) works. Previously the JIT saw it as an opaque
function which gets the frame object, and couldn't tell whether or not it
messed with the frame object. Now we let the JIT look inside the<tracefunc> function, so it's able to see that coverage.py isn't
messing with the frame in any weird ways, it's just reading the line number and
file path out of it.

I asked several friends in the VM implementation and research field if they
were aware of any other research into making VMs stay fast when debugging tools
like coverage.py are running. No one I spoke to was aware of any (but I
didn't do a particularly exhaustive review of the literature, I just tweeted at
a few people), so I'm pleased to say that PyPy is quite possibly the first VM
to work on optimizing code in debugging mode! This is possible because of our
years spent investing in meta-tracing research.

Wednesday, October 16, 2013

The sprint in London was a lot of fun and very fruitful. In the last
update on STM, Armin was working on improving and specializing the
automatic barrier placement. There is still a lot to do in that area,
but that work is merged now. Specializing and improving barrier placement
is still to be done for the JIT.

But that is not all. Right after the sprint, we were able to squeeze
the last obvious bugs in the STM-JIT combination. However, the performance
was nowhere near to what we want. So until now, we fixed some of the most
obvious issues. Many come from RPython erring on the side of caution
and e.g. making a transaction inevitable even if that is not strictly
necessary, thereby limiting parallelism. Another problem came from
increasing counters everytime a guard fails, which caused transactions
to conflict on these counter updates. Since these counters do not have
to be completely accurate, we update them non-transactionally now with
a chance of small errors.

There are still many such performance issues of various complexity left
to tackle: we are nowhere near done. So stay tuned or contribute :)

Performance

Now, since the JIT is all about performance, we want to at least
show you some numbers that are indicative of things to come.
Our set of STM benchmarks is very small unfortunately
(something you can help us out with), so this is
not representative of real-world performance. We tried to
minimize the effect of JIT warm-up in the benchmark results.

The machine these benchmarks were executed on has 4 physical
cores with Hyper-Threading (8 hardware threads).

Raytracer from stm-benchmarks:
Render times in seconds for a 1024x1024 image:

Interpreter

Base time: 1 thread

8 threads (speedup)

PyPy-2.1

2.47

2.56 (0.96x)

CPython

81.1

73.4 (1.1x)

PyPy-STM

50.2

10.8 (4.6x)

For comparison, disabling the JIT gives 148s on PyPy-2.1 and 87s on
PyPy-STM (with 8 threads).

For comparison, disabling the JIT gives 492ms on PyPy-2.1 and 538ms on
PyPy-STM.

Try it!

All this can be found in the PyPy repository on the stmgc-c4
branch.
Try it for yourself, but keep in mind that this is still experimental
with a lot of things yet to come. Only Linux x64 is supported right
now, but contributions are welcome.

Summary

What the numbers tell us is that PyPy-STM is, as expected,
the only of the three interpreters where multithreading gives a large
improvement in speed. What they also tell us is that, obviously, the
result is not good enough yet: it still takes longer on a 8-threaded
PyPy-STM than on a regular single-threaded PyPy-2.1. However, as you
should know by now, we are good at promising speed and delivering it...
years later :-)

But it has been two years already since PyPy-STM started, and this is
our first preview of the JIT integration. Expect major improvements
soon: with STM, the JIT generates code that is completely suboptimal in
many cases (barriers, allocation, and more). Once we improve this, the
performance of the STM-JITted code should come much closer to PyPy 2.1.

Cheers

Remi & Armin

Hi all,

The sprint in London was a lot of fun and very fruitful. In the last
update on STM, Armin was working on improving and specializing the
automatic barrier placement. There is still a lot to do in that area,
but that work is merged now. Specializing and improving barrier placement
is still to be done for the JIT.

But that is not all. Right after the sprint, we were able to squeeze
the last obvious bugs in the STM-JIT combination. However, the performance
was nowhere near to what we want. So until now, we fixed some of the most
obvious issues. Many come from RPython erring on the side of caution
and e.g. making a transaction inevitable even if that is not strictly
necessary, thereby limiting parallelism. Another problem came from
increasing counters everytime a guard fails, which caused transactions
to conflict on these counter updates. Since these counters do not have
to be completely accurate, we update them non-transactionally now with
a chance of small errors.

There are still many such performance issues of various complexity left
to tackle: we are nowhere near done. So stay tuned or contribute :)

Performance

Now, since the JIT is all about performance, we want to at least
show you some numbers that are indicative of things to come.
Our set of STM benchmarks is very small unfortunately
(something you can help us out with), so this is
not representative of real-world performance. We tried to
minimize the effect of JIT warm-up in the benchmark results.

The machine these benchmarks were executed on has 4 physical
cores with Hyper-Threading (8 hardware threads).

Raytracer from stm-benchmarks:
Render times in seconds for a 1024x1024 image:

Interpreter

Base time: 1 thread

8 threads (speedup)

PyPy-2.1

2.47

2.56 (0.96x)

CPython

81.1

73.4 (1.1x)

PyPy-STM

50.2

10.8 (4.6x)

For comparison, disabling the JIT gives 148s on PyPy-2.1 and 87s on
PyPy-STM (with 8 threads).

For comparison, disabling the JIT gives 492ms on PyPy-2.1 and 538ms on
PyPy-STM.

Try it!

All this can be found in the PyPy repository on the stmgc-c4
branch.
Try it for yourself, but keep in mind that this is still experimental
with a lot of things yet to come. Only Linux x64 is supported right
now, but contributions are welcome.

Summary

What the numbers tell us is that PyPy-STM is, as expected,
the only of the three interpreters where multithreading gives a large
improvement in speed. What they also tell us is that, obviously, the
result is not good enough yet: it still takes longer on a 8-threaded
PyPy-STM than on a regular single-threaded PyPy-2.1. However, as you
should know by now, we are good at promising speed and delivering it...
years later :-)

But it has been two years already since PyPy-STM started, and this is
our first preview of the JIT integration. Expect major improvements
soon: with STM, the JIT generates code that is completely suboptimal in
many cases (barriers, allocation, and more). Once we improve this, the
performance of the STM-JITted code should come much closer to PyPy 2.1.

Tuesday, October 15, 2013

We're pleased to announce that as of today,
the default PyPy comes with a GC that has much smaller pauses than yesterday.

Let's start with explaining roughly what GC pauses are. In CPython each
object has a reference count, which is incremented each time we create
references and decremented each time we forget them. This means that objects
are freed each time they become unreachable. That is only half of the story
though. First note that when the last reference to a large tree of
objects goes away, you have a pause: all the objects are freed. Your
program is not progressing at all during this pause, and this pause's
duration can be arbitrarily large. This occurs at deterministic times,
though. But consider code like this:

This creates a reference cycle. It means that while we deleted references to
a and b from the current scope, they still have a reference count of 1,
because they point to each other, even though the whole group has no references
from the outside. CPython employs a cyclic garbage collector which is used to
find such cycles. It walks over all objects in memory, starting from some known
roots, such as type objects, variables on the stack, etc. This solves the
problem, but can create noticeable, nondeterministic GC pauses as the heap
becomes large and convoluted.

PyPy essentially has only the cycle finder - it does not bother with reference
counting, instead it walks alive objects every now and then (this is a big
simplification, PyPy's GC is much more complex than this). Although this might
sound like a missing feature, it is really one of the reasons why PyPy is so
fast, because at the end of the day the total time spent in managing the
memory is lower in PyPy than CPython. However, as a result, PyPy also has the
problem of GC pauses.

To alleviate this problem, which is essential for
applications like games, we started to work on incremental GC, which spreads
the walking of objects and cleaning them across the execution time in smaller
intervals. The work was sponsored by the Raspberry Pi foundation, started
by Andrew Chambers and finished by Armin Rigo and Maciej Fijałkowski.

Benchmarks

Everyone loves benchmarks. We did not measure any significant speed difference
on our quite extensive benchmark suite on speed.pypy.org. The main
benchmark that we used for other comparisons was translating the topaz
ruby interpreter using various versions of PyPy and CPython. The exact
command was python <pypy-checkout>/bin/rpython-O2--rtype targettopaz.py.
Versions:

topaz - dce3eef7b1910fc5600a4cd0afd6220543104823

pypy source - defb5119e3c6

pypy compiled with minimark (non-incremental GC) - d1a0c07b6586

pypy compiled with incminimark (new, incremental GC) - 417a7117f8d7

CPython - 2.7.3

The memory usage of CPython, PyPy with minimark and PyPy with incminimark is
shown here. Note that this benchmark is quite bad for PyPy in general, the
memory usage is higher and the amount of time taken is longer. This is due
to the JIT warmup being both memory hungry and inefficient (see below).
But first, the new GC is not worse than the old one.

EDIT:Red line is CPython, blue is incminimark (new), green is minimark (old)

However, the GC pauses are significantly smaller. For PyPy the way to
get GC pauses is to measure time between start and stop while running stuff
with PYPYLOG=gc-collect:log pypy program.py, for CPython, the magic
incantation is gc.set_debug(gc.DEBUG_STATS) and parsing the output.
For what is worth, the average and total for CPython, as well as the total
number of events are not directly comparable since it only shows the cyclic
collector, not the reference counts. The only comparable thing is the
amount of long pauses and their duration. In the table below, pause duration
is sorted into 8 buckets, each meaning "below that or equal to the threshold".
The output is generated using the gcanalyze tool.

CPython:

150.1ms

300.2ms

450.3ms

600.5ms

750.6ms

900.7ms

1050.8ms

1200.9ms

5417

5

3

2

1

1

0

1

PyPy minimark (non-incremental GC):

216.4ms

432.8ms

649.2ms

865.6ms

1082.0ms

1298.4ms

1514.8ms

1731.2ms

27

14

6

4

6

5

3

3

PyPy incminimark (new incremental GC):

15.7ms

31.4ms

47.1ms

62.8ms

78.6ms

94.3ms

110.0ms

125.7ms

25512

122

4

1

0

0

0

2

As we can see, while there is still work to be done (the 100ms ones could
be split among several steps), we did improve the situation quite drastically
without any actual performance difference.

Note about the benchmark - we know it's a pretty extreme case of JIT
warmup, we know we suck on it, we're working on it and we're not afraid of
showing PyPy is not always the best ;-)

Nitty gritty details

Here are some nitty gritty details for people really interested in
Garbage Collection. This was done as a patch to "minimark", our current
GC, and called "incminimark" for now. The former is a generational
stop-the-world GC. New objects are allocated "young", which means that
they initially live in the "nursery", a special zone of a few MB of
memory. When the nursery is full, a "minor collection" step moves the
surviving objects out of the nursery. This can be done quickly (a few
millisecond) because we only need to walk through the young objects that
survive --- usually a small fraction of all young objects; and also by
far not all objects that are alive at this point, but only the young
ones. However, from time to time this minor collection is followed by a
"major collection": in that step, we really need to walk all objects to
classify which ones are still alive and which ones are now dead
("marking") and free the memory occupied by the dead ones ("sweeping").
You can read more details here.

This "major collection" is what gives the long GC pauses. To fix this
problem we made the GC incremental: instead of running one complete
major collection, we split its work into a variable number of pieces and
run each piece after every minor collection for a while, until there are
no more pieces. The pieces are each doing a fraction of marking, or a
fraction of sweeping. It adds some few milliseconds after each of these
minor collections, rather than requiring hundreds of milliseconds in one
go.

The main issue is that splitting the major collections means that the
main program is actually running between the pieces, and so it can
change the pointers in the objects to point to other objects. This is
not a problem for sweeping: dead objects will remain dead whatever the
main program does. However, it is a problem for marking. Let us see
why.

In terms of the incremental GC literature, objects are either "white",
"gray" or "black". This is called tri-color marking. See for example
this blog post about Rubinius, or this page about LuaJIT or the wikipedia description. The
objects start as "white" at the beginning of marking; become "gray" when
they are found to be alive; and become "black" when they have been fully
traversed. Marking proceeds by scanning grey objects for pointers to
white objects. The white objects found are turned grey, and the grey
objects scanned are turned black. When there are no more grey objects,
the marking phase is complete: all remaining white objects are truly
unreachable and can be freed (by the following sweeping phase).

In this model, the important part is that a black object can never point
to a white object: if the latter remains white until the end, it will be
freed, which is incorrect because the black object itself can still be
reached. How do we ensure that the main program, running in the middle
of marking, will not try to write a pointer to white object into a black
object? This requires a "write barrier", i.e. a piece of code that runs
every time we set a pointer into an object or array. This piece of code
checks if some (hopefully rare) condition is met, and calls a function
if that is the case.

The trick we used in PyPy is to consider minor collections as part of
the whole, rather than focus only on major collections. The existing
minimark GC had always used a write barrier of its own to do its job,
like any generational GC. This existing write barrier is used to detect
when an old object (outside the nursery) is modified to point to a young
object (inside the nursery), which is essential information for minor
collections. Actually, although this was the goal, the actual write
barrier code is simpler: it just records all old objects into which we
write any pointer --- to a young or old object. As we found out over
time, doing so is not actually slower, and might actually be a
performance improvement: for example, if the main program does a lot of
writes into the same old object, we don't need to check over and over
again if the written pointer points to a young object or not. We just
record the old object in some list the first time, and that's it.

The trick is that this unmodified write barrier works for incminimark
too. Imagine that we are in the middle of the marking phase, running
the main program. The write barrier will record all old objects that
are being modified. Then at the next minor collection, all surviving
young objects will be moved out of the nursery. At this point, as we're
about to continue running the major collection's marking phase, we
simply add to the list of pending gray objects all the objects that we
just considered --- both the objects listed as "old objects that are
being modified", and the objects that we just moved out of the nursery.
A fraction from the former list were black object; so this mean that
they are turned back from the black to the gray color. This technique
implements nicely, if indirectly, what is called a "backward write
barrier" in the literature. The backwardness is about the color that
needs to be changed in the opposite of the usual direction "white ->
gray -> black", thus making more work for the GC. (This is as opposed
to "forward write barrier", where we would also detect "black -> white"
writes but turn the white object gray.)

In summary, I realize that this description is less about how we turned
minimark into incminimark, and more about how we differ from the
standard way of making a GC incremental. What we really had to do to
make incminimark was to write logic that says "if the major collection
is in the middle of the marking phase, then add this object to the list
of gray objects", and put it at a few places throughout minor
collection. Then we simply split a major collection into increments,
doing marking or sweeping of some (relatively arbitrary) number of
objects before returning. That's why, after we found that the existing
write barrier would do, it was not much actual work, and could be done
without major changes. For example, not a single line from the JIT
needed adaptation. All in all it was relatively painless work. ;-)

Cheers,armin and fijal

Hello everyone.

We're pleased to announce that as of today,
the default PyPy comes with a GC that has much smaller pauses than yesterday.

Let's start with explaining roughly what GC pauses are. In CPython each
object has a reference count, which is incremented each time we create
references and decremented each time we forget them. This means that objects
are freed each time they become unreachable. That is only half of the story
though. First note that when the last reference to a large tree of
objects goes away, you have a pause: all the objects are freed. Your
program is not progressing at all during this pause, and this pause's
duration can be arbitrarily large. This occurs at deterministic times,
though. But consider code like this:

This creates a reference cycle. It means that while we deleted references to
a and b from the current scope, they still have a reference count of 1,
because they point to each other, even though the whole group has no references
from the outside. CPython employs a cyclic garbage collector which is used to
find such cycles. It walks over all objects in memory, starting from some known
roots, such as type objects, variables on the stack, etc. This solves the
problem, but can create noticeable, nondeterministic GC pauses as the heap
becomes large and convoluted.

PyPy essentially has only the cycle finder - it does not bother with reference
counting, instead it walks alive objects every now and then (this is a big
simplification, PyPy's GC is much more complex than this). Although this might
sound like a missing feature, it is really one of the reasons why PyPy is so
fast, because at the end of the day the total time spent in managing the
memory is lower in PyPy than CPython. However, as a result, PyPy also has the
problem of GC pauses.

To alleviate this problem, which is essential for
applications like games, we started to work on incremental GC, which spreads
the walking of objects and cleaning them across the execution time in smaller
intervals. The work was sponsored by the Raspberry Pi foundation, started
by Andrew Chambers and finished by Armin Rigo and Maciej Fijałkowski.

Benchmarks

Everyone loves benchmarks. We did not measure any significant speed difference
on our quite extensive benchmark suite on speed.pypy.org. The main
benchmark that we used for other comparisons was translating the topaz
ruby interpreter using various versions of PyPy and CPython. The exact
command was python <pypy-checkout>/bin/rpython-O2--rtype targettopaz.py.
Versions:

topaz - dce3eef7b1910fc5600a4cd0afd6220543104823

pypy source - defb5119e3c6

pypy compiled with minimark (non-incremental GC) - d1a0c07b6586

pypy compiled with incminimark (new, incremental GC) - 417a7117f8d7

CPython - 2.7.3

The memory usage of CPython, PyPy with minimark and PyPy with incminimark is
shown here. Note that this benchmark is quite bad for PyPy in general, the
memory usage is higher and the amount of time taken is longer. This is due
to the JIT warmup being both memory hungry and inefficient (see below).
But first, the new GC is not worse than the old one.

EDIT:Red line is CPython, blue is incminimark (new), green is minimark (old)

However, the GC pauses are significantly smaller. For PyPy the way to
get GC pauses is to measure time between start and stop while running stuff
with PYPYLOG=gc-collect:log pypy program.py, for CPython, the magic
incantation is gc.set_debug(gc.DEBUG_STATS) and parsing the output.
For what is worth, the average and total for CPython, as well as the total
number of events are not directly comparable since it only shows the cyclic
collector, not the reference counts. The only comparable thing is the
amount of long pauses and their duration. In the table below, pause duration
is sorted into 8 buckets, each meaning "below that or equal to the threshold".
The output is generated using the gcanalyze tool.

CPython:

150.1ms

300.2ms

450.3ms

600.5ms

750.6ms

900.7ms

1050.8ms

1200.9ms

5417

5

3

2

1

1

0

1

PyPy minimark (non-incremental GC):

216.4ms

432.8ms

649.2ms

865.6ms

1082.0ms

1298.4ms

1514.8ms

1731.2ms

27

14

6

4

6

5

3

3

PyPy incminimark (new incremental GC):

15.7ms

31.4ms

47.1ms

62.8ms

78.6ms

94.3ms

110.0ms

125.7ms

25512

122

4

1

0

0

0

2

As we can see, while there is still work to be done (the 100ms ones could
be split among several steps), we did improve the situation quite drastically
without any actual performance difference.

Note about the benchmark - we know it's a pretty extreme case of JIT
warmup, we know we suck on it, we're working on it and we're not afraid of
showing PyPy is not always the best ;-)

Nitty gritty details

Here are some nitty gritty details for people really interested in
Garbage Collection. This was done as a patch to "minimark", our current
GC, and called "incminimark" for now. The former is a generational
stop-the-world GC. New objects are allocated "young", which means that
they initially live in the "nursery", a special zone of a few MB of
memory. When the nursery is full, a "minor collection" step moves the
surviving objects out of the nursery. This can be done quickly (a few
millisecond) because we only need to walk through the young objects that
survive --- usually a small fraction of all young objects; and also by
far not all objects that are alive at this point, but only the young
ones. However, from time to time this minor collection is followed by a
"major collection": in that step, we really need to walk all objects to
classify which ones are still alive and which ones are now dead
("marking") and free the memory occupied by the dead ones ("sweeping").
You can read more details here.

This "major collection" is what gives the long GC pauses. To fix this
problem we made the GC incremental: instead of running one complete
major collection, we split its work into a variable number of pieces and
run each piece after every minor collection for a while, until there are
no more pieces. The pieces are each doing a fraction of marking, or a
fraction of sweeping. It adds some few milliseconds after each of these
minor collections, rather than requiring hundreds of milliseconds in one
go.

The main issue is that splitting the major collections means that the
main program is actually running between the pieces, and so it can
change the pointers in the objects to point to other objects. This is
not a problem for sweeping: dead objects will remain dead whatever the
main program does. However, it is a problem for marking. Let us see
why.

In terms of the incremental GC literature, objects are either "white",
"gray" or "black". This is called tri-color marking. See for example
this blog post about Rubinius, or this page about LuaJIT or the wikipedia description. The
objects start as "white" at the beginning of marking; become "gray" when
they are found to be alive; and become "black" when they have been fully
traversed. Marking proceeds by scanning grey objects for pointers to
white objects. The white objects found are turned grey, and the grey
objects scanned are turned black. When there are no more grey objects,
the marking phase is complete: all remaining white objects are truly
unreachable and can be freed (by the following sweeping phase).

In this model, the important part is that a black object can never point
to a white object: if the latter remains white until the end, it will be
freed, which is incorrect because the black object itself can still be
reached. How do we ensure that the main program, running in the middle
of marking, will not try to write a pointer to white object into a black
object? This requires a "write barrier", i.e. a piece of code that runs
every time we set a pointer into an object or array. This piece of code
checks if some (hopefully rare) condition is met, and calls a function
if that is the case.

The trick we used in PyPy is to consider minor collections as part of
the whole, rather than focus only on major collections. The existing
minimark GC had always used a write barrier of its own to do its job,
like any generational GC. This existing write barrier is used to detect
when an old object (outside the nursery) is modified to point to a young
object (inside the nursery), which is essential information for minor
collections. Actually, although this was the goal, the actual write
barrier code is simpler: it just records all old objects into which we
write any pointer --- to a young or old object. As we found out over
time, doing so is not actually slower, and might actually be a
performance improvement: for example, if the main program does a lot of
writes into the same old object, we don't need to check over and over
again if the written pointer points to a young object or not. We just
record the old object in some list the first time, and that's it.

The trick is that this unmodified write barrier works for incminimark
too. Imagine that we are in the middle of the marking phase, running
the main program. The write barrier will record all old objects that
are being modified. Then at the next minor collection, all surviving
young objects will be moved out of the nursery. At this point, as we're
about to continue running the major collection's marking phase, we
simply add to the list of pending gray objects all the objects that we
just considered --- both the objects listed as "old objects that are
being modified", and the objects that we just moved out of the nursery.
A fraction from the former list were black object; so this mean that
they are turned back from the black to the gray color. This technique
implements nicely, if indirectly, what is called a "backward write
barrier" in the literature. The backwardness is about the color that
needs to be changed in the opposite of the usual direction "white ->
gray -> black", thus making more work for the GC. (This is as opposed
to "forward write barrier", where we would also detect "black -> white"
writes but turn the white object gray.)

In summary, I realize that this description is less about how we turned
minimark into incminimark, and more about how we differ from the
standard way of making a GC incremental. What we really had to do to
make incminimark was to write logic that says "if the major collection
is in the middle of the marking phase, then add this object to the list
of gray objects", and put it at a few places throughout minor
collection. Then we simply split a major collection into increments,
doing marking or sweeping of some (relatively arbitrary) number of
objects before returning. That's why, after we found that the existing
write barrier would do, it was not much actual work, and could be done
without major changes. For example, not a single line from the JIT
needed adaptation. All in all it was relatively painless work. ;-)