Friday, November 26, 2010

We're pleased to announce the 1.4 release of PyPy. This is a major breakthrough
in our long journey, as PyPy 1.4 is the first PyPy release that can translate
itself faster than CPython. Starting today, we are using PyPy more for
our every-day development. So may you :) You can download it here:

What is PyPy

PyPy is a very compliant Python interpreter, almost a drop-in replacement
for CPython. It is fast (pypy 1.4 and cpython 2.6 comparison).

New Features

Among its new features, this release includes numerous performance improvements
(which made fast self-hosting possible), a 64-bit JIT backend, as well
as serious stabilization. As of now, we can consider the 32-bit and 64-bit
linux versions of PyPy stable enough to run in production.

More highlights

PyPy's built-in Just-in-Time compiler is fully transparent and
automatically generated; it now also has very reasonable memory
requirements. The total memory used by a very complex and
long-running process (translating PyPy itself) is within 1.5x to
at most 2x the memory needed by CPython, for a speed-up of 2x.

More compact instances. All instances are as compact as if
they had __slots__. This can give programs a big gain in
memory. (In the example of translation above, we already have
carefully placed __slots__, so there is no extra win.)

We're pleased to announce the 1.4 release of PyPy. This is a major breakthrough
in our long journey, as PyPy 1.4 is the first PyPy release that can translate
itself faster than CPython. Starting today, we are using PyPy more for
our every-day development. So may you :) You can download it here:

What is PyPy

PyPy is a very compliant Python interpreter, almost a drop-in replacement
for CPython. It is fast (pypy 1.4 and cpython 2.6 comparison).

New Features

Among its new features, this release includes numerous performance improvements
(which made fast self-hosting possible), a 64-bit JIT backend, as well
as serious stabilization. As of now, we can consider the 32-bit and 64-bit
linux versions of PyPy stable enough to run in production.

More highlights

PyPy's built-in Just-in-Time compiler is fully transparent and
automatically generated; it now also has very reasonable memory
requirements. The total memory used by a very complex and
long-running process (translating PyPy itself) is within 1.5x to
at most 2x the memory needed by CPython, for a speed-up of 2x.

More compact instances. All instances are as compact as if
they had __slots__. This can give programs a big gain in
memory. (In the example of translation above, we already have
carefully placed __slots__, so there is no extra win.)

In our previous blog post, we talked about how fast PyPy can translate
itself compared to CPython. However, the price to pay for the 2x speedup was
an huge amount of memory: actually, it was so huge that a standard -Ojit
compilation could not be completed on 32-bit because it required more than the
4 GB of RAM that are addressable on that platform. On 64-bit, it consumed
8.3 GB of RAM instead of the 2.3 GB needed by CPython.

This behavior was mainly caused by the JIT, because at the time we wrote the
blog post the generated assembler was kept alive forever, together with some
big data structure needed to execute it.

In the past two weeks Anto and Armin attacked the issue in the jit-free
branch, which has been recently merged to trunk. The branch solves several
issues. The main idea of the branch is that if a
loop has not been executed for a certain amount of time (controlled by the new
loop_longevity JIT parameter) we consider it "old" and no longer needed,
thus we deallocate it.

(In the process of doing this, we also discovered and fixed an
oversight in the implementation of generators, which led to generators being
freed only very slowly.)

To understand the freeing of loops some more, let's look at how many loops are
actually created during a translation.
The purple line in the following graph shows how many loops (and bridges) are
alive at any point in time with an infinite longevity, which is equivalent to
the situation we had before the jit-free branch. By contrast, the blue
line shows the number of loops that you get in the current trunk: the
difference is evident, as now we never have more than 10000 loops alive, while
previously we got up to about 37000 ones. The time on the X axis is expressed
in "Giga Ticks", where a tick is the value read out of the Time Stamp Counter
of the CPU.

The grey vertical bars represent the beginning of each phase of the
translation:

annotate performs control flow graph construction and type inference.

rtype lowers the abstraction level of the control flow graphs with types to that of C.

pyjitpl constructs the JIT.

backendopt optimizes the control flow graphs.

stackcheckinsertion finds the places in the call graph that can overflow the C stack and inserts checks that raise an exception instead.

database_c produces a database of all the objects the C code will have to know about.

source_c produces the C source code.

compile_c calls the compiler to produce the executable.

You can nicely see, how the number of alive graphs drops shortly after the
beginning of a new phase.

Those two fixes, freeing loops and generators, improve the memory usage greatly:
now, translating PyPy
on PyPy on 32-bit consumes 2 GB of RAM, while on CPython it consumes 1.1 GB.
This result can even be improved somewhat, because we are not actually freeing
the assembler code itself, but
only the large data structures around it; we can consider it as a residual
memory leak of around 150 MB in this case. This will be fixed in the
jit-free-asm branch.

The following graph shows the memory usage in more detail:

the blue line (cpython-scaled) shows the total amount of RAM that the
OS allocates for CPython. Note that the X axis (the time) has been
scaled down so that it spans as much as the PyPy one, to ease the
comparison. Actually, CPython took more than twice as much time as PyPy to
complete the translation

the red line (VmRss) shows the total amount of RAM that the
OS allocates for PyPy: it includes both the memory directly handled by
our GC and the "raw memory" that we need to allocate for other tasks, such
as the assembly code generated by the JIT

the brown line (gc-before) shows how much memory is used by the GC
before each major collection

the yellow line (gc-after) shows how much memory is used by the GC
after each major collection: this represent the amount of memory which is
actually needed to hold our Python objects. The difference between
gc-before and gc-after (the GC delta) is the amout of memory that the GC
uses before triggering a new major collection

By comparing gc-after and cpython-scaled, we can see that PyPy
uses mostly the same amount of memory as CPython for storing the application
objects (due to reference counting the memory usage in CPython is always very
close to the actually necessary memory). The extra memory
used by PyPy is due to the GC delta, to the machine code generated by the JIT
and probably to some other external effect (such as e.g. Memory
Fragmentation).

Note that the GC delta can be set arbitrarly low (another recent addition --
the default value depends on the actual RAM on your computer; it probably
works to translate if your computer has precisely 2 GB, because in this
case the GC delta and thus the total memory usage will be somewhat
lower than reported here), but the cost is to have more
frequent major collections and thus a higher run-time overhead. The same is
true for the memory needed by the JIT, which can be reduced by telling the JIT
to compile less often or to discard old loops more frequently. As often
happens in computer science, there is a trade-off between space and time, and
currently for this particular example PyPy runs twice as fast as CPython by
doubling the memory usage. We hope to improve even more on this trade-off.

On 64-bit, things are even better as shown by the the following graph:

The general shape of the lines is similar to the 32-bit graph. However, the
relative difference to CPython is much better: we need about 3 GB of RAM, just
24% more than the 2.4 GB needed by CPython. And we are still more than 2x
faster!

The net result of our work is that now translating PyPy on PyPy is practical
and takes less than 30 minutes. It's impressive how quickly you get used to
translation taking half the time -- now we cannot use CPython any more for that
because it feels too slow :-).

In our previous blog post, we talked about how fast PyPy can translate
itself compared to CPython. However, the price to pay for the 2x speedup was
an huge amount of memory: actually, it was so huge that a standard -Ojit
compilation could not be completed on 32-bit because it required more than the
4 GB of RAM that are addressable on that platform. On 64-bit, it consumed
8.3 GB of RAM instead of the 2.3 GB needed by CPython.

This behavior was mainly caused by the JIT, because at the time we wrote the
blog post the generated assembler was kept alive forever, together with some
big data structure needed to execute it.

In the past two weeks Anto and Armin attacked the issue in the jit-free
branch, which has been recently merged to trunk. The branch solves several
issues. The main idea of the branch is that if a
loop has not been executed for a certain amount of time (controlled by the new
loop_longevity JIT parameter) we consider it "old" and no longer needed,
thus we deallocate it.

(In the process of doing this, we also discovered and fixed an
oversight in the implementation of generators, which led to generators being
freed only very slowly.)

To understand the freeing of loops some more, let's look at how many loops are
actually created during a translation.
The purple line in the following graph shows how many loops (and bridges) are
alive at any point in time with an infinite longevity, which is equivalent to
the situation we had before the jit-free branch. By contrast, the blue
line shows the number of loops that you get in the current trunk: the
difference is evident, as now we never have more than 10000 loops alive, while
previously we got up to about 37000 ones. The time on the X axis is expressed
in "Giga Ticks", where a tick is the value read out of the Time Stamp Counter
of the CPU.

The grey vertical bars represent the beginning of each phase of the
translation:

annotate performs control flow graph construction and type inference.

rtype lowers the abstraction level of the control flow graphs with types to that of C.

pyjitpl constructs the JIT.

backendopt optimizes the control flow graphs.

stackcheckinsertion finds the places in the call graph that can overflow the C stack and inserts checks that raise an exception instead.

database_c produces a database of all the objects the C code will have to know about.

source_c produces the C source code.

compile_c calls the compiler to produce the executable.

You can nicely see, how the number of alive graphs drops shortly after the
beginning of a new phase.

Those two fixes, freeing loops and generators, improve the memory usage greatly:
now, translating PyPy
on PyPy on 32-bit consumes 2 GB of RAM, while on CPython it consumes 1.1 GB.
This result can even be improved somewhat, because we are not actually freeing
the assembler code itself, but
only the large data structures around it; we can consider it as a residual
memory leak of around 150 MB in this case. This will be fixed in the
jit-free-asm branch.

The following graph shows the memory usage in more detail:

the blue line (cpython-scaled) shows the total amount of RAM that the
OS allocates for CPython. Note that the X axis (the time) has been
scaled down so that it spans as much as the PyPy one, to ease the
comparison. Actually, CPython took more than twice as much time as PyPy to
complete the translation

the red line (VmRss) shows the total amount of RAM that the
OS allocates for PyPy: it includes both the memory directly handled by
our GC and the "raw memory" that we need to allocate for other tasks, such
as the assembly code generated by the JIT

the brown line (gc-before) shows how much memory is used by the GC
before each major collection

the yellow line (gc-after) shows how much memory is used by the GC
after each major collection: this represent the amount of memory which is
actually needed to hold our Python objects. The difference between
gc-before and gc-after (the GC delta) is the amout of memory that the GC
uses before triggering a new major collection

By comparing gc-after and cpython-scaled, we can see that PyPy
uses mostly the same amount of memory as CPython for storing the application
objects (due to reference counting the memory usage in CPython is always very
close to the actually necessary memory). The extra memory
used by PyPy is due to the GC delta, to the machine code generated by the JIT
and probably to some other external effect (such as e.g. Memory
Fragmentation).

Note that the GC delta can be set arbitrarly low (another recent addition --
the default value depends on the actual RAM on your computer; it probably
works to translate if your computer has precisely 2 GB, because in this
case the GC delta and thus the total memory usage will be somewhat
lower than reported here), but the cost is to have more
frequent major collections and thus a higher run-time overhead. The same is
true for the memory needed by the JIT, which can be reduced by telling the JIT
to compile less often or to discard old loops more frequently. As often
happens in computer science, there is a trade-off between space and time, and
currently for this particular example PyPy runs twice as fast as CPython by
doubling the memory usage. We hope to improve even more on this trade-off.

On 64-bit, things are even better as shown by the the following graph:

The general shape of the lines is similar to the 32-bit graph. However, the
relative difference to CPython is much better: we need about 3 GB of RAM, just
24% more than the 2.4 GB needed by CPython. And we are still more than 2x
faster!

The net result of our work is that now translating PyPy on PyPy is practical
and takes less than 30 minutes. It's impressive how quickly you get used to
translation taking half the time -- now we cannot use CPython any more for that
because it feels too slow :-).

Monday, November 15, 2010

As some of you already know, I've recently started working on a
very large radio telescope at SKA South Africa. This telescope's
operating software runs almost exclusively on Python (several high throughput
pieces are in C or CUDA or directly executed by FPGAs). Some cool telescope pictures:

(photos courtesy of SKA South Africa)

Most of the operation software is using the KatCP protocol to talk between devices.
The currently used implementation is Open Source software with a custom home built
server and client. As part of the experiments, I've implemented a Twisted based
version and run in on top of CPython and PyPy for both the default
implementation and the one based on Twisted to see how those perform.

There are two testing scenarios: the first one is trying to saturate the connection
by setting up multiple sensors that report state every 10ms, the second one
is measuring a round-trip between sending a request and receiving the response.
Both numbers are measuring the number of requests per 0.2s, so the more the better. On X axis there is a number of simultanously connected clients.

All benchmark code is available in the KatCP repository.

The results are as follows:

As you can see, in general Twisted has larger overhead for a single client
and scales better as the number of clients increases. That's I think expected,
since Twisted has extra layers of indirection. The round trip degradation of
Twisted has to be investigated, but for us scenario1 is by far more important.

All across the board PyPy performs much better than CPython for both
Twisted and a home-made solution, which I think is a pretty good result.

Note: we didn't roll this set up into production yet, but there are high
chances for both twisted and PyPy to be used in some near future.

Cheers,
fijal

Hello.

As some of you already know, I've recently started working on a
very large radio telescope at SKA South Africa. This telescope's
operating software runs almost exclusively on Python (several high throughput
pieces are in C or CUDA or directly executed by FPGAs). Some cool telescope pictures:

(photos courtesy of SKA South Africa)

Most of the operation software is using the KatCP protocol to talk between devices.
The currently used implementation is Open Source software with a custom home built
server and client. As part of the experiments, I've implemented a Twisted based
version and run in on top of CPython and PyPy for both the default
implementation and the one based on Twisted to see how those perform.

There are two testing scenarios: the first one is trying to saturate the connection
by setting up multiple sensors that report state every 10ms, the second one
is measuring a round-trip between sending a request and receiving the response.
Both numbers are measuring the number of requests per 0.2s, so the more the better. On X axis there is a number of simultanously connected clients.

All benchmark code is available in the KatCP repository.

The results are as follows:

As you can see, in general Twisted has larger overhead for a single client
and scales better as the number of clients increases. That's I think expected,
since Twisted has extra layers of indirection. The round trip degradation of
Twisted has to be investigated, but for us scenario1 is by far more important.

All across the board PyPy performs much better than CPython for both
Twisted and a home-made solution, which I think is a pretty good result.

Note: we didn't roll this set up into production yet, but there are high
chances for both twisted and PyPy to be used in some near future.

Saturday, November 13, 2010

As could be foreseen by my Call for Memory Benchmarks post a while ago, I am
currently working on improving the memory behaviour of PyPy's Python
interpreter. In this blog post I want to describe the various data a Python
instance can store. Then I want to describe how a branch that I did and that was
recently merged implements the various features of instances in a very
memory-efficient way.

Python's Object Model

Storing the Class

Every instance knows which class it belongs to. This information is accessible
via the .__class__ attribute. It can also be changed to other (compatible
enough) classes by writing to that attribute.

Instance Variables

Every instance also has an arbitrary number of attributes stored (also called
instance variables). The instance variables used can vary per instance, which is
not the case in most other class-based languages: traditionally (e.g. in
Smalltalk or Java) the class describes the shape of its instances,
which means that the
set of admissible instance variable names is the same for all instances of a
class.

In Python on the other hand, it is possible to add arbitrary attributes to an
instance at any point. The instance behaves like a dictionary mapping attribute
names (as strings) to the attribute values.

This is actually how CPython implements instances. Every instance has a
reference to a dictionary that stores all the attributes of the instance. This
dictionary can be reached via the .__dict__ attribute. To make things more
fun, the dictionary can also be changed by writing to that attribute.

(The picture glosses over a number of details, but it still shows the essential
issues.)

This way of storing things is simple, but unfortunately rather inefficient. Most
instances of the same class have the same shape, i.e. the same set of instance
attribute names. That means that the key part of all the dictionaries is
identical (shown grey here). Therefore storing that part repeatedly in all
instances is a waste. In addition, dictionaries are themselves rather large.
Since they are typically implemented as hashmaps, which must not be too full to
be efficient, a dictionary will use something like 6 words on average per key.

Slots

Since normal instances are rather large, CPython 2.2 introduced slots, to make
instances consume less memory. Slots are a way to fix the set of attributes an
instance can have. This is achieved by adding a declaration to a class, like
this:

classB(object):
__slots__ = ["x", "y", "z"]

Now the instances of B can only have x, y and z as attributes
and don't have a dictionary at all. Instead, the instances of B get
allocated with enough size to hold exactly the number of instance variables that
the class permits. This clearly saves a lot of memory over the dictionary
approach, but has a number of disadvantages. It is obviously less flexible, as
you cannot add additional instance variables to an instance if you happen to
need to do that. It also introduces a set of rules and corner-cases that can
be surprising sometimes (e.g. instances of a subclass of a class with slots that
doesn't have a slots declaration will have a dict).

Using Maps for Memory-Efficient Instances

As we have seen in the diagram above, the dictionaries of instances of the same
class tend to look very similar and share all the keys. The central idea to use
less memory is to "factor out" the common parts of the instance dictionaries
into a new object, called a "map" (because it is a guide to the landscape of the
object, or something). After that factoring out, the representation of the
instances above looks something like this:

Every instance now has a reference to its map, which describes what the instance
looks like. The actual instance variables are stored in an array (called
storage in the diagram). In the example here, the map describes that the
instances have three attributes x, y and z. The numbers after the
attributes are indexes into the storage array.

If somebody adds a new attribute to one of the instances, the map for that
instance will be changed to another map that also contains the new attribute,
and the storage will have to grow a field with the new attribute. The maps are
immutable, immortal and reused as much as possible. This means, that two
instances of the same class with the same set of attributes will have the same
map. This also means that the memory the map itself uses is not too important,
because it will potentially be amortized over many instances.

Note that using maps makes instances nearly as small as if the correct slots had
been declared in the class. The only overhead needed is the indirection to the
storage array, because you can get new instance variables, but not new slots.

The rest of the post describes a number of further details that occur if
instances are implemented using maps.

Supporting Dictionaries with Maps

The default instance representation with maps as shown above works without
actually having a dictionary as part of each instance. If a dictionary is
actually requested, by accessing the .__dict__ attribute, it needs to be
created and cached. The dictionary is not a normal Python dictionary, but a thin
wrapper around the object that forwards all operations to it. From the user's
point of view it behaves like a normal dictionary though (it even has the
correct type).

The dictionary needs to be cached, because accessing .__dict__ several times
should always return the same dictionary. The caching happens by using a
different map that knows about the dictionary and putting the dictionary into
the storage array:

Things become really complex if the fake dict is used in strange ways. As long
as the keys are strings, everything is fine. If somebody adds other keys to the
dict, they cannot be represented by the map any more (which supports only
attributes, i.e. string keys in the __dict__). If that happens, all the
information of the instance will move into the fake dictionary, like this:

In this picture, the key -1 was added to the instance's dictionary. Since
using the dictionary in arbitrary ways should be rare, we are fine with the
additional time and memory that the approach takes.

Slots and Maps

Maps work perfectly together with slots, because the slots can just be stored
into the storage array used by the maps as well (in practise there are some
refinements to that scheme). This means that putting a __slots__ on a
class has mostly no effect, because the instance only stores the values of the
attributes (and not the names), which is equivalent to the way slots are stored
in CPython.

Implementation Details

In the diagrams above, I represented the maps as flat objects. In practise this
is a bit more complex, because it needs to be efficient to go from one map to
the next when new attributes are added. Thus the maps are organized in a tree.

The instances with their maps from above look a bit more like this in practise:

Every map just describes one attribute of the object, with a name and a an
index. Every map also has a back field, that points to another map
describing what the rest of the object looks like. This chain ends with a
terminator, which also stores the class of the object.

The maps also contain the information necessary for making a new object of
class A. Immediately after the new object has been created, its map is the
terminator. If the x attribute is added, its maps is changed to the
second-lowest map, and so on. The blue arrows show the sequence of maps that
the new object goes through when the attributes x, y, z are added.

This representation of maps as chains of objects sounds very inefficient if an
object has many attributes. The whole chain has to be walked to find the index.
This is true to some extent. The problem goes away in the presence of the JIT,
which knows that the chain of maps is an immutable structure, and will thus
optimize away all the chain-walking. If the JIT is not used, there are a few
caches that try to speed up the walking of this chain (similar to the method
cache in CPython and PyPy).

Results

It's hard to compare the improvements of this optimization in a fair way, as
the trade-offs are just very different. Just to give an impression, a million
objects of the same class with three fields on a 32bit system takes:

without slots:

182 MiB memory in CPython

177 MiB memory in PyPy without maps

40 MiB memory in PyPy with maps

with slots:

45 MiB memory in CPython

50 MiB memory in PyPy without maps

40 MiB memory in PyPy with maps

Note how maps make the objects a bit more efficient like CPython using slots.
Also, using slots has no additional effect in PyPy.

Conclusion

Maps are a powerful approach to shrinking the memory used by many similar
instances. I think they can be pushed even further (e.g. by adding information
about the types of the attributes) and plan to do so in the following months.
Details will be forthcoming.

As could be foreseen by my Call for Memory Benchmarks post a while ago, I am
currently working on improving the memory behaviour of PyPy's Python
interpreter. In this blog post I want to describe the various data a Python
instance can store. Then I want to describe how a branch that I did and that was
recently merged implements the various features of instances in a very
memory-efficient way.

Python's Object Model

Storing the Class

Every instance knows which class it belongs to. This information is accessible
via the .__class__ attribute. It can also be changed to other (compatible
enough) classes by writing to that attribute.

Instance Variables

Every instance also has an arbitrary number of attributes stored (also called
instance variables). The instance variables used can vary per instance, which is
not the case in most other class-based languages: traditionally (e.g. in
Smalltalk or Java) the class describes the shape of its instances,
which means that the
set of admissible instance variable names is the same for all instances of a
class.

In Python on the other hand, it is possible to add arbitrary attributes to an
instance at any point. The instance behaves like a dictionary mapping attribute
names (as strings) to the attribute values.

This is actually how CPython implements instances. Every instance has a
reference to a dictionary that stores all the attributes of the instance. This
dictionary can be reached via the .__dict__ attribute. To make things more
fun, the dictionary can also be changed by writing to that attribute.

(The picture glosses over a number of details, but it still shows the essential
issues.)

This way of storing things is simple, but unfortunately rather inefficient. Most
instances of the same class have the same shape, i.e. the same set of instance
attribute names. That means that the key part of all the dictionaries is
identical (shown grey here). Therefore storing that part repeatedly in all
instances is a waste. In addition, dictionaries are themselves rather large.
Since they are typically implemented as hashmaps, which must not be too full to
be efficient, a dictionary will use something like 6 words on average per key.

Slots

Since normal instances are rather large, CPython 2.2 introduced slots, to make
instances consume less memory. Slots are a way to fix the set of attributes an
instance can have. This is achieved by adding a declaration to a class, like
this:

classB(object):
__slots__ = ["x", "y", "z"]

Now the instances of B can only have x, y and z as attributes
and don't have a dictionary at all. Instead, the instances of B get
allocated with enough size to hold exactly the number of instance variables that
the class permits. This clearly saves a lot of memory over the dictionary
approach, but has a number of disadvantages. It is obviously less flexible, as
you cannot add additional instance variables to an instance if you happen to
need to do that. It also introduces a set of rules and corner-cases that can
be surprising sometimes (e.g. instances of a subclass of a class with slots that
doesn't have a slots declaration will have a dict).

Using Maps for Memory-Efficient Instances

As we have seen in the diagram above, the dictionaries of instances of the same
class tend to look very similar and share all the keys. The central idea to use
less memory is to "factor out" the common parts of the instance dictionaries
into a new object, called a "map" (because it is a guide to the landscape of the
object, or something). After that factoring out, the representation of the
instances above looks something like this:

Every instance now has a reference to its map, which describes what the instance
looks like. The actual instance variables are stored in an array (called
storage in the diagram). In the example here, the map describes that the
instances have three attributes x, y and z. The numbers after the
attributes are indexes into the storage array.

If somebody adds a new attribute to one of the instances, the map for that
instance will be changed to another map that also contains the new attribute,
and the storage will have to grow a field with the new attribute. The maps are
immutable, immortal and reused as much as possible. This means, that two
instances of the same class with the same set of attributes will have the same
map. This also means that the memory the map itself uses is not too important,
because it will potentially be amortized over many instances.

Note that using maps makes instances nearly as small as if the correct slots had
been declared in the class. The only overhead needed is the indirection to the
storage array, because you can get new instance variables, but not new slots.

The rest of the post describes a number of further details that occur if
instances are implemented using maps.

Supporting Dictionaries with Maps

The default instance representation with maps as shown above works without
actually having a dictionary as part of each instance. If a dictionary is
actually requested, by accessing the .__dict__ attribute, it needs to be
created and cached. The dictionary is not a normal Python dictionary, but a thin
wrapper around the object that forwards all operations to it. From the user's
point of view it behaves like a normal dictionary though (it even has the
correct type).

The dictionary needs to be cached, because accessing .__dict__ several times
should always return the same dictionary. The caching happens by using a
different map that knows about the dictionary and putting the dictionary into
the storage array:

Things become really complex if the fake dict is used in strange ways. As long
as the keys are strings, everything is fine. If somebody adds other keys to the
dict, they cannot be represented by the map any more (which supports only
attributes, i.e. string keys in the __dict__). If that happens, all the
information of the instance will move into the fake dictionary, like this:

In this picture, the key -1 was added to the instance's dictionary. Since
using the dictionary in arbitrary ways should be rare, we are fine with the
additional time and memory that the approach takes.

Slots and Maps

Maps work perfectly together with slots, because the slots can just be stored
into the storage array used by the maps as well (in practise there are some
refinements to that scheme). This means that putting a __slots__ on a
class has mostly no effect, because the instance only stores the values of the
attributes (and not the names), which is equivalent to the way slots are stored
in CPython.

Implementation Details

In the diagrams above, I represented the maps as flat objects. In practise this
is a bit more complex, because it needs to be efficient to go from one map to
the next when new attributes are added. Thus the maps are organized in a tree.

The instances with their maps from above look a bit more like this in practise:

Every map just describes one attribute of the object, with a name and a an
index. Every map also has a back field, that points to another map
describing what the rest of the object looks like. This chain ends with a
terminator, which also stores the class of the object.

The maps also contain the information necessary for making a new object of
class A. Immediately after the new object has been created, its map is the
terminator. If the x attribute is added, its maps is changed to the
second-lowest map, and so on. The blue arrows show the sequence of maps that
the new object goes through when the attributes x, y, z are added.

This representation of maps as chains of objects sounds very inefficient if an
object has many attributes. The whole chain has to be walked to find the index.
This is true to some extent. The problem goes away in the presence of the JIT,
which knows that the chain of maps is an immutable structure, and will thus
optimize away all the chain-walking. If the JIT is not used, there are a few
caches that try to speed up the walking of this chain (similar to the method
cache in CPython and PyPy).

Results

It's hard to compare the improvements of this optimization in a fair way, as
the trade-offs are just very different. Just to give an impression, a million
objects of the same class with three fields on a 32bit system takes:

without slots:

182 MiB memory in CPython

177 MiB memory in PyPy without maps

40 MiB memory in PyPy with maps

with slots:

45 MiB memory in CPython

50 MiB memory in PyPy without maps

40 MiB memory in PyPy with maps

Note how maps make the objects a bit more efficient like CPython using slots.
Also, using slots has no additional effect in PyPy.

Conclusion

Maps are a powerful approach to shrinking the memory used by many similar
instances. I think they can be pushed even further (e.g. by adding information
about the types of the attributes) and plan to do so in the following months.
Details will be forthcoming.

Wednesday, November 10, 2010

PyPy joins the Software Freedom Conservancy

Good news. PyPy is now a member of the Software Freedom Conservancy (SFC),
see the SFC blog post. This allows us to manage non-profit monetary aspects of
the project independently from a company or particular persons. So we
can now officially receive donations both from people prefering right or
left sides, see the Donate buttons on our home page and our blog.
And you can use PayPal or Google Checkout, Donations are tax-exempt in the
USA and hopefully soon in Europe as well.

What's it going to get used for? For the immediate future we intend to use
the donations for funding travels of core contributors to PyPy sprints
who otherwise can't afford to come. So if you have no time but some
money you can help to encourage coding contributors to care for PyPy.
If we end up with bigger sums we'll see and take suggestions. Money
spending decisions will be done by core PyPy people according to
non-profit guidelines. And we'll post information from time to time
about how much we got and where the money went.

If you have any questions regarding the SFC membership or donations
you may send email to sfc at pypy.org which will be observed
by Carl Friedrich Bolz, Jacob Hallen and Holger Krekel - the initial
PyPy SFC representatives on behalf of the PyPy team. Many thanks go
out to Bradley M. Kuhn for helping to implement the PyPy SFC membership.

cheers,

Holger & Carl Friedrich

PyPy joins the Software Freedom Conservancy

Good news. PyPy is now a member of the Software Freedom Conservancy (SFC),
see the SFC blog post. This allows us to manage non-profit monetary aspects of
the project independently from a company or particular persons. So we
can now officially receive donations both from people prefering right or
left sides, see the Donate buttons on our home page and our blog.
And you can use PayPal or Google Checkout, Donations are tax-exempt in the
USA and hopefully soon in Europe as well.

What's it going to get used for? For the immediate future we intend to use
the donations for funding travels of core contributors to PyPy sprints
who otherwise can't afford to come. So if you have no time but some
money you can help to encourage coding contributors to care for PyPy.
If we end up with bigger sums we'll see and take suggestions. Money
spending decisions will be done by core PyPy people according to
non-profit guidelines. And we'll post information from time to time
about how much we got and where the money went.

If you have any questions regarding the SFC membership or donations
you may send email to sfc at pypy.org which will be observed
by Carl Friedrich Bolz, Jacob Hallen and Holger Krekel - the initial
PyPy SFC representatives on behalf of the PyPy team. Many thanks go
out to Bradley M. Kuhn for helping to implement the PyPy SFC membership.

Tuesday, November 9, 2010

We have to admit: even if we have been writing for years about the fantastic
speedups that the PyPy JIT gives, we, the PyPy developers, still don't use it
for our daily routine. Until today :-).

Readers brave enough to run translate.py to translate PyPy by themselves
surely know that the process takes quite a long time to complete, about a hour
on super-fast hardware and even more on average computers. Unfortunately, it
happened that translate.py was a bad match for our JIT and thus ran much
slower on PyPy than on CPython.

One of the main reasons is that the PyPy translation toolchain makes heavy use
of custom metaclasses, and until few weeks ago metaclasses disabled some of
the central optimizations which make PyPy so fast. During the recent
Düsseldorf sprint, Armin and Carl Friedrich fixed this problem and
re-enabled all the optimizations even in presence of metaclasses.

So, today we decided that it was time to benchmark again PyPy against itself.
First, we tried to translate PyPy using CPython as usual, with the following
command line (on a machine with an "Intel(R) Xeon(R) CPU W3580 @ 3.33GHz" and
12 GB of RAM, running a 32-bit Ubuntu):

Yes, it's not a typo: PyPy is almost two times faster than CPython!
Moreover, we can see that PyPy is faster in each of the individual steps apart
compile_c, which consists in just a call to make to invoke gcc.
The slowdown comes from the fact that the Makefile also contains a lot of
calls to the trackgcroot.py script, which happens to perform badly on PyPy
but we did not investigate why yet.

However, there is also a drawback: on this specific benchmark, PyPy consumes
much more memory than CPython. The reason why the command line above contains
--no-allworkingmodules is that if we include all the modules the
translation crashes when it's complete at 99% because it consumes all the 4GB
of memory which is addressable by a 32-bit process.

A partial explanation if that so far the assembler generated by the PyPy JIT
is immortal, and the memory allocated for it is never reclaimed. This is
clearly bad for a program like translate.py which is divided into several
independent steps, and for which most of the code generated in each step could
be safely be thrown away when it's completed.

If we switch to 64-bit we can address the whole 12 GB of RAM that we have, and
thus translating with all working modules is no longer an issue. This is the
time taken with CPython (note that it does not make sense to compare with the
32-bit CPython translation above, because that one does not include all the
modules):

The results are comparable with the 32-bit case: PyPy is still almost 2 times
faster than CPython. And it also shows that our 64-bit JIT backend is as good
as the 32-bit one. Again, the drawback is in the consumed memory: CPython
used 2.3 GB while PyPy took 8.3 GB.

Overall, the results are impressive: we knew that PyPy can be good at
optimizing small benchmarks and even middle-sized programs, but as far as we
know this is the first example in which it heavily optimizes a huge, real world
application. And, believe us, the PyPy translation toolchain is complex
enough to contains all kinds of dirty tricks and black magic that make Python
lovable and hard to optimize :-).

We have to admit: even if we have been writing for years about the fantastic
speedups that the PyPy JIT gives, we, the PyPy developers, still don't use it
for our daily routine. Until today :-).

Readers brave enough to run translate.py to translate PyPy by themselves
surely know that the process takes quite a long time to complete, about a hour
on super-fast hardware and even more on average computers. Unfortunately, it
happened that translate.py was a bad match for our JIT and thus ran much
slower on PyPy than on CPython.

One of the main reasons is that the PyPy translation toolchain makes heavy use
of custom metaclasses, and until few weeks ago metaclasses disabled some of
the central optimizations which make PyPy so fast. During the recent
Düsseldorf sprint, Armin and Carl Friedrich fixed this problem and
re-enabled all the optimizations even in presence of metaclasses.

So, today we decided that it was time to benchmark again PyPy against itself.
First, we tried to translate PyPy using CPython as usual, with the following
command line (on a machine with an "Intel(R) Xeon(R) CPU W3580 @ 3.33GHz" and
12 GB of RAM, running a 32-bit Ubuntu):

Yes, it's not a typo: PyPy is almost two times faster than CPython!
Moreover, we can see that PyPy is faster in each of the individual steps apart
compile_c, which consists in just a call to make to invoke gcc.
The slowdown comes from the fact that the Makefile also contains a lot of
calls to the trackgcroot.py script, which happens to perform badly on PyPy
but we did not investigate why yet.

However, there is also a drawback: on this specific benchmark, PyPy consumes
much more memory than CPython. The reason why the command line above contains
--no-allworkingmodules is that if we include all the modules the
translation crashes when it's complete at 99% because it consumes all the 4GB
of memory which is addressable by a 32-bit process.

A partial explanation if that so far the assembler generated by the PyPy JIT
is immortal, and the memory allocated for it is never reclaimed. This is
clearly bad for a program like translate.py which is divided into several
independent steps, and for which most of the code generated in each step could
be safely be thrown away when it's completed.

If we switch to 64-bit we can address the whole 12 GB of RAM that we have, and
thus translating with all working modules is no longer an issue. This is the
time taken with CPython (note that it does not make sense to compare with the
32-bit CPython translation above, because that one does not include all the
modules):

The results are comparable with the 32-bit case: PyPy is still almost 2 times
faster than CPython. And it also shows that our 64-bit JIT backend is as good
as the 32-bit one. Again, the drawback is in the consumed memory: CPython
used 2.3 GB while PyPy took 8.3 GB.

Overall, the results are impressive: we knew that PyPy can be good at
optimizing small benchmarks and even middle-sized programs, but as far as we
know this is the first example in which it heavily optimizes a huge, real world
application. And, believe us, the PyPy translation toolchain is complex
enough to contains all kinds of dirty tricks and black magic that make Python
lovable and hard to optimize :-).