Friday, October 16, 2009

In the last week, I (Armin) have been taking some time off the
JIT work to improve our GCs. More precisely, our GCs now take
one or two words less for every object. This further reduce the
memory usage of PyPy, as we will show at the end.

Background information: RPython object model

We first need to understand the RPython object model as
implemented by our GCs and our C backend. (Note that the
object model of the Python interpreter is built on top of
that, but is more complicated -- e.g. Python-level objects
are much more flexible than RPython objects.)

The instances of A and B look like this in memory (all cells
are one word):

GC header

vtable ptr of A

hash

x

GC header

vtable ptr of B

hash

x

y

The first word, the GC header, describes the layout. It
encodes on half a word the shape of the object, including where it
contains further pointers, so that the GC can trace it. The
other half contains GC flags (e.g. the mark bit of a
mark-and-sweep GC).

The second word is used for method dispatch. It is similar to a
C++ vtable pointer. It points to static data that is mostly a
table of methods (as function pointers), containing e.g. the method f
of the example.

The hash field is not necessarily there; it is only present in classes
whose hash is ever taken in the RPython program (which includes being
keys in a dictionary). It is an "identity hash": it works like
object.__hash__() in Python, but it cannot just be the address of
the object in case of a GC that moves objects around.

Finally, the x and y fields are, obviously, used to store the value
of the fields. Note that instances of B can be used in places that
expect a pointer to an instance of A.

Unifying the vtable ptr with the GC header

The first idea of saving a word in every object is the observation
that both the vtable ptr and the GC header store information about
the class of the object. Therefore it is natural to try to only have
one of them. The problem is that we still need bits for the GC flags,
so the field that we have to remove is the vtable pointer.

This means that method dispatch needs to be more clever: it
cannot directly read the vtable ptr, but needs to compute it
from the half-word of the GC header. Fortunately, this can be
done with no extra instruction on the assembler level. Here is
how things will look like in the end, assuming a 32-bit x86
machine (but note that as usual we just generate portable C).

The trick for achieving efficiency is that we store all
vtables together in memory, and make sure that they don't take
more than 256 KB in total (16 bits, plus 2 bits of alignment).
Here is how the assembler code (produced by the normal C
compiler, e.g. gcc) for calling a method looks like. Before
the change:

Note that the complex addressing scheme done by the second MOV
is still just one instruction: the vtable_start and
method_offset are constants, so they are combined. And as the
vtables are anyway aligned at a word boundary, we can use
4*EDX to address them, giving us 256 KB instead of just 64 KB
of vtables.

Optimizing the hash field

In PyPy's Python interpreter, all application-level objects
are represented as an instance of some subclass of W_Root.
Since all of these objects could potentially be stored in a
dictionary by the application Python program, all these
objects need a hash field. Of course, in practice, only a
fraction of all objects in a Python program end up having
their hash ever taken. Thus this field of W_Root is wasted
memory most of the time.

(Up to now, we had a hack in place to save the hash field
on a few classes like W_IntegerObject, but that meant that
the Python expression ``object.__hash__(42)'' would raise
a TypeError in PyPy.)

The solution we implemented now (done by some Java GCs, among
others) is to add a hash field to an object when the
(identity) hash of that object is actually taken. This means
that we had to enhance our GCs to support this. When objects
are allocated, we don't reserve any space for the hash:

object at 0x74B028

...00...

x

y

When the hash of an object is taken, we use its current memory
address, and set a flag in the GC header saying that this
particular object needs a hash:

object at 0x74B028

...01...

x

y

If the GC needs to move the object to another memory location,
it will make the new version of the object bigger, i.e. it
will also allocate space for the hash field:

object at 0x825F60

...11...

x

y

0x74B028

This hash field is immediately initialized with the old memory
address, which is the hash value that we gave so far for the
object. To not disturb the layout of the object, we always
put the extra hash field at the end. Of course, once set,
the hash value does not change even if the object needs to
move again.

Results

Running the following program on PyPy's Python interpreter
with n=4000000:

The total amount of RAM used on a 32-bit Linux is 247 MB,
completing in 10.3 seconds. On CPython, it consumes 684 MB
and takes 89 seconds to complete... This nicely shows that
our GCs are much faster at allocating objects, and that our
objects can be much smaller than CPython's.

Armin Rigo & Carl Friedrich Bolz

In the last week, I (Armin) have been taking some time off the
JIT work to improve our GCs. More precisely, our GCs now take
one or two words less for every object. This further reduce the
memory usage of PyPy, as we will show at the end.

Background information: RPython object model

We first need to understand the RPython object model as
implemented by our GCs and our C backend. (Note that the
object model of the Python interpreter is built on top of
that, but is more complicated -- e.g. Python-level objects
are much more flexible than RPython objects.)

The instances of A and B look like this in memory (all cells
are one word):

GC header

vtable ptr of A

hash

x

GC header

vtable ptr of B

hash

x

y

The first word, the GC header, describes the layout. It
encodes on half a word the shape of the object, including where it
contains further pointers, so that the GC can trace it. The
other half contains GC flags (e.g. the mark bit of a
mark-and-sweep GC).

The second word is used for method dispatch. It is similar to a
C++ vtable pointer. It points to static data that is mostly a
table of methods (as function pointers), containing e.g. the method f
of the example.

The hash field is not necessarily there; it is only present in classes
whose hash is ever taken in the RPython program (which includes being
keys in a dictionary). It is an "identity hash": it works like
object.__hash__() in Python, but it cannot just be the address of
the object in case of a GC that moves objects around.

Finally, the x and y fields are, obviously, used to store the value
of the fields. Note that instances of B can be used in places that
expect a pointer to an instance of A.

Unifying the vtable ptr with the GC header

The first idea of saving a word in every object is the observation
that both the vtable ptr and the GC header store information about
the class of the object. Therefore it is natural to try to only have
one of them. The problem is that we still need bits for the GC flags,
so the field that we have to remove is the vtable pointer.

This means that method dispatch needs to be more clever: it
cannot directly read the vtable ptr, but needs to compute it
from the half-word of the GC header. Fortunately, this can be
done with no extra instruction on the assembler level. Here is
how things will look like in the end, assuming a 32-bit x86
machine (but note that as usual we just generate portable C).

The trick for achieving efficiency is that we store all
vtables together in memory, and make sure that they don't take
more than 256 KB in total (16 bits, plus 2 bits of alignment).
Here is how the assembler code (produced by the normal C
compiler, e.g. gcc) for calling a method looks like. Before
the change:

Note that the complex addressing scheme done by the second MOV
is still just one instruction: the vtable_start and
method_offset are constants, so they are combined. And as the
vtables are anyway aligned at a word boundary, we can use
4*EDX to address them, giving us 256 KB instead of just 64 KB
of vtables.

Optimizing the hash field

In PyPy's Python interpreter, all application-level objects
are represented as an instance of some subclass of W_Root.
Since all of these objects could potentially be stored in a
dictionary by the application Python program, all these
objects need a hash field. Of course, in practice, only a
fraction of all objects in a Python program end up having
their hash ever taken. Thus this field of W_Root is wasted
memory most of the time.

(Up to now, we had a hack in place to save the hash field
on a few classes like W_IntegerObject, but that meant that
the Python expression ``object.__hash__(42)'' would raise
a TypeError in PyPy.)

The solution we implemented now (done by some Java GCs, among
others) is to add a hash field to an object when the
(identity) hash of that object is actually taken. This means
that we had to enhance our GCs to support this. When objects
are allocated, we don't reserve any space for the hash:

object at 0x74B028

...00...

x

y

When the hash of an object is taken, we use its current memory
address, and set a flag in the GC header saying that this
particular object needs a hash:

object at 0x74B028

...01...

x

y

If the GC needs to move the object to another memory location,
it will make the new version of the object bigger, i.e. it
will also allocate space for the hash field:

object at 0x825F60

...11...

x

y

0x74B028

This hash field is immediately initialized with the old memory
address, which is the hash value that we gave so far for the
object. To not disturb the layout of the object, we always
put the extra hash field at the end. Of course, once set,
the hash value does not change even if the object needs to
move again.

Results

Running the following program on PyPy's Python interpreter
with n=4000000:

The total amount of RAM used on a 32-bit Linux is 247 MB,
completing in 10.3 seconds. On CPython, it consumes 684 MB
and takes 89 seconds to complete... This nicely shows that
our GCs are much faster at allocating objects, and that our
objects can be much smaller than CPython's.

Thursday, October 15, 2009

As the readers of this blog already know, I've been working on porting the
JIT to CLI/.NET for the last months. Now that it's finally possible to get a
working pypy-cli-jit, it's time to do some benchmarks.

Warning: as usual, all of this has to be considered to be a alpha version:
don't be surprised if you get a crash when trying to run pypy-cli-jit. Of
course, things are improving very quickly so it should become more and more
stable as days pass.

For this time, I decided to run four benchmarks. Note that for all of them we
run the main function once in advance, to let the JIT recoginizing the hot
loops and emitting the corresponding code. Thus, the results reported do
not include the time spent by the JIT compiler itself, but give a good
measure of how good is the code generated by the JIT. At this point in time,
I know that the CLI JIT backend spends way too much time compiling stuff, but
this issue will be fixed soon.

f1.py: this is the classic PyPy JIT benchmark. It is just a function
that does some computational intensive work with integers.

floatdemo.py: this is the same benchmark involving floating point
numbers that have already been described in a previous blog post.

oodemo.py: this is just a microbenchmark doing object oriented stuff
such as method calls and attribute access.

richards2.py: a modified version of the classic richards.py, with a
warmup call before starting the real benchmark.

The benchmarks were run on a Windows machine with an Intel Pentium Dual Core
E5200 2.5GHz and 2GB RAM, both with .NET (CLR 2.0) and Mono 2.4.2.3.

Because of a known mono bug, if you use a version older than 2.1 you need
to pass the option -O=-branch to mono when running pypy-cli-jit, else it
will just loop forever.

For comparison, we also run the same benchmarks with IronPython 2.0.1 and
IronPython 2.6rc1. Note that IronPython 2.6rc1 does not work with mono.

So, here are the results (expressed in seconds) with Microsoft CLR:

Benchmark

pypy-cli-jit

ipy 2.0.1

ipy 2.6

ipy2.01/ pypy

ipy2.6/ pypy

f1

0.028

0.145

0.136

5.18x

4.85x

floatdemo

0.671

0.765

0.812

1.14x

1.21x

oodemo

1.25

4.278

3.816

3.42x

3.05x

richards2

1228

442

670

0.36x

0.54x

And with Mono:

Benchmark

pypy-cli-jit

ipy 2.0.1

ipy2.01/ pypy

f1

0.042

0.695

16.54x

floatdemo

0.781

1.218

1.55x

oodemo

1.703

9.501

5.31x

richards2

720

862

1.20x

These results are very interesting: under the CLR, we are between 5x faster
and 3x slower than IronPython 2.0.1, and between 4.8x faster and 1.8x slower
than IronPython 2.6. On the other hand, on mono we are consistently faster
than IronPython, up to 16x. Also, it is also interesting to note that
pypy-cli runs faster on CLR than mono for all benchmarks except richards2.

I've not investigated yet, but I think that the culprit is the terrible
behaviour of tail calls on CLR: as I already wrote in another blog post,
tail calls are ~10x slower than normal calls on CLR, while being only ~2x
slower than normal calls on mono. richads2 is probably the benchmark that
makes most use of tail calls, thus explaining why we have a much better result
on mono than CLR.

The next step is probably to find an alternative implementation that does not
use tail calls: this probably will also improve the time spent by the JIT
compiler itself, which is not reported in the numbers above but that so far it
is surely too high to be acceptable. Stay tuned.

As the readers of this blog already know, I've been working on porting the
JIT to CLI/.NET for the last months. Now that it's finally possible to get a
working pypy-cli-jit, it's time to do some benchmarks.

Warning: as usual, all of this has to be considered to be a alpha version:
don't be surprised if you get a crash when trying to run pypy-cli-jit. Of
course, things are improving very quickly so it should become more and more
stable as days pass.

For this time, I decided to run four benchmarks. Note that for all of them we
run the main function once in advance, to let the JIT recoginizing the hot
loops and emitting the corresponding code. Thus, the results reported do
not include the time spent by the JIT compiler itself, but give a good
measure of how good is the code generated by the JIT. At this point in time,
I know that the CLI JIT backend spends way too much time compiling stuff, but
this issue will be fixed soon.

f1.py: this is the classic PyPy JIT benchmark. It is just a function
that does some computational intensive work with integers.

floatdemo.py: this is the same benchmark involving floating point
numbers that have already been described in a previous blog post.

oodemo.py: this is just a microbenchmark doing object oriented stuff
such as method calls and attribute access.

richards2.py: a modified version of the classic richards.py, with a
warmup call before starting the real benchmark.

The benchmarks were run on a Windows machine with an Intel Pentium Dual Core
E5200 2.5GHz and 2GB RAM, both with .NET (CLR 2.0) and Mono 2.4.2.3.

Because of a known mono bug, if you use a version older than 2.1 you need
to pass the option -O=-branch to mono when running pypy-cli-jit, else it
will just loop forever.

For comparison, we also run the same benchmarks with IronPython 2.0.1 and
IronPython 2.6rc1. Note that IronPython 2.6rc1 does not work with mono.

So, here are the results (expressed in seconds) with Microsoft CLR:

Benchmark

pypy-cli-jit

ipy 2.0.1

ipy 2.6

ipy2.01/ pypy

ipy2.6/ pypy

f1

0.028

0.145

0.136

5.18x

4.85x

floatdemo

0.671

0.765

0.812

1.14x

1.21x

oodemo

1.25

4.278

3.816

3.42x

3.05x

richards2

1228

442

670

0.36x

0.54x

And with Mono:

Benchmark

pypy-cli-jit

ipy 2.0.1

ipy2.01/ pypy

f1

0.042

0.695

16.54x

floatdemo

0.781

1.218

1.55x

oodemo

1.703

9.501

5.31x

richards2

720

862

1.20x

These results are very interesting: under the CLR, we are between 5x faster
and 3x slower than IronPython 2.0.1, and between 4.8x faster and 1.8x slower
than IronPython 2.6. On the other hand, on mono we are consistently faster
than IronPython, up to 16x. Also, it is also interesting to note that
pypy-cli runs faster on CLR than mono for all benchmarks except richards2.

I've not investigated yet, but I think that the culprit is the terrible
behaviour of tail calls on CLR: as I already wrote in another blog post,
tail calls are ~10x slower than normal calls on CLR, while being only ~2x
slower than normal calls on mono. richads2 is probably the benchmark that
makes most use of tail calls, thus explaining why we have a much better result
on mono than CLR.

The next step is probably to find an alternative implementation that does not
use tail calls: this probably will also improve the time spent by the JIT
compiler itself, which is not reported in the numbers above but that so far it
is surely too high to be acceptable. Stay tuned.

Tuesday, October 6, 2009

We've just merged branch which adds float support to x86 backend.
This means that floating point operations are now super fast
in PyPy's JIT. Let's have a look at example, provided by
Alex Gaynor
and stolen from Factor blog.

The original version of the benchmark, was definitely tuned for the performance needs of CPython.

For running this on PyPy, I changed to a bit simpler version of the program,
and I'll explain a few changes that I did, which the reflect current
limitations of PyPy's JIT. They're not very deep and they might be
already gone while you're reading it:

Usage of __slots__. This is a bit ridiculous, but we spend quite a bit
of time to speed up normal instances of new-style classes which are
very fast, yet ones with __slots__ are slower. To be fixed soon.

Usage of reduce. This one is even more obscure, but reduce is not
perceived as a thing producing loops in a program. Moving to
a pure-Python version of reduce fixes the problem.

Using x ** 2 vs x * x. In PyPy, reading a local variable is a
no-op when JITted (the same as reading local variable in C). However
multiplication is simpler operation that power operation.

I also included the original Java benchmark. Please
note that original java version is similar to my modified one
(not the one specifically tuned for CPython)

and while JVM is much faster, it's very good that we can even compare :-)

Cheers
fijal

Hello.

We've just merged branch which adds float support to x86 backend.
This means that floating point operations are now super fast
in PyPy's JIT. Let's have a look at example, provided by
Alex Gaynor
and stolen from Factor blog.

The original version of the benchmark, was definitely tuned for the performance needs of CPython.

For running this on PyPy, I changed to a bit simpler version of the program,
and I'll explain a few changes that I did, which the reflect current
limitations of PyPy's JIT. They're not very deep and they might be
already gone while you're reading it:

Usage of __slots__. This is a bit ridiculous, but we spend quite a bit
of time to speed up normal instances of new-style classes which are
very fast, yet ones with __slots__ are slower. To be fixed soon.

Usage of reduce. This one is even more obscure, but reduce is not
perceived as a thing producing loops in a program. Moving to
a pure-Python version of reduce fixes the problem.

Using x ** 2 vs x * x. In PyPy, reading a local variable is a
no-op when JITted (the same as reading local variable in C). However
multiplication is simpler operation that power operation.

I also included the original Java benchmark. Please
note that original java version is similar to my modified one
(not the one specifically tuned for CPython)