Thursday, October 15, 2009

First pypy-cli-jit benchmarks

As the readers of this blog already know, I've been working on porting the
JIT to CLI/.NET for the last months. Now that it's finally possible to get a
working pypy-cli-jit, it's time to do some benchmarks.

Warning: as usual, all of this has to be considered to be a alpha version:
don't be surprised if you get a crash when trying to run pypy-cli-jit. Of
course, things are improving very quickly so it should become more and more
stable as days pass.

For this time, I decided to run four benchmarks. Note that for all of them we
run the main function once in advance, to let the JIT recoginizing the hot
loops and emitting the corresponding code. Thus, the results reported do
not include the time spent by the JIT compiler itself, but give a good
measure of how good is the code generated by the JIT. At this point in time,
I know that the CLI JIT backend spends way too much time compiling stuff, but
this issue will be fixed soon.

f1.py: this is the classic PyPy JIT benchmark. It is just a function
that does some computational intensive work with integers.

floatdemo.py: this is the same benchmark involving floating point
numbers that have already been described in a previous blog post.

oodemo.py: this is just a microbenchmark doing object oriented stuff
such as method calls and attribute access.

richards2.py: a modified version of the classic richards.py, with a
warmup call before starting the real benchmark.

The benchmarks were run on a Windows machine with an Intel Pentium Dual Core
E5200 2.5GHz and 2GB RAM, both with .NET (CLR 2.0) and Mono 2.4.2.3.

Because of a known mono bug, if you use a version older than 2.1 you need
to pass the option -O=-branch to mono when running pypy-cli-jit, else it
will just loop forever.

For comparison, we also run the same benchmarks with IronPython 2.0.1 and
IronPython 2.6rc1. Note that IronPython 2.6rc1 does not work with mono.

So, here are the results (expressed in seconds) with Microsoft CLR:

Benchmark

pypy-cli-jit

ipy 2.0.1

ipy 2.6

ipy2.01/ pypy

ipy2.6/ pypy

f1

0.028

0.145

0.136

5.18x

4.85x

floatdemo

0.671

0.765

0.812

1.14x

1.21x

oodemo

1.25

4.278

3.816

3.42x

3.05x

richards2

1228

442

670

0.36x

0.54x

And with Mono:

Benchmark

pypy-cli-jit

ipy 2.0.1

ipy2.01/ pypy

f1

0.042

0.695

16.54x

floatdemo

0.781

1.218

1.55x

oodemo

1.703

9.501

5.31x

richards2

720

862

1.20x

These results are very interesting: under the CLR, we are between 5x faster
and 3x slower than IronPython 2.0.1, and between 4.8x faster and 1.8x slower
than IronPython 2.6. On the other hand, on mono we are consistently faster
than IronPython, up to 16x. Also, it is also interesting to note that
pypy-cli runs faster on CLR than mono for all benchmarks except richards2.

I've not investigated yet, but I think that the culprit is the terrible
behaviour of tail calls on CLR: as I already wrote in another blog post,
tail calls are ~10x slower than normal calls on CLR, while being only ~2x
slower than normal calls on mono. richads2 is probably the benchmark that
makes most use of tail calls, thus explaining why we have a much better result
on mono than CLR.

The next step is probably to find an alternative implementation that does not
use tail calls: this probably will also improve the time spent by the JIT
compiler itself, which is not reported in the numbers above but that so far it
is surely too high to be acceptable. Stay tuned.

As the readers of this blog already know, I've been working on porting the
JIT to CLI/.NET for the last months. Now that it's finally possible to get a
working pypy-cli-jit, it's time to do some benchmarks.

Warning: as usual, all of this has to be considered to be a alpha version:
don't be surprised if you get a crash when trying to run pypy-cli-jit. Of
course, things are improving very quickly so it should become more and more
stable as days pass.

For this time, I decided to run four benchmarks. Note that for all of them we
run the main function once in advance, to let the JIT recoginizing the hot
loops and emitting the corresponding code. Thus, the results reported do
not include the time spent by the JIT compiler itself, but give a good
measure of how good is the code generated by the JIT. At this point in time,
I know that the CLI JIT backend spends way too much time compiling stuff, but
this issue will be fixed soon.

f1.py: this is the classic PyPy JIT benchmark. It is just a function
that does some computational intensive work with integers.

floatdemo.py: this is the same benchmark involving floating point
numbers that have already been described in a previous blog post.

oodemo.py: this is just a microbenchmark doing object oriented stuff
such as method calls and attribute access.

richards2.py: a modified version of the classic richards.py, with a
warmup call before starting the real benchmark.

The benchmarks were run on a Windows machine with an Intel Pentium Dual Core
E5200 2.5GHz and 2GB RAM, both with .NET (CLR 2.0) and Mono 2.4.2.3.

Because of a known mono bug, if you use a version older than 2.1 you need
to pass the option -O=-branch to mono when running pypy-cli-jit, else it
will just loop forever.

For comparison, we also run the same benchmarks with IronPython 2.0.1 and
IronPython 2.6rc1. Note that IronPython 2.6rc1 does not work with mono.

So, here are the results (expressed in seconds) with Microsoft CLR:

Benchmark

pypy-cli-jit

ipy 2.0.1

ipy 2.6

ipy2.01/ pypy

ipy2.6/ pypy

f1

0.028

0.145

0.136

5.18x

4.85x

floatdemo

0.671

0.765

0.812

1.14x

1.21x

oodemo

1.25

4.278

3.816

3.42x

3.05x

richards2

1228

442

670

0.36x

0.54x

And with Mono:

Benchmark

pypy-cli-jit

ipy 2.0.1

ipy2.01/ pypy

f1

0.042

0.695

16.54x

floatdemo

0.781

1.218

1.55x

oodemo

1.703

9.501

5.31x

richards2

720

862

1.20x

These results are very interesting: under the CLR, we are between 5x faster
and 3x slower than IronPython 2.0.1, and between 4.8x faster and 1.8x slower
than IronPython 2.6. On the other hand, on mono we are consistently faster
than IronPython, up to 16x. Also, it is also interesting to note that
pypy-cli runs faster on CLR than mono for all benchmarks except richards2.

I've not investigated yet, but I think that the culprit is the terrible
behaviour of tail calls on CLR: as I already wrote in another blog post,
tail calls are ~10x slower than normal calls on CLR, while being only ~2x
slower than normal calls on mono. richads2 is probably the benchmark that
makes most use of tail calls, thus explaining why we have a much better result
on mono than CLR.

The next step is probably to find an alternative implementation that does not
use tail calls: this probably will also improve the time spent by the JIT
compiler itself, which is not reported in the numbers above but that so far it
is surely too high to be acceptable. Stay tuned.

Oh, I didn't know about .NET 4 beta. Have you got any link that explains how they fixed the tail call stuff? I'll surely give it a try.

About the .NET integration: no news from this front. Nowadays I'm fully concentrated on the JIT because I need some (possibly good :-)) results for my phd thesis. When pypy-cli-jit is super-fast, I'll try to make is also useful :-)

@Michael: from the link you posted, it seems that tail call improvements in .NET 4 are only for x86_64, but my benchmarks were un on 32 bit, so I don't think it makes a difference. Anyway, I'll try to benchmark with .NET 4 soon, thanks for the suggestion.

@Anonymous: the paper is interesting, but I don't think it's usable for our purposes: throwing and catching exception is incredibly costing in .NET, we cannot really use them too heavily. The fact that the paper says nothing about performances is also interesting :-)