Friday, February 4, 2011

PyPy faster than C on a carefully crafted example

Good day everyone.

Recent round of optimizations, especially loop invariant code motion
has been very good for small to medium examples. There is work ongoing to
make them scale to larger ones, however there are few examples worth showing
how well they perform. This one following example, besides getting benefits
from loop invariants, also shows a difference between static and dynamic
compilation. In fact, after applying all the optimizations C does, only a
JIT can use the extra bit of runtime information to run even faster.

Hence, PyPy 50% faster than C on this carefully crafted example. The reason
is obvious - static compiler can't inline across file boundaries. In C,
you can somehow circumvent that, however, it wouldn't anyway work
with shared libraries. In Python however, even when the whole import system
is completely dynamic, the JIT can dynamically find out what can be inlined.
That example would work equally well for Java and other decent JITs, it's
however good to see we work in the same space :-)

Cheers,
fijal

EDIT: Updated GCC version

Good day everyone.

Recent round of optimizations, especially loop invariant code motion
has been very good for small to medium examples. There is work ongoing to
make them scale to larger ones, however there are few examples worth showing
how well they perform. This one following example, besides getting benefits
from loop invariants, also shows a difference between static and dynamic
compilation. In fact, after applying all the optimizations C does, only a
JIT can use the extra bit of runtime information to run even faster.

Hence, PyPy 50% faster than C on this carefully crafted example. The reason
is obvious - static compiler can't inline across file boundaries. In C,
you can somehow circumvent that, however, it wouldn't anyway work
with shared libraries. In Python however, even when the whole import system
is completely dynamic, the JIT can dynamically find out what can be inlined.
That example would work equally well for Java and other decent JITs, it's
however good to see we work in the same space :-)

There's another simple case where pypy could (in principle) do very much better than standard C: turn pow(x, i) into sqrt(x*x*x) if i == 3/2, and other reductions. In practice if you don't know what i is at compiletime you often bundle the simplifications into a function (at the cost of some ifs) but a JIT could do a very nice job on this automagically whenever i is fixed, which it usually is.

@haypo print the result so the loop don't get removed as dead code. Besides, the problem is really the fact that's -flto is unfair since python imports more resemble shared libraries than statically-compiled files.

First it's conceptual: C is almost as optimized as assembly (it's often referred to as a super assembler) so even if Pypy ends-up generating some assembly code, it has first to evaluate the runtime environment to figure out the type of variables and emit assembly code, and all this process is not free... so Pypy can only asymptotically reach the same level as C and assembly.

Second, the test is flawed: I did a slight modification that shouldn't change the results: I've inlined the add() in both python and C. Oh! surprise: Pypy keeps the same time whereas C is 4x faster than before (without inlining).

So to make it fair, we need to use the best capabilities of both languages:- python: I'm sure the author provided the best python implementation, and the fact that inlining add() doesn't change results kinda proves this)- C: when you inline the function you get:

@Eric This post is not trying to argue that Python is "better" or even faster than C. It is just pointing out that certain classes of optimizations (i.e. whole program optimizations) come naturally to the PyPy JIT.

This is, of course, only one small facet of why a program runs fast. The author admits that it is a contrived example to illustrate the point.

Taking the point to an extreme, one could see a PyPy program run faster than a C program if the C program made many calls to simple shared libraries. For example, if one dynamically links a C stdlib into their program, and uses it heavily, the equivalent python code may conceivably run faster.

Please read the title of this article again: "PyPy faster than C on a carefully crafted example"

Based on a specific example or not it doesn't matter, I'm simply not comfortable with reading strong statement like this that are obvioulsy false to any serious computer scientist and misleading to beginners. It's false because it's the conclusion of a test which is biased.

The root of benchmarking is to get rid of any biasIn this case the obvious bias is that Pypy is optimized and C isn't (as demonstrated above with inline functions).

You can't transpose only what you want in real life and not the other: your argument that in real life the C could use external library hence be slower is valid, but then you have to compare with real life Python scripts which can't be as much optimized by Pypy as this crafted example. So in real life you get a C code that may be slowed down a bit by dynamic linking, and python scripts that are much slower because Pypy isn't ready to match C speed for everything (yet).

If you want to use a crafted Python example, you have to compare it to a crafted C example, so that you can compare apples with apples.

All that is methodology, that said JIT is quite powerful and it's impressive in itself to beat CPython by a large margin.

Eric: Your comments about "real life" are irrelevant - the post is about a specific, contrived example. I don't think anyone would argue that a high-level, garbage-collected language like python could ever beat out C in general - it's simply a demonstration that, in a very specific instance, equivalent code in python and C can run faster in python because of the JIT making optimizations that can't occur at compile time.

point taken, but do update the article to take into account my remark: both the title and the conclusion of the "demonstration" are false, even on a contrived example as you barely can't find any C code that would be slower than the code generated by your JIT for the simple reason that C is really too close to assembly and that JIT adds an overhead.

Please don't digress, what I say is simple:The article states that Pypy generates code faster than C on a crafted example.I demonstrate there is a more optimized C code that the author's one, hence that the whole article is wrong... end of the story.

You're right, people very often use dynamic linking. However the following is not a reasonable piece of Python code:

def add(a, b): return a + b

People rarely use that and more importantly they don't write a loop that calls it 1 billion times.

The point is that the reasoning spans two levels (hence is flawed/biased):- in Python the author took a crafted piece of Python that is not meaningful in real life because it has the property to do what he wants at the Pypy level- in C the author uses a very common mechanism that isn't fully optimized (not as much as Python/Ppy is optimized).

I know you will not agree since you're all proud that "Pypy is faster than C" (lol it's nonsense even on a "crafted example") but you have to compare apples with apples.

@Eric what you don't understand is the point of the article. The actual point is to demonstrate a nice property of PyPy JIT, which is able to generate fast code when it can. Comparing to C in this manner proves that PyPy's generated machine code is relevant with regard to speed.Of course this example is fragile because it relies on suboptimal C code, but this serves only to prove the point about PyPy.