Tuesday, August 2, 2011

PyPy is faster than C, again: string formatting

String formatting is probably something you do just about every day in Python,
and never think about. It's so easy, just "%d %d" % (i, i) and you're
done. No thinking about how to size your result buffer, whether your output
has an appropriate NULL byte at the end, or any other details. A C
equivalent might be:

char x[44];
sprintf(x, "%d %d", i, i);

Note that we had to stop for a second and consider how big numbers might get
and overestimate the size (44 = length of the biggest number on 64bit (20) +
1 for the sign * 2 + 1 (for the space) + 1 (NUL byte)), it took the authors of
this post, fijal and alex, 3 tries to get the math right on this :-)

This is fine, except you can't even return x from this function, a more
fair comparison might be:

char *x = malloc(44 * sizeof(char));
sprintf(x, "%d %d", i, i);

x is slightly overallocated in some situations, but that's fine.

But we're not here to just discuss the implementation of string
formatting, we're here to discuss how blazing fast PyPy is at it, with
the new unroll-if-alt branch. Given the Python code:

Run under PyPy, at the head of the unroll-if-alt branch, and
compiled with GCC 4.5.2 at -O4 (other optimization levels were tested,
this produced the best performance). It took 0.85 seconds to
execute under PyPy, and 1.63 seconds with the compiled binary. We
think this demonstrates the incredible potential of dynamic
compilation, GCC is unable to inline or unroll the sprintf call,
because it sits inside of libc.

Which as discussed above, is more comperable to the Python, gives a
result of 1.96 seconds.

Summary of performance:

Platform

GCC (stack)

GCC (malloc)

CPython

PyPy (unroll-if-alt)

Time

1.63s

1.96s

10.2s

0.85s

relative to C

1x

0.83x

0.16x

1.9x

Overall PyPy is almost 2x faster. This is clearly win for dynamic
compilation over static - the sprintf function lives in libc and so
cannot be specializing over the constant string, which has to be parsed
every time it's executed. In the case of PyPy, we specialize
the assembler if we detect the left hand string of the modulo operator
to be constant.

Cheers,
alex & fijal

String formatting is probably something you do just about every day in Python,
and never think about. It's so easy, just "%d %d" % (i, i) and you're
done. No thinking about how to size your result buffer, whether your output
has an appropriate NULL byte at the end, or any other details. A C
equivalent might be:

char x[44];
sprintf(x, "%d %d", i, i);

Note that we had to stop for a second and consider how big numbers might get
and overestimate the size (44 = length of the biggest number on 64bit (20) +
1 for the sign * 2 + 1 (for the space) + 1 (NUL byte)), it took the authors of
this post, fijal and alex, 3 tries to get the math right on this :-)

This is fine, except you can't even return x from this function, a more
fair comparison might be:

char *x = malloc(44 * sizeof(char));
sprintf(x, "%d %d", i, i);

x is slightly overallocated in some situations, but that's fine.

But we're not here to just discuss the implementation of string
formatting, we're here to discuss how blazing fast PyPy is at it, with
the new unroll-if-alt branch. Given the Python code:

Run under PyPy, at the head of the unroll-if-alt branch, and
compiled with GCC 4.5.2 at -O4 (other optimization levels were tested,
this produced the best performance). It took 0.85 seconds to
execute under PyPy, and 1.63 seconds with the compiled binary. We
think this demonstrates the incredible potential of dynamic
compilation, GCC is unable to inline or unroll the sprintf call,
because it sits inside of libc.

Which as discussed above, is more comperable to the Python, gives a
result of 1.96 seconds.

Summary of performance:

Platform

GCC (stack)

GCC (malloc)

CPython

PyPy (unroll-if-alt)

Time

1.63s

1.96s

10.2s

0.85s

relative to C

1x

0.83x

0.16x

1.9x

Overall PyPy is almost 2x faster. This is clearly win for dynamic
compilation over static - the sprintf function lives in libc and so
cannot be specializing over the constant string, which has to be parsed
every time it's executed. In the case of PyPy, we specialize
the assembler if we detect the left hand string of the modulo operator
to be constant.

How are those two loops equivalent? You're not printing anything in the Python loop. I/O buffering etc. can eat quite a bit of runtime. It would also be nice to see what the particular improvements in this "unroll-if-alt" branch are.

This doesn't really have anything to do with dynamic compilation, but cross module optimization. There are static compilers, such as the Glasgow Haskell Compiler, that do this. If the compilation strategy depended on runtime data (e.g. measure hot spots), it would be dynamic compilation.

*yawn* If you want to see ridiculously fast string formatting, look at the Boost's Spirit library (specifically Karma). Small test case, but point well illustrated: http://www.boost.org/doc/libs/1_47_0/libs/spirit/doc/html/spirit/karma/performance_measurements/numeric_performance/int_performance.html Or look at Spirit's input parser for even integers: http://alexott.blogspot.com/2010/01/boostspirit2-vs-atoi.html

What machine are you on that an int is 64 bits? Hardly anybody uses ILP64 or SILP64 data models ( http://en.wikipedia.org/wiki/64-bit#Specific_C-language_data_models ). Maybe a fourth try is in order? :P

So when are you going to teach PyPy that the result of an unused string formatting can be deleted, and then delete the loop? ;)

I'm not sure how you'd get there from a tracing JIT, though. WIth Python, you still have to call all the formatting and stringification methods because they might have side effects. You only get to know that the entire operation is a no-op after you've inlined everything, but by then it will be at a low enough representation that it's hard to tell.

@Anonymous 2: Even if all the time were spent in malloc/free, PyPy has to dynamically allocate the string data structure, as well as provide a buffer to fill with the characters from the integers, since it has no way of knowing how much space will be needed (could be a custom integer class).

However, you're right that malloc and free are slow and a good gc system would have a faster allocator.

Please read the manual of GCC. There you will see that every optimization level above 3 is handled as it would be 3. '-O4' is nothing else than '-O3'.

It is also known that optimizing with -O3 may lead to several problems at runtime (e.g. memory delays for short programs or memory allocation failure in larger programs).That's why the recommended optimization level is '2' (or 's' for embedded systems) and not '3'.

For all you complaining about test eviorment. Pypy would still have to do that internaly. If they should be truely comparable, then you need to also include snprintf inside the loop, making C even slower. Also, I doubt you will get 200% performance boost from scheduler change.

unroll-if-alt will be included in 1.6 right? Also when will 1.6 be released?

@Andrew, @hobbs: Oh, sorry I overlooked the "s" in "sprintf". It would still be nice compare the generated machine code to explain the differences.

Whenever, someone claims language L1 implementation A is faster than language L2 implementation B there are obvious questions about (1) fairness of comparison, (2) what is being measured. In this case PyPy is specializing on the format string interpreter (does that require library annotations?) which a C compiler could do in principle here (but probably doesn't.) So, I'm always a bit suspicious when I see these kinds of comparisons.

@Johan: GHC's cross-module optimization often comes at the expense of binary compatibility. A JIT has a big advantage here.

What performance impact does the malloc/free produce in the C code? AFAIK Python allocates memory in larger chunks from the operating system. Probably Python does not have to call malloc after initialization after it allocated the first chunk.

AFAIK each malloc/free crosses the boundaries between user-mode/kernel-mode.

So, IMHO you should compare the numbers of a C program whichdoes not allocate dynamic memory more than once and uses an internal memory management system.

The point here is not that the Python implementation of formatting is better than the C standard library, but that dynamic optimisation can make a big difference. The first time the formatting operator is called its format string is parsed and assembly code for assembling the output generated. The next 999999 times that assembly code is used without doing the parsing step. Even if sprintf were defined locally, a static compiler can’t optimise away the parsing step, so that work is done redundantly every time around the loop.

In a language like Haskell something similar happens. A string formatting function in the style of sprintf would take a format string as a parameter and return a new function that formats its arguments according to that string. The new function corresponds to the specialized assembly code generated by PyPy’s JIT. I think if you wanted to give the static compiler the opportunity to do optimizations that PyPy does at runtime you would need to use a custom type rather than a string as the formatting spec. (NB my knowledge of functional-language implementation is 20 years out of date so take the above with a pinch of salt.)

The C code shown does not do any malloc/free. The sprintf function formats the string into the char array x, which is allocated on the stack. It is highly unlikely that the sprintf function itself mallocs any memory.

You wrote: "We think this demonstrates the incredible potential of dynamic compilation, ..."

I disagree. You tested a microbenchmark. Claims about compiler or language X winning over Y should be made after observing patterns in real programs. That is: execute or analyse real C programs which are making use of 'sprintf', record their use of 'sprintf', create a statistics out of the recorded data and then finally use the statistical distributions to create Python programs with a similar distribution of calls to '%'.

@Anonymous: this branch, unroll-if-alt, will not be included in the release 1.6, which we're doing right now (it should be out any day now). It will only be included in the next release, which we hope to do soonish. It will also be in the nightly builds as soon as it is merged.

Is string/IO performance in general being worked on in Pypy? Last I looked Pypy showed it was faster than CPython in many cases on its benchmarks page, but for many string/IO intensive tasks I tried Pypy v1.5 on, it was slower.

@Connelly yes, for some definition of working (being thought about). that's one reason why twisted_tcp is slower than other twisted benchmarks. We however welcome simple benchmarks as bugs on the issue tracker

This is a horribly flawed benchmark which illustrates absolutely nothing. First of all, an optimizing JIT should be (easily) able to detect that your inner loop has no side effects and optimize it away. Secondly, with code like that you should expect all kinds of weirds transformations by the compiler, hence - you can't be really sure what you are comparing here. As many here have pointed out, you should compare the output assembly.

Anyway, if you really want to do a benchmark like that, do it the right way. Make the loop grow a string by continuous appending and write the string to the file in the end (time the loop only). This way you will get accurate results which really compare the performance of two compilers performing the same task.

Note for JS example: I did not want to use the stuff like i + " " + i because it bypasses the format function call. Obviously, using the + operator the nodejs example would be much faster (but pypy probably as well).

Also, I used PyPy 1.5 as I did not find any precompiled PyPy 1.6 for OS X.

@tt: This is a very inefficient C/C++ implementation of the idea "make a loop grow a string by continuous appending and write the string to the file in the end". In addition, it appears to be an uncommon piece of C/C++ code.

Well, I never said anything about writing super efficient C code. Anyway, I don't see how you want to implement string formatting more efficiently - if we talk about general usage scenario. You can't really reuse the old string buffer, you basically have to allocate new one each time the string grows. Or pre-allocate a larger string buffer and do some substring copies (which will result in a much more complicated code). Anyway, the malloc() on OS X is very fast.

My point is: even this C code, which you call inefficient is around 6 times faster then pypy 1.5

The compare is good because both use standard langauge fetatures to do the same thing, using a third part lib is not the same, then I have to code the same implant in RPython and people would still complain do to RPython often being faster then C regardless.

Python could have detected that the loop is not doing anything, but give that one value had a __str__ call it could've broken some code. Anyway, C compiler could also see that you didn't do anything with the value and optimalize it the same way.

@Antiplutocrat:Honestly, I expected a bit more objectivity from posters here. I am really disappointed that you compare me to "haters" (whoever that may be).

Your point about unroll-if-alt is absolutely valid and I myself have explicitly stated that I did not use that feature. At no point I have refuted that the original blog post was wrong - it is still very well possible that PyPy 1.6 is faster then C in this usage scenario. The main goal of my post was to make clear that the original benchmarks were flawed, as they grant the compiler too much space for unpredictable optimizations. I believe that my benchmark code produces more realistic results and I suggest that the authors of this blog entry re-run the benchmark using my code (or something similar, which controls for unpredictable optimizations).

@Anonymous: I agree with your other paragraphs, but not with the one where you wrote that "... OLDER version (4.5.x) of GCC whilst a newer version (4.6.x) is available with major improvements to the optimizer in general".

I am not sure, what "major improvements" in GCC 4.6 do you mean? Do you have benchmark numbers to back up your claim?

As far as well-written C code is concerned, in my opinion, there haven't been any "major improvements" in GCC for more than 5+ years. There have been improvements of a few percent in a limited number of cases - but nothing major.

Even LTO (link-time optimization (and lets hope it will be safe/stable to use when GCC 4.7 is released)) isn't a major boost. I haven't seen LTO being able to optimize calls to functions living in dynamic libraries (the bsearch(3) function would be a nice candidate). And I also haven't seen GCC's LTO being able to optimize calls to/within the Qt GUI library when painting pixels or lines onto the screen.

The main point of the PyPy article was that run-time optimizations in PyPy have a chance of surpassing GCC in certain cases.

Personally, I probably wouldn't willingly choose to work on a project like PyPy - since, err, I believe that hard-core JIT optimizations on a dynamically typed language like Python are generally a bad idea - but I am (in a positive way) eager to see what the PyPy team will be able to do in this field in the years to come.

@⚛: I indeed do not have benchmarks for these claims, but GCC 4.6 indeed added some newer optimization techniques to its assortment. Maybe these may not have had a significant influence in said case but they might have somewhere else. I'm merely saying: you can't really compare the latest hot inventions with something that is surpassed (e.g. compare Java 7 to a program output by Visual C++ back form the VS 2003 IDE).

All by all, I'm not saying that Python sucks and don't want to sound like a fanboy (on the contrary, Linux uses a great deal of Python and if this could mean a major speedup, then why the hell not ;).

I guess I was pissed off because the written article sounds very much fanboyish and pro-Python (just look at the title alone).