Friday, January 25, 2008

In the last days we improved platform-support of PyPy's Python interpreter.
Jean-Paul Calderone has been tirelessly working for some time now on setting up a
buildbot for translating and testing PyPy. So far the basic mechanisms are
working and the buildbot is running on various machines, including some that
Michael Schneider (bigdog) lets us use, one of them being a Windows machine,
the other one with a 64bit Linux (lots of thanks to those two, you are
awesome!).

What is still missing is a nice way to visualize the test results to quickly see
which tests have started failing on which platforms. There is a prototype
already, which still needs some tweaking.

The availability of these machines has triggered some much-needed bug-fixing in
PyPy to make our Python interpreter work better on Windows and on 64 bit Linux.
Maciek and Michael Schneider worked on this quite a bit last week, with the
result that PyPy supports many more extension modules now on Windows and 64 bit
Linux. Since we now have the buildbot the hope is that the support also won't
disappear soon :-).

In the last days we improved platform-support of PyPy's Python interpreter.
Jean-Paul Calderone has been tirelessly working for some time now on setting up a
buildbot for translating and testing PyPy. So far the basic mechanisms are
working and the buildbot is running on various machines, including some that
Michael Schneider (bigdog) lets us use, one of them being a Windows machine,
the other one with a 64bit Linux (lots of thanks to those two, you are
awesome!).

What is still missing is a nice way to visualize the test results to quickly see
which tests have started failing on which platforms. There is a prototype
already, which still needs some tweaking.

The availability of these machines has triggered some much-needed bug-fixing in
PyPy to make our Python interpreter work better on Windows and on 64 bit Linux.
Maciek and Michael Schneider worked on this quite a bit last week, with the
result that PyPy supports many more extension modules now on Windows and 64 bit
Linux. Since we now have the buildbot the hope is that the support also won't
disappear soon :-).

Monday, January 21, 2008

(yes, C as in language, not c as in speed of light). I looked recently at the great computer language shootout, for some benchmarks and to make some speed comparisons. I use this benchmark, modified it to be rpythonic-enough and compared speeds. The code is here (the only change from the Python version was to create a class instead of tuple, so actually this version is more OO). Also the benchmark is very likely flawed because it favours better GCs :).
So, here we go:

Language:

Time of run (for N=14):

Python version running on Python 2.5.1, distribution

25.5s

Python version running on PyPy with generational GC

45.5

Python with psyco

20s

RPython translated to C using PyPy's generational GC

0.42s

compiling the Haskell version with GHC 6.6.1

1.6s

compiling the C version with gcc 4.1.2 -O3 -fomit-frame-pointer

0.6s

Also worth noticing is that when using psyco with the original version (with tuples) it is very fast (2s).

So, PyPy's Python interpreter is 80% slower than CPython on this (not too horrible), but RPython is 40% faster than gcc here. Cool. The result is mostly due to our GC, which also proves that manual memory-management can be slower than garbage collection in some situations. Please note that this result does not mean that RPython is meant for you. It requires a completely different mindset than the one used to program in Python. Don't say you weren't warned! :-)

(yes, C as in language, not c as in speed of light). I looked recently at the great computer language shootout, for some benchmarks and to make some speed comparisons. I use this benchmark, modified it to be rpythonic-enough and compared speeds. The code is here (the only change from the Python version was to create a class instead of tuple, so actually this version is more OO). Also the benchmark is very likely flawed because it favours better GCs :).
So, here we go:

Language:

Time of run (for N=14):

Python version running on Python 2.5.1, distribution

25.5s

Python version running on PyPy with generational GC

45.5

Python with psyco

20s

RPython translated to C using PyPy's generational GC

0.42s

compiling the Haskell version with GHC 6.6.1

1.6s

compiling the C version with gcc 4.1.2 -O3 -fomit-frame-pointer

0.6s

Also worth noticing is that when using psyco with the original version (with tuples) it is very fast (2s).

So, PyPy's Python interpreter is 80% slower than CPython on this (not too horrible), but RPython is 40% faster than gcc here. Cool. The result is mostly due to our GC, which also proves that manual memory-management can be slower than garbage collection in some situations. Please note that this result does not mean that RPython is meant for you. It requires a completely different mindset than the one used to program in Python. Don't say you weren't warned! :-)

Saturday, January 19, 2008

After having spent the last few days on understanding PyPy's JIT,
today I went back hacking the clr module. As a result, it is now
possible to import and use external assemblies from pypy-cli,
including Windows Forms

Here is a screenshot of the result you get by typing the following at
the pypy-cli interactive prompt:

Unfortunately at the moment you can't do much more than this, because
we still miss support for delegates and so it's not possibile to
handle events. Still, it's a step in the right direction :-).

After having spent the last few days on understanding PyPy's JIT,
today I went back hacking the clr module. As a result, it is now
possible to import and use external assemblies from pypy-cli,
including Windows Forms

Here is a screenshot of the result you get by typing the following at
the pypy-cli interactive prompt:

This is still to be considered an alpha version; there are few known
bugs and probably a lot of unknown ones :-), so don't expect it to
work in every occasion. Still, it's a considerable step towards real
world :-).

A while ago Amit Regmi, a student from Canada, started working on the
clr module improvements branch as a university project.

During the sprint Carl Friedrich, Paul and me worked more on it and
brought it to a mergeable state.

It adds a lot of new features to the clr module, which is the
module that allows integration between pypy-cli (aka PyPy.NET) and
the surrounding .NET environment:

full support to generic classes;

a new importer hook, allowing things like fromSystemimportMath and so on;

.NET classes that implements IEnumerator are treated
as Python iterators; e.g. it's is possile to iterate over them
with a for loop.

This is still to be considered an alpha version; there are few known
bugs and probably a lot of unknown ones :-), so don't expect it to
work in every occasion. Still, it's a considerable step towards real
world :-).

Friday, January 18, 2008

Over the years PyPy has (ab?)used various external software for different
purposes, and we've discovered bugs in nearly all of them, mostly by pushing them
to their limits. For example, many compilers are not happy with 200MB of
source in one file. The Microsoft C compiler has a limit of 65536 lines of code
per file and the CLI was raising "System.InvalidProgramException: Method
pypy.runtime.Constants:.cctor () is too complex.", where too complex probably
means "too long". Just for fun, today we collected all projects we could think of
in which we found bugs:

So one could say that PyPy is really just the most expensive debugging tool
ever :-).

Over the years PyPy has (ab?)used various external software for different
purposes, and we've discovered bugs in nearly all of them, mostly by pushing them
to their limits. For example, many compilers are not happy with 200MB of
source in one file. The Microsoft C compiler has a limit of 65536 lines of code
per file and the CLI was raising "System.InvalidProgramException: Method
pypy.runtime.Constants:.cctor () is too complex.", where too complex probably
means "too long". Just for fun, today we collected all projects we could think of
in which we found bugs:

Monday, January 14, 2008

The Leysin sprint has started since yesterday morning in the usual location. The view is spectacular (see photo) the weather mostly sunny. The following people are sprinting:

Maciej Fijalkowski

Armin Rigo

Toby Watson

Paul deGrandis

Antonio Cuni

Carl Friedrich Bolz

So it is a rather small sprint.

We started working on various features and performance improvements for the high level backends (JVM and .NET) and on implementing ctypes for PyPy. Later this week we plan to spend a few days on the JIT, because Anto and I both need to get into it for our respective university projects.

The Leysin sprint has started since yesterday morning in the usual location. The view is spectacular (see photo) the weather mostly sunny. The following people are sprinting:

Maciej Fijalkowski

Armin Rigo

Toby Watson

Paul deGrandis

Antonio Cuni

Carl Friedrich Bolz

So it is a rather small sprint.

We started working on various features and performance improvements for the high level backends (JVM and .NET) and on implementing ctypes for PyPy. Later this week we plan to spend a few days on the JIT, because Anto and I both need to get into it for our respective university projects.

Friday, January 11, 2008

PyPy contains a framework for writing custom Garbage Collectors, and a few simple GCs have been written in this framework. A common issue with all these GCs is how to find all the stack roots, i.e. all the pointers to live GC-managed objects currently stored in local variables, in all the callers of the current function. The current solution is to maintain a custom shadow stack of roots, where all functions push and pop copies of their local variables of type "GC pointer". Clearly this is an overhead. Can we remove it?

LLVM has recently grown some support for this. By emitting markers in the LLVM source and with the help of a bit of custom C++ code, we can generate stack maps for the functions compiled by LLVM. Then, with 100% non-portable code in our framework GC's root finding algorithm, we can walk the machine stack and locate where in each stack frame LLVM stores the GC pointers. (Yes, I mean non-portable: LLVM offers no help for doing that. Maybe it will at some point, though I didn't manage to explain why this is an issue to people working on this in LLVM so far...). I've tried that approach in the llvmgcroot branch. Over the manually-managed shadow stack, this gives speed improvements which are, very roughly, on the order of 5%.

Note that this prevents some optimizations in LLVM, because it forces it to allocate all local variables of type "GC pointer" in the stack; it cannot keep them in registers and it must assume that they can be changed more or less at any time (as moving GCs do). Can we do better?

Actually, yes. We can even do better in the C backend, using a GCC hack. GCC has this nice extension:

asm("bla", constrains);

This is meant to generate assembler instructions directly from C. Internally, GCC considers the whole asm() as a single regular instruction of its intermediate language; the constrains are expressed in the same way as the constrains for all the prebuilt intermediate language instructions. They express things like input and output operands of the instruction, whether they can live in memory or in registers, whether the whole instruction has side-effects, etc. The nice thing about asm() is that it doesn't kill any optimization whatsoever in GCC - it's your job to make sure that you use the correct constrains.

So what I've tried in the asmgcroot branch is to use asm() as markers. In this branch, the C backend produces code like this after each function call, for each local variable containing a live GC pointer:

asm("/* GCROOT %0 */" : "=g"(localvar) : "0"(localvar) : "memory");

This causes GCC to emit the following line in the assembler file it generates:

/* GCROOT register-or-memory-containing-localvar */

I won't go in the details of the asm() line above - the constrains are just enough to make sure that GCC doesn't optimize too much, but don't prevent most optimizations from occurring. For example, the localvar can be in a register.

The assembler will just ignore the line above; it is a comment. But what we can do is write our own tool parsing the assembler files. This tool locates the /* GCROOT */ comments and follows where the register or memory location in the comment comes from (to do this it must follow the control flow and data flow of the function). This allows it to build a stack map: for each call instruction it knows exactly which registers and frame stack locations contain a live GC pointer. The stack map is then emitted in an extra assembler file that we link with the rest. As with LLVM above, the stack map is then used at run-time by non-portable code written in our GC's stack root tracker.

Yes, that's rather insane. But at least, we don't need to modify the assembler file - just read it. If GCC is too clever in its optimizations, the custom parser will get lost and complain cleanly; but I think that it is relatively safe in the sense that GCC optimizations should not be able to make the custom parser produce wrong results.

The branch is not merged because it's probably too insane to merge (not to mention, it's probably not portable to non-GCC compilers, and it is completely platform-specific). Still, it gives good results, better that the pure LLVM approach - on the order of 10% to 25% speed-ups for pypy-c.

PyPy contains a framework for writing custom Garbage Collectors, and a few simple GCs have been written in this framework. A common issue with all these GCs is how to find all the stack roots, i.e. all the pointers to live GC-managed objects currently stored in local variables, in all the callers of the current function. The current solution is to maintain a custom shadow stack of roots, where all functions push and pop copies of their local variables of type "GC pointer". Clearly this is an overhead. Can we remove it?

LLVM has recently grown some support for this. By emitting markers in the LLVM source and with the help of a bit of custom C++ code, we can generate stack maps for the functions compiled by LLVM. Then, with 100% non-portable code in our framework GC's root finding algorithm, we can walk the machine stack and locate where in each stack frame LLVM stores the GC pointers. (Yes, I mean non-portable: LLVM offers no help for doing that. Maybe it will at some point, though I didn't manage to explain why this is an issue to people working on this in LLVM so far...). I've tried that approach in the llvmgcroot branch. Over the manually-managed shadow stack, this gives speed improvements which are, very roughly, on the order of 5%.

Note that this prevents some optimizations in LLVM, because it forces it to allocate all local variables of type "GC pointer" in the stack; it cannot keep them in registers and it must assume that they can be changed more or less at any time (as moving GCs do). Can we do better?

Actually, yes. We can even do better in the C backend, using a GCC hack. GCC has this nice extension:

asm("bla", constrains);

This is meant to generate assembler instructions directly from C. Internally, GCC considers the whole asm() as a single regular instruction of its intermediate language; the constrains are expressed in the same way as the constrains for all the prebuilt intermediate language instructions. They express things like input and output operands of the instruction, whether they can live in memory or in registers, whether the whole instruction has side-effects, etc. The nice thing about asm() is that it doesn't kill any optimization whatsoever in GCC - it's your job to make sure that you use the correct constrains.

So what I've tried in the asmgcroot branch is to use asm() as markers. In this branch, the C backend produces code like this after each function call, for each local variable containing a live GC pointer:

asm("/* GCROOT %0 */" : "=g"(localvar) : "0"(localvar) : "memory");

This causes GCC to emit the following line in the assembler file it generates:

/* GCROOT register-or-memory-containing-localvar */

I won't go in the details of the asm() line above - the constrains are just enough to make sure that GCC doesn't optimize too much, but don't prevent most optimizations from occurring. For example, the localvar can be in a register.

The assembler will just ignore the line above; it is a comment. But what we can do is write our own tool parsing the assembler files. This tool locates the /* GCROOT */ comments and follows where the register or memory location in the comment comes from (to do this it must follow the control flow and data flow of the function). This allows it to build a stack map: for each call instruction it knows exactly which registers and frame stack locations contain a live GC pointer. The stack map is then emitted in an extra assembler file that we link with the rest. As with LLVM above, the stack map is then used at run-time by non-portable code written in our GC's stack root tracker.

Yes, that's rather insane. But at least, we don't need to modify the assembler file - just read it. If GCC is too clever in its optimizations, the custom parser will get lost and complain cleanly; but I think that it is relatively safe in the sense that GCC optimizations should not be able to make the custom parser produce wrong results.

The branch is not merged because it's probably too insane to merge (not to mention, it's probably not portable to non-GCC compilers, and it is completely platform-specific). Still, it gives good results, better that the pure LLVM approach - on the order of 10% to 25% speed-ups for pypy-c.

Tuesday, January 8, 2008

Armin and me have been working on PyPy's parser and bytecode compiler for the Python language in the last days. Armin implemented several bytecode optimizations that CPython has since a while whereas I tried to refactor our tokenizer and parser (because our existing parser is rather slow and also not very nice code). Armin is mostly done whereas the new parser is not very far yet.

What is done, however, is the Python tokenizer. It is implemented in the usual way, by using a set of regular expressions to generate a deterministic finite automaton (DFA). This automaton is then turned into a big function which does the actual tokenization. Of course the picture is not quite as simple for Python, because it is not possible to tokenize Python using only regular expressions. To generate the proper "indent" and "dedent" tokens it would be necessary to keep state (the previous indentation levels) which a DFA cannot do. This is solved by postprocessing the tokens that the tokenizer produces to turn whitespace tokens into the proper indent and dedent tokens.

For debugging purposes I implemented a visualization tool for DFAs using PyPy's pygame-based graph viewer. The graph viewer is able to visualize interactively any graph given in the graph-description language of Graphviz. Looking at the tokenizing DFA for Python is rather instructive, both for understanding how tokenizing works and (maybe) for understanding the Python language. To try it, download the dot file of the DFA and run from a pypy checkout:

$ python pypy/bin/dotviewer.py tokenizer.dot

The following is a screenshot of the graphviewer:

For people who don't want do checkout PyPy I generated a (rather big) png for the DFA.

Next thing I would like to do (apart from actually finishing the parser, of course :-) ) is visualize the Python grammar itself using syntax diagrams or something similar. So far I couldn't really find a program to do that, though.

Armin and me have been working on PyPy's parser and bytecode compiler for the Python language in the last days. Armin implemented several bytecode optimizations that CPython has since a while whereas I tried to refactor our tokenizer and parser (because our existing parser is rather slow and also not very nice code). Armin is mostly done whereas the new parser is not very far yet.

What is done, however, is the Python tokenizer. It is implemented in the usual way, by using a set of regular expressions to generate a deterministic finite automaton (DFA). This automaton is then turned into a big function which does the actual tokenization. Of course the picture is not quite as simple for Python, because it is not possible to tokenize Python using only regular expressions. To generate the proper "indent" and "dedent" tokens it would be necessary to keep state (the previous indentation levels) which a DFA cannot do. This is solved by postprocessing the tokens that the tokenizer produces to turn whitespace tokens into the proper indent and dedent tokens.

For debugging purposes I implemented a visualization tool for DFAs using PyPy's pygame-based graph viewer. The graph viewer is able to visualize interactively any graph given in the graph-description language of Graphviz. Looking at the tokenizing DFA for Python is rather instructive, both for understanding how tokenizing works and (maybe) for understanding the Python language. To try it, download the dot file of the DFA and run from a pypy checkout:

$ python pypy/bin/dotviewer.py tokenizer.dot

The following is a screenshot of the graphviewer:

For people who don't want do checkout PyPy I generated a (rather big) png for the DFA.

Next thing I would like to do (apart from actually finishing the parser, of course :-) ) is visualize the Python grammar itself using syntax diagrams or something similar. So far I couldn't really find a program to do that, though.