Tuesday, October 14, 2008

Background

Finally, the JIT post :-). First some background: Despite our plans at the end
of the EU period, PyPy's Python interpreter didn't get a good and widely
applicable JIT in the last year. The reason for that was that we discovered that
although the basic idea to generate JIT compilers is good, the concrete
prototype made during the EU period is basically flawed. It could have
been pushed a bit farther, but would have run into deep troubles eventually. One
of the problems would have been performance instability: change a seemingly
unrelated bit in your source program, and the performance changes in unexpected
ways, which is clearly not desirable. Another problem with that old approach is
that too much assembler code is generated, leading to memory problems, and also
that the generated assembler is bad in various ways, e.g. it is hard in that
approach to do proper register allocation.

Therefore we decided that it would be worthless to pursue this direction much
further. Instead we tried to research approaches to fixing the inherent
problems. This research was largely done in Prolog and I eventually wrote my
Master's thesis about it. From the Prolog work we got some good insights into
what needs to be done and what sorts of techniques are needed. Also, it inspired
Armin to do some more exploration on a small Python prototype which used the
lessons learned from Prolog and also some additional ideas from tracing JITs. So
far, however, the prototype is neither in RPython, nor much integrated with
PyPy.

This research is not the only thing happening in the JIT-area. During the last
year, Antonio Cuni was working on bringing the JIT to pypy-cli. This
consisted mostly of writing a .NET backend for the old JIT-generator. Some
further work is being done since August by John Witulski, who is writing an
AMD64 backend for the JIT-generator for his Bachelor's thesis.

Where to go from there

During the sprint we discussed in which directions we should continue now. We
plan to work quite a bit on the JIT in the coming months. Both Armin and Anto
are in Düsseldorf for four months, and them and me plan to mostly work on the
JIT (as well as giving a lecture on "Dynamic Programming Languages", trying to
ensnare some more students).

The first step will be to experiment a bit more with Armin's prototype. So far
it looks rather promising, but there are some unsolved issues that we need to
look into first. The first issue is to think a bit about how to efficiently do
profiling to compile only important code paths. The other large issue are
so-called "virtualizables". Roughly speaking, they are the frame objects of the
interpreter from which the JIT is generated. They need special treatment,
because on the one hand it is important that they get optimized away to make the
code fast, since the frames are accessed all the time for the local variables;
on the other hand they should still be usable for introspection if code is
around that is trying to look into them.

When this is done, the prototype needs to be ported to RPython, which is a
non-trivial task, since it is rather dynamic so far (it is rather important that
the unresolved issues are done before the porting, because once the prototype is
in RPython, experimentation will be harder). The porting has the potential to be
tedious, but in a sense it is "just work", as opposed to unclear research.

At this point it will become important to think about the backend interface. The
interface that the old frontend used to produce assembler code won't be usable
for the new approach, so things need to be rearranged slightly. Afterwards the
backends will have more information and be invoked at a slightly higher level,
which should allow them to produce better code.

When all this is done, the JIT generator will be in a rather good state and it
should become possible (modulo a lot of details, of course), to use it on the
Python interpreter.

Conclusion

I am intentionally not attaching any time estimates to this blog post. So far
our time estimates have not been very accurate when it comes to the JIT, which
only lead to disappointment when the JIT failed to materialize. We hope that we
will progress in interesting ways in the next four months, but who knows. Note
that we are really quite disappointed ourselves that it took so much longer than
we planned and hoped. The reason for this is mostly that this work really is
research and sometimes it is just hard to predict what sort of problems turn up.
Partial evaluation (the basis for our JIT generator) is a 30 years old technique
that was always just promising and never really successful, so the fact that we
think we can solve its problems in a few years is very much hubris anyway :-).
On the positive side, we think that we now know these problems much better than
ever before and that we have a plan that has a chance to succeed.

Also we are still convinced that our approach has huge potential, despite the
difficulties. If we manage to pull it off, it should be significantly simpler to
support new language features in the JIT and also to get speedups on some rather
interesting bits of the language. Some ideas we are having include generating a
JIT for the regex engine or speed up ctypes-bindings to be nearly as fast as an
extension module (or faster?). Also the JIT will be such that by construction
the JIT-generated code behaves identical to the original code, which isn't
always true for Psyco, for example.

Background

Finally, the JIT post :-). First some background: Despite our plans at the end
of the EU period, PyPy's Python interpreter didn't get a good and widely
applicable JIT in the last year. The reason for that was that we discovered that
although the basic idea to generate JIT compilers is good, the concrete
prototype made during the EU period is basically flawed. It could have
been pushed a bit farther, but would have run into deep troubles eventually. One
of the problems would have been performance instability: change a seemingly
unrelated bit in your source program, and the performance changes in unexpected
ways, which is clearly not desirable. Another problem with that old approach is
that too much assembler code is generated, leading to memory problems, and also
that the generated assembler is bad in various ways, e.g. it is hard in that
approach to do proper register allocation.

Therefore we decided that it would be worthless to pursue this direction much
further. Instead we tried to research approaches to fixing the inherent
problems. This research was largely done in Prolog and I eventually wrote my
Master's thesis about it. From the Prolog work we got some good insights into
what needs to be done and what sorts of techniques are needed. Also, it inspired
Armin to do some more exploration on a small Python prototype which used the
lessons learned from Prolog and also some additional ideas from tracing JITs. So
far, however, the prototype is neither in RPython, nor much integrated with
PyPy.

This research is not the only thing happening in the JIT-area. During the last
year, Antonio Cuni was working on bringing the JIT to pypy-cli. This
consisted mostly of writing a .NET backend for the old JIT-generator. Some
further work is being done since August by John Witulski, who is writing an
AMD64 backend for the JIT-generator for his Bachelor's thesis.

Where to go from there

During the sprint we discussed in which directions we should continue now. We
plan to work quite a bit on the JIT in the coming months. Both Armin and Anto
are in Düsseldorf for four months, and them and me plan to mostly work on the
JIT (as well as giving a lecture on "Dynamic Programming Languages", trying to
ensnare some more students).

The first step will be to experiment a bit more with Armin's prototype. So far
it looks rather promising, but there are some unsolved issues that we need to
look into first. The first issue is to think a bit about how to efficiently do
profiling to compile only important code paths. The other large issue are
so-called "virtualizables". Roughly speaking, they are the frame objects of the
interpreter from which the JIT is generated. They need special treatment,
because on the one hand it is important that they get optimized away to make the
code fast, since the frames are accessed all the time for the local variables;
on the other hand they should still be usable for introspection if code is
around that is trying to look into them.

When this is done, the prototype needs to be ported to RPython, which is a
non-trivial task, since it is rather dynamic so far (it is rather important that
the unresolved issues are done before the porting, because once the prototype is
in RPython, experimentation will be harder). The porting has the potential to be
tedious, but in a sense it is "just work", as opposed to unclear research.

At this point it will become important to think about the backend interface. The
interface that the old frontend used to produce assembler code won't be usable
for the new approach, so things need to be rearranged slightly. Afterwards the
backends will have more information and be invoked at a slightly higher level,
which should allow them to produce better code.

When all this is done, the JIT generator will be in a rather good state and it
should become possible (modulo a lot of details, of course), to use it on the
Python interpreter.

Conclusion

I am intentionally not attaching any time estimates to this blog post. So far
our time estimates have not been very accurate when it comes to the JIT, which
only lead to disappointment when the JIT failed to materialize. We hope that we
will progress in interesting ways in the next four months, but who knows. Note
that we are really quite disappointed ourselves that it took so much longer than
we planned and hoped. The reason for this is mostly that this work really is
research and sometimes it is just hard to predict what sort of problems turn up.
Partial evaluation (the basis for our JIT generator) is a 30 years old technique
that was always just promising and never really successful, so the fact that we
think we can solve its problems in a few years is very much hubris anyway :-).
On the positive side, we think that we now know these problems much better than
ever before and that we have a plan that has a chance to succeed.

Also we are still convinced that our approach has huge potential, despite the
difficulties. If we manage to pull it off, it should be significantly simpler to
support new language features in the JIT and also to get speedups on some rather
interesting bits of the language. Some ideas we are having include generating a
JIT for the regex engine or speed up ctypes-bindings to be nearly as fast as an
extension module (or faster?). Also the JIT will be such that by construction
the JIT-generated code behaves identical to the original code, which isn't
always true for Psyco, for example.

At the beginning of this year, PyPy grew ctypes support, thanks to generous
support by Google. This made it possible to interface with C libraries from
our Python interpreter, something that was possible but rather tedious before.
What we are lacking so far is a way to interface to large C++ libraries (like
GUI libraries). During the sprint we had a brainstorming session about possible
approaches for fixing this shortcoming.

Those all have the property that they produce some code that is then compiled
with a compiler to produce a CPython extension. The produced code also uses
functions from CPython's C-API. This model is not simple to use for PyPy in its
current state. Since PyPy generates C code automatically, a fixed C-level API
does not exist (it is not unlikely that at one point in the future we might have
to provide one, but not yet). At the moment, PyPy very much has a "Don't call
us, we call you"-approach.

A very different approach is followed by the Reflex package, which is
developed at CERN (which has an incredible amount of C++ libraries). It is not
mainly intended for writing Python bindings for C++ libraries but instead
provides reflection capabilities for C++. The idea is that for every C++ shared
library, an additional shared library is produced, which allows together with
Reflex to introspect properties of C++ classes, methods, etc. at runtime. These
facilities are then used for writing a small generic CPython extension module,
that allows CPython to use any C++ library for which this reflection information
was generated.

This approach is a bit similar to the ctypes module, apart from the fact
that ctypes does not use any reflection information, but the user has to
specify the data structures that occur in the C code herself. This makes it
sometimes rather burdensome to write cross-platform library bindings.

For PyPy the approach seems rather fitting: We would need to implement only the
generic extension module and could then use any number of C++ libraries. Of
course some more evaluation is needed (e.g. to find out whether there are any
restrictions for the C++ code that the library can use and how bothersome it is
to get this reflection information for a large library) but so far it seems
promising.

At the beginning of this year, PyPy grew ctypes support, thanks to generous
support by Google. This made it possible to interface with C libraries from
our Python interpreter, something that was possible but rather tedious before.
What we are lacking so far is a way to interface to large C++ libraries (like
GUI libraries). During the sprint we had a brainstorming session about possible
approaches for fixing this shortcoming.

Those all have the property that they produce some code that is then compiled
with a compiler to produce a CPython extension. The produced code also uses
functions from CPython's C-API. This model is not simple to use for PyPy in its
current state. Since PyPy generates C code automatically, a fixed C-level API
does not exist (it is not unlikely that at one point in the future we might have
to provide one, but not yet). At the moment, PyPy very much has a "Don't call
us, we call you"-approach.

A very different approach is followed by the Reflex package, which is
developed at CERN (which has an incredible amount of C++ libraries). It is not
mainly intended for writing Python bindings for C++ libraries but instead
provides reflection capabilities for C++. The idea is that for every C++ shared
library, an additional shared library is produced, which allows together with
Reflex to introspect properties of C++ classes, methods, etc. at runtime. These
facilities are then used for writing a small generic CPython extension module,
that allows CPython to use any C++ library for which this reflection information
was generated.

This approach is a bit similar to the ctypes module, apart from the fact
that ctypes does not use any reflection information, but the user has to
specify the data structures that occur in the C code herself. This makes it
sometimes rather burdensome to write cross-platform library bindings.

For PyPy the approach seems rather fitting: We would need to implement only the
generic extension module and could then use any number of C++ libraries. Of
course some more evaluation is needed (e.g. to find out whether there are any
restrictions for the C++ code that the library can use and how bothersome it is
to get this reflection information for a large library) but so far it seems
promising.

Sunday, October 12, 2008

One of the discussions that happened during the sprint was about how to approach
the next PyPy release. There hasn't been a release since the end of the EU
period, which is not an optimal situation. Therefore we plan to make a 1.1
release at the beginning of next year, ideally before Pycon US. We'd also like
to move towards time-based releases. This will be greatly helped by the
new buildbot infrastructure, which allows us to decide when the
state of the codebase is stable enough to release.

Another goal of the release is to involve more people from the wider PyPy
community by having bugdays and generally asking for more support. This will be
particularly useful for bugs on platforms that no one of the core developers
group is using.

Feature-wise the release will mostly contain CPython 2.5 language support,
including some new extension modules (like ctypes, expat, sqlite).
In addition we plan to make it easier to actually install and use the PyPy
Python interpreter, which means some sort of proper installation procedure and
supporting distutils on top of PyPy. Another part of the release will be
support for fully sand-boxing an interpreter.

In contrast to our last release, we will focus mainly on PyPy's Python
Intepreter and more particularly its C-version. There are also various
experimental interpreters that PyPy contains, like for Prolog, Smalltalk,
JavaScript and Scheme. We also don't intend to put the LLVM and Javascipt
backends in the release, since they are essentially unmaintained and at least
partially broken. If anybody is particularly interested in one of these
components, please feel free to step up and take responsibility for them.
Another thing that the release won't contain is a JIT. I plan to make another
blog-post about this soon, stay tuned.

One of the discussions that happened during the sprint was about how to approach
the next PyPy release. There hasn't been a release since the end of the EU
period, which is not an optimal situation. Therefore we plan to make a 1.1
release at the beginning of next year, ideally before Pycon US. We'd also like
to move towards time-based releases. This will be greatly helped by the
new buildbot infrastructure, which allows us to decide when the
state of the codebase is stable enough to release.

Another goal of the release is to involve more people from the wider PyPy
community by having bugdays and generally asking for more support. This will be
particularly useful for bugs on platforms that no one of the core developers
group is using.

Feature-wise the release will mostly contain CPython 2.5 language support,
including some new extension modules (like ctypes, expat, sqlite).
In addition we plan to make it easier to actually install and use the PyPy
Python interpreter, which means some sort of proper installation procedure and
supporting distutils on top of PyPy. Another part of the release will be
support for fully sand-boxing an interpreter.

In contrast to our last release, we will focus mainly on PyPy's Python
Intepreter and more particularly its C-version. There are also various
experimental interpreters that PyPy contains, like for Prolog, Smalltalk,
JavaScript and Scheme. We also don't intend to put the LLVM and Javascipt
backends in the release, since they are essentially unmaintained and at least
partially broken. If anybody is particularly interested in one of these
components, please feel free to step up and take responsibility for them.
Another thing that the release won't contain is a JIT. I plan to make another
blog-post about this soon, stay tuned.

Saturday, October 11, 2008

The Düsseldorf sprint is currently in full progress and this post will try to
summarize what progress has been made in the last days. We are (again) sprinting
at the STUPS group of the Düsseldorf University. You can find the sprint
announcement and the daily planning file.

Holger and Samuele put quite some effort over several days into setting up and
improving PyPy's testing infrastructure. PyPy has a variety of tests. On the one
hand, there are of course our own tests. But then we also have the CPython tests
that should be run on top of pypy-c. Up to now we used a custom-made pile of
hacks, held together by lots of duct-tape. It consisted of a variety of
different machines running different things with different reporting solutions.
Some of the old test-results can still be found on wyvern. Now we are moving
to a buildbot based solution together with a custom reporter to have a view
similar to the old one. Some details are not quite finished yet, but most of the
things are already working rather well (currently all the results displayed
are from the 2.5-merge branch).

Another large (and ongoing) topic of work is the 2.5 branch. It contains the
work done by our Summer-of-Code student, Bruno Gola, of adding CPython 2.5
features to PyPy's Python interpreter. While Bruno implemented most language
features and imported the 2.5 stdlib into PyPy, a lot of details were still
missing. In the last days nearly everybody worked on fixing small issues and
failing stdlib tests. While doing that we tried to categorize some CPython tests
as implementation dependant so that we can skip them when running on PyPy.

Memory Improvements

One goal of the sprint is to measure and to reduce the memory behaviour of our
Python interpreter. The idea is to make pypy-c a realistic option for use on
embedded devices. By memory behaviour we mean both the
dynamic memory usage (how much bytes does a dict or an instance take) as well as
the size of the executable and details of the GC strategy.

Alexander, Carl Friedrich and Antonio did some work on analyzing the static data
that a pypy-c executable contains. Our executables have the tendency to be
rather large, both due to a lot of code and due to a large amount of static
data. The analysis didn't give any really surprising results, the problem is
mostly that we have a lot of static data originating from a bit everywhere in
our program. Two big offenders are the unicodedata-module with about 750 KB
of static data and the multimethod-tables with about 150 KB of data.

Armin, Iko, Anto and Maciek worked on a new approach to malloc-removal. This is
(for PyPy) a crucial optimization of the translation toolchain that performs
escape analysis to find out which objects don't outlive the frame they were
allocated in. Since RPython is garbage-collected we usually have a lot of
allocations, so it is important to statically get rid of many of them. To
successfully do that, some inlining is needed to give the analysis more context.
This leads to the fact that we have rather aggressive inlining-settings to allow
as much malloc-removal as possible. The new approach tries to inline functions
only if this actually leads to the successful removal of a malloc operation. The
code is not finished quite yet, so it remains to be seen how successful it will
be.

Before the sprint Maciek had started to work on a mark-compact GC for PyPy. The
idea is that it is better for memory-constrained-environments because it does
not double the memory-requirements during collections. During the sprint Armin
and Maciek worked on cleaning up the code a bit and then merging the branch.
An interesting property of the mark-compact GC is that after a collection all
the memory that is not currently used by the program is returned to the
operating system. Right now the GC is not as fast as our more mature ones, but
it probably will be the basis for future tweaking.

A small thing that was done by Alexander and Carl Friedrich to make objects smaller is
to enable shared instance dictionaries also for instances of old-style
classes. Before it worked only for instances of new-style classes. Shared
instance dictionaries are a way to reduce the memory-usage of instances. In the
optimal case, it gives the same memory-savings that __slots__ are giving,
but without any behavioural changes. Conceptually it is very similar e.g. to
the notion of "map" in the Self project, or the hidden classes that Google Chrome's V8
is using (click on the link, there are nice graphics). The
difference is that for them it is mostly a way to get faster attribute access,
and PyPy is so far only using it form memory savings (but that might change in
the future).

In parallel to all the other work, John Witulski worked tirelessly on advancing
the AMD64-JIT-backend. John has the implementation of this backend as the topic
of his Bachelor's thesis. He is progressing quite well (especially also
considering that this is his first sizeable Python project ever), just sometimes
being impaired by such annoyances as errors in the official Intel documentation.
By now the backend is supporting many integer operations and control flow.

The Düsseldorf sprint is currently in full progress and this post will try to
summarize what progress has been made in the last days. We are (again) sprinting
at the STUPS group of the Düsseldorf University. You can find the sprint
announcement and the daily planning file.

Holger and Samuele put quite some effort over several days into setting up and
improving PyPy's testing infrastructure. PyPy has a variety of tests. On the one
hand, there are of course our own tests. But then we also have the CPython tests
that should be run on top of pypy-c. Up to now we used a custom-made pile of
hacks, held together by lots of duct-tape. It consisted of a variety of
different machines running different things with different reporting solutions.
Some of the old test-results can still be found on wyvern. Now we are moving
to a buildbot based solution together with a custom reporter to have a view
similar to the old one. Some details are not quite finished yet, but most of the
things are already working rather well (currently all the results displayed
are from the 2.5-merge branch).

Another large (and ongoing) topic of work is the 2.5 branch. It contains the
work done by our Summer-of-Code student, Bruno Gola, of adding CPython 2.5
features to PyPy's Python interpreter. While Bruno implemented most language
features and imported the 2.5 stdlib into PyPy, a lot of details were still
missing. In the last days nearly everybody worked on fixing small issues and
failing stdlib tests. While doing that we tried to categorize some CPython tests
as implementation dependant so that we can skip them when running on PyPy.

Memory Improvements

One goal of the sprint is to measure and to reduce the memory behaviour of our
Python interpreter. The idea is to make pypy-c a realistic option for use on
embedded devices. By memory behaviour we mean both the
dynamic memory usage (how much bytes does a dict or an instance take) as well as
the size of the executable and details of the GC strategy.

Alexander, Carl Friedrich and Antonio did some work on analyzing the static data
that a pypy-c executable contains. Our executables have the tendency to be
rather large, both due to a lot of code and due to a large amount of static
data. The analysis didn't give any really surprising results, the problem is
mostly that we have a lot of static data originating from a bit everywhere in
our program. Two big offenders are the unicodedata-module with about 750 KB
of static data and the multimethod-tables with about 150 KB of data.

Armin, Iko, Anto and Maciek worked on a new approach to malloc-removal. This is
(for PyPy) a crucial optimization of the translation toolchain that performs
escape analysis to find out which objects don't outlive the frame they were
allocated in. Since RPython is garbage-collected we usually have a lot of
allocations, so it is important to statically get rid of many of them. To
successfully do that, some inlining is needed to give the analysis more context.
This leads to the fact that we have rather aggressive inlining-settings to allow
as much malloc-removal as possible. The new approach tries to inline functions
only if this actually leads to the successful removal of a malloc operation. The
code is not finished quite yet, so it remains to be seen how successful it will
be.

Before the sprint Maciek had started to work on a mark-compact GC for PyPy. The
idea is that it is better for memory-constrained-environments because it does
not double the memory-requirements during collections. During the sprint Armin
and Maciek worked on cleaning up the code a bit and then merging the branch.
An interesting property of the mark-compact GC is that after a collection all
the memory that is not currently used by the program is returned to the
operating system. Right now the GC is not as fast as our more mature ones, but
it probably will be the basis for future tweaking.

A small thing that was done by Alexander and Carl Friedrich to make objects smaller is
to enable shared instance dictionaries also for instances of old-style
classes. Before it worked only for instances of new-style classes. Shared
instance dictionaries are a way to reduce the memory-usage of instances. In the
optimal case, it gives the same memory-savings that __slots__ are giving,
but without any behavioural changes. Conceptually it is very similar e.g. to
the notion of "map" in the Self project, or the hidden classes that Google Chrome's V8
is using (click on the link, there are nice graphics). The
difference is that for them it is mostly a way to get faster attribute access,
and PyPy is so far only using it form memory savings (but that might change in
the future).

In parallel to all the other work, John Witulski worked tirelessly on advancing
the AMD64-JIT-backend. John has the implementation of this backend as the topic
of his Bachelor's thesis. He is progressing quite well (especially also
considering that this is his first sizeable Python project ever), just sometimes
being impaired by such annoyances as errors in the official Intel documentation.
By now the backend is supporting many integer operations and control flow.

Friday, October 10, 2008

As we already blogged, in the last half-year or so, Michael Leuschel, Armin
and me did a lot of JIT generator work on a Prolog prototype. The idea was to
experiment more quickly with some techniques than what would have been possible
with RPython. These experiments were quite successful in themselves. With very
little code we managed to get a JIT that is not doing too badly when compared to
existing projects for Prolog.

This Prolog work was also the subject of my Master's thesis. I finished the
thesis about two weeks ago (and since then have been mostly sleeping and then
sprinting). The thesis should be self-contained when it comes to explaining the
JIT concepts but needs knowledge of Prolog to be understandable.

As we already blogged, in the last half-year or so, Michael Leuschel, Armin
and me did a lot of JIT generator work on a Prolog prototype. The idea was to
experiment more quickly with some techniques than what would have been possible
with RPython. These experiments were quite successful in themselves. With very
little code we managed to get a JIT that is not doing too badly when compared to
existing projects for Prolog.

This Prolog work was also the subject of my Master's thesis. I finished the
thesis about two weeks ago (and since then have been mostly sleeping and then
sprinting). The thesis should be self-contained when it comes to explaining the
JIT concepts but needs knowledge of Prolog to be understandable.