Saturday, March 12, 2011

Controlling the Tracing of an Interpreter With Hints, Part 1: Controlling the Extent of Tracing

The question I was asked most often during my recent US trip was how exactly
the hints work that interpreter authors can use to improve the execution speed
of the programs running on their interpreters. Since those hints are not really
documented all that well, I decided to write blog posts about them. This is the
first one.

Background

First, let's recap some basics: PyPy's approach to implementing dynamic
languages is to write an interpreter for
the language in RPython. This interpreter can be translated to C and then
further to machine code. The interpreter consists of code in the form of a
large number of generated C functions and some data. Similarly, the user
program consists of functions in the language the interpreter executes.

As was explained in a blog post and a paper two years ago, PyPy's JIT is a
meta-tracer. Since we want to re-use our tracer for a variety of languages, we
don't trace the execution of the user program, but instead trace the execution
of the interpreter that is running the program. This means that the traces
don't contain the bytecodes of the language in question, but RPython-level
operations that the interpreter did to execute the program.

On the other hand, the loops that are traced by the tracer are the loops in the
user program. This means that the tracer stops tracing after one iteration of
the loop in the user function that is being considered. At this point, it can
have traced many iterations of the interpreter main loop.

Here's a diagram of this process:

On the left you see the levels of execution. The CPU executes the binary of
PyPy's Python interpreter, which consists of RPython functions that have been
compiled first to C, then to machine code. Some of these functions contain
loops, others don't. The interpreter runs a Python program written by a
programmer (the user). If the tracer is used, it traces operations on the level
of the interpreter. However, the extent of the trace is determined by the loops
in the user program.

How Far Should Tracing Go

When the tracer encounters a function call at the interpreter level, e.g. the
interpreter main loop calling a helper function, it can do one of two things:

it can trace into the helper function, effectively inlining it into the trace.

it can not trace into the function and instead record a call to that function
as an operation in the trace. Such a call operation in the trace is sometimes
called residual call.

As a default, the tracer will try to trace into the helper because that will
give more information to the optimizer, allowing it to do a better job. This is
particularly important for the allocation removal optimization, because if a
freshly allocated object is passed as an argument to a residual call, its
allocation cannot be optimized away.

There is a problem however if the helper function itself contains a loop. The
tracer records the linear sequence of operations that are being executed. Thus
when it encounters a loop on the interpreter level it records all the
operations of every iteration of the loop itself, with the net effect of
unrolling it. The only places where the tracer stops and tries to close the
trace is in the main loop of the interpreter. When the tracer encounters the
main loop, it also checks whether the original user loop has been closed, and
thus whether it can stop tracing.

For most helper functions in the interpreter that contain loops, fully
unrolling does not make sense. If a loop is unrolled, the trace is specific to
the number of iteration that was seen during tracing. If the trace is later
executed with a different number of iterations, the trace will be left via a
guard failure, which is inefficient. Therefore the default behaviour of the
tracer is to never trace into a function on the interpreter level that contains
a loop, but to trace into all non-looping helper functions.

This default behaviour is essentially a heuristic, but one that usually makes
sense. We want to produce just enough traces to make the resulting code
efficient, but not more. Therefore we trace as much as possible (everything by
default) except the functions which loops where tracing would produce code that
is less general than it could be.

As an example for a helper with a loop, take string concatenation. It loops over
the characters of both arguments and copies them over into the result string. It
does not make sense to unroll the loops in this function. If we do that,
the resulting trace can only be used for strings of the length that was seen
during tracing. In practise, the string lengths are usually different each run,
meaning that the trace with unrolling is not run to completion in most cases.

Influencing the Default Behaviour

Sometimes the default behaviour is not actually what is wanted. This is
something the interpreter author has to decide, usually by looking at the traces
that are produced and deciding that they should be improved. There are two ways
in which the default is wrong:

false negatives: if a helper function that does contain a loop should
be traced into, unrolling the loop.

false positives: if a helper function that does not contain a loop is
inlined into the trace, but the interpreter author decides that this is not
helpful.

If the interpreter author finds false negatives or false positives, she can fix
that by applying a hint to the tracer. These hints take the form of function
decorators (which both live in the pypy.rlib.jit module). In the next two
subsections I will describe these two function decorators and their use.

Unrolling Functions With Loops

The first decorator, used to fix false negatives, is the unroll_safe
decorator. It is used to tell the tracer to always trace into a function that
has a loop, effectively unrolling the loop. This decorator should be used only
if the loop in the helper function is expected to always run for the same number
of iterations. This sounds like a strong restriction, in practise this is less
severe: The number of iterations needs to only be the same in the context where
the helper functions is traced from.

It is easiest to understand this condition via an example. Let's look at the
BUILD_TUPLE bytecode in Python. It takes one argument, the length n of
the tuple being built. The bytecode pops n arguments from the stack, turns
them into a tuple and pushes that tuple on the stack. Thus the function that
implements BUILD_TUPLE in PyPy's Python interpreter calls a helper
popvalues which pops n values from the stack and returns them in a list.
This helper is implemented with a loop and would thus not be traced into by
default. The loop in the helper can run for very different numbers of
iterations, because it is used in a variety of places. However, for every
concrete BUILD_TUPLE bytecode, the argument will be constant. Therefore it
is safe (and even necessary) to annotate popvalues with the unroll_safe
decorator.

A different example is the implementation of the isinstance builtin. It is
used to check whether an object a is an instance of a class B like
this: isinstance(a, B). The second argument of the function can also be a
tuple of classes to check whether an object is an instance of one of a number of
classes: isinstance(a, (A, B, C, D)). To implement this second case, the
implementation of isinstance contains a loop iterating over the elements of
the tuple. The number of loop iterations can vary, but is usually fixed for each
individual call site which typically just lists a few classes in the source
code. Therefore it is also safe to annotate the implementation of isinstance
with the unroll_safe decorator.

Preventing the Tracing of Functions

The second decorator dont_look_inside is used to fix false positives. It
tells the JIT to never trace into the decorated function and just always produce
a residual call instead. This decorator is in many ways less important than the
unrolling one (except for a special situation that I will describe in a
follow-up post). It is used if tracing into a function is not expected to yield
any speed benefits, because the optimizer will not be able to improve it much.
This is often the case if the called helper function does not contain any
"dynamic" behaviour. In such a situation it is better to just leave the function
call in the trace, because that produces less code.

An example would be the import mechanism in Python. It's very unlikely that any
performance improvement can be had by turning part of it into assembler.
Therefore we hide it from the tracer by annotating them with
dont_look_inside.

Conclusion

In this post we discussed two hints that can be used to control precisely which
parts of the interpreter should be meta-traced. If these hints are used
carefully, this can go a long way to making the interpreter produce traces that
contain exactly the interesting part of the execution, and will contain calls to
the functions that can not be optimized by tracing techniques.

The question I was asked most often during my recent US trip was how exactly
the hints work that interpreter authors can use to improve the execution speed
of the programs running on their interpreters. Since those hints are not really
documented all that well, I decided to write blog posts about them. This is the
first one.

Background

First, let's recap some basics: PyPy's approach to implementing dynamic
languages is to write an interpreter for
the language in RPython. This interpreter can be translated to C and then
further to machine code. The interpreter consists of code in the form of a
large number of generated C functions and some data. Similarly, the user
program consists of functions in the language the interpreter executes.

As was explained in a blog post and a paper two years ago, PyPy's JIT is a
meta-tracer. Since we want to re-use our tracer for a variety of languages, we
don't trace the execution of the user program, but instead trace the execution
of the interpreter that is running the program. This means that the traces
don't contain the bytecodes of the language in question, but RPython-level
operations that the interpreter did to execute the program.

On the other hand, the loops that are traced by the tracer are the loops in the
user program. This means that the tracer stops tracing after one iteration of
the loop in the user function that is being considered. At this point, it can
have traced many iterations of the interpreter main loop.

Here's a diagram of this process:

On the left you see the levels of execution. The CPU executes the binary of
PyPy's Python interpreter, which consists of RPython functions that have been
compiled first to C, then to machine code. Some of these functions contain
loops, others don't. The interpreter runs a Python program written by a
programmer (the user). If the tracer is used, it traces operations on the level
of the interpreter. However, the extent of the trace is determined by the loops
in the user program.

How Far Should Tracing Go

When the tracer encounters a function call at the interpreter level, e.g. the
interpreter main loop calling a helper function, it can do one of two things:

it can trace into the helper function, effectively inlining it into the trace.

it can not trace into the function and instead record a call to that function
as an operation in the trace. Such a call operation in the trace is sometimes
called residual call.

As a default, the tracer will try to trace into the helper because that will
give more information to the optimizer, allowing it to do a better job. This is
particularly important for the allocation removal optimization, because if a
freshly allocated object is passed as an argument to a residual call, its
allocation cannot be optimized away.

There is a problem however if the helper function itself contains a loop. The
tracer records the linear sequence of operations that are being executed. Thus
when it encounters a loop on the interpreter level it records all the
operations of every iteration of the loop itself, with the net effect of
unrolling it. The only places where the tracer stops and tries to close the
trace is in the main loop of the interpreter. When the tracer encounters the
main loop, it also checks whether the original user loop has been closed, and
thus whether it can stop tracing.

For most helper functions in the interpreter that contain loops, fully
unrolling does not make sense. If a loop is unrolled, the trace is specific to
the number of iteration that was seen during tracing. If the trace is later
executed with a different number of iterations, the trace will be left via a
guard failure, which is inefficient. Therefore the default behaviour of the
tracer is to never trace into a function on the interpreter level that contains
a loop, but to trace into all non-looping helper functions.

This default behaviour is essentially a heuristic, but one that usually makes
sense. We want to produce just enough traces to make the resulting code
efficient, but not more. Therefore we trace as much as possible (everything by
default) except the functions which loops where tracing would produce code that
is less general than it could be.

As an example for a helper with a loop, take string concatenation. It loops over
the characters of both arguments and copies them over into the result string. It
does not make sense to unroll the loops in this function. If we do that,
the resulting trace can only be used for strings of the length that was seen
during tracing. In practise, the string lengths are usually different each run,
meaning that the trace with unrolling is not run to completion in most cases.

Influencing the Default Behaviour

Sometimes the default behaviour is not actually what is wanted. This is
something the interpreter author has to decide, usually by looking at the traces
that are produced and deciding that they should be improved. There are two ways
in which the default is wrong:

false negatives: if a helper function that does contain a loop should
be traced into, unrolling the loop.

false positives: if a helper function that does not contain a loop is
inlined into the trace, but the interpreter author decides that this is not
helpful.

If the interpreter author finds false negatives or false positives, she can fix
that by applying a hint to the tracer. These hints take the form of function
decorators (which both live in the pypy.rlib.jit module). In the next two
subsections I will describe these two function decorators and their use.

Unrolling Functions With Loops

The first decorator, used to fix false negatives, is the unroll_safe
decorator. It is used to tell the tracer to always trace into a function that
has a loop, effectively unrolling the loop. This decorator should be used only
if the loop in the helper function is expected to always run for the same number
of iterations. This sounds like a strong restriction, in practise this is less
severe: The number of iterations needs to only be the same in the context where
the helper functions is traced from.

It is easiest to understand this condition via an example. Let's look at the
BUILD_TUPLE bytecode in Python. It takes one argument, the length n of
the tuple being built. The bytecode pops n arguments from the stack, turns
them into a tuple and pushes that tuple on the stack. Thus the function that
implements BUILD_TUPLE in PyPy's Python interpreter calls a helper
popvalues which pops n values from the stack and returns them in a list.
This helper is implemented with a loop and would thus not be traced into by
default. The loop in the helper can run for very different numbers of
iterations, because it is used in a variety of places. However, for every
concrete BUILD_TUPLE bytecode, the argument will be constant. Therefore it
is safe (and even necessary) to annotate popvalues with the unroll_safe
decorator.

A different example is the implementation of the isinstance builtin. It is
used to check whether an object a is an instance of a class B like
this: isinstance(a, B). The second argument of the function can also be a
tuple of classes to check whether an object is an instance of one of a number of
classes: isinstance(a, (A, B, C, D)). To implement this second case, the
implementation of isinstance contains a loop iterating over the elements of
the tuple. The number of loop iterations can vary, but is usually fixed for each
individual call site which typically just lists a few classes in the source
code. Therefore it is also safe to annotate the implementation of isinstance
with the unroll_safe decorator.

Preventing the Tracing of Functions

The second decorator dont_look_inside is used to fix false positives. It
tells the JIT to never trace into the decorated function and just always produce
a residual call instead. This decorator is in many ways less important than the
unrolling one (except for a special situation that I will describe in a
follow-up post). It is used if tracing into a function is not expected to yield
any speed benefits, because the optimizer will not be able to improve it much.
This is often the case if the called helper function does not contain any
"dynamic" behaviour. In such a situation it is better to just leave the function
call in the trace, because that produces less code.

An example would be the import mechanism in Python. It's very unlikely that any
performance improvement can be had by turning part of it into assembler.
Therefore we hide it from the tracer by annotating them with
dont_look_inside.

Conclusion

In this post we discussed two hints that can be used to control precisely which
parts of the interpreter should be meta-traced. If these hints are used
carefully, this can go a long way to making the interpreter produce traces that
contain exactly the interesting part of the execution, and will contain calls to
the functions that can not be optimized by tracing techniques.

Sorry if the following questions are naive, but you post makes me wonder if not tracing at all the functions which contain loops with a varying number of iteration means that no optimization is possible at all for those loops? Also, wouldn't it be possible to detect there is a loop and produce a special kind of trace in that case which do not duplicate the body of the loop? I guess that if it was possible and useful, you'd have done it, so I guess the real question is: why doesn't this work?

@Victor: yes, there are probably ways to do place some of the hints more automatically. However, you will always have to look at the traces and think about how to improve them, so we chose the pragmatic path and didn't do anything magic.

Yes, functions in the interpreter with loops that we do not trace are not optimized at all. For most of these functions this is not a problem, e.g. string concatenation does not have much optimization potential anyway. However, there are some functions with loops (like the implementation of the map builtin) that would benefit from tracing, and we don't have a good general solution for that yet.

One of the ideas for solutions are indeed to try to start new traces in the interpreter functions with loops. We did not get around to playing with this yet, as there are not so many cases in the Python interpreter where this leads to a huge benefit.

I'm puzzled now. I fail to see why those loops "do not have much optimization potential". I can understand that it's hard to optimize them because of the trace problem but I thought they would benefit from optimization like any other code (eg avoiding boxing/unboxing temporary variables), especially since they are within a loop, hence any gain will be multiplied by the number of iterations.

@Gaëtan:is it possible that you are mixing up the two levels involved? The post talked only about functions in the interpreter, not about the functions in pure Python that a user of the interpreter might write. To clarify:

- All loops on the application level, i.e. in the program the user wrote, are traceable and will be traced if they are executed often enough.

- Some loops in the interpreter itself are not. Most of these loops do not do any boxing/unboxing, so they won't benefit from optimization. For some of the loops that would benefit we added some manual hacks to trace them anyway, e.g. for the implementation of "map". Some others still need to be improved, e.g. any, all, zip, ...

While I understand special casing to overwrite the default tracing/not-tracing rules can help performance, I wonder how well are the default heuristics performing. Do you have any bulk part estimation of the performance loss by turning off special casing? And how many hints (related to whether to trace or unroll) do you have to introduce to PyPy?

Thanks :-). No, I didn't really do benchmarks yet, plan to do so in the future (these blog posts will turn into a paper soonish).

There are about 20-30 unroll_safe hints and equally many dont_look_inside hints. Some of them are really important, ie the speed would be abysmal without them. Most of them are really in the bytecode dispatch area, they are cases that e.g. Jython would not have, because in Jython the Python-to-Java compiler takes care of them.

No, I wasn't confusing the two levels involved (if pypy wasn't optimizing variable-length loops in userlevel code, it wouldn't optimize much I guess).

My point was more theoretical: I guess that, in theory, those loops would benefit from optimizations like any other part of the interpreter. Your answer leads me to believe that *in practice* this isn't an issue because there are either not that many of them in the interpreter and/or they are not in speed critical parts and most of those that are important speed-wise have been taken care of manually in some way or another.

@Gaëtan: yes, that's a good interpretation. At some point we might still think about a more general solution for this problem, to get the remaining rare cases fixed, but for now we have a lot of the common ones covered.