Tuesday, June 10, 2008

List comprehension implementation details

List comprehensions are a nice feature in Python. They are, however, just
syntactic sugar for for loops. E.g. the following list comprehension:

def f(l):
return [i ** 2 for i in l if i % 3 == 0]

is sugar for the following for loop:

def f(l):
result = []
for i in l:
if i % 3 == 0:
result.append(i ** 2)
return result

The interesting bit about this is that list comprehensions are actually
implemented in almost exactly this way. If one disassembles the two functions
above one gets sort of similar bytecode for both (apart from some details, like
the fact that the append in the list comprehension is done with a special
LIST_APPEND bytecode).

Now, when doing this sort of expansion there are some classical problems: what
name should the intermediate list get that is being built? (I said classical
because this is indeed one of the problems of many macro systems). What CPython
does is give the list the name _[1] (and _[2]... with nested list
comprehensions). You can observe this behaviour with the following code:

Now to the real reason why I am writing this blog post. PyPy's Python
interpreter implements list comprehensions in more or less exactly the same way,
with on tiny difference: the name of the variable:

Now, that shouldn't really matter for anybody, should it? Turns out it does. The
following way too clever code is apparently used a lot:

__all__ = [__name for __name in locals().keys() if not __name.startswith('_') '
or __name == '_']

In PyPy this will give you a "$list0" in __all__, which will prevent the
import of that module :-(. I guess I need to change the name to match CPython's.

Lesson learned: no detail is obscure enough to not have some code depending
on it. Mostly problems on this level of obscurity are the things we are fixing
in PyPy at the moment.

List comprehensions are a nice feature in Python. They are, however, just
syntactic sugar for for loops. E.g. the following list comprehension:

def f(l):
return [i ** 2 for i in l if i % 3 == 0]

is sugar for the following for loop:

def f(l):
result = []
for i in l:
if i % 3 == 0:
result.append(i ** 2)
return result

The interesting bit about this is that list comprehensions are actually
implemented in almost exactly this way. If one disassembles the two functions
above one gets sort of similar bytecode for both (apart from some details, like
the fact that the append in the list comprehension is done with a special
LIST_APPEND bytecode).

Now, when doing this sort of expansion there are some classical problems: what
name should the intermediate list get that is being built? (I said classical
because this is indeed one of the problems of many macro systems). What CPython
does is give the list the name _[1] (and _[2]... with nested list
comprehensions). You can observe this behaviour with the following code:

Now to the real reason why I am writing this blog post. PyPy's Python
interpreter implements list comprehensions in more or less exactly the same way,
with on tiny difference: the name of the variable:

11 comments:

In fairness, the clever code does not depend on the name looking as it actually does in CPython; the clever code merely expects that variables auto-created by Python internals will begin with an underscore. Which is far more reasonable than actually expecting the specific name "_[1]" (and, wow, you're right, that does look weird; you've shown me something I've never seen before about Python!) to turn up in the variable list.

I would have said "Lesson learned: when MIT hackers in the 1960's come up with some funny thing called GENSYM, it's not just because they're weird; it really does serve a purpose". But then I'm an asshole Lisp hacker. :-)

anonymous: Using gensym for getting the symbol wouldn't have helped in this case at all. The gensymmed symbol would still have showed up in the locals() dictionary. So depending on whether the gensym implementation returns symbols that start with an underscore or not the same bug would have occured.

turingtest: I agree that that would be preferable, but it's sort of hard with the current interpreter design. Also, it's a pragmatic implementation in that the interpreter didn't have to change at all to add the list comps.

The code's not overly clever, it's ridiculous, because it exactly duplicates the effects of not having __all__ at all. From foo import * already won't import names prefaced with an underscore. Also from the google code search it looks like it's mostly used in Paste, most of the other hits are false positives.

The "from foo import *" case (without __all__ defined) is a good enough reason to match the cpython naming, though, the useless code in Paste not withstanding.

arkanes: no, the "from foo import *" case isn't really changed by the different choice of symbols because the new variable is really only visible within the list comprehension and deleted afterwards. It doesn't leak (as opposed to the iteration variable).

arkanes: This is not the same as not having __all__ defined. __all__ would skip the function _() which is used to mark and translate strings with gettext. In other words, it is emulating the default no __all__ behavior and adding in _()

Carl: doesn't the "$list0" get imported without the all? If not what keeps it from causing a problem normally? Could you not just delete the $list0 variable after assigning it to the LHS?