Micro-Optimizations in Python Code: Speeding Up Lookups

I'm going to show you how a micro optimization can speed up your python code by
a whopping 5%. 5%! It can also annoy anyone that has to maintain your code.

But really, this is about explaining code might you see occasionally
see in the standard library or in other people's code. Let's take an example
from the standard library, specifically the collections.OrderedDict class:

Notice the last arg: dict_setitem=dict.__setitem__. It makes sense if you
think about it. To associate a key with a value, you'll need to provide a
__setitem__ method which takes three arguments: the key you're setting, the
value associated with the key, and the __setitem__ class method to the
built in dict class. Wait. Ok maybe the last argument makes no sense.

Scope Lookups

To understand what's going on here, we need to take a look at scopes. Let's
start with a simple question, if I'm in a python function, and I encounter
something named open, how does python go about figuring out the value of
open?

The short answer is that without knowing the contents of the GLOBAL and the
LOCAL section, you can't know for certain the value of open. Conceptually,
python checks three namespaces for a name (ignoring nested scopes to keep
things simple):

locals

globals

builtin

So in the myfunc function, if we're trying to find a value for open,
we'll first check the local namespace, then the globals namespace, then the
builtins namespace. And if open is not defined in any namespace, a
NameError is raised.

Scope Lookups, the Implementation

The lookup process above is just conceptual. The implementation of this
lookup process gives us room to exploit the implementation.

Look at the differences between foo and bar. Right away we can
see that at the bytecode level python has already determined
what's a local variable and what is not because foo is using
LOAD_FAST and bar is using LOAD_GLOBAL.

We won't get into the details of how python's compiler knows when to emit which
bytecode (perhaps that's another post), but suffice to say python knows which
type of lookup it needs to perform when it executes a function.

One other thing that can be confusing is that LOAD_GLOBAL is used
for lookups in the global as well as the builtin namespace. You can
think of this as "not local", again ignoring the issue of nested scopes.
The C code for this is roughly [1]:

Even if you've never seen any of the C code for CPython, the above code is
pretty straightforward. First, check if the key name we're looking for is in
f->f_globals (the globals dict), then check if the name is in
f->f_builtins (the builtins dict), and finally, raise a NameError if
both checks failed.

Binding Constants to the Local Scope

Now when we look at the initial code sample, we can see that the
last argument is binding a function into the local scope of a function.
It does this by assigning a value, dict.__setitem__, as the default
value of an argument. Here's another example:

We're doing the same thing here, binding what would normally be objects
that are in the builtin namespace into the local namespace instead.
So instead of requiring the use of LOAD_GLOBAL (a global lookup),
python instead will use LOCAL_FAST. So
how much faster is this? Let's do some crude testing:

Or in other words, that's an 11.9% improvement[2]. That's way more than the
5% I promised at the beginning of this post!

There's More to the Story

It's reasonable to think that the speed improvment is because LOAD_FAST
reads from the local namespace whereas LOAD_GLOBAL will first check the
global namespace before falling back to checking the builtin namespace. And in
the example function above, isinstance, dict, and list all come
from the built in namespace.

However, there's more going on. Not only are we able to skip additional lookup
with LOAD_FAST, it's also a different type of lookup.

The C code snippet above showed the code for LOAD_GLOBAL, but here's the
code for LOAD_FAST:

We're retrieving the local value by indexing into an array. It's not shown
here, but oparg is just an index into that array.

Now it's starting to make sense. In our first version not_list_or_dict
had to perform 4 lookups, and each name was in the builtins namespace which
we only look at after looking in the globals namespace. That's 8 dictionary
key lookups. Compare that to directly indexing into a C array 4 times,
which is what happens in the second version of not_list_or_dict, which
all use LOAD_FAST under the hood. This is why lookups in the local
namespace are faster.

Wrapping Up

Now the next time you see this in someone else's code you'll know what's
going on.

And one final thing. Please don't actually do these kinds of optimizations
unless you really need to. And most of the time you don't need to. But when
the time really comes, and you really need to squeeze out every last bit of
performance, you'll have this in your back pocket.