Dabeaz

Friday, August 12, 2011

An Inside Look at the GIL Removal Patch of Lore

As most Python programmers know, people love to hate
the Global Interpreter Lock (GIL). Why can't it
simply be removed? What's the problem? However, if you've been around the
Python community long enough, you might also know that the
GIL was already removed once before--specifically, by Greg Stein who created
a patch against Python 1.4 in 1996. In fact, here's a link:

This patch often gets mentioned in discussions regarding the
GIL--especially as justification for keeping the GIL. For
example, see the Python
FAQ, the forum post It
Isn't Easy to Remove the GIL or this mailing discussion on
Free
Threading. These discussions usually point out that the patch
made the performance of single-threaded applications much worse--so
much so that the patch couldn't be adopted. However, beyond that,
technical details about the patch are somewhat sketchy.

Despite using Python since early 1996, I will
freely admit that I never really knew the details of Greg's GIL
removal patch. For the most part, I just vaguely knew that someone
had attempted to remove the GIL, that it apparently killed the performance
of single-threaded apps, and that it subsequently faded into
oblivion. However, given my recent interest in making the GIL better, I
thought it might be a fun challenge to turn on the time machine,
see if I could actually find the patch, and compile a version of GIL-less Python in order to take a peek
under the covers.

So, in this post, I'll do just that and try to offer some commentary on some of
the more interesting and subtle aspects of the patch--in particular,
aspects of it that seem especially tricky or problematic.
Given the increased interest in concurrency, the GIL, and other matters, I hope
that this information might be of some use to others, or at the very
least, help explain why Python still has a GIL. Plus, as the saying
goes, those who don't study the past are doomed to repeat it. So,
let's jump in.

Python 1.4

In order to play with the patch, you must first download and build
Python-1.4. You can find it on python.org if
you look long enough. After some bit of Makefile twiddling, I was
able to build it and try a few things out on a Linux system.

Using Python-1.4 is a big reminder of how much Python has
changed. It has none of the nice features you're used to to using now (list
comprehensions, sets, string methods, sum(),
enumerate(), etc.). In playing with it, I realize that about
half of everything I type results in some kind of error. I
digress.

Thread support in Python-1.4 is equally minimal. The
threading.py module that most people know doesn't yet
exist. Instead, there is just the lower-level thread module
which simply lets you launch a thread, allocate a mutex lock, and not much
else. Here's a small sample:

Under the covers though, the implementation is
remarkably similar to modern Python. Each thread consists of a C
function that runs the specified Python callable and there is a GIL that
guards access to some critical global interpreter state.

A Reentrant Interpreter?

If you read the threading.README file included in the patch,
you will find this description:

These patches enable Python to be "free threaded" or, in other words,
fully reentrant across multiple threads. This is particularly important
when Python is embedded within a C program.

The stated goal of making Python "fully reentrant" is important so
let's discuss.

All Python code gets compiled down to an intermediate "machine
langauge" before it executes. For example, consider a simple function
like this:

Now, here's the critical bit--as a general rule, most low-level
interpreter instructions tend to execute atomically. That is, while
executing an instruction, the GIL is held, and no preemption is
possible until completion.

For a large majority of interpreter instructions, the lack of
preemption is rarely a concern because they execute almost
immediately. However, every now and then, a program may execute a
long-running operation. Typically this happens if an operation is
performed on a huge amount of data or if Python calls out to
long-running C code that doesn't release the GIL.

Here is a simple example you can try to see it. Launch the
above countdown function in its own thread like this (using the legacy
thread module).

If you do this, everything should just grind to a halt. You will see
no output from the countdown thread at all and Ctrl-C will be frozen.
In a nutshell, this is the issue with preemption (or lack thereof).
Due to the GIL, it's possible for one thread to temporarily block the progress of
all other threads on the system.

The lack of preemption together with the GIL presents certain challenges when Python is
embedded into a multithreaded C program. In such applications, the
Python interpreter might be used for high-level scripting, but also for
things such as event handling and callbacks. For example, maybe you have some C
thread that's part of a game or visualization package that's calling
into the Python interpreter to trigger event-handler methods in real
time. In such code, control flow might pass from Python, to C, back
to Python, and so forth.

Needless to say, it's possible that in such an environment, you might
have a collection of C or C++ threads that compete for use of the
Python interpreter and are forced to synchronize on the GIL. This means that the interpreter might
become a bottleneck of the whole system. If, somehow, you could get
rid of the GIL, then any thread would be allowed to use the
interpreter without worrying about other threads. For example, a C++
program triggering a Python event callback, wouldn't have to concern
itself with other Python threads---the callback would simply run
without being blocked. This is what you get by making the
interpreter fully reentrant.

It is in this context of embedding that the GIL removal patch should
probably be viewed. At the time it was created,
significantly more Python users were involved with integrating Python
with C/C++ applications. In my own area of
scientific computing, people were using Python to build interactive
data visualization tools which often involved heavy amounts
of CPU computation. I knew others who were tring to use
Python for internal scripting of commercial PC video games. For all of
these groups, removal of the GIL (if possible) was viewed as desirable
because doing so would simplify programming and improve the user-experience (better
responsiveness of the GUI, fewer stalls, etc.). If you've ever been
sitting there staring at the spinning beachball on your Mac wishing it
would just go away, well, then you get
the general idea.

Exploring the Patch

If you download the free-threading patch, you will find that it is a
relatively small set of files that replace and add functionality to Python-1.4. Here is
a complete list of the modified files included in the patch:

As you can see, it's certainly not a rewrite of the entire
interpreter. In fact, if you run a diff across the Python-1.4 source
and the patch, you'll find that the changes amount to about 1000 lines
of code (in contrast, the complete source code to Python-1.4 is about
82000 lines as measured by 'wc').

I won't go into the details of applying or compiling the patch
except to say that detailed instructions are included in the README should you want
to build it yourself.

Initial Impressions

With the patch applied, I tried to do a few rough performance tests (note: I
ran these under Ubuntu 8.10 on a dual-core VMWare Fusion instance
running on my Mac). First, let's just write a simple spin-loop and see what happens:

Here, the GIL-less Python is only about 4 times slower. Now, I'm just
slightly more impressed.

To test threads, I wrote a small sample that subdivided the work
across two worker threads is an embarrassingly parallel manner (note: this
code is a little wonky due to the fact that Python-1.4 doesn't
implement thread joining--meaning that you have to do it yourself with
the included binary-semaphore lock).

If you run this code with the GIL, the execution time is about
2.5 seconds or approximately 1.3 times slower than the single-threaded
version (1.9 seconds). Using the GIL-less Python, the execution time is
18.5 seconds or approximately 1.45 times slower than the
single-threaded version (12.7 seconds). Just to emphasize, the GIL-less Python
running with two-threads is running more than 7 times slower than the
version with a GIL.

Ah, but what about preemption you ask? If you return to the example
above in the section about reentrancy, you will find that removing the
GIL, does indeed, allow free threading and long-running calculations
to be preempted. Success!

Needless to say, there might be a few reasons why the patch quietly
disappeared.

Under the Covers

Okay, the performance is terrible, but what is actually going on
inside? Are there any lessons to be learned? A look at the source
code and related documentation reveals all.

Capturing Thread State

For free-threading to work, each thread has to isolate its
interpreter state and not rely on C global variables. The threading patch does this by defining a new
data structure such as the following:

Essentially, all global variables in the interpreter have been picked
up and moved into a per-thread data structure. Some of these values are obvious
candidates such as exception information, tracing, and profiling
hooks. Other parts are semi-random. For example, there is storage
for the compare callback function used by list sorting and a global
error handling variable (c_error) used to propagate errors across
internal functions in the implementation of complex numbers.

To manage multiple threads, the interpreter builds a linked-list of
all active threads. This linked list contains the thread-identifier
along with the corresponding PyThreadState
structure. For example:

Whenever a thread wants to get its per-state state information, it
simply calls a function PyThreadState_Get(). This function
scans the linked-list searching for the caller's thread-identifier.
When found, the matching thread state structure is moved to the front
of the linked list and the value returned. Here is a short example of
code that illustrates an example use with the relevant bits highlighted:

Bits and pieces of the thread state code live on in Python3.2 today. In
particular, per-thread state is captured in a data similar structure and
there are functions for obtaining the state. In fact, there is even a
function called PyThreadState_Get().

Fine-Grained Locking of Reference Counting

Memory management of Python objects relies on reference counting. In
the C API, there are macros for manipulating reference counts:

These macros, along with their NULL-pointer safe variants Py_XINCREF and
Py_XDECREF are used throughout the Python source. A quick
search of the Python-1.4 source reveals about 250 uses.

With free-threading, reference counting operations lose their
thread-safety. Thus, the patch introduces a global reference-counting mutex
lock along with atomic operations for updating the count. On Unix,
locking is implemented using a standard
pthread_mutex_t lock (wrapped inside a PyMutex structure) and the following functions:

The Py_INCREF and Py_DECREF macros are then
redefined to use these thread-safe functions.

On Windows, fine-grained locking is achieved by redefining
Py_INCREF and Py_DECREF to use
InterlockedIncrement and
InterlockedDecrement calls (see Interlocked
Variable Access [MSDN]).

On Unix, it must be emphasized that simple reference count
manipulation has been replaced by no fewer than three function calls,
plus the overhead of the actual locking. It's far more expensive.

As a performance experiment, I decided to comment out the
PyMutex_Lock and PyMutex_Unlock calls and run the
interpreter in an unsafe mode. With this change, the performance of my
single-threaded 'spin-loop' dropped from 12.7 seconds to about 3.9 seconds. The threaded version dropped from 18.5 seconds to about 4 seconds. [ Note: corrected due to an unnoticed build-error when trying this experiment initially ].

Clearly fine-grained locking of reference counts is the major
culprit behind the poor performance, but even if you take away the locking, the reference counting performance is still very sensitive to any kind of extra overhead (e.g., function call, etc.). In this case, the performance is still about twice as slow as Python with the GIL.

Locking of Mutable Builtins

Mutable builtins such as lists and dicts need their own locking to
synchronize modifications. Thus, these objects grow an extra lock
attribute per instance. For example:

Virtually all underlying methods (append, insert, setitem, getitem,
repr, etc.) then use the per-instance lock under the covers.

A interesting aspect of the implementation is the way that locks are
allocated. Instead of allocating a dedicated mutex lock for each
list or dictionary, the interpreter simply keeps a small pool of
available locks. When a list or dict first needs to be locked, a lock
is taken from the pool and used as long as it is needed (until the instance
is no longer being manipulated by any threads). At this point, the
lock is simply released back to the pool.

This scheme greatly reduces the number of needed locks. Generally
speaking, the number of locks is about the same as the number of
active threads. Deeply nested data structures (e.g., lists
of lists of lists) may also increase the number of locks needed if certain
recursive operations are invoked. For example, printing a deeply
nested data structure might cause a lock to be allocated for each level
of nesting.

Locking of mutable containers does not involve anything more
sophisticated than a mutex lock. For example, no attempt has been
made to utilize reader-writer locks.

Locking of Various Internal Operations

In various places thoughout the interpreter, there are low-level
book-keeping operations, often related to memory management and optimization.
Their implementation often relies upon the use of unsafe C static variables.

For this, the patch defines a mutex lock dedicated generally to
executing critical sections along with a pair of C macros for
locking.

As an example of use, consider the int object. Due to the fact that
integers are used so frequently, the underlying C code uses a custom
memory allocator and other tricks to avoid excessive calls to C's
malloc and free functions. Here is an example of
the kind of thing you see in the patch (changes highlighted):

A quick analysis of the patch shows that there are about 20 such
critical sections that have to be protected. This includes low-level
code in the implementation of ints, tuples, dicts as well as code
generally related to the runtime of the interpreter (e.g., module
imports, signal handling, sys module, etc.).

Although all of these sections share the same lock, the associated
overhead appears to be negligible compared to that of reference
counting.

Other Tricky Bits

Although the patch addresses some fundamental issues needed to make
free-threading work, it is only a small start. In particular, no
effort has been made to verify the thread-safety of any standard
library modules. This includes a large body of C code that would have
be audited in detail to identify and fix potential race
conditions.

Certain parts of the Python implementation also remain problematic.
For example, certain low-level C API functions such as
PyList_GetItem() and PyDict_GetItem() return
borrowed references (e.g, objects without an increased reference
count). Although it seems remote, there is a possibility that such
functions could return a reference to an object that then gets destroyed by
another thread.

Final Words

Looking at the patch has been an interesting trip through history,
but is there anything to learn from it? This is by no means an
exhaustive list, but a few thoughts come to mind:

Reference counting is a really lousy memory-management technique
for free-threading. This was already widely known, but the performance
numbers put a more concrete figure on it. This will definitely be
the most challenging issue for anyone attempting a GIL removal patch.

For mutable types, you need per-instance locking. However,
through clever lock management, you can probably do it with a
relatively small number of locks (proportional to the number
of threads) as opposed to actually putting a dedicated lock on every
instance. The performance impact of such locking warrants further
study--especially given the heavy use of dictionaries throughout the
interpreter.

Various internal parts of the interpreter will need locking, but
such locking doesn't appear to have as much of an impact as one might
expect (at least I wasn't able to measure a huge performance
hit due to it).

Python 3 already includes a few critical pieces needed to make
free-threading work. In particular, there are data structures that
capture per-thread state and functions for obtaining that state.
Because of that, if you were to eliminate the GIL, most of the
effort would tend to focus on locking as opposed to isolating
state.

Even though you might be able to patch the interpreter with a
small amount of code, verifying the thread safety of all standard library
modules (both Python and C code) will probably be a daunting
(and possibly never-ending) endeavor.

Despite removing the GIL, I was unable to produce any
performance experiment that showed a noticeable improvement on
multiple cores. Really, the only benefit (ignoring the horrible
performance) seen in pure Python code, was having preemptible
instructions.

That's about it for now. I hope you found this trip through time
interesting. In a future installment, I'll explore the problem of
pushing locked reference counting as far as it can possibly go. As a
preview, a simple patch involving less than a dozen lines of code
makes the whole GIL-less Python run more than twice as fast. However,
can it go even faster? Stay tuned.