PEP: async
Title: Asynchronous Callbacks and Intrinsic Language Parallelism
Version: 0.1
Last-Modified: $Date$
Author: Trent Nelson <trent@trent.me>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 02-Dec-2012
Python-Version: 3.x
Abstract
========
The goal of this PEP is to produce a Python interpreter that, through
the introduction of new language constructs and library interfaces,
will allow code to be written that automatically exploits all
available CPU cores.
It proposes to achieve this by introducing two inter-related concepts:
new parallel language constructs and a new asynchronous library.
Rationale
=========
More and more cores are readily becoming available in both consumer
hardware and enterprise servers. Python does not have a way to
naturally exploit multiple cores. This PEP tries to address this
issue by introducing language support for parallel constructs and
asynchronous callbacks.
Relationship to Existing Efforts, PEPs and Design Discussions
=============================================================
Asynchronous IO became a hot topic on python-ideas during September,
October and November of 2012. An output of those discussions was a
new library called tulip, by Guido van Rossum, that focuses on a new
async IO library. There is also PEP-3153 'Asynchronous IO Support'
which tackles some of the issues regarding async IO.
With regards to parallelism, the GIL has always been a contentious
issue, and the mail archives are filled with numerous attempts to
remove it (or simply lament its performance impact).
The reason the proposal here is being submitted in its own PEP is
because it specifically addresses the two issues in concert: handling
asynchronous events *and* providing truly concurrent execution of
Python code within a single interpreter.
It's also worth noting that the proposed solution is low-level enough
such that Tulip, Twisted, Tornado, PEP-3153 and any other library can
easily leverage the primitives introduced by this PEP.
Use Cases
=========
Primary - Event Driven
----------------------
There are two use cases this PEP addresses. The primary use case is
writing software that is event-driven; for example, an SMTP or HTTP
server, which waits for client connections and then "services" them.
Note that the onus is on event-driven, not "server software"; a load-
testing client should be just as able to spin up tens of thousands of
connections to a target server, as that same server is able to handle
them.
The goal of this PEP for the primary use case is to allow Python to
maximize the underlying system resources as optimally as it can; if
a high-performance, multi-threaded native C/C++ library can handle
65k clients and saturate multiple gigabit links (by utilising all
cores), then Python should be able to come as close to this as
possible (minus the interpreter overhead).
Secondary - Parallel Task/Data Decomposition
--------------------------------------------
The secondary use case is software that is not primarily driven by
external events or IO (although it still may perform this sort of
work (reading files, connecting to a database, etc)), but may deal
with "embarassingly parallel" data structures, and thus, would also
benefit from parallel language constructs. Such constructs would be
things like native map/reduce support and parallel task execution.
The goal of this PEP for the secondary use case is to provide elegant
language primitives that allow code to be written in such a way that
all available cores will be exploited. This use case differs from the
first in that it will be primarily a single thread of execution that
periodically runs in parallel, only to collapse back down to a single
thread of execution again. Compare this to the event-driven example,
where server/client software revolves around an event loop that will
facilitate parallel execution automatically (based on the receipt of
events), without requiring the programmer to explicitly delineate
"parallel" sections of code.
Requirements
============
It is important to establish the requirements for such a significant
language change up-front, before enumerating possible solutions. (It
may even be worth separating these requirements out into their own
informational PEP, which would be evaluated and approved independently
to any soutions.) Agreeing on a set of requirements for asynchronous
and parallel primitives allows multiple solutions to be evaluated
objectively.
Proposed solutions, including the one in this PEP, should include a
traceability matrix that maps each requirement to satisfying design
and implementation details/features.
1. The solution must not introduce a performance penalty of more than
4% when running existing Python code that hasn't been converted to
take advantage of the new primitives. (4% has been picked as it
was the overhead introduced by the garbage collector implementation
circa Python 2.0.) This is often referred to in GIL-removal threads
as "performance must not be worse for single-threaded code".
2. The solution should scale linearly in proportion to the number of
cores available. Scale in this context refers to both performance
(throughput, connection concurrency and overall latency) as well as
resource usage (memory overhead).
3. In addition to scaling linearly, there should not be an excessive
performance penalty for code written against the new primitives if
there is only one core available.
4. The solution must not rely on major backwards-incompatible changes
to the public C API, especially with regards to the GIL. That is,
the solution should be compatible with existing extension code that
relies on GIL semantics (such as acquiring and releasing the GIL).
5. The solution may introduce minor changes to the C API that extension
modules will have to accommodate in order to compile against the
target version of Python, even if said module uses none of the new
primitives. The solution should attempt to keep such changes to
the absolute minimum (i.e. adding #defines, possibly calling new
CPython initialization methods when loading, etc).
6. The solution will not "automagically parallelize" existing code.
6.1 Specifically, the solution will not suddenly allow concurrent
execution of threading.Thread() instances. These objects
(and their underlying Python object counterparts) will still
be subject to the GIL.
6.2 Parallelization will only apply to new code written against
the new async/parallel primitives.
7. The new asynchronous and parallel primitives can be used alongside
legacy code without incurring any additional performance penalties.
(Legacy code in this context refers to IO multiplexing via select/
poll/epoll/kqueue and handling "blocking" code by deferring it to
threading.Thread() instances.)
8. Although the solution will present a platform-agnostic interface to
the new primitives, it shall be designed against the platform with
the best native support for underlying asynchronous and parallel
primitives. This allows such a platform (or platforms) to run at
their full potential. On platforms with lesser support for such
primitives, the solution should mimic the semantics of the interface
using traditional techniques for "simulating" asynchronicity (i.e.
select/poll, blocking threads, etc).
9. The solution should be low-level enough to facilitate easy adoption
by existing libraries such as Twisted and Tornado, yet high-level
enough to be pleasant to work with directly without the need for
such libraries.
10. The solution should not incur excessive cognitive overhead; the
average Python day-programmer should be able to easily grok the
core concepts and start writing high-performance code within a
single sitting. It should not expose leaky-abstractions that
require substantial background knowledge in parallel programming
or socket/file multiplexing.
11. The parallel aspect of the solution must offer tangible benefits
over and above those provided by the multiprocessing module, which
already facilitates concurrency through the use of multiple Python
processors.
12. The solution may introduce new restrictions or constraints against
code written to use the new primitives, subject to the following:
* The constraint is essential for meeting the requirements
outlined in this section (i.e. performance, maintaining
GIL semantics, etc).
* The constraint should be logical to the Python programmer
within the context of asynchronous/parallel execution.
For example, a constraint that says a nonlocal variable
cannot be affected from within a parallel execution context
is reasonable within the context of parallel programming.
Disallowing the use of generators, list comprehensions or
integer arithmetic within a parallel execution context,
however, is not.
* Any violation of the new constraints/restrictions shall be
detected and propagated as exceptions. (If this is not
possible for performance reasons, then the cost of silent
errors needs to be weighed against the performance benefits
on a case-by-case basis.)
Impact of Requirements on Solution Design
=========================================
Implementation Constraints
--------------------------
The following requirements significantly constrain the possible ways a
design could be implemented:
* Negligible performance penalties for existing code (or code
written using the new parallel and asynchronous primitives
but still executing on a single core) [R1, R3, R7].
* No change to opaque GIL semantics (from the perspective of
extension modules) [R4, R5].
These requirements rule out any solutions that attempt to remove the
GIL (in lieu of fine-grained locking, for example). Such experiments
have always resulted in code that runs orders of magnitude slower in
a single-threaded context (due to the overhead incurred by locking
and unlocking mutexes).
None of these requirements are new concepts -- every time GIL removal
comes up for discussion, these requirements are cited as reasons why
a given proposal won't be feasible.
However, The common trait of past proposals is that they focused on
the wrong level of abstraction: threading.Thread(). Discussions always
revolve around what can be done to automatically make threading.Thread
instances run concurrently.
threading.Thread() is a Dead End
--------------------------------
That line of thinking is a dead end. In order for existing threading
Thread() code to run concurrently, you'd need to replace the GIL with
fine-grained locking, such that only one thread could access the GC
and memory allocation routines at a time (specifically with regards
to reference counting). And fine-grained locking is a dead end due
to the overhead incurred by single-threaded code.
Formalizing Concurrency Entry and Exit Points
---------------------------------------------
This is the rationale behind [R6]: threading.Thread() will not become
magically concurrent, nor will any concurrency be observed on multiple
cores unless the new primitives are explicitly used.
That requirement is significant, because it prescribes the guidelines
and constraints under which true concurrency can be achieved:
* The GIL must be held.
* The new primitives must be used.
Performance Requirements
------------------------
[R11] stipulates that the solution must offer tangible benefits over
and above what could be achieved through multiprocessing. In order
to meet this requirement, the solution will need to leverage the most
optimal multicore/threading programming paradigms, such as eliminating
read/write contention for shared data, using lockless data structures
where possible, and reducing context switching overhead incurred when
excessive threads are all vying for attention.
So, the solution must be performant, and it must be achieved within
the confines of holding the GIL.
The Importance of New Langugage Constraints
-------------------------------------------
However, the final requirement, [R12], affords us the ability to
introduce new language constraints within the context of the new
primitives referred to in [R6/7], as long as the constraints are:
- Acceptable within the context of parallel execution.
- Essential for achieving both high-performance and meeting the
other strict guidelines (maintain GIL semantics, etc).
The notion of what's considered an 'acceptable' constraint will
obviously vary from person to person, and will undoubtedly be best
suited to a BDFL-style pronouncement. It's worth noting, though,
that the community's willingness to accept certain constraints will
be directly proportional to the performance improvements said
constraints offer.
For example, a possible constraint may be that Python code executing
within a parallel context is not able to affect any global/nonlocal
variables.
Solution Proposal - Overview
============================
In order to help solidify some of the concepts being presented, here's
the general overview of the solution the PEP author has in mind. Keep
in mind it's only an example, and it is being presented in order to
stimulate ideas and further discussion.
The essence of the idea is this: in addition to the main interpreter
thread, an additional thread is created for each available CPU core.
These threads idle/wait against a condition variable by default. That
is, when there's nothing to be done in parallel, they do nothing.
The main interpreter thread's behaviour when executing non-parallel code
is identical to how it behaves now when executing normal Python code;
all GIL, memory and GC semantics are preserved.
Because we have introduced new language primitives, we can detect
entry and exit from parallel execution contexts at the interpreter
level (i.e. at CFrame_EvalEx), via specific op-codes, for example.
Upon detecting entry to a parallel context, the main interpreter
thread preps any relevant data structures, then signals to all other
threads to begin execution. The other threads are essentially frame
execution pipelines, essentially performing the same role as the main
CFrame_EvalEx method, only customized to run in parallel.
It will be helpful to visualize the sort of Python code that these
frame pipelines will execute. Remember that there are two use cases
this PEP targets: event-driven clients/servers, and parallel data/task
decomposition.
In either use case, the idea is that you should be able to separate
out the kernels of logic that can be performed in parallel (or
concurrently, in response to an external event). This is analogous to
GPU programming, where one loads a single "kernel" of logic (a C
function) into hundreds or thousands of GPU hardware threads, which
then perform the action against different chunks of data. The logic
is self-contained; once primed with input data, it doesn't rely on
anything else to produce its output data.
A similar approach would be adopted for Python. The kernel of logic
would be contained within a Python callable. The input would simply
be the arguments passed to the callable. The optional output would be
whatever the callable returns, which would be utilized for parallel
task/data decomposition algorithms (i.e. map/reduce). For the event
driven pattern, in lieu of a return value, the callable will invoke
one of many "parallel safe" methods in order to move the processing of
the event to the next state/stage. Such methods will always return
immediately, enqueuing the action to be performed in the background.
It it these callables that will be executed concurrently by the other
"frame pipeline" threads once the main thread signals that a parallel
context has been entered.
The concurrent frame pipelines (modified versions of CEval_FrameEx),
simply churn away executing these callables. Execution of the
callable is atomic in the sense that, once started, it will run 'til
completion.
Thus, presuming there are available events/data, all cores will be
occupied executing the frame pipelines.
This will last for a configurable amount of time. Once that time
expires, no more callables will be queued (or the threads will sleep
via some other mechanism), and the main thread will wait 'til all
parallel execution has completed.
Upon detecting this, it performs any necessary "post-parallel"
cleanup, releases the GIL, acquires it again, and then, presuming
there is more parallel computation to be done, starts the whole
process again.
Thus, the program flow can be viewed a constant switch between the
single-threaded execution (indistinguishable from current interpreter
behaviour) to explicit parallel execution, where all cores churn away
on a modified CFrame_EvalEx that has been primed with a "parallel"
callable.
During this parallel execution, the interpreter will periodically take
breaks and switch back to the single-thread model, release the GIL, do
any signal handling and anything else it needs to in order to play
nice with legacy code (and extensions). Once it acquires the GIL
again, it can un-pause the parallel execution, and all cores continue
executing the parallel callables.
As long as the parallel context is still active (i.e. hasn't been
explicitly exited), this cycle repeats over and over. Single thread,
to concurrent parallel callable execution, to single thread, back to
concurrent parallel callable execution, etc.
Exiting a parallel context will probably require slightly different
behaviour depending on whether it was an event driven context, or
parallel task/data decomposition. For the latter, it is likely that
some sort of a "parallel finalization" method needs to be run (like
the "reduce" part of map/reduce), typically against the "output data"
each parallel callable "returned" (i.e. via a Python ``return foo``).
Concurrent Frame Execution Pipelines and Memory, GC and Ref. Counting
---------------------------------------------------------------------
Those familiar with CPython internals will have probably noticed a
flaw in the logic presented above: you can't have multiple threads
evaluating Python frames (i.e. CFrame_EvalEx) concurrently because
of the way Python currently allocates memory, counts references and
manages garbage collection.
That is 100% correct. You can't run the interpreter concurrently
because none of the memory allocation, reference counting and garbage
collection facilities are thread safe. Incrementing a reference to an
object translates to a simple integer increment; there is no mutex
locking and unlocking, no atomic ops or memory barries protecting
critical sections. In fact, not only are they not thread safe, the
assumption that there is only one, global thread of execution pervades
every aspect of the implementation.
It is *this* aspect of CPython that limits concurrent execution; not
the GIL. The GIL is simply the tool used to enforce this pervasive
assumption that there is only a single global thread of execution.
Again, it's worth noting that prior attempts at achieving concurrency
were done by adding all sorts of fine grained locking around critical
sections of memory allocation, ref. counting and GC code.
That has a disasterous impact on CPython's performance. Incrementing
and decrementing a C integer variable now becomes a dance involving
mutex locking and unlocking. Combine that with the fact those two
actions are probably the single most frequently performed action by a
CPython interpreter, and you can appreciate how futile any approach
relying on fine-grained locking is.
If fine-grained locking is out, how can we allow multiple CPU cores
to execute a (modified) CFrame_EvalEx concurrently without breaking
all of the existing memory allocation, reference counting and garbage
collection framework?
The idea offered by this PEP is as follows: each parallel thread is
set up such that it thinks its using the normal global malloc/refcnt
facilities, when in fact it has a localized instance.
All object allocation, deallocation, reference counting and garbage
collection that takes place during concurrent CFrame_EvalEx execution
is localized to that thread, and we maintain the illusion of a single
global thread of execution.
We can do this because our design entails only ever having a single
"parallel frame execution pipeline" thread for each available core.
This means that the thread's execution won't ever be interrupted by
another identical thread trying to access the same memory facilities.
(Additionally, we can set the thread affinity such that it always runs
on the same CPU core. This allows us to benefit from cache locality,
among other things.)
Thus, all of the existing memory allocation, reference counting and
garbage collection facilities can be used without modification by
each thread's CFrame_EvalEx pipeline. As everything is local to that
thread/CPU, no locking is required around ref. counting primitives.
No locking means no additional overhead.
In fact, because each parallel thread will only ever be executing
short-lived "parallel friendly" callables, we could omit the garbage
collection functionality completely in lieu of free()'ing allocated
memory as soon as the refcnt hits 0.
Or we could even experiment with omitting reference counting entirely
and just automatically free all memory upon completion of the parallel
callable. After all, the callable shouldn't be affecting any global
(nonlocal) state (i.e. it's side-effect free), and as we're providing
well defined mechanisms for communicating the results of execution
back into the program flow (via ``return ...`` or calling other
"parallel-safe" methods), the lifetime of all other objects allocated
during the callable ends as soon as that callable completes.
The Constraints Imposed by Thread-Local Object Allocation
---------------------------------------------------------
Remember that [R12] says that constraints may be introduced by the new
parallel facilities as long as the following conditions are met:
- The constraint is essential for performance reasons.
- The constraint is logical to your average Python programmer.
- The benefit yielded by concurrent execution outweighs the
inconvenience of the constraint.
The implication of thread-local object allocation facilities is that
parallel code will have no visibility to nonlocal/global objects.
Thus, parallel callables will have to rely solely on their function
arguments providing them with everything they need to perform their
work. Likewise, they can't affect global variable state, they can
only communicate the results of their execution back to the main
program flow by either returning results (via Python ``return``),
or calling one of the "parallel friendly" methods to push execution
onto the next state/stage.
At surface level, these constraints hopefully meet the criteria in
[R12]; they're essential for performance, they're logical within the
context of parallel programming, and the benefit of concurrency
outweighs the cost of writing "parallel friendly" callable code.
There will undoubtedly be more constraints that surface once a working
prototype has been implemented, but for now, it's probably safe to
assume the biggest one will be the limited view of the universe (from
the perspective of available/known objects) presented to code that
executes within a parallel callable.
The Importance of Never Blocking
================================
The vast majority of discussion so far has focused on the parallel
aspect of this PEP. That is, introducing new primitives that allow
Python code to run concurrently across multiple cores as efficiently
as possible.
The term "efficiently as possible", within the context of a Python
interpreter executing within a parallel context (i.e. with all CPU
cores running parallel callables), means the following:
- Each CPU core is spending 100% of its time in userland,
executing the modified CFrame_EvalEx pipeline.
- There is no cross-talk between cores; code in core 1 doesn't
try to communicate with code in core 2 half way through the
parallel callable.
- There is no read/write contention for shared data, which means
no need for synchronization primitives (at least in the most
common code paths).
- The code within the callable never blocks.
The first three points are addressed in the previous section. The
final point, that callable code never blocks, is discussed in this
section.
Blocking, in this context, refers to a thread calling something that
doesn't immediately/always return in a deterministic fashion. System
calls that involve writing and reading to sockets/files are examples
of calls that can "block".
Once blocked, a thread is useless. Not only does execution of the
current parallel callable come grinding to a halt -- the entire
pipeline stalls. Execution only resumes when the thread "unblocks",
typically when underlying IO has been completed.
Because our implementation relies heavily on the notion of binding a
single thread to each CPU core, when one of these parallel pipeline
threads stalls, the entire CPU core stalls -- there are no other
threads ready and waiting in the wings to pick up execution.
So, blocking must be avoided at all costs. Luckily, this is far from
being a foreign concept -- the ability to avoid blocking is critical
to the success of all networking/event libraries.
Different platforms provide different facilities for avoiding blocking
system calls. On POSIX systems, programs primarily rely on setting a
socket or file descriptor to "non-blocking"; alternatively, a blocking
call can be deferred to a separate thread, which allows the main
program flow to continue unimpeded.
Windows, on the other hand, has much more sophisticated facilities for
achieving high-performance IO by way of excellent asynchronous IO
support (usually referred to as "overlapped IO"), IO completion ports,
thread pools, efficient locking primitives and more.
AIX has facilities equal to that of Windows XP/2003 (overlapped IO and
IOCP, but no intrinsic thread pools). Solaris introduced a similar
API to IOCP called "event ports", which is functionally equivalent to
AIX and Windows XP/2003.
The key point to take away from this is that although all platforms
provide a means for achieving non-blocking or asynchronous calls, how
each platform actually goes about doing it is vastly different.
The Requirement for Asynchronous Callbacks
==========================================
From the perspective of the parallel callable, how the underlying
operating system goes about asynchronous IO or non-blocking is an
irrelevant implementation detail.
However, what *is* important is the set of methods made available
to parallel callables that allow potentially-blocking calls to be
made in a "parallel-safe" manner.
There are many ways to expose this functionality to programs. In
fact, how this functionality should be exposed is exactly the topic
being discussed on python-ideas and prototyped in libraries like
tulip. It is also the essence of existing networking libraries like
Twisted and Tornado, and is the focus of another PEP: PEP 3145.
This PEP proposes an alternate approach to async IO compared to the
libraries above. It was first presented to python-ideas on the 30th
of November, and the core concept hasn't changed since.
(http://mail.python.org/pipermail/python-ideas/2012-November/018076.html)
The relevant code example is as follows::
class Callback:
__slots__ = [
'success',
'failure',
'timeout',
'cancel',
]
async module/class async:
def getaddrinfo(host, port, ..., cb):
...
def getaddrinfo_then_connect(.., callbacks=(cb1, cb2))
...
def accept(sock, cb):
...
def accept_then_write(sock, buf, (cb1, cb2)):
...
def accept_then_expect_line(sock, line, (cb1, cb2)):
...
def accept_then_expect_multiline_regex(sock, regex, cb):
...
def read_until(fd_or_sock, bytes, cb):
...
def read_all(fd_or_sock, cb):
return self.read_until(fd_or_sock, EOF, cb)
def read_until_lineglob(fd_or_sock, cb):
...
def read_until_regex(fd_or_sock, cb):
...
def read_chunk(fd_or_sock, chunk_size, cb):
...
def write(fd_or_sock, buf, cb):
...
def write_then_expect_line(fd_or_sock, buf, (cb1, cb2)):
...
def connect_then_expect_line(..):
...
def connect_then_write_line(..):
...
def submit_work(callable, cb):
...
def submit_blocking_work(callable, cb):
...
def submit_scheduled_work(callable, schedule, cb):
"""
Allows for periodic work to be submitted (i.e. run every
hour).
"""
...
def submit_future_work(callable, delay, cb):
"""
Allows for work to be executed at an arbitrary point in
the future.
"""
...
def submit_synchronized_work(callable, cb):
"""
Submits a callable to be executed on the main Python
interpreter thread (i.e. outside of any parallel context
execution).
"""
...
def run_once(..):
"""Run the event loop once."""
def run(..):
"""Keep running the event loop until exit."""
The two key aspects of these methods are: a) every method (except the
run_once/run ones) must be passed a callback, and b) every method
*always* returns immediately. From the perspective of the parallel
callable, everything happens asynchronously.
Other aspects of this proposed API worth noting (particularly where it
differs from conventional proposals to date):
- There's a single flat interface to *all* asynchronous calls.
This is in stark contrast to the object-oriented/interface
approach used by Twisted, which relies on concepts/classes
such as Protocols, Producers, Consumers, Transports, etc.
- It has been written in such a way to satisfy [R8], which states
that the interface should focus on allowing the platform with
the best asynchronous primitives to run natively, and simulating
the behaviour elsewhere. This means that functions like
getaddrinfo() is exposed via the interface, because Windows
provides a native way to call this method asynchronously.
- Every single function bar the submit_* ones deal with IO in some
way or another. All possible operations you could call against
a socket or file descriptor are accessible asynchronously. This
is particularly important for calls such as connect and accept,
which may not have asynchronous counterparts on some systems.
(In which case, the "asynchronicity" would be mimicked in the
same fashion Twisted et al do today.)
- There are a lot of combined methods:
def getaddrinfo_then_connect(.., callbacks=(cb1, cb2))
def accept_then_write(sock, buf, (cb1, cb2)):
def accept_then_expect_line(sock, line, (cb1, cb2)):
def accept_then_expect_multiline_regex(sock, regex, cb):
def read_until(fd_or_sock, bytes, cb):
def read_until_lineglob(fd_or_sock, cb):
def read_until_regex(fd_or_sock, cb):
def read_chunk(fd_or_sock, chunk_size, cb):
def write_then_expect_line(fd_or_sock, buf, (cb1, cb2)):
def connect_then_expect_line(..):
def connect_then_write_line(..):
These combined methods provide optimal shortcuts for the most
common actions performed by client/server event driven software.
The idea is to delay the need to run Python code for as long as
possible -- or in some cases, avoid the need to call it at all,
especially in the case of error handling, where the interface
can detect whether the incoming bytes are erroneous or not.
For example, write_then_expect_line() automatically handles
three possible code paths that could occur after data has been
written to a network client:
- The expected line is received -> cb.success() is executed.
- The expected line is not received. Either the line didn't
match the expected pattern, the line was too long, had
invalid chars, etc. Whatever the case, cb.failure() is
automatically called -- which would be configured to send an
error code (if appropriate to the protocol) and then close
the connection.
- A terminated \r\n line isn't received within the timeout
specified by cb.timeout. cb.failure() is run with the
failure type indicating 'timeout'.
The general principle is to try and delay the need to execute
Python code for as long as possible -- and, when it can't be
delayed any longer, make sure the Python code has enough data to
do useful work.
This frees the Python programmer from writing repetitive IO
processing code (i.e. is my buffer full, has a full line been
received, has the correct pattern been detected, etc) and focus
more on program logic -- "when a line matching this pattern is
received, run this Python code" (i.e. the callable accessible
from the cb.success callback).
Additionally, by keeping all the buffer processing and input
detection work in C, the Python interpreter doesn't need to
switch back to Python code until there's something useful to
do.
This becomes an important optimization when dealing with heavily
loaded servers. If you've got 64k concurrent clients reading
and writing to your server, the faster you can serve each
request, the better you're going to perform. Minimizing the
number of times Python code is being executed over the course of
a given interval will improve the overall performance.
- Finally, there is a set of methods specifically tailored towards
handling blocking calls:
def submit_work(callable, cb):
def submit_blocking_work(callable, cb):
def submit_scheduled_work(callable, schedule, cb):
"""
Allows for periodic work to be submitted (i.e. run every
hour).
"""
def submit_future_work(callable, delay, cb):
"""
Allows for work to be executed at an arbitrary point in
the future.
"""
def submit_synchronized_work(callable, cb):
"""
Submits a callable to be executed on the main Python
interpreter thread (i.e. outside of any parallel context
execution).
"""
These methods serve an important role. Because the proposed
solution advises against using threading.Thread (or rather,
threading.Threads inhibit the parallel capabilities of the
interpreter, as it has to let the threads run at regular
intervals, pausing the parallel pipelines to do so), there
needs to be a way to submit non-IO work that may potentially
block from within the context of a parallel callable.
This is what the submit_* methods are for. They allow for
submission of arbitrary callables. Where and how the callable
is executed depends on what submit method is used. Pure
"compute" parallel callables can be submitted via submit_work.
Blocking calls can be submitted via submit_blocking_call (this
allows the underlying implementation to differentiate between
blocking calls that need to be handled in a separate, on-demand
thread, versus compute callables that can simply be queued to
one of the pipelines).
Additionally, work can be submitted for processing
"synchronously", that is, when parallel processing is paused (or
has stopped) and the main interpreter thread is executing in as
a single-thread with the GIL acquired and visibility to all
global/nonlocal variables. This provides a means to affect the
global state of the program from within a parallel callable
*without* requiring shared data synchronization primitives.
(Note that most of these methods have equivalent counterparts in
libraries like Twisted, i.e. reactor.callFromThread, callLater,
etc.)
Tying It All Together
=====================
With the provision of asynchronous callbacks, we have all the pieces
we need to facilitate our main goal of exploiting multiple cores
within the context of the following two use cases:
- Writing high-performance, event-driven client/server software.
- Writing parallel task/data decomposition algorithms.
We are able to exploit multiple cores by introducing a set of parallel
primitives (the specifics of which we've yet to discuss) that the main
interpreter thread uses to signal the entry and exit of parallel
execution contexts.
We have introduced the concept of "parallel callables". These are
simply Python functions that can be run concurrently across multiple
cores. In order to achieve concurrent execution of Python code as
efficiently as possible, some constraints need to be observed by the
code within such a "parallel callable":
- The code should never block. This is achieved by providing an
asynchronous facade for all possible IO operations.
- The code will not have visibility to any global variables or
objects (globals() will literally return an empty dict (or maybe
just the same dict as locals())).
- Because no shared or global state can be accessed directly from
within the body of the code, any "input data" the code requires
must be made accessible via normal function arguments.
- The lifetime of all objects created within a parallel callable
ends when the callable finishes (that is, they are explicitly
deallocated).
- Within the context of parallel task/data composition, if a
callable needs to communicate the result of its work back to the
main program, it does so by simply returning that data (i.e. via
a normal Python ``return foo`` statement).
- Within the context of event-driven clients/servers, callables
can use a set of async.submit_*_work() facilities to either
communicate with the main program or affect global program
state in some way.
- The code within the parallel callable should be reduced to the
simplest expression of program logic suitable for exposing as an
atomic unit. In general, more callables with less functionality
are preferable to less callables with greater functionality.
- Parallel callables are idempotent (an important property if they
are to be executed concurrently). Provided with the same input,
they'll always produce the same output. They can't affect each
other's state, nor do they leak state (i.e. object allocation)
once finished, nor can they affect nonlocal/global objects
unless the standard communication primitives are used (using
``return`` or async.submit*).
With these constraints in place, we propose achieving interpreter
parallelism/concurrency by creating a thread for each CPU core, and,
upon a signal from the main thread that parallel computation can
begin, a modified CFrame_EvalEx pipeline.
The parallel pipeline will only ever execute parallel callables. It
will execute a single callable to completion (i.e. until the function
has finished), perform any data marshalling required by ``return``,
free all objects allocated during execution of the callable, and then
check whether or not the main thread has paused (or completed) parallel
execution. This determines whether the parallel thread yields control
back to a waiting main interpreter thread, or whether it can pull down
another callable off its queue and start the process again.
We identify that the major issue with attempting concurrency in the
past is that all approaches try and make threading.Thread suddenly
behave concurrently. This requires revamping a significant amount of
CPython internal structures such that they're protected by locks.
This adds an unacceptable overhead; frequent operations like incref
and decref, which are currently extremely trivial and almost free from
a "cycle cost" perspective, suddenly need to be protected with mutex
lock and unlocks.
This sort of an approach will never work with the current memory
allocation, reference counting and garbage collection implementation,
because it has all been written with the assumption that there is a
single global thread of execution at any one time.
We propose an alternate approach to for achieving concurrency without
the overhead of fine-grained locking: provide each thread pipeline
with its own, localized object allocation. This would be done by
replacing the global object head static pointers currently used with a
set of thread-local replacements. This maintains the illusion of a
single global thread of execution from the perspective of CPython
internals.
We also suggest exploring avenues such as dropping garbage collection
altogether and simply deallocating objects (i.e. free()'ing them) as
soon as there refcnt hits 0, or even doing away with reference
counting altogether and automatically deallocating everything that was
allocated as part of the "parallel callable cleanup" steps each thread
must take before starting a new callable.
This proposal introduces a new constraint: no visibility to
global/nonlocal variables from within the context of a parallel
callable. The callable must be written such that it has all the data
it needs from its function arguments. If it needs to affect global
program state, it does so by returning a value or using one of the
async.* methods.
Our proposed approach also meets all of the requirements with regards
to maintaining GIL semantics. It does this by having the main
interpreter thread periodically pause parallel execution in order to
release the GIL, as well as perform other bits of housekeeping that
can only be done with the GIL held from the single-threaded execution
context of the main interpreter thread (which, unlike parallel
threads, has access to the "global" memory state, garbage collection,
etc).
Obviously, the longer the parallel threads are permitted to run
without being paused the better, but legacy code that relies on GIL
semantics (especially with extension codes) will still run without
issue in the mean time.
Because we don't introduce any fine-grained locking primitives in
order to achieve concurrency, no additional overhead is incurred by
Python code running in the normal single-threaded context or the new
parallel context.
As part of ensuring parallel callables run as efficiently as possible,
we propose a new "async facade" API. This API exposes methods that
allow potentially-blocking system calls to *always* be executed
asynchronously. Calls to these methods always return instantly,
ensuring that parallel callable code never blocks.
The choice to expose low-level methods asynchronously versus proposing
a higher level async API (such as Producer/Consumer/Transport/Protocol
etc) was a conscious one; just like the parallel callables, these
methods are intended to be idempotent; they do not need any class
state or instance objects in order to be invoked. They can simply be
called directly from the context of a parallel callable with the
intended target object (i.e. a socket or file descriptor) as the first
argument.
The proposed async facade also leverages the new concept of parallel
callables via the explicit callback requirements. As long as the same
"parallel callable" constraints are adhered to by a callback (i.e.
cb.success), the execution can be scheduled for the parallel pipeline
automatically. The more "parallel callables" used to handle program
state, the longer Python can stay executing parallel contexts. This
is important for being able to achieve high performance across all
cores in heavily loaded situations.
Also note that we don't discuss how the backend for the async facade
should be implemented, nor do we leak any underlying platform details
into the interface. There is no mention of IOCP, select, poll, epoll,
kqueue, non-blocking sockets/fds or AIO. These are all implementation
details that the "parallel callable" code shouldn't be interested in.
All it needs is an asynchronous interface to all system calls that
could potentially block, as well as a means to submit general parallel
work and work that needs to be done by the main interpreter thread
outside of the parallel execution context.
The final point worth mentioning regarding the async facade is that it
isn't attempting to compete with Tulip, Twisted, Tornado or any other
async IO library.
In fact, its primary goal isn't async IO at all. Its primary goal is
to provide parallel callables with a means to call potentially-blocking
system calls asynchronously, such that they don't stall the parallel
thread pipeline.
Because the new primitives being proposed are designed to exploit
multiple cores concurrently, the idea is that subsequent versions of
Twisted could be written to leverage the new capabilities quite
easily. This applies to any other event loops like Tornado et al.
XXX TODO
========
Example Python Code Using New Parallel Primitives
-------------------------------------------------
I haven't thought much about what the new parallel primitives will
look like from a syntactical perspective. I'm confident that the most
elegant solution will naturally present itself as part of subsequent
PEP reviews.
The primary goal of this PEP is to introduce the concept of parallel
primitives. Such primitives have well defined entry and exit points,
which the main interpreter thread relies upon to control parallel
execution contexts.
There are two use cases that need to be catered for: event-driven
client/server programs and parallel task/data decomposition.
A possible example of what the former would look like::
class SimpleEchoServer(async.TCPServer):
def __init__(self, cb):
cb.success = self.connection_established
async.accept(self.sock, cb)
@parallel.event
def connection_established(self, cb):
cb.success = self.line_received
async.expect_line(self.sock, cb)
@parallel.event
def line_received(self, data, cb):
cb.success = self.connection_established
async.write(self.sock, data, cb)
server = SimpleEchoServer()
async.register(server)
async.run()
The parallel task/data decomposition primitives will probably share
some similarities with the OpenMP-style of interface, as it's well
suited to this sort of task::
@parallel.map
def linelen(line)
return len(line)
# Runs linelen(line) in parallel against each sequence
# element in ``lines``, then sums the result.
total = parallel.reduce(text.splitline(), lenline, lambda i: i+1)
Underlying Implementation of Async Facade
-----------------------------------------
More discussion is warranted regarding how the async facade would be
implemented on different platforms. POSIX platforms will use epoll
or kqueue, falling back to poll. Windows, AIX and Solaris can use
their native AIO facilities with IOCP or event ports.
How each underlying implementation interacts with the parallel
primitives also needs to be discussed. Are separate threads used, is
it all done in the main thread during parallel breaks, etc.
References
==========
Copyright
=========
This document has been placed in the public domain.
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End:
vim:set ts=8 sw=4 sts=4 et syntax=rst tw=70: