Saturday, January 14, 2012

Transactional Memory (II)

Here is an update about the previous blog post about the
Global Interpreter Lock (GIL). In 5 months, the point of view
changed quite a bit.

Let me remind you that the GIL is the technique used in both CPython and
PyPy to safely run multi-threaded programs: it is a global lock that
prevents multiple threads from actually running at the same time. The
reason to do that is that it would have disastrous effects in the
interpreter if several threads access the same object concurrently --- to
the point that in CPython even just manipulating the object's reference
counter needs to be protected by the lock.

So far, the ultimate goal to enable true multi-CPU usage has been to
remove the infamous GIL from the interpreter, so that multiple threads
could actually run in parallel. It's a lot of work, but this has been
done in Jython. The reason that it has not been done in CPython so far
is that it's even more work: we would need to care not only about
carefully adding fine-grained locks everywhere, but also about reference
counting; and there are a lot more C extension modules that would need
care, too. And we don't have locking primitives as performant as
Java's, which have been hand-tuned since ages (e.g. to use help from the
JIT compiler).

But we think we have a plan to implement a different model for using
multiple cores. Believe it or not, this is better than just removing
the GIL from PyPy. You might get to use all your cores without ever
writing threads.

You would instead just use some event dispatcher, say from Twisted, from
Stackless, or from your favorite GUI; or just write your own. From
there, you (or someone else) would add some minimal extra code to the
event dispatcher's source code, to exploit the new transactional features
offered by PyPy. Then you would run your program on a
special version of PyPy, and voilà: you get some form of automatic parallelization.
Sounds magic, but the basic idea is simple: start handling multiple
events in parallel, giving each one its own transaction. More about
it later.

First, why would this be better than "just" removing the GIL? Because
using threads can be a mess in any complex program. Some authors (e.g.
Lee) have argued that the reason is that threads are fundamentally
non-deterministic. This makes it very hard to reason about them.
Basically the programmer needs to "trim" down the non-determinism (e.g.
by adding locks, semaphores, etc.), and it's hard to be sure when he's
got a sufficiently deterministic result, if only because he can't write
exhaustive tests for it.

By contrast, consider a Twisted program. It's not a multi-threaded
program, which means that it handles the "events" one after the other.
The exact ordering of the events is not really deterministic, because
they often correspond to external events; but that's the only source of
non-determinism. The actual handling of each event occurs in a nicely
deterministic way, and most importantly, not in parallel with the
handling of other events. The same is true about other libraries like
GUI toolkits, gevent, or Stackless.

(Of course the Twisted and the Stackless models, to cite only these two,
are quite different from each other; but they have in common the fact
that they are not multi-threaded, and based instead on "events" ---
which in the Stackless case means running a tasklet from one switch()
point to the next one.)

These two models --- threads or events --- are the two main models we
have right now. The latter is more used in Python, because it is much
simpler to use than the former, and the former doesn't give any benefit
because of the GIL. A third model, which is the only one that gives
multi-core benefits, is to use multiple processes, and do inter-process
communication.

Consider the case of a big program that has arbitrary complicated
dependencies. Even assuming a GIL-less Python, this is likely enough to
prevent the programmer from even starting a multi-threaded rewrite,
because it would require a huge mess of locks. He could also consider
using multiple processes instead, but the result is annoying as well:
the complicated dependencies translate into a huge mess of inter-process
synchronization.

The problem can also be down-sized to very small programs, like the kind
of hacks that you do and forget about. In this case, the dependencies
might be simpler, but you still have to learn and use subtle locking
patterns or a complex inter-process library, which is overkill for the
purpose.

(This is similar to how explicit memory management is not very hard for
small programs --- but still, nowadays a lot of people agree that
automatic memory management is easier for programs of all sizes. I
think the same will eventually be true for using multiple CPUs, but the
correct solution will take time to mature, like garbage collectors did.
This post is a step in hopefully the right direction :-))

Let me introduce the notion of independent events: two events are
independent if they don't touch the same set of objects. In a multi-threaded
world, it means that they can be executed in parallel without needing any lock
to ensure correctness.

Events might also be mostly independent, i.e. they rarely access the same
object concurrently. Of course, in a multi-threaded world we would still need
locks to ensure correctness, but the point is that the locks are rarely causing
pauses: lock contention is low.

Consider again the Twisted example I gave above. There are often several
events pending in the dispatch queue (assuming the program is using 100%
of our single usable CPU, otherwise the whole discussion is moot). The case I am
interested in is the case in which these events are generally mostly
independent, i.e. we expect few conflicts between them. However
they don't have to be proved independent. In fact it is fine if
they have arbitrary complicated dependencies as described above. The
point is the expected common case. Imagine that you have a GIL-less
Python and that you can, by a wave of your hand, have all the careful
locking mess magically done. Then what I mean here is the case in which
such a theoretical program would run mostly in parallel on multiple
core, without waiting too often on the locks.

In this case, the solution I'm proposing is that with minimal tweaks
in the event dispatch loop, we can
handle multiple events on multiple threads, each in its own transaction.
A transaction is basically a tentative execution of the corresponding
piece of code: if we detect conflicts with other concurrently executing
transactions, we abort the whole transaction and restart it from
scratch.

By now, the fact that it can basically work should be clear: multiple
transactions will only get into conflict when modifying the same data
structures, which is the case where the magical wand above would have
put locks. If the magical program could progress without too many
locks, then the transactional program can progress without too many
conflicts. In a way, you get even more than what the magical program
can give you: each event is dispatched in its own transaction, which
means that from each event's point of view, we have the illusion that
nobody else is running concurrently. This is exactly what all existing
Twisted-/Stackless-/etc.-based programs are assuming.

Note that this solution, without transactions, already exists in some
other languages: for example, Erlang is all about independent events.
This is the simple case where we can just run them on multiple cores,
knowing by construction of the language that you can't get conflicts.
Of course, it doesn't work for Python or for a lot of other languages.
From that point of view, what I'm suggesting is merely that
transactional memory could be a good model to cope with the risks of
conflicts that come from not having a special-made language.

Of course, transactional memory
(TM) is not a perfect solution either. Right now, the biggest issue is
the performance hit that comes from the software implementation (STM).
In time, hardware support (HTM) is likely to show up and help
mitigate the problem; but I won't deny the fact that in some cases,
because it's simple enough and/or because you really need the top
performance, TM is not the best solution.

Also, the explanations above are silent on what is a hard point for TM,
namely system calls. The basic general solution is to suspend other
transactions as soon as a transaction does its first system call, so
that we are sure that the transaction will succeed. Of course this
solution is far from optimal. Interestingly, it's possible to do better
on a case-by-case basis: for example, by adding in-process buffers, we
can improve the situation for sockets, by having recv() store in a
buffer what is received so that it can be re-recv()-ed later if the
transaction is aborted; similarly, send() or writes to log files can be
delayed until we are sure that the transaction will commit.

From my point of view, the most important point is that the TM solution
comes from the correct side of the "determinism" scale. With threads,
you have to prune down non-determinism. With TM, you start from a
mostly deterministic point, and if needed, you add non-determinism. The
reason you would want to do so is to make the transactions shorter:
shorter transactions have less risks of conflicts, and when there are
conflicts, less things to redo. So making transactions shorter
increases the parallelism that your program can achieve, while at the
same time requiring more care.

In terms of an event-driven model, the equivalent would be to divide the
response of a big processing event into several events that are handled
one after the other: for example, the first event sets things up and fires the second
event, which does the actual computation; and afterwards a third event
writes the results back. As a result, the second event's transaction
has little risks of getting aborted. On the other hand, the writing
back needs to be aware of the fact that it's not in the same transaction
as the original setting up, which means that other unrelated
transactions may have run in-between.

These, and others, are the problems of the TM approach. They are "new"
problems, too, in the sense that the existing ways of programming don't
have these problems.

Still, as you have guessed, I think that it is overall a win, and
possibly a big win --- a win that might be on the same scale for the age
of multiple CPUs as automatic garbage collection was 20 years ago for
the age of RAM size explosion.

Stay tuned for more!

--- Armin (and reviews by Antonio and Fijal)

UPDATE: please look at the tiny transaction module I wrote as an example. The idea is to have the same interface as this module, but implemented differently. By making use of transactional memory internally, it should be possible to safely run on multiple CPUs while keeping the very same programmer interface.

Here is an update about the previous blog post about the
Global Interpreter Lock (GIL). In 5 months, the point of view
changed quite a bit.

Let me remind you that the GIL is the technique used in both CPython and
PyPy to safely run multi-threaded programs: it is a global lock that
prevents multiple threads from actually running at the same time. The
reason to do that is that it would have disastrous effects in the
interpreter if several threads access the same object concurrently --- to
the point that in CPython even just manipulating the object's reference
counter needs to be protected by the lock.

So far, the ultimate goal to enable true multi-CPU usage has been to
remove the infamous GIL from the interpreter, so that multiple threads
could actually run in parallel. It's a lot of work, but this has been
done in Jython. The reason that it has not been done in CPython so far
is that it's even more work: we would need to care not only about
carefully adding fine-grained locks everywhere, but also about reference
counting; and there are a lot more C extension modules that would need
care, too. And we don't have locking primitives as performant as
Java's, which have been hand-tuned since ages (e.g. to use help from the
JIT compiler).

But we think we have a plan to implement a different model for using
multiple cores. Believe it or not, this is better than just removing
the GIL from PyPy. You might get to use all your cores without ever
writing threads.

You would instead just use some event dispatcher, say from Twisted, from
Stackless, or from your favorite GUI; or just write your own. From
there, you (or someone else) would add some minimal extra code to the
event dispatcher's source code, to exploit the new transactional features
offered by PyPy. Then you would run your program on a
special version of PyPy, and voilà: you get some form of automatic parallelization.
Sounds magic, but the basic idea is simple: start handling multiple
events in parallel, giving each one its own transaction. More about
it later.

First, why would this be better than "just" removing the GIL? Because
using threads can be a mess in any complex program. Some authors (e.g.
Lee) have argued that the reason is that threads are fundamentally
non-deterministic. This makes it very hard to reason about them.
Basically the programmer needs to "trim" down the non-determinism (e.g.
by adding locks, semaphores, etc.), and it's hard to be sure when he's
got a sufficiently deterministic result, if only because he can't write
exhaustive tests for it.

By contrast, consider a Twisted program. It's not a multi-threaded
program, which means that it handles the "events" one after the other.
The exact ordering of the events is not really deterministic, because
they often correspond to external events; but that's the only source of
non-determinism. The actual handling of each event occurs in a nicely
deterministic way, and most importantly, not in parallel with the
handling of other events. The same is true about other libraries like
GUI toolkits, gevent, or Stackless.

(Of course the Twisted and the Stackless models, to cite only these two,
are quite different from each other; but they have in common the fact
that they are not multi-threaded, and based instead on "events" ---
which in the Stackless case means running a tasklet from one switch()
point to the next one.)

These two models --- threads or events --- are the two main models we
have right now. The latter is more used in Python, because it is much
simpler to use than the former, and the former doesn't give any benefit
because of the GIL. A third model, which is the only one that gives
multi-core benefits, is to use multiple processes, and do inter-process
communication.

Consider the case of a big program that has arbitrary complicated
dependencies. Even assuming a GIL-less Python, this is likely enough to
prevent the programmer from even starting a multi-threaded rewrite,
because it would require a huge mess of locks. He could also consider
using multiple processes instead, but the result is annoying as well:
the complicated dependencies translate into a huge mess of inter-process
synchronization.

The problem can also be down-sized to very small programs, like the kind
of hacks that you do and forget about. In this case, the dependencies
might be simpler, but you still have to learn and use subtle locking
patterns or a complex inter-process library, which is overkill for the
purpose.

(This is similar to how explicit memory management is not very hard for
small programs --- but still, nowadays a lot of people agree that
automatic memory management is easier for programs of all sizes. I
think the same will eventually be true for using multiple CPUs, but the
correct solution will take time to mature, like garbage collectors did.
This post is a step in hopefully the right direction :-))

Let me introduce the notion of independent events: two events are
independent if they don't touch the same set of objects. In a multi-threaded
world, it means that they can be executed in parallel without needing any lock
to ensure correctness.

Events might also be mostly independent, i.e. they rarely access the same
object concurrently. Of course, in a multi-threaded world we would still need
locks to ensure correctness, but the point is that the locks are rarely causing
pauses: lock contention is low.

Consider again the Twisted example I gave above. There are often several
events pending in the dispatch queue (assuming the program is using 100%
of our single usable CPU, otherwise the whole discussion is moot). The case I am
interested in is the case in which these events are generally mostly
independent, i.e. we expect few conflicts between them. However
they don't have to be proved independent. In fact it is fine if
they have arbitrary complicated dependencies as described above. The
point is the expected common case. Imagine that you have a GIL-less
Python and that you can, by a wave of your hand, have all the careful
locking mess magically done. Then what I mean here is the case in which
such a theoretical program would run mostly in parallel on multiple
core, without waiting too often on the locks.

In this case, the solution I'm proposing is that with minimal tweaks
in the event dispatch loop, we can
handle multiple events on multiple threads, each in its own transaction.
A transaction is basically a tentative execution of the corresponding
piece of code: if we detect conflicts with other concurrently executing
transactions, we abort the whole transaction and restart it from
scratch.

By now, the fact that it can basically work should be clear: multiple
transactions will only get into conflict when modifying the same data
structures, which is the case where the magical wand above would have
put locks. If the magical program could progress without too many
locks, then the transactional program can progress without too many
conflicts. In a way, you get even more than what the magical program
can give you: each event is dispatched in its own transaction, which
means that from each event's point of view, we have the illusion that
nobody else is running concurrently. This is exactly what all existing
Twisted-/Stackless-/etc.-based programs are assuming.

Note that this solution, without transactions, already exists in some
other languages: for example, Erlang is all about independent events.
This is the simple case where we can just run them on multiple cores,
knowing by construction of the language that you can't get conflicts.
Of course, it doesn't work for Python or for a lot of other languages.
From that point of view, what I'm suggesting is merely that
transactional memory could be a good model to cope with the risks of
conflicts that come from not having a special-made language.

Of course, transactional memory
(TM) is not a perfect solution either. Right now, the biggest issue is
the performance hit that comes from the software implementation (STM).
In time, hardware support (HTM) is likely to show up and help
mitigate the problem; but I won't deny the fact that in some cases,
because it's simple enough and/or because you really need the top
performance, TM is not the best solution.

Also, the explanations above are silent on what is a hard point for TM,
namely system calls. The basic general solution is to suspend other
transactions as soon as a transaction does its first system call, so
that we are sure that the transaction will succeed. Of course this
solution is far from optimal. Interestingly, it's possible to do better
on a case-by-case basis: for example, by adding in-process buffers, we
can improve the situation for sockets, by having recv() store in a
buffer what is received so that it can be re-recv()-ed later if the
transaction is aborted; similarly, send() or writes to log files can be
delayed until we are sure that the transaction will commit.

From my point of view, the most important point is that the TM solution
comes from the correct side of the "determinism" scale. With threads,
you have to prune down non-determinism. With TM, you start from a
mostly deterministic point, and if needed, you add non-determinism. The
reason you would want to do so is to make the transactions shorter:
shorter transactions have less risks of conflicts, and when there are
conflicts, less things to redo. So making transactions shorter
increases the parallelism that your program can achieve, while at the
same time requiring more care.

In terms of an event-driven model, the equivalent would be to divide the
response of a big processing event into several events that are handled
one after the other: for example, the first event sets things up and fires the second
event, which does the actual computation; and afterwards a third event
writes the results back. As a result, the second event's transaction
has little risks of getting aborted. On the other hand, the writing
back needs to be aware of the fact that it's not in the same transaction
as the original setting up, which means that other unrelated
transactions may have run in-between.

These, and others, are the problems of the TM approach. They are "new"
problems, too, in the sense that the existing ways of programming don't
have these problems.

Still, as you have guessed, I think that it is overall a win, and
possibly a big win --- a win that might be on the same scale for the age
of multiple CPUs as automatic garbage collection was 20 years ago for
the age of RAM size explosion.

Stay tuned for more!

--- Armin (and reviews by Antonio and Fijal)

UPDATE: please look at the tiny transaction module I wrote as an example. The idea is to have the same interface as this module, but implemented differently. By making use of transactional memory internally, it should be possible to safely run on multiple CPUs while keeping the very same programmer interface.

If you go that road, you will certainly find out that Transactional Memory is much, much harder to get right than it looks like in today effectful/imperative languages. Sure, it looks wonderful on paper, but if your language doesn't help you control side-effects it will give you a very hard time.

Currently, there is satisfying STM support in Haskell (because of its tight type-based control of side-effects) and Clojure (beacuse of its tight control on mutability), and it might be getting into Scala.

I doubt Python can easily get such control, at least without an important reorganization of idiomatic practices and frameworks, that go beyond the "let's be event-driven" decision. Which makes your "this is going to work magically" story a bit hard to believe.

There has been intense research on this topic for some decades now, and several attempts at getting it to work in current mainstream languages have mostly failed.

One major use-case for multithreading involves a large, unchanging data structure which many threads access. I.e., the data structure is loaded by a parent task, then not modified again; a number of threads are then spawned to use it for calculations.

In CPython, the GIL makes this impossible if only because the reference counters need to be protected. With Cython in threads, however, you can turn off the GIL and do some work on C-style data structures.

I'm wondering whether the STM PyPy effort could have a very useful, and very early, benefit: simply enabling an unchanging data structure to be accessed by a number of processors via the kinds of events you describe. There wouldn't be a need for transactions, because the programmer would take responsibility for only sharing unchanging structures between simultaneously-executing events.

But it seems like the basic requirements for this kind of facility might be met in in early stage of STM development. And a solution that allowed multiple processors to access large, unchanging structures would be very useful in certain applications. I know I have one in mind that I'm looking at CPython/Cython for, but I'd rather see if I could get the performance I need from PyPy.

@Anonymous: in the extract you cite I meant "the two main models in Python". As far as I can tell, Go does concurrency by enforcing all communications to go via channels, so I would classify it as a "special-made" language. This solution might be nice and usable, but it does not really work at all in languages like Python.

@gasche: I know about Haskell, Clojure and Scala, and I just read the two blog posts you pointed to.

I'm not talking about giving explicit TM to the end programmers. I'm instead considering TM as an internal, implementation-only feature. That makes it very similar to GCs.

I know the points and issues of traditional TM systems, which are nicely reported by Joe Duffy in "A (brief) retrospective on transactional memory". These are of course perfectly valid issues, but I think they do not apply (or "not that much") in the particular context I'm talking about. For example, this includes the large sections about nested transactions, and about consistency between the transactional and non-transactional worlds (Weak or Strong Atomicity, The Privatization Problem). Even "Where is the Killer App?" is obvious in this case: any existing Twisted App is potentially a Killer App.

Sorry for not including references to papers. I must admit I don't know any paper that describes a similar use case for TM.

Your idea could work for other easy to inject into points, such as loops, and comprehensions. Especially with much of the work in pypy already done for identifying information about loops.

How does this compare to grand central dispatch and blocks? http://en.wikipedia.org/wiki/Grand_Central_Dispatch

Events are a very good way to model concurrency, and are widely used. It is a great place to dispatch concurrency into parallelism.

Closures/blocks provide a fairly decent way to get some of the protection of STM - and in many programs give you the 80% solution. For code that plays nicely and avoids mutable, or global data - this works. Luckily, a lot of event based code is already written in this way. As you say, they are "generally mostly independent".

Making the bad cases a quick fail, like in JavaScript worker threads could be an ok option. As soon as someone tries to access global data(do a system call, access the DOM, or access data outside the closure even), the program would fail there. Then you could fix those cases, or "add non-determinism" as you say. I think I'd prefer fail fast here, rather than have to detect these problems, and have them silently pass by.

You still have scheduling problems, and trying to figure out task size. As well, this does not solve lots of other problems. However, it is cool that it could be applied automatically, and probably 'safely'.

Another random thought... you could probably mark chunks of code as 'pure' as your run through them, and if they do a system call or mutate global data mark them as 'unpure' and don't try them again.

I very much look forward to reading your results as you implement more.

Update: please look at the tiny transaction module I wrote as an example. The idea is to have the same interface as this module, but implemented differently. By making use of transactional memory internally, it should be possible to safely run on multiple CPUs while keeping the very same programmer interface.

@Armin: That transaction code looks very simple. It seems trivial to implement a map/mapReduce style function on top of your transaction module.

It is a very similar API to worker pool APIs which many thread using programs use. The main difference is that you combine the join() in the run method. It seems that a threaded web server for example could use this? What would happen if each incoming request comes in, and is put into the transaction (and say the 10th request has an error)? Would it be better to use multiple transactions?

@notme: yes, a web server or anything can use this instead of using threads. It's of course missing a convincing select() or poll() version for that.

The details haven't been thought out; right now an exception interrupts everything. In an STM model it's unclear if concurrent transactions should still be allowed to complete or not. Anyway the point is that exceptions should not really occur because precisely they interrupt everything --- you would typically add instead in every transaction code like "try: .. except: traceback.print_exc()".

I'm not sure I buy your conclusions here. STM is not a panacea for solving concurrency issues, and it has some key limitations that limit its general applicability.

On what granularity do you plan to have transactions? How do you know? Perhaps the VM will have enough knowledge of a given thread's activities to limit transactional overhead to only those structures in memory that are shared, but there still needs to be some indirection in case another thread hops in and starts making changes.

Where do transactions start and end? In STMs I know, the in-transaction overhead for reading and writing data is *much* higher, since it needs to know if someone else has committed a transaction first and be able to roll back.

Perhaps this is all intended to be hidden, and you never actually have "threads" that the user can see. But if you're going to parallelize, you'll have threads *somewhere* that are going to contend for resources. If they're going to contend for resources, even in an STM, they're going to have to check for contention, register their interest, and then you're back to the indirection overhead.

Perhaps I'm not understand what your end goal is. You can't simply turn the world into a series of transactions unless you want every read and write to have transaction overhead or you have some clear way of limiting transaction overhead to only where it's needed. You cite Erlang...but Erlang deals with immutable objects, and there's far less need for anything like an STM. Others have mentioned Clojure...but again, Clojure is mostly immutable structures, and transactional overhead is limited to Refs, where you'll make single coarse-grained reads and writes.

Am I missing the point? Are you not suggesting VM-wide STM, with the resulting transactional overhead for every read and write?

@Charles: Indeed, I am suggesting VM-wide STM, with the resulting transactional overhead for every read and write. I actually got such a VM yesterday (with no GC): it seems to be about 10x slower on a single thread.

Note that even 10x slower is a plus if it scales to dozens of processors. But of course, a better point of view is that some years ago the regular pypy *was* 10x slower than CPython. It was a lot of efforts but we managed to make it only 1.5-2x slower. And this is all without counting the JIT. If STM bogs down to a generally-not-triggered read barrier before every read, then the performance impact could be well under 2x.

Please note also that I don't care about Java-like performance where even loosing 10% of performance would be a disaster. If we end up with a pypy-tm that is 2x slower than a regular pypy, I would be quite happy, and I believe that there is a non-negligible fraction of the Python users that would be, too.

On granularity: for now I'm going with the idea that the granularity is defined "naturally" in the source program as the amount of work done every time some central dispatch loop calls some code. There might be several dispatch loops in total, too. This is true in the cases I can think of: typical Twisted or Stackless programs, pypy's "translate.py", the richards benchmark, etc.

Please look at http://paste.pocoo.org/show/539822/ for an example of what I'm talking about. It's a diff against the standard richards.py: it is a pure Python user program in which I added calls to the new 'transaction' module. At this level there is no hint of Transactional Memory.

@Gary Robinson: (off-topic:) for this kind of use case, you can use os.fork() after the immutable data is ready. It "kind of works" both in pypy and in cpython, although not really --- in cpython the reference counts are modified, causing the pages to get unshared between processes; and in pypy the garbage collector (GC) has the same effect, so far. It could be solved in pypy by more tweaks the GC.

@armin: @Anonymous: in the extract you cite I meant "the two main models in Python". As far as I can tell, Go does concurrency by enforcing all communications to go via channels, so I would classify it as a "special-made" language. This solution might be nice and usable, but it does not really work at all in languages like Python.

Armin, Stackless Python uses a model that at the API level is very similar to Go. Go borrows from the Bell Labs family of languages (i.e. Newsqueak). The fundamental idea is that message pasing is used to share information between threads/processes/coroutines. In this regard, Go is in the same camp as say, Erlang (although the messaging systems are different).

What I think is interesting and workable for Python are efforts in languages like Polyphonic C# (see the paper "Scalable Join Patterns") and Concurrent/Parallel ML, where lock-free libraries and STM techniques are used under the hood to improve the efficiency of the messaging/synchronisation system. In this fashion, the programmer has a conceptually clean concurrency model and still can make the important decisions about how to partition the problem.

@daniel@Armin, CSP may be built into Go, but IMO this was a mistake, there is no requirement for it to be a language feature; it fits nicer as library. See [python-csp] for a python library

I have looked at Python-CSP a long time ago. I recall it being verbose. However I use Stackless Python. And using PyPy's stackless.py, I implemented select() and join patterns. Sometimes I wish I had language support: they cut down on silly mistakes and make the code less verbose for simple cases. However what I have found is that the language can get in the way. For instance, in Go, one has to come up with hacks to do some simple like do a select on an arbitrary number of channels. Perhaps I am wrong but I suspect stuff like select()'s design was influenced by the fact Newsqueak was originally designed to make a windowing system easier to write. So one is monitoring only a handful of channels. In constrast, this is not the way Stackless Python programmes are written.