12/21/2012

12-21-12 - Coroutine-centric Architecture

I've been talking about this for a while but maybe haven't written it all clearly in one place.
So here goes. My proposal for a coroutine-centric architecture (for games).

1. Run one thread locked to each core.

(NOTE : this is only appropriate on something like a game console
where you are in control of all the threads! Do not do this on an OS like Windows where other apps may also be
locking to cores, and you have the thread affinity scheduler problems, and so on).

The one-thread-per-core set of threads is your thread pool.
All code runs as "tasks" (or jobs or whatever) on the thread pool.

The threads never actually do ANY OS Waits. They never switch. They're not really threads, you're not using
any of the OS threading any more. (I suppose you still are using the OS to handle signals and such, and
there are probably some OS threads that are running which will grab some of your time, and you want that;
but you are not using the OS threading in your code).

2. All functions are coroutines.
A function with no yields in it is just a very simple coroutine.
There's no special syntax to be a coroutine or call a coroutine.

All functions can take futures or return futures. (a future is just a value that's not yet ready).
Whether you want this to be totally implicit or not is up to your taste about how much of the operations
behind the scenes are visible in the code.

When you call a function, it is not a "branch", it's just a normal function call. If that function yields, it yields the whole
current coroutine. That is, it's just like threading and waits, but rather with coroutines and yields.

Batch starting is one of those things that people often leave out. Starting tasks one by one is just like
waiting for them one by one (instead of using a wait_all), it causes bad thread-thrashing (waking up and
going back to sleep over and over, or switching back and forth).

3. Full stack-saving is crucial.

For this to be efficient you need a very small minimum stack size (4k is probably good) and you need stack-extension on demand.

You may have lots of pending coroutines sitting around and you don't want them gobbling all your memory with 64k stacks.

Full stack saving means you can do full variable capture for free, even in a language like C where tracking references is hard.

4. You stop using the OS mutex, semaphore, event, etc. and instead use coroutine variants.

Instead of a thread owning a lock, a coroutine owns a lock. When you block on a lock it's a yield of the coroutine instead a full OS wait.

Getting access to a mutex or semaphore is an event that can trigger coroutines being run or resumed. eg. it's a future just like the
return from an async procedural call. So you can do things like :

future<int> y = some_async_func();
yield( y , my_mutex.when_lock() );

which yields your coroutine until the joint condition is met that the async func is done AND you can get the lock on "my_mutex".

Joint yields are very important because they prevent unnecessary coroutine wakeup. While coroutine thrashing is not nearly as bad as
thread thrashing (and is one of the big advantages of coroutine-centric architecture (in fact perhaps the biggest)).

You must have coroutine versions of all the ops that have delays (file IO, networking, GPU, etc) so that you can yield on them
instead of doing thread-waits.

5. You must have some kind of GC.

Because coroutines will constantly be capturing values, you must ensure their lifetime is >= the life of the coroutine. GC is
the only reasonable way to do this.

I would also go ahead and put an RW-lock in every object as well since that will be necessary.

6. Dependencies and side effects should be expressed through args and return values.

You really need to get away from funcs like

void DoSomeStuff(void);

that have various un-knowable inputs and outputs. All inputs & outputs need to be values so that they
can be used to create dependency chains.

When that's not directly possible, you must use a convention to express it. eg. for file manipulation
I recommend using a string containing the file name to express the side effects that go through the
file system (eg. for Rename, Delete, Copy, etc.).

7. Note that coroutines do not fundamentally alter the difficulties of threading.

You still have races, deadlocks, etc. Basic async ops are much easier to write with coroutines, but
they are no panacea and do not try to be anything other than a nicer way of writing threading.
(eg. they are not transactional memory or any other auto-magic).

to be continued (perhaps) ....

Add 3/15/13 : 8. No static size anything. No resources you can run out of.
This is another "best practice" that goes with modern thread design that I forgot to list.

Don't use fixed-size queues for thread communication; they seem like an optimization or simplification
at first, but if you can ever hit the limit (and you will) they cause big problems. Don't assume a
fixed number of workers or a maximum number of async ops in flight, this can cause deadlocks and be a
big problem.

The thing is that a "coroutine centric" program is no longer so much like a normal imperative C program.
It's moving towards a functional program where the path of execution is all nonlinear. You're setting a big graph to evaluate, and then you just need
to be able to hit "go" and wait for the graph to close. If you run into some limit at some point during
the graph evaluation, it's a big mess figuring out how to deal with that.

Of course the OS can impose limits on you (eg. running out of memory) and that is a hassle you have to deal
with.

8 comments:

CPS is basically the way code is transformed (by the compiler not a human :-)) in most Scheme compilers. Each function is given the next function to execute. This make call/cc free and therefore one shot continuation (coroutine) also free.

Note that this requires proper tail-call recursion optimization but you do not need to save the stack.

However, while it is pretty straightforward in scheme where you only have expressions and only very few primary constructs (lambda, if, set!, defime mostly), I am not sure CPS transform is doable in C which has statement and expressions.

Also, look at Racket (for example) which has a nice (but not usually efficient unfortunately due to GC issues) and consistent way to do multi-threaded computations all along with continuations basically mixing future/touch to off-load expressions in other threads and call/cc to emulate coroutine (and exceptions, return...)

On the non-functional side, there's a few new imperative languages with that kind of design too, most prominently Go (www.golang.org) and Rust (www.rust-lang.org). Both check pretty much every point on your check list, with one distinction: both don't automatically promote return values to futures. They do both have built-in communication primitives though: Go has channels and Rust pipes, and the idiomatic way to handle this in both languages is to pass in a channel/pipe end point to the coroutine and send a value once it's done. (Both also have lambdas so it's easy to wrap a function that doesn't send its return value that way).

Both don't have batch-starting (that I've seen) or any explicit dependency tracking between coroutines; again they instead phrase this in terms of the communication primitives: If you're waiting for something to finish before you start a dependent event, make it send you a dummy value on completion. Both do have async waits for multiple events at once (usually wait-any, not wait-all).

Both have proper coroutines with stack-saving and small default stack size (~4k). Both use segmented stacks that grow dynamically and aren't necessarily contiguous (so there's no need to reserve large amounts of address space per coroutine).

Neither will expose any of the OS threading primitives by default, and instead use proper async versions that yield instead of blocking.

Both are GC'ed; Rust has a peculiar but interesting model where everything defaults to stack allocated (with compiler-verified memory safety, i.e. pointers to stack variables may not escape their assigned lifetime) and there's two separate heaps: the "managed heap", which is per-coroutine and fully GC'ed (the compiler enforces that such values don't become visible to other coroutines), and the "exchange heap", which is shared but may only contain objects that have exactly one pointer to them (similar to std::unique_ptr). Because the managed heaps are all separate, they can be GCed on a per-coroutine basis without problems. Interesting model, but I have no idea how well it works in practice. Go is just "GC everything that can't be on the stack", which is less efficient but also a lot simpler to reason about :)

Both don't have any locks per object, and instead emphasize using the built-in comm primitives over shared memory. Seems like the right call for normal usage. Go now also comes with a built-in race checker courtesy of Dmitry Vyukov for the parts that do use shared memory.

Most co-routine implementations I've seen lack the ability to carefully set up and share data between the "scheduler" and the coroutine itself.

What I mean is - you can always make coroutines function by saving the entire execution context of the coroutine every time you switch away, and restoring it every time you switch back. But if you're switching between coroutines that share a lot of state, it's not necessarily very efficient.

Ideally you'd have some notion of "sibling" coroutines - and the notion that when you yield you're only going to yield to a sibling, not to just any coroutine anywhere in the entire program. In 99% of cases the sibling is exactly the same code, just in a different instance. That allows the compiler to make certain assumptions about which registers are live, which will be dirtied, etc.

It also allows the compiler to schedule coroutines into each other. For example - you know that whenever you start a coroutine of this family, the first thing it does is run a prologue that loads the top four stack entries into registers r0-r3 - this happens on every entry point to the coroutine. OK, so if you're somewhere that you know is heading for a switch to this type of coroutine, you can do this there, rather than wait until you're actually in the coroutine wanting to use those registers as addresses or whatever.

In functions there's already some of the above functionality in function signatures (i.e. a particular signature makes a family of functions - you don't know which one you're going to call, but it's one of only a few) - but it seems like the same concept would be useful in coroutines.

The above was all learned doing pixel shaders as coroutines of course - you get a substantial perf increase.

I didn't meant that implementing coroutines requires GC. I meant that I believe a realistic/usable larger application that is coroutine-centric should use GC, otherwise the lifetime management in micro-coroutines is too heinous.

boost.context is part of boost-libraries since 1.51 (boost.org).boost.coroutine will be released with boost-1.53 (next week).boost.fiber is available at github.com/olk/boost-fiber (but not ready yet, need some fixes and docu).