User login

Navigation

general thread model motivations?

About the purpose of threads — especially green threads — I would like to hear suggestions about models, metaphors, terminology, or tactics about why this form of code organization is a good idea. A very high-level, hand-wavy perspective is what I hope to explore, to inform docs one might write for new users of code who ask simply, "But why?" If anyone wants to contribute, I'll also post short comments too, though I'm more interested in what other folks think. Note other ways of organizing code amount to the same thing, just folded differently; for example, coroutines are just green threads by another name and focus, at least in broad terms.

I've been thinking about green threads all the time lately, outside work, imagining diagrams and prose to explain the point, as starting context to motivate a library architecture whose VM model revolves around green threads (and groups of them associated under one identity in simulation of a green process). Working on solutions for a long time can make one focus too hard on the answer, and not enough on the question; so I hope I can get folks to ask rhetorical questions that amount to vague problem statement(s) causing one to seek a thread oriented design.

Ideas I had earlier this evening got profoundly vague as I tried to generalize. I suspected thread use causes hierarchy flattening, and simplifies by making many complex things uniform in organization. Another half-formed idea likened thread use to a kind of indirection, but a flat top-level indirection, with threads as equal peers. You see I'm trying to squint and find really coarse form and structure here.

Here's one last anecdote for entertainment's sake. I'm not sure what year it was — maybe 1996? — but I had a chance to ask Ike Nassi at Apple what he thought needed to be done for Dylan to gel/progress/matriculate/etc, because he was in charge of the Dylan group and I thought he might have a view from on high. He surprised me by saying he wished Dylan had defined a thread model. My first reaction was, "What does that have to do with anything?" But now I get it. So I wonder what I didn't grasp then, which I learned since, that makes the difference.

Comment viewing options

Your post implies that you consider green threads to be a form of concurrent programming, however, green threads could also be a form of parallelization. The difference is important since there are a lot of consequences associated which each.

If you define green threads to be concurrent then I think you also need to choose a form of concurrency control (synchronization on shared memory, message passing, transactional memory...) since that choice will also have a lot of consequences.

Ah, I meant this green thread definition (which is typical): a fiber which is scheduled cooperatively in user space for multi-tasking that looks like a thread, as far as code is organized, but does not involve a native thread scheduled by an OS kernel. In practice this looks a bit like coroutines, but with api resembling threads. In particular, a fiber has to have a "stack" whose backtrace matches that of any native thread, or else it wouldn't look like a thread; but a stack can be as small as a single activation frame, i.e. easily much less than 1K.

In a PL (programming language) context, one can ask whether a language can even express the idea — whether there is support — and if so how it might be organized. For example, in Scheme you can roll a fiber engine just using first class continuations. I think it was Dybvig's description of such an engine library that confused me in 1990. (Semantics of calling a continuation in Scheme can confuse people when it amounts to supplanting a current fiber with one from a continuation — the stack swaps — while passing data between them.) On this site we might ask whether fibers in a PL are a good idea, despite inter-fiber communication (IFC) gotchas.

There is concurrency involved even if there's just one core and one native thread, in that code from different fibers can be interleaved as a scheduler pleases at well-defined points where context switch is known to be possible. For example, if fibers are driving async interfaces elsewhere, each async call typically blocks a fiber until an async reply occurs, and thus context switch is likely.

Threads and fibers both imply shared memory visibility, requiring synchronization primitives to safely cooperate, but these can be cheap for fibers especially when preemptive context switch will not occur. (A critical section in a fiber spanning more than one async call necessarily requires a cooperative green lock, but locks can be skipped when context switch cannot occur.) However, shared-nothing message passing is also desired when you want location opacity, so it's possible to move components to remote positions; you even want to do this when simulating distributed nodes in one local address space, or otherwise it's hard to test remote logic at least once in cheap local setups where it's easier to debug.

Presence of fibers in a PL would affect semantics of what it means to share mutable or (preferably) immutable state, perhaps refcounted when local, and how control is managed in language-visible mechanisms. Your second paragraph lists motivations for why you'd want detail defined by a PL instead of a library.

Threads are the "worse is better" model of concurrency. From the implementers' perspective, threads are the easiest way to hack multi-tasking or concurrency into an imperative language. But from the users' perspective, threads are painfully, pervasively, and persistently difficult to use safely and correctly. Threads are a fountain of accidental complexity.

Whether you consider threads a "good idea" depends on whether you favor transparency of implementation vs. simplicity of interface.

I'm of the second philosophy. While I can respect the historical context in which threads were invented, I think that threads compare extremely poorly to many other concurrency models. Even for the non-compositional models (e.g. actors model) the ability to reason about when and where interactions may occur in a specific implementation is a great advantage. Many formal concurrency models can be superior even in performance aspects, e.g. via support for static optimizations, inlining, and partial-evaluation across concurrent behaviors. Many models are naturally 'green' in the sense that we can efficiently compute many logical processes on one CPU - perhaps even by blending machine code, rather than context-switching.

Sadly, it seems we've landed ourselves in a situation where most programmers don't know of any alternative for concurrency than 'threads'. So if you want to explain any other model of concurrency, you'll need to describe it in terms of 'safe threading' or 'automatic threading', and explain it in terms of implementation rather than abstraction.

Green threads can be understood as an effort to improve performance for an awful but familiar concurrency model by adding layers. In some ways, I feel they're the worst of both philosophies: green threads sacrifice both transparency and simplicity on the altar of efficiency, and do nothing to raise the optimization ceiling. But they do have a mitigating benefit: compared to coarse OS threads, green threads are much cheaper to use in abstractions, which can help encourage a higher level of programming (i.e. without worrying as much about the performance details).

I admire the fearless way you offer criticism and disagreement, and wish other folks did so too, because traction in analysis is hard without feedback questioning premises one might otherwise blindly assume. As part of docs for a library, I would like to dissuade folks from using code by offering as many reasons as I can about why a library is a bad idea, unless you somehow clear all the hurdles making it lose. I'd like even more arguments why fibers are a bad idea, if they come to mind later, so I can warn folks off more strongly. [It occurs to me this might sound like sarcasm; nope, I'm completely serious.]

If you have a better model which I can explain to folks in few words, I'd like to recommend that instead of a worse model, before I dive into worse in earnest. But it would have to sound appealing and actionable after a short description, or it wouldn't work as well at driving people away, which would be useful.

I have seen engineering teams thrash back and forth between preference for thread-oriented versus event-oriented architecture. Personally, I find pre-emptive native threads quite irritating when pursued aggressively. But it's hard to fault the clear view of what code intends to do (except when it's so easy for other threads to subtly sabotage any intentions).

I'd like to see something with a happy medium allowing either data-centric or code-centric analysis without confusion and/or great effort. First I want to see a clear dataflow plan, and then I want to see code achieving that without really twisty organization. In other words, I'd like the sort of simplicity making mistakes obvious to occur when auditing either data or code, and I thought clear pipelines in fiber-oriented code might partially suit both.

I agree with dmbarbour that threads are generally a poor choice for applications implementing concurrency. Using threads safely is difficult, and brings to mind an analogy to Greenspun's 10th rule, ie.

To me, the main problem with threading is that it changes the semantics of a language without changing any of its syntax. This gives the impression that going from, for example, "Java" to "Java with threads" is an incremental step, when actually it can undermine so many assumptions that single-threaded Java code is based on, that it's better to think of it as a different language (or at least dialect).

One example of a 'mainstream' alternative to threading is Javascript's concurrency model. It uses "web workers" which are processes (ie. no shared memory). The processes can interact by message-passing: code which spawns a process is given a handle which it can pass objects into. This makes a *copy* of the objects available to the process, via an event. Likewise, the process can pass objects into the handle to trigger an event in the calling process, which again *copies* the objects.

You can also get zero copy effects easily, in one process, using refcounted content fragments that are immutable (at least by convention). Doing the same between OS processes is irritating via shared memory, partly because cleaning things up is hard if one process dies.

Interesting discussion. I think the main problem we currently facing is the distance between implementing efficient parallel algorithms on current hardware versus how we want to express an algorithm in a programming language or even abstractly. As an example, look at this really interesting paper by one of my coworkers Todd Mytkowicz and many others:
SIMD parallelization of applications that traverse irregular data structures

In particular, they show how to efficiently parallelize lexers using SIMD units. For me, the interesting part is that we tend to think about regular expressions in terms of non-deterministic finite automata (NFA's) which seem inherently non-parallel at first sight. The steps to go from a general NFA to a fully parallel SIMD algorithm are non-trivial though and it can only be done if the NFA is both specified declaratively enough (ie. such that it can be recognized by a compiler) and if the compiler has very specific domain knowledge.

To me that means we have a lot of work to do as language designers: how to create future language abstractions that are high-level enough to support such kind of parallel applications; I guess we need languages that offer both powerful DSL extensions on a syntactic level, but that also allow extensible compilers where one can implement specific optimization stages to compile to SIMD for example.

Threads, IMO, are a reasonable technique at the machine-code level. Programmers ought to be using better abstractions and then relying on the compiler/interpreter to optimize, possibly to optimize into a threaded model.

As for what the better abstraction is, I would say that it's probably processes; each program operating "independently" and communicating via well-defined protocols (probably via ports, but not necessarily). Alternatively you could have variables/buffers that each thread regards as "volatile," which may be written by other threads, and protect everything else.

As dmbarbour says, threads as we understand them are a fountain of accidental complexity. When we have better abstractions and languages to translate things into, threads will probably be one of the major barriers to translating old programs onto new systems.

I've implemented a "safe" threading abstraction, based on fork/exit/wait. I'm not the only one who's implemented this; I've seen it in several languages now.

Here is how it works.

A thread may "fork" into any number of other threads (not just one), but then halts until all of those threads have halted. Each "child" thread can see all the data its "parent" thread can see, but any attempts to write that data are treated as copy-on-write; that means that if a thread has changed the value of something, then a copy of that thing has been allocated for *that thread* to hold the new value. Data allocated by the child thread, including altered copies of data that belongs to the parent, is fully owned by the child thread; no locks, traps, mutexes, or anything else required.

The crucial thing to note here is that because the parent (grandparent, great-grandparent, g'g'grandparent, etc) thread is stopped, it will not change the value of anything visible to one of its children while that child is running. Because of the copy-on-write semantics, no thread can see (and be confused by) writes committed by a different thread.

When the parent thread "spawns" its children, it allocates space reserved for each child to write back into. Memory allocated by child threads can be "willed" to the parent process, but because of copy-on-write, it is not possible for any child to place a reference to anything newly-allocated (anything which that child has allocated) into anything already-allocated (which the other children could see). Putting a reference to a newly-allocated object into the reserved "result" space is the only way the child can make the parent become aware of such things when the parent resumes execution.

When a thread exits, all of the memory allocated by that thread (except for that willed to the parent thread) is reclaimed, so writing into this reserved space (which the other children cannot see) is the only way for threads to communicate results, and results can be communicated only to the parent thread.

With this model, a threaded application can use multiple CPUs and be completely lock-free because shared state is strictly controlled. In fact, there is no shared state between threads all of which are running at the time when the state is shared.

The downside is that you need an extra level of indirection to refer from one allocated datum to another, so pointer-following involves an extra memory access. ie, rather than have a variable store a pointer to another datum, it must store the ID number of that datum. Copy-on-write keeps the ID number, but creates a new address. Each thread has its own table mapping ID numbers to addresses. Copy-on-write allocates the datum, then copies the changed value into it, then updates the table. All references to that datum in the entire affected thread, via this indirection, now refer to the copy of the datum at its new address and thus with its new value.

(I like that plan, especially the copy-on-write parts. If it's okay, instead of uppercase COW as acronym for copy-on-write, let's use lowercase cow at times when there's no chance we're talking about farm animals. It's more casual, assumes familarity by other folks, and downplays role as arcane jargon. For folks who don't know copy-on-write, it's not that arcane. It just means when you want to write something, you instead make a new mutable copy of parts you alter, while perhaps sharing parts of the old immutable data.)

My personal use-case for fibers has parts resembling elements from your fork plan, but with less address space isolation character of an OS process (or an Elang style immutable data process). I should describe a copy-on-write stream data structure I take for granted as instrumental in simplifying the issue of avoiding conflict. I'm sure other folks have a similar data type in existing libraries I haven't noticed. I recently noticed HN (Hacker News) folks point at code using an ArrayBuffer type, and for all I know it could be the same thing, but I have no idea. [And it's referenced by the Transferable Objects material cited by Manuel J. Simoni above.] It's a mutable stretchy vector of immutable refcounted iov (iov is short for struct iovec), so iovs can be safely shared because all mutation is copy-on-write with respect to bytes in iovs.

For brevity, call this type ybuf or thorn-buf, but on a whiteboard just write B for buf. Starting ten years ago I called this type deck in analogy to a deck of cards, where each iov is like an immutable card that can be reshuffled into sub-decks. I plan to name libraries, and layers within, after irritants like thorn and barnacle (meaning no salt), so that's all thorn means. Lowercase y is just a substitute for thorn glyph þ. [Note this type of refcounted buffer is not used where I work — if it was, it would be part of secret sauce I can't describe. But they use something else.]

An instance of B is just a sequence of bytes, like an in-memory file, but composed of immutable fragments in pointer-plus-length format, also known as an iov. If each refcounted object (aka rod) is a subclass of R, and we use V as the iov type, then each fragment is represented as pair (R, V), and thus each buf B is just a vector of (R, V) pairs, plus a bit of metainfo like length in pairs and total logical size in bytes. Appending to B just appends one or more (R, V) pairs and increments the refcount of each R involved. Editing B by append — or insert, remove, write or anything — is just slice editing that replaces one subset of (R, V) pairs with another subset of pairs, altering refcounts accordingly. We can edit the inline iov V to refer to fewer bytes of a fragment involved, but we never touch actual bytes in a fragment. In fact, it's never actually necessary to copy a fragment for copy-on-write, because we can just refer to fewer bytes. So copying in B is mostly by reference via refcounting.

Fibers can use B as a pipe with fiber-safe copy-on-write to stream both inputs and outputs, provided iov bytes remain immutable, and provided refcounts can be safely updated — trivial when single-native-threaded. So fibers can simulate process isolation when i/o occurs using B, because they can't observe mutation, since it never occurs while refcount is above zero, even though it's just cooperative immutability. One fiber can keep using bytes already handed off to another fiber in a pipeline, because immutable iov fragments are safe to continue accessing. For example, one leg of a fiber pipeline can implement write-behind without increasing pipeline latency, because write can occur after handing data to another fiber. Each fiber has its own private mutable B instances, but (R, V) pairs can be shared when bytes in a V fragment are immutable.

A general fiber use-case I have in mind sounds like, "I want to perform complex mostly-independent actions on as many TCP connections as I can in a server at one time." A specific sub-case of this includes work I've been doing the last six-plus years in TCP deduplication. (I tell folks all compression is dictionary-based compression — different codecs just vary where a dictionary lives and how dict entry life cycle is managed, so secret sauce is mostly about identity of content and cache invalidation.) If I was using fibers, I would want each of many thousands of connections to consist of a pipeline of several fibers. So as many as 100,000 fibers might need to co-exist efficiently, where each needed at most a few activation frames. Concurrency applies because fragment encode/decode tranformation is commutative, while transmission is associative because order must be preserved. When this is done in C, process isolation is not an option, but it's still nice to get as much safety as feasible.

When the parent thread "spawns" its children, it allocates space reserved for each child to write back into.

In fibers I plan to get a free ride from activation frame representation here. A called function can return a refcounted value of any size because that value can be kept alive by a callee's frame until the caller get its own ref and releases the callee's activation frame. A value so returned can ride all the way up until a child fiber returns it to a parent fiber. (Shared address space simplifies handoff.)

When a thread exits, all of the memory allocated by that thread (except for that willed to the parent thread) is reclaimed, so writing into this reserved space (which the other children cannot see) is the only way for threads to communicate results, and results can be communicated only to the parent thread.

When fibers refcount everything, space allocated by a fiber is reclaimed when refs hit zero. The complex part occurs when ref chains can be very long and require high latency to finish freeing refcounted objects. So this means a fiber that exits must stuff its frame into a list of things to be released by one or more garbage collection fibers that just release references.

The downside is that you need an extra level of indirection to refer from one allocated datum to another, so pointer-following involves an extra memory access. ie, rather than have a variable store a pointer to another datum, it must store the ID number of that datum. Copy-on-write keeps the ID number, but creates a new address. Each thread has its own table mapping ID numbers to addresses. Copy-on-write allocates the datum, then copies the changed value into it, then updates the table. All references to that datum in the entire affected thread, via this indirection, now refer to the copy of the datum at its new address and thus with its new value.

I know what you mean. The first time I used copy-on-write was in a 1996-97 storage system named Quilt to replace Bento in OpenDoc, which did its own memory-mapping via page cache because it had to work on end-user Mac systems where virtual memory was all but turned off. (It was killed by the purchase of NeXT.) The main purpose of the design was to live inside a fixed memory footprint, unlike Bento, by making everything disk-based but keeping as much in RAM as possible. OpenDoc had a feature called "drafts" that terrified me until I realized copy-on-write solved the problem. Each draft was a immutable snapshot of an earlier version of a document, which you could revert to using at will. By using copy-on-write, I was able to incrementally replace anything modified, and the new IDs were reachable from the root of the tree describing the document version involved.

Quilt actually had two layers of copy-on-write: one in high level btree layers and another in low level block i/o layers. Editing btrees in high level code caused copy-on-write replacement of tree nodes. Then down in the pseudo virtual memory layer, any time a page was modified, this remapped the old page offset to an offset of a new mutable copy, resulting in a log-structured file system in each OpenDoc file. This avoided altering any old page on disk until a user committed changes via save, because a crash had to safely revert to previous content without any chance of data loss. I love cow algorithms, they're awesome. (I briefly pitched the idea to QuickTime folks because it supported media editing with costs proportional to edit size.)

I'm under the impression most functional programming systems with immutable data actually perform updates by altering ID tables so new versions replace old ones in symbol tables. It's something folks should think about.

I like this copy-on-write because it gives each thread a private copy, which is like distributed memory, which is much easier to understand.

Here's what I don't get: how can you get something like sequential consistency? What if two spawned threads do an increment of the same item. They both get a copy of the master value, increment that? No matter what the master thread does at the join, you can't get an increment of an increment.

The whole point of the model is that you do not simultaneously have two or more different threads writing something. That is precisely the issue that turns out to be very hard to manage correctly. Race conditions, locks, mutexes, incomplete writes read by other threads, etc... Shared mutable state is the whole problem with threading.

If the master value is 23, and then thread-A tries to increment it, the master value is still 23, but thread-A has its own version of the same value, which is 24. If thread-B then tries to increment it, then thread-B also gets its own version of the same value, which is also 24. If A and B both want to return their version of the value to the parent thread, they can do that on exit -- but the parent thread has to be the one to combine the increments, because it's the only one that can write its own value.

I was going to answer too, but you said it better. I can amplify: If state in the parent needs change, only the parent can do it in that model, so children can only return values, which a parent can ignore if desired.

Shared mutable state is the whole problem with threading.

But worse is shared mutable state without threading discipline, when no other organizing principle is pursued. Of course, saying "not as crazy as some things" isn't a lot of praise for threads.

Ok, how does B get 24? From the master, I assume. So when does A give its 23 back to the master? You say it "can" do that on exit. Is that the only possibility? In that case B has to wait for A to finish before it can even get the 24 value. Or can A give it back earlier? But then the master becomes a central coordinator for all memory writes, which is a big bottleneck.

Yes, I understand the problem with simultaneous writes, I just don't understand how this gives a scalable solution.

If a "Master" thread has the value X=23, then while that is true, all threads that it spawns will start with the value X=23. If it spawns A and B at the same time, then A and B will both start out seeing X=23. The answer to your question (I paraphrase here) of How Does B See The Increment Made By A, is that B does NOT see the increment made by A unless it is a child of A started after A made the increment and is therefore running while A is stopped.

Using this model, if I want to spawn a thousand threads that all post increments to some variable starting at zero, what I will do under the hood is spawn two threads giving instructions to both to spawn 500 more, wait until they exit, and add the two values they return to get my new value for the variable. Each of those two threads will have done the same, except that they each started two threads giving each the mission of spawning 250 instead of 500, etc.

At the bottom level, you'll have a thousand threads all running simultaneously, all posting increments to their own value for the variable, none of them will be slowed down by mutexes or locks or any other synch primitive, and not a single one of them will ever see a single increment made by any of the others. Nine levels of "delegating threads" will resume when both of their children have exited, combine the work their two children did, and pass it on to their own parent thread. At the top level, the "master thread" will combine the work that its two "delegating threads" got done and come up with the final answer.

"B does NOT see the increment made by A unless it is a child of A". So what happens when B tries to update that shared variable in the case that it is NOT a child of A? It updates the old value and so its increment action is discarded? So you're giving up sequential consistency? Or does a runtime error occur or something?

I'm not exactly following your story here. Also it seems to me that you have an awful lot of stopped threads. You need very fast context switching or your efficiency goes down the tubes. Maybe you can point me to a publication?

Ray's threads are effectively pure functions operating in their own scratch-space. Updates to the scratch-space are discarded when the pure function returns. The caller (parent thread) only sees a special return value. The scratch-space starts as a logical copy of the parent's space. For performance, the physical copying happens only on write, i.e. each child can have its own set of dirty pages.

Very little context switching is needed, because threads never interact. This is a very dysfunctional (or purely functional?) family of threads, where the parents remain sternly silent and always outlive their children, and the siblings never speak to one another. It is possible to run each child thread to completion, in sequence, without ever contexts switching between active threads.

But it *is* necessary that if they do so they must use a proper protocol and an explicitly purposed communications channel such as a pipe or something. Writing a new value to a variable mustn't be communication, because if it is, then it's pretty much impossible to do anything without "communicating" unintentionally. When there is no way of getting anything done without "communicating," there is too much chance of communicating accidentally and wrongly.

I feel like Edgser Dijkstra must have felt when he was explaining "goto considered harmful" and someone kept asking, "but without goto, how do I make arbitrary control jumps from one point to another?"

This is a form of structured programming - you remember realizing that with loops and if statements and procedure calls you could express all the same algorithms you can using goto? Well, this is the same. I'm claiming that with threads that do not use mutable shared state, you can express the same algorithms that you can with mutable shared state. When you keep asking how to get mutable shared state in this model, you're asking how to avoid getting any benefit from it.

I suppose it's possible to add some kind of special, lock-controlled or mutex-controlled "special" shared memory space or shared variable type to this model, in the same way that "goto" was supported in new languages that used structured programming but also had to serve as direct translation targets for code written in languages that didn't.
But for god's sake don't teach anyone to program that way! It is a completely useless complication that does not promote clear thinking or the solution of problems.

Any use of a lock or mutex or of shared mutable state in general is a sign that the code is designed wrongly, cripples it so that it will not scale efficiently with increasing numbers of processing cores, and is a sign that it ought to have been structured better.

As Dijkstra would have responded to the question of "how do I make arbitrary control jumps without goto," so I respond to the question of "how do I get shared mutable state." It's a self-defeating question because you can avoid useless complications and think more clearly about the problems, while giving up no expressive power, if you don't.

I can see how this sort of model might be useful for performance, but I get the impression that it'd be really painful for abstraction. Forking twice in succession has different behavior and semantics than a one-time four-way fork, which means we can't reason compositionally. Peripherally, there are likely to be issues integrating with FFI and such.

I'm knee deep in adding fibers and channels to my language so this is a topic close to my heart. I find the model really appealing at some intuitive level, so here's a couple of hand-wavey explanations of why:

It matches how you would intuitively reason about some problems

Caveat: I like imperative languages. I like declarativity too, but I'm OK with code in the small that expresses how things are done.

I think many times if you informally talk about how you'd solve a problem, that description involves some amount of concurrency. Consider a recipe that's like:

Start boiling the pasta.

While that's going on prepare the sauce.

When both are done, mix together and serve.

Some kind of concurrency lets you write code similar to that. Lately, I'm more interested in green threads because I like having some level of ability to reason about when context switches happen.

It's lets you organize programs around independent pipelines

One example where the lightbulb went on for me about channels and fibers was Rob Pike's talk about lexing. I find the idea of organizing a parser around:

A lexing fiber that consumes characters and emits tokens.

A parsing fiber that consumes tokens and emits an AST.

To be really beautiful and appealing. Fibers and channels let you express that pattern directly without having to do a bunch of state machine shenanigans. Instead of your lexer returning tokens, it can send them and not have to unwind its callstack before each token.

It's about state, not concurrency

I think callstacks are the most terse, useful data structure in your program. Recursion and local variables give you a trivial way to store a bunch of stuff and even compose complex aggregate "objects".

Once I started thinking of callstacks as an ad-hoc data structure, I realized that it's really handy to have lots of them. Without fibers, you have to unwind and destroy one callstack before you can create another. Fibers let you have as many as you want and keep them around as long as you want. To me, they feel like the natural evolution of closures.

I see so many things going wrong with this recipe, e.g. if executed by a naive interpreter. For example, you never say: "look and see whether the pasta is done boiling", much less when to look again or how to tell. And even if you fixed that, it seems we can get into a 'stuck' state where the pasta is finished boiling but we aren't finished preparing the sauce (because we're only to prepare sauce while boiling the pasta).

Code that seems 'intuitive' is dangerous if its behavior sometimes runs counter to our intuitions. Computers don't have human intuitions or experience. They won't fill the gaps. They'll do what we say, not what we mean.

lets you organize programs around independent pipelines

If there exists a concurrency model that prevents you from organizing programs around independent pipelines, I'm not aware of it. If there exists a concurrency model much more inconvenient for organizing programs into pipelines than explicit imperative threads and channels, I'm not aware of that either. Are you comparing this to a non-concurrent model?

I see so many things going wrong with this recipe, e.g. if executed by a naive interpreter.

Of course. It's a bad recipe for a human, much less a computer to follow. My point was just that when describing many "recipes" (i.e. ways to solve a problem) to other humans, we often use some concurrent terminology. Given that, I think it's helpful to have concepts that very roughly map to those in the language, though of course much more precisely defined.

Code that seems 'intuitive' is dangerous if its behavior sometimes runs counter to our intuitions.

This is true, but I don't know if that point leads to any firm guidelines. Certainly, no programming language will every perfectly follow our intuitions. Our intuitions are not even internally consistent.

But that also doesn't mean we should abandon intuitiveness entirely. It's a multidimensional continuum, so you can never win, but like other areas of usability, you can make measurable progress.

If there exists a concurrency model that prevents you from organizing programs around independent pipelines, I'm not aware of it. If there exists a concurrency model much more inconvenient for organizing programs into pipelines than explicit imperative threads and channels, I'm not aware of that either. Are you comparing this to a non-concurrent model?

I'm thinking of callback-based asynchrony Ã la JavaScript and other "evented IO" systems. In those, you have to twist your entire callstack into awful CPS (as opposed to CSP!) soup.

Callstacks (activation frames) are in fact a very flexible and useful data structure. When I decided to support an arbitrary number of them, I moved them to the heap and made them subject to garbage collection like all other data structures.

Each "thread" (using the copy-on-write system I described above) can have its own call stack. The portion that's identical (ie, the stack that existed when "spawn" was called) is shared. A function that was called in the parent thread can return (and return different things) in every child thread, and then after the child threads exit, the parent thread can return something else. Usually whatever the "most successful" child returned, in a concurrent-search application, or a sum of what all children returned in other applications, etc. Also, each child thread can extend its call stack in different directions and with different arguments.

If I want concurrent threads to communicate, that would need to be done via a shared channel -- at the moment I'm using sockets/pipes for that.

Yet, when you stand back, that's the worst parallelization: passing lexer tokens to a parser thread is incredibly communication heavy.

I don't think munificent was talking about parallelization, just concurrency. Whenever they are convoluted, we always get misunderstandings like this.

As someone who has built a serious tree-level incremental compiler before, I feel that I've come close to building a parallel compiler. We are sort of limited by not knowing tree boundaries without first parsing, which we could somewhat solve through secondary indicators such as file boundaries or using shallow parsers that rely on verbose parse constructs in the language (curly braces, semicolons). Then its just a matter of using my glitchy clean techniques that trace inter-tree dependencies and reprocess tree nodes in parallel until everyone is processed with the correct dependencies (the only inter-node dependencies come from symbol tables and directly from your parent). Classic speculative parallelism!

I think we could have done this for Scala, and there would have probably been a performance benefit. But in this case, only tree node type checking would have been parallelized for a benefit, and even then at a larger node grain (like blocks or methods, not for expression tree nodes or even statement nodes).

Languages aren't really supporting the design and implementation of proper parallel algorithms. I suspect hardware-backed threads aren't going away for a long time because of this.

Occam, CUDA, and MapReduce? Explicit use of threads is definitely dying in the parallel computing world. The problem is that most languages like Go are just aiming at concurrency. Even many language designers still claim the problems are essentially the same and don't get it at all.

I don't think munificent was talking about parallelization, just concurrency. Whenever they are convoluted, we always get misunderstandings like this.

Ah, rereading, I think you're right. However, then I disagree even more -- unless the example is for something like an incremental/streaming lexer/parser (e.g., for network-bound computations), the interleaved design is convoluted.

As someone who has built a serious tree-level incremental compiler before, I feel that I've come close to building a parallel compiler. ... Classic speculative parallelism!

Getting language support for this type of programming (some sort of super STM? variants of Daan's work? etc.) would be awesome. Parallelizing incremental evaluation is hard because of the fine granularity; it's hard to find big chunks of data parallelism. Getting speculation support for the independent cases would be cool -- our grammar synthesis work has been veering in that direction.

Occam, CUDA, and MapReduce... Explicit use of threads is definitely dying in the parallel computing world.

As a designer of parallel DSLs, I'm not arguing against restricted models, but that I do not expect them to be used in a significant number of cases.

For example, the big thing in MapReduce the last few years was adding iteration and in-memory support, which is fundamentally the same as the undergraduate level optimization of tiling your loops. The current abstractions quickly break down, so while we wait for system designers to reinvent these things (and not even venture too far forward), production devs have to break out of the frameworks and do them manually.

I expect parallel model escape hatches to remain necessary (and to be exercised) for the foreseeable future.

Parallelizing incremental evaluation is hard because of the fine granularity; it's hard to find big chunks of data parallelism. Getting speculation support for the independent cases would be cool -- our grammar synthesis work has been veering in that direction.

I don't think this is the right problem: parallel and incremental are just duals and are fairly orthogonal (at least, this is what I get from the 2 for the price of 1 paper). If you do the work to parallelize your computation, then you've basically done the work to also incrementalize it.

My current challenge problem is live programming a kinect program: assuming the input is recorded, how do I "repair" execution that might shoot through multiple frames but does not interrupt so many computations within each frame. Chopping up the computations into a bunch of pieces (manually specified) and then just seeing how the dominoes fall in terms of broken dependencies seems more viable than anything else I can think of.

I expect parallel model escape hatches to remain necessary (and to be exercised) for the foreseeable future.

I agree, but these don't necessarily lead to being exposed directly to hardware threads. But pragmatically I guess the distinction is not very important, its still about having low-level control by breaking out of a high-level abstraction.

I don't think this is the right problem: parallel and incremental are just duals and are fairly orthogonal (at least, this is what I get from the 2 for the price of 1 paper).

Yes and no. The intuition behind incrementalizing and parallelizing using similar properties is old, and Daan's work was about nicely applying it to task graphs.

An underlying issue I've had with most PL projects looking at task parallel computations is that they assume really big tasks where you don't have to do much to parallelize them, such as ignoring issues like temporal locality. For example, a benchmark in Daan's paper was taken from my own work, but the workload was multiplied (by orders of magnitude?) so as to achieve a speedup.

Returning to "the right problem", it really is of parallelizing the incremental computation. For browsers, *most* of the computation is incremental. We've shown how to do bulk parallel, and traditional results hint at how to do incremental, but we really need to parallelize the incremental case. It gets harder in many ways (the active data region is dynamically determined), but easier in others (speculative execution's repair is, ideally, incremental, reducing to a problem we're already solving!).

I suspect a lot of important computations are like that, such as when developing and running big data computations: you only modified part of a query, or only some websites you're analyzing changed. Should I get a next job, this is one of the questions high on my list of throwing a student at :)

The intuition behind incrementalizing and parallelizing using similar properties is old, and Daan's work was about nicely applying it to task graphs.

Agreed, but they document the intuition, so its a nice reference.

An underlying issue I've had with most PL projects looking at task parallel computations is that they assume really big tasks where you don't have to do much to parallelize them, such as ignoring issues like temporal locality.

The converse of this is that we ignore computations where their is already active research and strong needs, like machine learning/deep neural networks. Their fuzziness never fits into our exact world view.

We've shown how to do bulk parallel, and traditional results hint at how to do incremental, but we really need to parallelize the incremental case.

I naively think the incremental case is already parallel. I mean, just process the dirty nodes on your worklist in parallel right? Or if you have a real graph, parallel traversal to clean it up doesn't feel like a big problem. I guess the problem is in defining incremental, which I've always seen as a destroy and rebuilt style affair.

you only modified part of a query, or only some websites you're analyzing changed. Should I get a next job, this is one of the questions high on my list of throwing a student at :)

The problem is that their is already an active field of research far away from systems that is looking at this; I mentioned Adobe Insight before (or at least I was supposed to get back to you). We thought about making a play in this area but found it already to be very crowded.

An approach I've found that seems quite promising is to partition the problem, i.e. such that different sub-programs run in different logical "partitions". The partitions then become a locus of aggregate parallel processing, while communication between partitions is much more loosely coupled.

I was inspired to this approach by FRP.Sodium (a Haskell implementation of push-based FRP). For my Reactive Demand Programming implementation, Sirea, I use batches:

Each partition has a thread.

Each thread runs in steps.

During a step, communications (updates) to other partitions aggregate in batches. There is one 'outbox' per source partition for each destination partition. Batches are delivered atomically at the end of the step.

I use bounded-buffers for loose synchronization, fairness, and resource control. Only a limited number of "batches" may be in-flight or unprocessed per outbox.

At the start of each step, a partition will process ALL available batches. This includes both batches from many sources and multiple batches from single sources. For RDP, updates for the same signal are composed ('piggybacked'), but that isn't essential.

Since all batches are processed each step, there is never any risk of deadlock due to cyclic bounded buffers. Thus, there are no constraints on communication between partitions. Ad-hoc, dynamic communication patterns are safe.

Since batches are loaded atomically, we can guarantee a nice property of "snapshot consistency", which is weaker than global consistency but is still very convenient.

The partitions offer parallelism, while the batching offers the coarse granularity needed for efficient performance. The model should scale naturally, since it's easy to treat different services, agents, or clients as operating in different partitions. The only trick is you'll need to model batch-based communication, but that isn't difficult (e.g. just use TCP and frame each batch.)

Partitions and batching approach is easy enough to adapt, e.g. down to hard real-time systems or up to large dynamic structures (e.g. align partitions with data volumes or an index), so long as the partitions are relatively stable over time.

If you have a highly volatile structure, and hence need to "repartition" every step, it would be futile to name and align partitions with that particular structure. But in most cases where incremental computing is suitable, you can find at least one relatively stable structure for partitioning that will hold for a lot of incremental steps.

I'm curious what "world" you live in where you simultaneously (a) believe you'll benefit from incremental computing, and (b) can't find any relatively stable structure for aligning partitions, and (c) can't create a relatively stable structure (e.g. using stable indexing/clustering techniques or reconsidering the original requirements).

The world is a typical AJAX interaction with the DOM: some random subset of the page must be rerendered (CSS selectors -> cascade -> tree normalization -> layout solving). The actual mutations are few (some attributes changed, like adding a bit of content), but the impact is of the surrounding region. This is the incremental computation.

To make the incremental computation parallel, each stage of the above pipeline is basically a parallel for loop or a parallel traversal of some dynamically determined subtree. Different stages may act upon differently-sized neighborhoods.

The issue with parallelization is that the incremental computation is big enough that we want to make it faster / more energy-efficient, but it is tiny enough that you need to worry about data movement, task creation / scheduling overheads, etc. Dynamically scheduling each stage of the pipeline (in the sense of Cilk) is a no go; domain properties / static analysis / etc. become really useful at these granularities. It took me a lot of data representation and scheduling optimizations to get 10x parallel speedups on the predictable full-page case where you can assume the data/time costs of handling individual page nodes are sufficiently similar; doing the same for smaller regions is going to be hard.

Dynamic web pages and DOM are actually among examples that I believe will work very well with partitioning and batching, though it may require some browser support (so we can atomically update many variables from one batch or step without rendering intermediate states). But I'm working under the assumption that I control the web page, perhaps via code generation.

You're treating the page as a random variable, from which you shall somehow squeeze an incremental and parallel implementation. I grant, this is a much greater challenge, and is valuable for retroactive benefits. But I do not see it as essential to any problem domain.

I don't think munificent was talking about parallelization, just concurrency. Whenever they are convoluted, we always get misunderstandings like this.

Ah, rereading, I think you're right. However, then I disagree even more -- unless the example is for something like an incremental/streaming lexer/parser (e.g., for network-bound computations), the interleaved design is convoluted.

What exactly do you find convoluted about it? The lexer has a very simple interface and can be described as a finite automaton, or with a regular expression. Nothing convoluted here. The parser is made simpler by the tokenization — that's the whole point of introducing the lexer after all.

Perhaps you're saying that the communication between the two precesses/threads/coroutines is convoluted? But that would be a shortcoming of the language, not of the algorithm.

No, I was referring to the intent of the design, not the design itself. If performance was the goal, then this design doesn't really give us that, if performance wasn't the goal, then what is the use case of the design here? Why not just use the normal stream design that we always use for compilers? Heck, if you want to hide the state machine, we even already have yield return constructs.

So far remarks have been very helpful when they show readers of docs may expect specific circumstances when certain terms are used. I should probably never use the phrase "green threads" in docs because folks can't easily distinguish them from native threads, except perhaps in footnotes confirming a suspicion readers have that fibers are green threads. My last paragraph below summarizes what I thought this afternoon, and paragraphs before it just provide context, though in an awkward, disjointed way.

A library I have in mind won't target parallelism in hardware at all. Explaining why hardware parallelism is not the goal of concurrency with fibers should be a main point in early introductory material, to avoid confusion. The conversation above about parallelism helped me see some folks may have the objective of optimizing speed by using more hardware resources at once. When one asks, "How can I use all these cores?" one tends to think of threads solely as a means of parallelism.

In contrast, in an i/o bound situation, one can have the problem of keeping just one core busy when data trickles in slowly enough, as when many network connections pass through rate-limited WANs (wide area networks). However many cores you have, getting more utilization out of each one gets more bang per core. (When a license grants rights to process at different rates, the top end for any given configuration is limited by how busy you can keep the processors in each model offered.) Of course you can always just waste cycles by copying data around to no effect, but that's utilization without benefit.

I first worked on a significantly threaded platform in 1990, using C++ on 8MHz machines with one core (Motorola 68K). The main purpose of threads then was to serve both end-user UI presentation demands as well as pursuing something productive in the background at the same time. In this context, the user represents the other parallel processor. Otherwise batch processing devoted to one task at a time would be more efficient use of desktop hardware. The classic system demo ran multiple graphics simulations at the same time in different windows, or the same window. (I know this is old and boring, but I mention it so you can think about what it means.) The point of interleaving work was to have a responsive UI and also keep the resources busy doing whatever a user wanted.

The Netscape browser in 1997 was quite frustrating when the UI froze, because it was single threaded on Macs, and only performed background work via ad hoc homebrew mechanism, behaving something like a small number of fibers dedicated to specific purposes, but without any formal thread-like organization. Anything using the disk heavily just froze the UI, except when incremental progress api was added, typically requiring one new hand-rolled mechanism in each case. While browsers are much better now, they seldom seem terrific given the huge increase of hardware resources since (if you ignore disk iops). With a nice modern OS, we can throw multiple processes at such problems, but we still can't run tens of thousands of processes, because an OS process is so heavyweight it consumes a big percentage of resources by itself and fights with other processes in a way driving utilization high without getting any benefit — probably just swapping like mad. (Keep in mind how few disk iops we have.)

Whether a problem solution involves N cores and 1 problem, or N problems and 1 core, may depend on whether you can get data to a core fast enough to keep it busy. When all your data is already in hand, stored locally, performing a complex operation (more) quickly may require parallel use of as many cores as feasible. But when all the data isn't known yet, because it's still showing up from disk or network, keeping even one core busy may require N processes, threads, and/or fibers. When N gets really big, it can't be processes.

[Edit: I had another idea I was going to wedge in there, but now it's too late to incorporate organically. Maybe if I append what seems a separate topic, you'll see how it's related. This question is one example from a family of similar ideas: "Why doesn't every process have an embedded http server running in one or more fibers, so you can ask questions about internals?" When you build an app, why isn't there a command line interface to ask your app questions about how it's going? Why aren't inspectors standard? Why can't you browse stats or a view of internal processing in your app from a browser, when it runs outside the browser? It might sound like I'm asking why debuggers suck, but that's whiny and I mean something more constructive: why can't your language support adding other activities in the same process with support functions enhancing your ability to understand what's happening?]

I often see you speaking favorably of "green threads", and I think I know what you're talking about by them. I don't have criticisms to share because I think they're pretty keen too, but you were asking about motivations for using them...

I use delimited continuations for such "green threads". There's no multi-core benefit. The benefit is having multiple contexts to switch between. For example, a UI that can yield while awaiting user input. So... probably much like you were doing on the 68K!

Having multiple contexts keeps each context leaner. This might be what you meant by threads tending to flatten code? By having this kind of a control-flow primitive, interaction between subsystems can be simplified if part of the interaction is waiting... or in particular if the subsystems would otherwise be written as re-entrant state machines which have to trickle down to their correct state. I'm in games, and it seems the bulk of such code is written in a style of stateful re-entrant code: ie. objects with an "update" method. It pains me to see. It often leads to ugly code. And I think programmer determination is the only thing making it work in practice.

You could write Artificial Intelligence (to intelligence as artificial flower is to flower) in a fairly arbitrary manner, as complex as you like, as if it was the lone process on a system except for magical functions to act or perceive... and then support multiple running instances using green threads with some sensible points where they yield -- such as awaiting a perceptual update, or the results of a fine-grained action. These are natural points where the code is waiting for other systems anyway. Perfect. On the other hand, without threads, design choices might be constricted to suit a regular re-entry point ("update"). As the complexity of this AI increases, the complexity of conforming to the needs of "the rest of the system" increases. Complexity begets complexity, and then you start hitting a brick wall sooner than necessary.

I'm not even particularly interested in parallel hardware threads. Fork, sure. GPU yeah. But dealing with fine-grained concurrency... blech. ;) Green threads, as you've been using the term -- I'd be sad to not have them. Thanks to Oleg for writing delimcc for OCaml!

You might look into usage in Python, as I've come across a fair number of references to people using green threads in it. I'm always surprised they're not more commonly used! Maybe people think "oh, threads that don't help me leverage all my cores -- what's the point!?" Because, as you've discovered, "threads" have become tied with hardware parallelism.

It's okay if you don't criticize; your related motivations and ideas are enough to ponder. (Useful criticism is hard when it targets basic premises. Attacking detail is easier, amounting to expression of taste more than analysis.)

I haven't looked into delimited continuations much, because I'm happy to work with plain continuations. I probably won't mention delimited variations in docs, unless I can usefully summarize them in something like a paragraph or two, without reference to any other source beyond something I said nearby. Since I'll aim to write code a person can understand completely by themselves, with associated docs, I'll never refer to papers in preference to clear local docs. Since you like delimited continuations, perhaps you (or someone else) can define delimited continuations in 25, 50, or 100 words — the fewer the better. But it has to be a definition that doesn't refer to another source, like a paper, since then it's just a pointer and not a definition. The worst possible thing you could do is have Oleg post a reply here that amounts to, "Read my work." :-) I find definitions useful when I expect they make sense to a C coder who doesn't want to learn any CS theory more complex than a grasp of stack call discipline.

Note I like to see folks use a variety of useful tools: I don't have much of an agenda beyond ensuring something I do is simple enough. Other folks can go as complex as they like; I have no problem with that. I expect to include dialogs in docs, amounting to conversation between characters about topics, especially when viewpoints differ substantially, so I can convey notions in conflict more clearly than I can in "consistent global first person narrating authority" perspective. A fiction will be maintained my code is being written by a guy named Wil, who resembles me slightly. (My middle name is Wil.) At least one character will aim to make Wil stop, because Wil's output interferes with agendas other folks pursue. I'll try not to make it too funny, but when you imitate things people actually do, it tends to be humorous. Partly I mention dialogs here so I can cheaply refer to the idea again.

I like the general idea you suggest that multiple contexts may allow you to focus on less in individual leaner contexts. It's like the idea of wearing a hat representing a narrow role. I'd be interested in further characterization of code in games, if it's about clarity of context and organization that might have something to do with scope, timing, and dependencies. Ugly code is useful to hear described; a new variant can suggest novel ideas for improvement.

It sounds like your paragraph on AI (in games?) is about keeping game intelligence in the background in a way that it does not interfere overly much with actual game play. I can relate to that. I first tried Skyrim on a PS3, and it had spectacular UI latency once a character reached a certain level, reportedly because the AI starting getting quite busy when a lot of things might apply to ongoing play. I complained to my sons the devs had clearly done something wrong. So I bought an Xbox — a third one since my sons each had one — just to play Skyrim without it becoming unplayably jerky with massive frame drop. Anyway, you could say I'm interested in use of fibers to ensure rendering is served a way causing smooth UI. However, I'm not especially interested in UI per se. (I do backend stuff generally.)

Whenever I hear about Python getting more green thread support of one flavor or another, that makes me happy. It's a good way to get some of the performance latent in modern very fast processors, which ought to be very much faster than those I used in the 80's. I was perfectly happy with response time of an SE30. I'd probably use Python more if there were no features that had magic "you don't need to understand how this works" elements to definition. Mainly I just want to wrap interfaces written in lower level languages for interactive testing.

Your last remark reminded me of I joke I plan to put in docs; I hope you take no offense. It's just a nod toward a tendency for folks to ask, "Have you tried XYZ?" for some value of XYZ. In dialogs, a couple characters — a boy and a girl — dressed like Krishnas will ask, "Have you tried Ouija?" And then they'll earnestly proselytize the use of ouija boards for programming. Thanks for giving me an excuse to mention that.

It took me a while to understand enough of delimited continuations just to implement a yield that worked. So my understanding is quite poor. But, you seem comfortable with continuations... the delimited part is adding a "prompt", which is like a handle or mark on the stack at that point. This is used to limit the continuation, rather than it being "the entire rest of the program".

Disclaimer: I could very well be butchering the concept and getting things wrong...

When I "prompt" and call a function... this new function is executing, but if it yields, its context is preserved while execution returns to where the prompt was set. At this point I think the situation may be the same as coroutines... although instead of my mainline yielding, it will resumes the yielded context. I might just be using different terms, as it seems exactly like coroutines yielding to each other.

Delimited continuations seem to be a pared-down generalization of control flow, such that use cases are wide-open. I'm only giving one use here, so that doesn't describe what they are. It's like trying to describe monads!

Ugly code in games... well, I guess it's subjective.

Games are largely developed with C++ (though Python and C# are popular for Indie games). The typical style is to encapsulate everything in objects with init, update, and deinit methods. So updating the game by one time-step is updating every active object by one timestep. This idea is very simple, and much like poetic meter or limiting oneself to painting with one brush and one color, it can bring out some creativity and it "works". Anything you do just has to work within this constraint... the advantage being that you don't have to think about possible better approaches at a higher level -- you always know you're making a new object and hooking it up via init/update/deinit.

Class hierarchies can get very hairy. Worse yet, to avoid bugs from altering existing parts of the hierarchy, one might duplicate large sections and modify them... duplicating bugs; multiplying the variations of bugfixes. But I imagine this is commonplace in any mid to large project with deadlines.

Over the past decade there's been some popularity with a database for game-object properties rather than embedding state in gigantic class hierarchies. I set the groundwork for this at one company, and I underestimated the culture-shock from not having the usual object.update framework. That was a real mess... leading me to realize how important familiarity was. It took several months for people to adjust. The concepts weren't difficult... We just develop go-to habits and patterns. The idea is that instead of heterogeneous collections of properties per object, you are dealing with tables of homogeneous properties -- objects then being arbitrary collections of these properties. And here, instead of object.update, you update a tables, so you can efficiently process all instances of a given property.

Subsystems in games tend to have much more interesting structure. Physics, rendering... sometimes AI (computer controlled characters) are interesting, but often they're state-machines.

I know the problem you ran into in Skyrim. I was hitting that too. Though rather than buy an XBox I just focused on finishing the main storyline. :) Bethesda have an interesting AI system, though it doesn't always shine through. It's quite complex and I can imagine several ways it could hog too many resources without easily regulating itself. Really, that is a common enough problem isn't it? -- code monitoring it's own time usage. A progress-bar is a simple example -- we typically add checkpoints in processing, but this is quite crude, and reliant on hand-tweaking based on observation. Yucky stuff. For throttling resource usage you can at least have an outside observer (watchdog) decide you're time is up.

I'm trying to have a PL relevant reaction to your post, but mostly coming up dry. Let's see... on ugliness, if you identified an ugly effect you see in games, and relate this to something about language involved, that could drive a subtread. You noted hairy class hierarchies, that seems cross language and common to most things favoring an object-oriented model. One can find passive aggression in declarative typology, because it doesn't really grease progress toward objectives. Does the Dewey Decimal System help authors generate content? No, but it saddles them with latent expectation of being pigeon-holed. I can't come up with any axe to grind here, though.

I generally like Bethesda games. (I don't see any way to make this on-topic PL, so this probably needs to be my last on it.) I like open worlds, and hate scripted outcomes. For example, I loved InFamous, but hated and stopped playing InFamous II because you pretty much had to do exactly what you were supposed to do next, once you reached a certain point. Lack of scripting in Bethesda games is great. When I first played Fallout 3, at first I was deeply puzzled because the game didn't seem to care if I ever did anything -- and I then I got to like it. To my secret shame, I have wasted at least a thousand hours playing Fallout 3 in the last five years. Usually I see how low a level I can reach all 100's in stats for a new character, which involves collecting skillbooks while getting the least experience possible at the same time. (Last play-through I managed it at the start of level 17 with 225 skillbooks read without overshooting the 100 max.) Part of the relaxation is that it has no point.

Actually, that helps me make a point about new coding libraries: they generally don't have any point either — just something to do.

"Why doesn't every process have an embedded http server running in one or more fibers, so you can ask questions about internals?" When you build an app, why isn't there a command line interface to ask your app questions about how it's going? Why aren't inspectors standard? Why can't you browse stats or a view of internal processing in your app from a browser, when it runs outside the browser? [...] Why can't your language support adding other activities in the same process with support functions enhancing your ability to understand what's happening?

These are good questions.

Some languages do support this sort of introspection (Smalltalk being the canonical example). Some languages at least enable it, with reflection and a managed runtime.

For the issue of browser-based inspectors in particular: the term 'AJAX' was coined in 2005. It took six years for it to penetrate; XMLHttpRequest was invented in 1999. I would not be surprised to see more web-based IDEs and debuggers over the next five to ten years.

Of course, there is still a huge performance cost associated with such features. Even a carefully designed debugger will tend to keep extra state and hinder optimizations. So I wouldn't expect this sort of feature from every process.

Smalltalk means a language, and typically also a specific system using that language. So while the PL doesn't require images or inspectors, that's what you get in image-based descendents of Smalltalk-80 systems. The name conflates PL and OS features. (Java typically suffers from similar conflation was well; it's whatever folks want to lump into a release using other Java tech.)

So if I say "embedded http server" this includes very minimal versions too, as long as when you connect to a port advertised as understanding http, you get browseable results — possibly in the most primitive sort of text-based interface imaginable. When you say "web based", which I carefully didn't, this implies more: perhaps everything anyone associates with large web frameworks. (Maybe folks go instantly to: But how are you going to embed Apache as a fiber? Or something else that sounds big.)

Suppose you add printf() based logging to an app for debugging. This sounds normal to folks, and they don't ask, "But how did you embed all of TeX in your app?" Because printing doesn't mean everything you might do when printing. The urge to go all-or-nothing is less present when printing. (It would be funny if folks demanded Unicode support when your C app logs.) If you tell a friend your garage now has a work-table for minor carpentry, they won't plan to come by and build a Boeing 747.

Let's segue into typical debug build practice here, and relate this to optional support integrated into an app at runtime. It's common practice to have more than one way to build software; a "debug" build usually turns on more symbols, turns off more optimization, and includes whatever hacks you added last for tracing, logging, and auditing data structures in redundant representation for comparison. If you add http debugging too, maybe a define for DEBUG_HTTP_PORT tells your code to link in a minimal optional http framework, and listen on that port. I won't belabor this further. (And clearly it would be crazy to leave that enabled in production code, unless you put serious engineering into every phase of development through testing and security audit.)

Maybe all you support is a static set of URLs that can read things you want to see. Your app's process state might be exposed in a homebrew imitation of a proc file system in Linux. Or maybe you go nuts and make debug support programmable, letting you upload scripts that run as fibers in your app. To actually make your app debuggable in a step-wise fashion in a web-browser would be do-able, but an enormous amount of work (that I can't imagine anyone doing unless they expected to sell product); among other things, it would require running on a VM with integral remote debugging support. Fun as a hack, agonizing as a product.

My original point was that folks typically write software in a monolithic fashion that owns an entire process address space in away that you can't add other independently executing entities, even when debugging, because it would seriously interfere, at many levels. Such extreme interference doesn't seem necessary. And value of flexibility when debugging seems like it would be high. We tend to improve only things we measure. The ability to look at what code is doing is a kind of informal measurement, with your expectations as a ruler.

Here's another anecdote (if you hate these, just tell me to stop, I can adjust). One unspecified job I had in the 90's included an OS feature supporting external requests to launch a new thread inside a process, if everything was set up the right way. I used it to spawn a remote interpreted inspector, using a dynamic language wrapper around native C++ interfaces. We jokingly called this feature a "virus insertion framework", and were told in no uncertain terms to stop saying that before someone overheard. But it's moot now.

folks typically write software in a monolithic fashion that owns an entire process address space in away that you can't add other independently executing entities, even when debugging

There are many memory-safe languages. And in such languages, it isn't difficult to introduce independently executing entities. But independent entities aren't what you need. For debugging (or extension, or other purposes) you need dependent entities. Pages provided from the http fiber must depend on the state in the primary process behavior.

Often, especially in memory-safe languages, there is no semantic access to the information you wish to display. Therefore, you must bypass the semantics with support of the language's runtime or compiler. Or you can hack the semantics to add a bypass, like reflection. Naturally, there are many security and safety risks involved with bypassing or weakening semantics. But these can be mitigated in a purely observational 'debugging' role.

There are programming styles that readily support semantic accessibility: blackboard systems, publish-subscribe systems, data-centric models, logic programming. Pushing most state into external resources (such as a database) helps a lot. Encapsulating ad-hoc state in stacks, semaphores, etc. can hurt a lot.

When you say "web based", which I carefully didn't, this implies more

All I meant by "web based" is that you can use a COTS web browser and have clickable links. Of course, if you need to continuously hit the refresh button, that will certainly hurt the user's experience. A live dashboard or stream is appropriate for understanding a dynamic system.

How the http server is implemented is, I think, a lesser concern than how it obtains whatever data it is to present. Especially if you want this to be "standard" for "every process".

I don't have a good answer, but saying nothing seems lame. Making points related to PL is hard. Nothing comes to mind except something so abstract it's almost worthless: I expect chaos, meaning I expect such divergent semantics and mechanisms that standardization has little value or meaning when it comes to process inspection. I wish trying to inspect was standard, starting small and simple, and being satisfied with some useful evidence in preference to pursuit of ideals.

Generators are iterators, where the producer is running as a green thread / co-routine. It is a protocol, where the green thread acts as the producer , when it yields a value to the calling thread, this value is then taken as the value of the iterator.

...then why are all the other models implemented in them? I say this not because I think threads are "good", in fact I would recommend using a higher-level model if you possibly can. But, most other models are not as complete and I've often found that when push came to shove I had to drop down to threads and build my own buggy higher-level model per the variation of Greenspuns 10 rule above. The fact that more and more buggy higher-level models start to look like data-flow I find particularly interesting.

Operating systems provide threads. And many APIs and FFIs are thread aware (e.g. OpenGL uses thread-local state) or just not thread safe, so clients must be aware of threads for interoperability. Regarding why threads are widely used, I'd say it's a matter of inertia and integration. If our operating systems and popular libraries were developed in another concurrency model, that model would be the new base for implementation.

If you ignore the integration issues with FFI and OS, threads are no more 'complete' than other concurrency paradigms. And they lead to a lot of awkwardness, e.g. because inter-process communication is vastly different from inter-thread communication.

I think it would be better to invoke Greenspun's tenth only if you found yourself modeling threads atop another concurrency model (e.g. atop actors).

Threads are a low-level assembly construct useful for expressing concurrency and parallelism. The only question is whether programmers should use them directly, and if not, what higher level abstraction works best to hide them.

GPU "threads" are very different in the sense that they are truly SIMD. There are also some interesting real dataflow machines and transputers that have fundamentally different low-level views of parallel and concurrent execution than what we are used to on contemporary hardware.

One could imagine that someday...when we are finally close to executing on something like brains...our notion of thread won't make any sense anymore.

Green threads can be motivated by comparing it to two alternatives: OS threads and callbacks. It goes something like this: When you have things happening asynchronously, you don't want to block on one thing while you could be doing something else. For example if you make a HTTP request and a disk read, you don't want to do the HTTP request and wait until you got a response, and then do the disk read, because you can do those at the same time. There needn't be parallelism in your own program, just parallelism between your CPU and the external entities (e.g. the internet and your hard disk). Your program is still doing only one thing at any given time, but while your program is doing something the disk and the internet are busy for you in parallel with your program. Green threads are better than hardware threads because they are lighter (faster context switch and less memory), and they are better than callbacks because you can write code linearly as you normally would instead of having to do manual CPS transform. There are also alternatives, like futures, which may or may not be better.

Another motivation for green threads is what munificent writes above: not for concurrency in reacting to external events but for structuring a program internally. I'm not convinced that green threads are a good idea for this. With green threads I mean threads that can cooperatively yield to the scheduler, which will then select another green thread to run. The thread that is selected to run is non-deterministic. The scheduler may happen to be deterministic but certainly which thread is selected to run is an implementation detail that you don't want to rely on, so from the point of view of the programmer it is non-deterministic. If you are using green threads for structuring a program internally then non-determinism doesn't buy you anything, you will want to eliminate it later with synchronization constructs (e.g. in the lexer-parser example he does that with channels). Rather than introducing non-determinism and then eliminating it it seems to me better to use a construct that is already deterministic but still allows you to decouple lexer from parser. Coroutines are the obvious choice. Coroutines differ from green threads in that you don't yield to the scheduler, but you explicitly yield to the coroutine that you want to run. In this example, the lexer would explicitly yield a token to the parser, and the parser would explicitly yield to the lexer when it needs a new token. This is fully deterministic at all conceptual levels. You can implement green threads on top of coroutines by writing a yield-to-scheduler function that picks another coroutine from a list of suspended coroutines, and yields to it.

If you have decided that you want coroutines in your language you may also consider including delimited continuations, which are strictly more powerful. The language Eff is another alternative that has a construct similar to delimited continuations, but possibly more compositional (does anybody know a clear relation between continuations and Eff?). Yet another alternative is monadic do-notation, which alleviates some but not all of the trouble of writing in continuation passing style. As far as I know the jury is still out on what the best delimited-continuation-like construct is.

If you have decided that you want coroutines in your language you may also consider including delimited continuations, which are strictly more powerful.

Only in the sense that multishot continuations are strictly more powerful than one-shot continuations. Multishot coroutines seem equally powerful as delimited continations.

The language Eff is another alternative that has a construct similar to delimited continuations, but possibly more compositional (does anybody know a clear relation between continuations and Eff?

Coincidentally, this is almost exactly what I asked a few days ago, specifically re: Oleg's article on extensible interpreters which uses delimited continuations to compose effects and handlers in what seems like a very similar fashion to Eff's algebraic effects.

IIUC, the paper talks about a generalized yield with an explicit run operator which needs to be wrapped around code that may yield (it's easy to see the analogy to reset of course). But do any existing languages with coroutines have such an operator?

asymmetric coroutines are implementable in terms of shift/reset

And the other way around? [Talking about coroutines as they exist, not the generalized ones from the paper.]

And the other way around? [Talking about coroutines as they exist, not the generalized ones from the paper.]

If you're referring to specific implementations of coroutines, then we should discuss the implementations (like my old C library, which takes coroutine arguments and so is compatible with capability discipline). See Revisiting Coroutines where I discuss the capability implications of different coroutine designs. Only unlabelled, asymmetric coroutines are problematic due to "coroutine capture".

The coroutine abstraction itself is not inherently capability insecure, though certain implementations may be. Restricted coroutines as exist in C# are also not capability insecure since "yield return" is simply an implicit "resume" capability in EROS/KeyKOS parlance. Because the depth of such coroutines can only ever by 1, there is no problem with coroutine capture. Basically, yield return should return a value to the yielder as well (the resume capability) to prevent coroutine capture in asymmetric designs.

Lua coroutines look like unlabelled asymmetric coroutines so they would be problematic. Any others?

Yes, multishot coroutines are yet another mechanism to express effects, although as far as I know they are an uncommon form of coroutines.

I also asked the same question a while ago in that thread you linked to, and at that time nobody did give a complete answer (in the form of giving Eff as a syntactic sugar on top of delimited continuations, or giving an example to show that it's not possible), but maybe it's still unknown now.

I think this comment is pretty definitive. It implements the general delimited yield/resume in Eff, so algebraic effects in Eff are at least as expressive as delimited continuations. Thanks for mentioning your comment, that thread was interesting and I had forgotten some of the subjects covered there. The parallels are still pretty compelling between Eff and EDLS so I'm still hoping someone with more knowledge will chime in with an incisive analysis.

Your whole first paragraph is very useful to me. When things could occur asynchronously, I'll usually have an additional goal to make this so, even if standard api leans toward synchronous. In a dialog, some character will have to ask, "But why would you want async anywhere?" It's easy to keep asking why.

(Anecdote: folks discussing api design at Taligent in the 1991-92 timeframe asked themselves, "Sync or Async, which is better?" Sync advocates came up with this: well, if the api is sync, you can always spawn a native thread to make it async — problem solved, because that thread can just sit there and wait for reply. This left async advocates saying, "Um, well, but what about..., uh," with a distracted look while pointing fingers at half-formed ideas. At the time it didn't seem reasonable to attempt more than N async operations where N exceeded an okay number of native threads. But now with a bigger gap in processor vs i/o speeds, it's easy to start far more operations than threads you want to spawn.)

For example if you make a HTTP request and a disk read, you don't want to do the HTTP request and wait until you got a response, and then do the disk read, because you can do those at the same time. There needn't be parallelism in your own program, just parallelism between your CPU and the external entities (e.g. the internet and your hard disk).

I can generalize this to discussion of things that can happen in parallel with one's program, to avoid waiting upon them serially if async dispatch can be managed. A gospel checklist of questions can probably be worked out, so randomly considered use-cases can consult appropriate related checklist items. (Are you trying to do X? One answer is Y.)

Rather than introducing non-determinism and then eliminating it it seems to me better to use a construct that is already deterministic but still allows you to decouple lexer from parser. Coroutines are the obvious choice. Coroutines differ from green threads in that you don't yield to the scheduler, but you explicitly yield to the coroutine that you want to run.

Oh, that's a really good point. I think of a green thread fiber as having a continuation which is the one run next when scheduled. So "independently schedulable" is part of what a fiber adds to a continuation. But with coroutines sharing one schedulable system entity, a producer and consumer (for example) can take turns sharing it, by swapping which continuation runs in one fiber.

If you have green threads, you can still have multiple coroutines per fiber, so they share the same timeslice from a scheduler's perspective. In a high level programming language, you must carefully expose a way to do this; in C-with-continuations you can have a fiber switch the continuation that runs as needed. (In C you can do really disturbingly wrong things, some of which must be documented carefully as poison for your program. In addition to proper use, docs ought to have sections about bad usage too. "For topical use only; do not take internally. See a doctor if accidentally ingested.")

If coroutines sharing a scheduler's timeslice is the intent, there's more than one way to do this. For example, a group of fibers can belong to a related family — a green process — and support for timeslice sharing in a green process might make sense in a scheduler. It would have the advantage of reducing competing ways to do a thing, and so improve clarity.

You can implement green threads on top of coroutines by writing a yield-to-scheduler function that picks another coroutine from a list of suspended coroutines, and yields to it.

That's basically what a trampoline in a fiber scheduler does. Returning to the trampoline's dispatch method means, "Yield to scheduler," so that's the point where cooperative context switch can occur. It's how both up-calls and down-calls are done — stack-winding by invocation or stack-unwinding by return. And the C stack resets to a known standard state at that point. Docs for fibers must have a metaphor to explain stack structure, since higher parts of a call stack are heap-based frames, while leaf calls to ordinary C methods add a transient C stack to the bottom, which gets unwound completely by return to the trampoline.

If you have decided that you want coroutines in your language you may also consider including delimited continuations, which are strictly more powerful.

Well I get coroutines for free with continuations based on cactus stacks of activation frames. While I have given delimited continuations a cursory look, I have a very intense dislike of systems expressed as abstract operators, and thus far I have blown them off. (And since I don't matter, this does no one any harm.) But I'm intellectually curious and still want to understand, even though I don't care. I bet you might be able to define them indirectly for me, without reference to operators, by describing a problem they intend to solve. If you assume I understand continuations completely, just describe a way I can shoot off my foot because I'm not using delimited continuations. Since I'm imposing, don't feel obligated on this, though. However, I think other folks would like it, and it may spur uptake if you care.

If you assume I understand continuations completely, just describe a way I can shoot off my foot because I'm not using delimited continuations. Since I'm imposing, don't feel obligated on this, though. However, I think other folks would like it, and it may spur uptake if you care.

Delimited continuations are to undelimited continuations, as try/catch is to "exit".

Also, as Oleg is fond of pointing out, continuations are always delimited by something, usually the operating system. The same is true of "exit", which is like throwing an 'uncaught exception'; from this perspective all exceptions are eventually caught, but some make it all the way back to the operating system.

Personally I've read a lot about delimited continuations and call/cc but have yet to use either of them for anything. I find delimited continuations more intuitive though. I still don't know if I've grasped call/cc correctly.

(I thought this was going to be clear, but be warned it gets less so by the end.)

That's helpful. It sounds like scope and bounds of delimited continuations is narrower than continuations generally. I assume this is for both concrete and abstract reasons. One concrete reason sounds like, "I can optimize stacks better when commitments are narrower." (Specifically, you might want a contiguous C-style stack as much as possible for efficiency.) And a similar abstract reason goes, "I have more room to vary language definition when operators promised fewer things constraining me less."

Including "exit" in your analogy may be mistaken in a direction opposite of what you might mean, since there's only one exit, while undelimited continuations can go more places than delimited continuations (I think).

For my purposes, I don't need either the concrete optimization or the abstract flexibility. Activation frame lists will not be worse in performance than something I have seen, that folk find acceptable in C code called "high performance." A sort of high level language I want on top will present fibers as integral with a C api exposed based on those activation frame lists, as the expected implementation, so abstract flexibility won't have value.

In a kind of C-with-continuations, every continuation visible to a fiber scheduler would be a first class value, in the sense a fiber can replace one with another to change control flow. The following construction aims to explain what this means, by giving you something to visualize.

Draw the call tree of a program as a graph with main() as the root node. Nodes in this graph are activation frames: state that might be mutable when executing associated code. (The same idea as a stack frame convention in assembler.) Informally you can think of each node as a block of code, but since the same re-entrant code is shared by every other invocation of a function, it's really the activation frame each time that differs, so a node is a frame.

Each time you invoke a function from code in a given source activation frame, add a new node for the destination activation frame. Between source and destination frame nodes, add two new directed edges (or arcs): a down-call from source to destination, and an up-call from destination to source. Each up-call is how a return occurs; each up-call is a first class continuation of the sort you see in call/cc for Scheme: it's a reified return address.

Note each down-call (invoking a function) is also a first class continuation as far as a fiber scheduler is concerned, but saving them (beyond context switching) has no big value when you can always just make a new one and call a function again. Doing this is cheap, trivial, and normal in most languages, and makes recursion so easy to get. So it's up-calls that are relatively weird when they can be invoked more than once.(But unless repeat return is planned, chaos can ensue, poisoning an app by corrupting memory as understood by the runtime model; that's the scary part in C.)

As functions call one another and return, we create a graph of activation frame nodes connected by arcs representing code jumps. If you ignore up-call arcs, the graph is an acyclic tree of activation frames reached by down-call arcs alone. So it looks like a tree when we think about who calls who. However, we don't store down-call arcs persistently — down-calls are ephemeral, typically only reachable from a fiber's state representing what it's doing now, which is where a scheduler stores this when a context switch occurs. If we stored all down-calls persistently, the graph would be strongly connected in a way never letting us garbage collect anything. So generally, down-calls are only among roots pointing at leaves of the call-tree still running in fibers. Up-calls representing stacks are what we keep around.

When we look at a given fiber's stack, all we see is a linked list from the current frame up to the activation frame for main(). An up-call continuation for return is stored in each activation frame. (And an up-call continuation can be stored anywhere else too as a first class value.)

When frames are refcounted, storing an up-call continuation somewhere keeps the stack for that continuation alive, so you can return that way as many times as you like. For example, if it's a consumer waiting for a producer to generate data, it's a coroutine. This doesn't require a dedicated fiber to run, just a willingness for any fiber to yield to it. I'm guessing this starts to sound confusing. Yes, the potential for chaos is large: plenty of rope to hang yourself.

If you constrain yourself mainly to using fibers, the only parts of your historical call graph still around is what is still reachable in stack frame lists owned by fibers, where a fiber points at the leaf-most part of a given flow of control. That point can be headed downward by new function calls, or it can be headed upward by returning to caller, resuming in each caller's continuation (as expressed by an up-call arc). But you could get fancy and save more continuations in data structures, say to put a variety of signal or exception handlers where a process can find and invoke them.

[Edit: you may prefer alterate directions of "in" and "out" to replace down and up, rewriting text above so down-call equates to in-call, and up-call equates to out-call. This coordinate system metaphor matches in- and out-parameter terminology for argument value movement direction.]

Incidentally, your notions of "up" and "down" w/r/t function calls are opposite mine; I think of call stacks as growing upward with calls and having the "top of the stack" get lower with returns.

I think that a good metaphor for undelimited continuations is the "return" statement familiar from most languages. When I'm explaining continuations, I ask people to think of them as the "return" statement from a given function. When you reify it, you make it into a function-like object whose value you can save somewhere in a structure that endures past the end of the function. Then, after the function has already returned, you can call the continuation to make it return again and again, restoring all of the stack that was there when it originally returned. That stored "return" function is what an undelimited continuation is.

Call/cc is the only function that can reify a continuation, and it can only reify its *own* continuation, and it complicates the semantics by automatically calling another function with call/cc's own continuation as an argument.

But the function that it calls can be a continuation reified by another earlier invocation of call/cc, in which case it amounts to a context switch between sequences of cooperatively-multitasking coroutines. The called continuation saves the continuation that it's been passed as an argument, does whatever it does, then uses call/cc to call that continuation giving it its own continuation as an argument, yielding back to its caller. ... and they can do that all day.

This is all clever, but I'm convinced that it's one of several possible wrong ways to go about capturing and using undelimited continuations. Aside from being nearly useless in pure-functional programming (try to use call/cc without mutation, and without getting into a loop; you won't be able to implement context switching), experience has shown that if you have more than one abstraction that is implemented in terms of undelimited continuations, it's very hard to make it so that they don't interfere with one another and also work without creating a memory leak.

I want to know when folks dislike things, especially if there's a good reason, and you usually have good reasons. So any time you have a reaction, "I hate this," assume what bothers you greatly interests me. (There seemed a hint of that with call/cc; note I'm not actually using call/cc.)

Incidentally, your notions of "up" and "down" w/r/t function calls are opposite mine; I think of call stacks as growing upward with calls and having the "top of the stack" get lower with returns.

I edited my comment to offer alternative directions of "out" and "in", but my coordinate system directional metaphors are aligned like this: more is inward is deeper is down. (If I wrote a dialog about this, going for an Inception joke would be mandatory.) While I can offer several reasons for this, I still want to know when it monkey-wrenches your natural way to visualize.

I have a bias though, which might mean everything I say is just rationalization. Mac stacks in Motorola 68K grew downward, so calling a function pushed a new stack frame in the negative direction. Starting in 1990, I spent three years debugging C++ in 68K assembler because we had no symbolic debugging. The only way to watch code execute was in 68K assembler, so each time I did this, a metaphor of "push means down" was reinforced. However, I don't think that's the biggest influence.

I think my coordinate system for "vertical coordinates increase downward" comes from typical user interface presentation, such as text like this where later lines appear below earlier ones. I write text files all the time, and I know later text appears at positive file offsets, so I associate positive offsets with downward UI presentation. And then I also see lots of C code, where struct fields have the same orientation, where earlier fields shown first have lesser address offsets, while later fields have greater address offsets when declared on subsequent text lines. I might use more spatial metaphors internally, making me want to align all of them.

One of my coworkers (a guy in his 50's like me) said to me, "You know, real trees have roots at the bottom and leaves on top." But when I asked him if he has ever seen someone draw a "tree" in programming that way, he said no. Instead, the "more is down" metaphor seems to dominate tree presentation, when folks always start at the root (say when discussing a recursive traversal) then add children underneath. For example, the expression ((3 + 4) * 5) might often be diagrammed like this, with leaves downward:

*
/ \
+ 5
/ \
3 4

When I think of function calls, I use the same model, with callee children shown below a parent caller:

When I was thinking about this last week, I also noticed the gdb debugger also uses "up" as a command to change frames in the direction of main() while "down" directs gdb to change frames toward a call tree leaf. I was trying to decide whether trees in diagrams as shown above would fit images people consider normal when visualizing code execution. Does gdb's choice bother some folks?

(Last week I was thinking about a diagram showing C stacks in an XY plane as above, with CPS transformed fiber stacks shown in a ZX plane at right angles. But in both planes, a convention of "more is down" would be used. A ZX plane in perspective would present a root as away, with child nodes toward a reader. The point of the diagram was to show returning to the scheduler required return to the ZX plane, with a local C stack completely unwound. I first started using plane metaphors when thinking about copy-on-write algorithms, with incremental change going in another patch plane.)

Call/cc is the only function that can reify a continuation, and it can only reify its *own* continuation, and it complicates the semantics by automatically calling another function with call/cc's own continuation as an argument.

We don't have to do it the way call/cc does. You got me thinking about this, though. I hadn't given a lot of thought before to letting folks express explicit continuation references in C code before it gets CPS transformed. Mostly I was going to do TCO with rewritten async methods. But if I let folks write code depending on TCO in the original C, then original C will have different stack behavior when a typical C compiler does no tail call optimization. Suppose I want to (a) let folks refer to "my continuation" in a function, and (b) replace the default continuation with an explicitly passed continuation. I can handle (a) by letting folks define the symbol understood as a keyword denoting "my continuation", and this could be defined as a dummy value when run in the original C. Then (b) can be handled by what looks like a function but is actually a special form. Just for example, suppose the user invokes the compiler with selfcc as the keyword for "my continuation", and CC() as the special form for explicit continuation passing. Then this function call in C could mean "pass my continuation to foo() for TCO":

foo(alpha, beta CC(selfcc));

The weird absence of a comma between beta and CC() lets us define CC() as a macro that becomes the empty string when original C code is compiled without CPS transform.

Aside from being nearly useless in pure-functional programming (try to use call/cc without mutation, and without getting into a loop; you won't be able to implement context switching), experience has shown that if you have more than one abstraction that is implemented in terms of undelimited continuations, it's very hard to make it so that they don't interfere with one another and also work without creating a memory leak.

I agree it seems hard to express context switch without mutation, as it involves updating a fiber's execution state when suspended. (And probably some kind of special form syntax is necessary for coroutines not to look weird.) But this seems normal to me in C. I guess I haven't though much about mechanisms I might expose in a Lisp or Smalltalk syntax. I've been assuming I won't have any feel for what looks reasonable to users when I know too much about internals. If you have suggestions for syntax, I want to see them when they occur to you.

Surely "less is down" if it's recursion starting at the root. You'll want your leaves to be minimal elements, surely?

I gather you're working from, "Recursion must have less work at each step, ending in a base case that stops recursion." When recursion sheds load at each step, there's "less" as you proceed.

But there's multiple ways to say "more" when describing this. For example, if you start a description at a root (with all the work), as you continue there's more to say, even though there's less work. More here means more presentation. Or, if you're using a stack, as you recurse there's more stack used as you push frames, even though each new frame gets less of the original problem to handle.

I like your objection by the way. It's a good example of why dialogs are useful. Having fictional third parties try hard to misunderstand creates an excuse to go over the same thing several times from different angles. (Not that you were trying to misunderstand; one wants to write characters that way, or else there's no plot.)

One of the things we did as students (and which I still sometimes do) is to run programs without a machine, using sheets of paper to stand-in for memory and invocation frames. Mostly this is now merely a way of reasoning about and explaining programs using a concrete set of physical objects to show how things work and keep track of details, but it is also in this way that I have on two occasions "hand-compiled" a compiler starting from its own source code in the language it implemented, which is somewhat magical given new hardware for which no programming system (not even a cross-compiler) yet exists.

The technique is fairly simple; you put a sheet of paper on the table to be the environment for "main"; when it calls something, you put another sheet on top of it to be the environment for the called procedure; when that procedure returns you take the sheet off the stack to reveal the "main" environment below it, which you are returning to. Etc. Off to the side, you have a binder with numbered pages to serve as your "heap"; "heap" addresses that can be stored in the "stack" of invocation frames, are literal page numbers which allows you to look things up (and write things down) in the notebook as necessary.

Each piece of paper in the call stack has the names and current values of local variables writ thereupon, along with a "return address" expressed as the line number at which to resume execution when you "return" to this sheet by taking the sheet on top of it off. Mutation, in languages that have that, is as simple as crossing off one variable=value line and writing another below it.

One result of this is that when I deal with literal, physical stacks in programming, they grow upward, with the paper environments of called routines laid literally on top of the paper environments of the calling routines, where gravity holds them in place on the wooden desktop.

Whether the addresses are greater toward the "top" of the stack or toward the "bottom" is entirely an irrelevant consideration to me; in fact I don't even know how you're mapping up and down in terms of greater and smaller addresses, nor care. Do the address numbers measure depth, like the ocean, or altitude, like the sky? Both make exactly the same kind of sense. If one of them is more "obvious" to someone else then it's solely because someone else's head doesn't work the way mine does.

As to diagrams of trees, etc, We draw them downward because we write lines of text downward. When we come to the point in our explanation where we need to draw a tree, the area above it is already filled with text, and unless we can guess immediately how much vertical space we need, the area below it is either too close (meaning we don't have enough room for the tree) or too far (meaning we waste vertical space). Conversely if the tree grows downward we know exactly where to start (on the next line below our text) and we will waste no space at the bottom because once we have reached the bottom we will know exactly where to begin writing text again. Thus we use our representational space effectively. But if it were our typographical convention to have later lines above earlier lines rather than below them in text, then the same considerations would result in drawing diagrams of trees upward from the root.

Anyway, I'm always puzzled by the "directions" people use as a metaphor for things that are, strictly speaking, non-directional - like memory addresses in a computer. Most of these analogies are opaque to me. Why for example do most binary trees have "left" and "right" children rather than "first" and "second" children? I have no idea. Why is the most significant bit called the "leftmost" or the "rightmost?" Beats the heck out of me. Why are architectures "big" or "little" endian depending on the storage sequence of multibyte variables? It's fricking random and makes no literal sense. But it seems to make some kind of sense to enough people, enough of the time, that a vocabulary has emerged. I just wish I understood that process.

The whole returning with a continuation confuses me. A stack tells you where you're going and not where you've been. You don't return to a stack, you continue with it. Of course the real problem is describing continuations in terms of call & return which is a bit like describing bricks and mortar in terms of cathedrals.

It's a frame of reference problem. (And yes, this overloading of "frame" in another sense is awkward, but almost inescapable when every word means more than one thing, and we scrabble about applying old metaphors that re-use words in different but confusingly similar ways.) Part of the problem is that I'm not showing real examples of C code before and after CPS transformation, so it's like hearing someone describe plays in a game you didn't see on television.

If you use a trampoline, it simulates a stack used by the layer below, and this simulation goes in the metaphorical ZX plane mentioned above. When a trampoline calls a leaf C method, it runs in the YX plane. If such a method wants to "call" an async method, this occurs in the ZX plane, and thus disturbingly involves returning to the trampoline (going outward in the YX plane), even though logically you're calling another method (going inward in the ZX plane).

Imagine writing code like that manually. I was able to do it, but I did it slowly — not nearly fast enough to crank out code at a high enough rate with errors so obvious I removed them as I went. So no one wants to see that. That's why there should be a C-to-C compiler that preprocesses input C and outputs a CPS-transformed version that sounds like the above. However, if it has a bug, it must be debugged with that twisty model in mind. (But if you ever write code heavily driven by callbacks with async api, that's twisty too.)

Can I think of anything similar? Say you make a system call in C, which does a context switch to the kernel. In effect, you return to the kernel and suspend user space code until later unblocked at the kernel's leisure. Except you didn't explicitly say you wanted to yield to the kernel, it happened implicitly.

I should write up some real docs, so some of this discussion is premature. It's like listening to a couple guys drinking beer and talking about how they would invade Grenada. As an observer, you want to say, "Doesn't look like you're invading anything to me." (This is intended as mild self-deprecation, if it's not clear.)

The source of my confusion is the fact that a function/procedure activation only ever returns once. With this in mind, describing continuations in terms of return is just, well, wrong. The name "continuation" is a good one, as it is where we continue after a return; it is not the return itself. Return is mechanism whereas continuation is program --- the same function called from the same address with the same arguments using unchanged globals can have different continuations (say different caller's caller).

I'm going to disagree with this, and maybe show too why that's dangerous when mutation can occur:

The source of my confusion is the fact that a function/procedure activation only ever returns once.

Someone can probably write an example in Scheme (but it won't make sense to most folks who already find call/cc confusing). So let's use C pseudo code instead, assuming new keyword selfcc refers to a function's continuaton — which allows it to be saved somewhere and later invoked again. Suppose continuation_t is the type of selfcc; if I save it in a global cache, I can reach it again later. (Note semantics of refcounting are ignored, so you would have to pretend a C++ style operator=() allowed something complex to happen during assignment, or that someone knew a particular global array was a GC root.)

This example shows how y() can return more than once, causing w() to also return more than once:

Suppose the first time w() is called, it has activation frame Fw1, and the first call to y() has activation frame Fy1. Then continuation selfcc inside y() will have stack Fy1->Fw1->rest-of-stack. So after we stuff selfcc into the global array shown, the continuation and it's stack remain alive, permitting another invocation of that continuation. If we assume calling a continuation looks like dereferencing a C function pointer (but that's not reasonable) it could look like this:

If we call y_zombie(0, k_odd_prime), and the continuation at index zero is still the one with stack top Fy1, it will invoke the out-call continuation returning to activation frame Fw1 ... again. And then z() will be called again, redoing whatever side effect it has, but with a new value this time. Clearly you could get insanely bad effects with this.

So it's not a good idea in general, but it's allowed by the model, unless somewhere declarative syntax is used to say (for example) functions called by w() are not permitted to return multiple times, otherwise failing w() by assertion. (That would give a static guarantee that dynamic violations would be trapped.) But this shows a function activation can return more than once.

[Edit: as Dillinger and Watson note in conversation below, activation frame Fy1 in the stack of y() above doesn't really belong in the continuation for y(), because the continuation can come from another call location that never yet saw an invocation of y(). So presentation above, in which I'll leave this error intact, confuses one thing for another, which I'll explain another time if I can figure out how to make it on-topic for thread and fiber models. When refcounting in C, we may pass arbitrary values back from a callee to a caller's continuation (possibly skipping callers in between elided by TCO), and want the callee's frame to live long enough for a ref handoff without a GC race condition. So something about the runtime invoking a continuation ensures the callee's frame is referenced when a continuation is called, at least until the caller can process a returned value. But I was wrong to say y()'s whole stack was in its continuation.]

You obviously understand what a continuation is, and there is no doubt that you know what happens when a function returns. However, I would suggest that you've confused the two. I would like to demonstrate to you why I'm saying this.

Here's an example in my little dialect of Scheme which I believe implements continuations in a correct way --- a continuation is just a reification of the control and value stack and does not contain any references to the function activation whose continuation it represents.

I define the identity function 'I' and I capture the continuation of one activation of 'I' and call it 'cont'

You are correct in that the continuation of a function Q contains no reference to any activation frame of Q. Rather, it must contain a reference to whatever activation frame was the dynamic *parent* of function Q.

What I expressed as 'returning from Q again' is more precisely speaking 'continuing as though Q had just returned again.'

I think that a way less confusing than call/cc for capturing continuations would be to treat them as a formal parameter and bind them to a name at the same time you're binding actual arguments to other formal parameters.

With that in mind, in a made-up Schemish syntax, I can write something like this, to create an explanation of continuations that ordinary mortals can follow a heck of a lot more easily than the usual explanation in terms of call/cc:

;; define a variable
(define Qreturn)
Qreturn
==> #uninitialized
(define Qfunc
;; the continuation is captured as a formal and named d
(lambda (a b (cont d))
;; here Qreturn is initialized with the continuation
;; as its value.
(set! Qreturn d)
(+ a b)))
;; the 'continuation argument' is invisible; it is a virtual argument
;; rather than an actual argument.
(Qfunc 1 2)
==> 3
;; Invoking the continuation with arg 5 commands it to
;; continue as though the most recent invocation captured
;; in Qreturn had returned 5.
(Qreturn 5)
==> 5
;; And of course you can do that any number of times.
(Qreturn 10)
==> 10
;; defining Zfunc so there will be a call stack when Qfunc
;; returns
(define Zfunc
(lambda (a b c)
(+ (Qfunc a b) c)))
(Zfunc 1 2 3)
==> 6
(Zfunc 2 3 4)
==> 9
;; Now, commanding it to return as though the most recent invocation
;; of Qfunc had returned something else can be seen to restore the
;; stack that Qfunc returned to, which contains one invocation frame
;; of Zfunc.
(Qreturn 3)
==> 7
(Qreturn 10)
==> 14
(Qreturn 2)
==> 6

The reason I encourage people to think of the continuation as 'returning from Qfunc' is that calling (d arg) from within the body of Qfunc has exactly the same effect that people expect to see from 'return(arg);' when they are writing code in other languages. If the name we bound to the continuation had been 'return' instead of 'd' it would be a very natural and familiar idiom to use it that way.

The behavior people see as "strange" about reified continuations doesn't become relevant until the value of the continuation is assigned to a variable with longer extent than the invocation frame of the function that it reifies the return to, like Qreturn in the above example. Even at that, when you access the continuation directly instead of obscuring it via call/cc, it's a heck of a lot easier to understand what it is.

The "undead function" in the above is not Qfunc; as you note, the continuation contains no reference to any activation frame of Qfunc. The "undead function" in the above is Zfunc, which is named in honor of its "Zombie" status. Zfunc has already returned when we invoke Qreturn again. The old activation frame for Zfunc gets loaded onto the stack again and the part of Zfunc that came after the return of Qfunc gets executed again. Zfunc has come back from the grave, with it's old dead a and b values, hungering for the flesh of the living!

And contrary to your intuition that a function only ever returns once, Zfunc, though called only once, will return to the REPL, from the same activation frame, as many times as Qreturn is invoked.

Actually Zfunc's activation frame was never claimed by the Reaper, and the a and b values are still very much alive. Because the captured continuation, which was still in scope, contained a reference to the activation frame, it remained live and the garbage collector passed over it.

Yep I was wrong, functions can return multiple times. I got too enthusiastic with my continuation-is-not-return standpoint. Even in my example the special form 'set!' returns to the REPL twice from the same frame.

"The function which has its continuation captured does not return every time the continuation is invoked" is more along the lines of what I wanted to establish. Confusing return and continuation invocation isn't that helpful I don't think.