and a bunch of ideas that have been banging around in my head for awhile.

I have always been very leery of multithreaded programming. I understand the concepts well enough. And sometimes that's really and truly the only thing you can do. This is particularly true at the OS level where you really only have one ethernet card and you need to mediate access between several different things running at the same time that all share the same underlying OS.

But it seems to me that having several threads of execution share memory in user-space processes is a mixed bag, and possibly not really worth it. I have several different reasons for thinking this. And a few of those reasons arise directly out of experience with modern multiprocessor system design.

Event-driven is better for single processor systems

My first thought on encountering the idea of threads was that they were totally ridiculous to use on anything but a multiprocessor system if you had an adequate OS interface for dealing with asynchronous events. Processing in discrete event-sized chunks that voluntarily give up the processor when they're done often results in better throughput because the processor's context is not randomly whipped around by context switches. Locality is everything.

And it's an easier model to get things right in. You don't have to worry that your program might be randomly interrupted at any point and some random thing will come along and change the bit of stuff you're working with unless you and the other random thing have an explicit agreement not to do that.

Library design issues

The only time I think threads are even a little bit justified is when you're writing a library that needs to do significant processing but doesn't want to block the thing that called it while it gets the processing done. And even then I think there are better ways to do things.

Your library should either let the calling program handle the concurrency issue or provide explicit hooks to allow an event oriented framework to call into the library to accomplish the task in small, discrete chunks.

If the library takes the first option, the program could conceivably even fork (create a whole new process) to accomplish the task and provide the result to the original process on a pipe. If your task really takes that long, the extra overhead of writing the result onto a pipe should contribute to the overall time only very minimally.

If your library takes the second option, each chunk should take a fairly small amount of time so the calling program can be effectively interactive if it needs to be. This can be hard for some problems, so for those problems I suspect the fork and write the result back on a pipe option is likely better.

If the library is pausing to do IO, the only polite way to write it (IMNSHO) is to provide hooks to allow an event driven application to call back into your library so it can handle the IO. And your library should never block, but simply inform the calling application of which event will allow your library to continue.

It's even better if your library can provide a way for the calling application to do all the IO on behalf of the library, though insisting that the calling application do this is going to make your library really hard to use in the common case. This is so that the application can virtualize the IO if need be. This future proofs your library against a whole class of possible changes in the underlying system.

There is really only a slight difference in degree between a 'shared memory' multiprocessor system and a LAN based multiprocessor system

Long winded justification

First, I'm going to define a hardware spectrum....

First, two terms... I realize that these categories have a somewhat fuzzy boundary.

Cooperating nodes

Nodes that may or may not share the same overall goal that are talking to eachother so that each might achieve its own individual goal.

Cooperating nodes may be actively trying to subvert other nodes. They are cooperating only in the sense that they are taking the rules by which the other nodes communicate into account.

Each node has its own discrete state that makes sense in the context of its goal, but may not really make any sense when considered in the context of the overall state of the system. It would be quite questionable as to whether it made any sense at all to consider the overall state of the system comprising all cooperating nodes as being something coherent and describable.

Communal nodes

Nodes that inherently and always share the same goal.

Nodes in a cluster doing a weather simulation fall into this category. You can consider the state of the weather simulation to be fully spread out among all the nodes. And while each node may have its own state that is discrete, those states really don't make sense unless they are considered within the context of all the nodes making up the shared state of the system.

Now that those two terms are defined, here are some more fuzzy categories.

Internet

Cooperating nodes separated by arbitrary amounts of latency.

The fact that nodes are merely cooperating and not communal is of the utmost importance here. Nodes must successfully cooperate while still making sure they aren't compromising their own goals. In order to do this they have to subject all incoming communications to a great deal of scrutiny and have sophisticated systems for dealing with 'gaming' and/or fraud by other nodes. This means that any kind of implicit sharing of state is almost certainly a horrible idea.

This is also one of the reasons (though not so explicitly stated in the referenced paper) that I think RPC is a bad idea.

WAN

Communal nodes that are potentially separated by more than 2 msecs of latency. There may be useful distinctions to make in terms of latency above 2 msecs. But I definitely think that anything above about 20 msecs is essentially the same from a system design standpoint.

One notable cutoff here is the range for playing games that are heavily based on human reaction times. The minimum reaction time for humans is about 90ms. This means that for latency to not provide so much noise that it swamps reaction time, you want at most 15ms of latency. The maximum physically possible radius in which this level of latency can be achieved is about 1300 miles, and in current practice is probably about 60 miles.

At about 150ms of latency it is so high that people notice in almost every interaction. The radius for this is 13500 miles, which is conveniently small enough that the entire Earth is covered. But in current practice, the radius for this is more like 3000 miles, which is only enough for a continent.

Of course, most systems in which human reaction times are a factor are also ones in which the nodes should be considered to be cooperating and not communal.

LAN

Communal nodes that are separated by a latency of between 100-200 µsecs and never more than 2 msecs.

Due to hard restrictions on the propagation of information through space, LANs will never be larger than about 180 miles in radius. With current technology a LAN the maximum size is probably around 2 miles.

NUMA

Communal nodes that are separated by fewer than 200 µsecs of latency, but still more than 500-1500 nsecs. NUMA architectures are also distinguished from SMP in that though all of the state of the system is ostensibly able to be freely access by any node, state 'owned' by a node is significantly cheaper for that node to access than it is for any other node to access it. Latency is still an important part of the distinction though since current SMP systems do not actually conform to the SMP ideal of all nodes having equal access to the state of the system.

Some might say that I'm missing that in NUMA architectures one node can treat the memory of other nodes as part of the same memory space. My response to this is that with proper OS support you can use virtual memory to make this the case in most situations. You can even do this on a LAN, though it's of dubious usefulness with such high latency. But even on a LAN the latency of fetching swapped out memory from a hard-drive is much higher than the latency to fetch a memory block from another node.

On an interesting note, up until clock speeds went into the 100s of MHz most SMP architectures would be classified as NUMA by this definition.

The maximum physically possible size of a NUMA system is approximately a 17 mile radius. With current technology the maximum size is probably closer to 20 meters.

SMP system

Communal nodes separated by less than 1500 nsecs of latency in which all state is able to be accessed equally by all nodes. This latter requirement isn't actually true of most current SMP systems because of the extensive use of caching.

The theoretical maximum size of an SMP system is about a 200 meter radius. With current technology the limit is about a 10 meter radius.

I think latency is the biggest difference here. The only reason you don't see NUMA architectures on a LAN is because the nodes on the LAN are too far apart in terms of latency to make algorithms implemented on top of LAN based NUMA even remotely efficient.

And latency is a hard difference that won't go away. It is based on a theoretical physical limitation of the universe for which there is no data at all that contradicts the theory. It is unlike our current limitations on processor speed in that the theoretically physically possible maximum processor speeds are still many orders of magnitude from current practice while the current theoretical minimum latencies are only 1-3 orders of magnitude from current practice. This means we can expect the disconnect between raw processor speed and latency to get worse over time rather than better.

The biggest point I had in making up that spectrum is that latency is the biggest distinction. But in each of those systems there is a core of state that is 'local' and has very little (L1 cache) or no (registers) latency associated with access. And in each of those systems latency can only be expected to become more of a factor as processor speeds increase. Even in an SMP system there are now instructions (memory barriers) that tell processors that supposedly shared memory state needs to be communicated to other processors in the system, and those instructions are considered very expensive to execute.

Therefor, after I carefully plot out these distinctions, I'm going to tell you that I think they are all essentially the same in terms of how you should program for them. The only distinction is between cooperating and communal nodes.

Consequences and elucidation

So I think it's likely better now, and in the future, to design your program to have a number of threads of execution that each have their own local state. If they need to share any state, that sharing should be sparingly and explicitly communicated to other threads.

The reason is that all methods of implicit sharing are going to introduce potentially unpredictable delays into your program, even if you're sharing in an SMP environment. Additionally implicit sharing requires laboriously instrumenting your code with expensive instructions to explicitly declare when you need some sort of exclusive access to something.

And since your programs share so little anyway, it makes little sense for them to share their entire space of addressable memory with each other. It's better for them to share only well defined sections, or communicate through the OS using pipes or other forms of RPC.

I would really like Unix developers to add these system facilities to support this model...

First, a sort of reverse mmap that allows you to turn a set of pages into a file descriptor that you can than hand to another process via the existing mechanism for passing file descriptors between processes. Then the receiving process can mmap the memory itself. This allows processes to relatively easily create a shared memory channel to communicate on in order to bypass the few extraneous copies between kernel memory and userspace inherent in using pipes. The OS could also provide a communication channel (like a UNIX domain socket) where it promised that if you wrote whole memory pages that those pages would be marked as COW pages and the receiving process would read them by basically having its page table updated if it read them into a page aligned data area. If you wanted to make sure they were never copied you could simply munmap them after you wrote them.

Secondly, fully supported asynchronous IO. This is sort of implemented, but seems to be generally an afterthought. It's obviously existed and been very important for serial and network device IO. But it needs to work for disk IO as well and be considered so important that nobody would even think shipping a kernel for which it didn't even vaguely acceptable. It would be really nice if disk IO could also be implemented underneath by mapping pages directly and using COW if things were magically aligned as well.

The existence of these facilities would provide even further motivation for library developers to allow the IO of their libraries to be 'virtualized'. This is because then their library could take advantage of details of how the application could communicate with other processes that only the application developer would know because the application developer is aware of all the environmental details for that particular application.

Comments:

Threading can certainly be overused. For most applications I think event loops and/or coöperating processes are good enough. I spent quite a while working on a moderately large heavily-threaded program, and although threading gave us many benefits that we couldn't have gotten from a pure event-loop design, it also gave us many bugs. Threading tends to fail badly, too; you get intermittent crashes or deadlocks or corruption rather than a nice contained misbehavior.

I think we're seeing lately a reaction to some of the threading mania of the '90s ("every window has its own thread! no, every BUTTON has its own thread!"). But I think threading has its place; I just don't think we've fully learned how, or when, to use it yet.

Explicit I/O isn't the only thing that can block a thread. Implicit I/O is pretty common: a threaded program can make use of the time spent filling a page fault, but there's no reasonable way an event-loop program can. A page fault can take several seconds if it involves spinning up your laptop's HD. Add in things like lazily-fetched ORM objects and you can even do "explicit" I/O at arbitrary and unexpected places in your program. (And then there are things like mapping memory from a FUSE filesystem that's providing a view of data on a dynamic ad-hoc network— sorry, I read too many Plan9 papers at an impressionable age...)

Note that many of the problems with threading can still occur with multi-process programs. Some of them can still occur with event loops in a single process (deadlocks of a sort, for example).

(As for COW faults, I thought all modern OSs already used copy-on-write for disk IO and page-flipping for IPC when possible? Maybe I was spoiled by Mach/NeXTstep, which does all this stuff. I remember trying to find a bad block on a disk once by doing "tar cvf - /badfs > /dev/null", and failing because the OS was lazy enough not to ever read the pages off the disk since the null device never faulted them in...)

(As for COW faults, I thought all modern OSs already used copy-on-write for disk IO and page-flipping for IPC when possible? Maybe I was spoiled by Mach/NeXTstep, which does all this stuff. I remember trying to find a bad block on a disk once by doing "tar cvf - /badfs > /dev/null", and failing because the OS was lazy enough not to ever read the pages off the disk since the null device never faulted them in...)

It is possible that OS X does this, though I suspect it doesn't. But no other Unix I know of does that. It's possible I'm just ignorant, but I don't think so.

Implicit I/O is usually what ends up killing my event-driven programs, and sometimes it's not really my fault, just some other program using a lot of memory and pushing mine off to swap. There, latency can really go through the roof, where one request that needs a page from the swap gets the whole thing stuck (well, at least one thread of it), and leaves other requests (that could possibly be fulfilled without delay) waiting. Sometimes, reading or writing to some mmap() that is actually far away (over the PCI bus, or further) is what keeps latency up and overall throughput down...

You are correct as well when you bring up that multi-process (or single-threaded event-driven programs) can still have tricky problems, some of them specific to event-driven programs. The other thing is that once you add threads to add some concurrent and use multi-core systems, the door is opened, and it seems to me that the biggest step is going from one to more than one threads.

FreeBSD had the COW on non-blocking write(), but it was taken out, IIRC. If I remember, changing the memory mappings involved flushing the TLB (every time, even if the COW didn't kick in, which was an extra cost if it did), which was too much overhead. They were pretty quiet about it, understandably, but when you consider that one of the reasons sendfile() exists and is better than mmap()ping a file is that the former does not cause TLB invalidations (and also doesn't eat address space).

I can't find a reference to the removal, unfortunately, so I might be mistaken, but I'm quite certain they removed it.

After looking that up, I'll have to agree with you. Though I greatly prefer code examples to mathematical formalisms. If I ever learn a language based on Pi calculus I may have to go over to that Wikipedia entry and put in a few example programs to demonstrate the formalisms.