...the OS has to allow the individual process itself manage its own threads in a very very lightweight way.

However, if the machine has multiple cpus, then separate threads can run concurrently on separate cpus. Any or all of those threads can become an association of fibers. The scheduling of the threads will be managed by the system, but within any given thread, control of which fiber of the set of fibers associated with that thread is cooperative under control of the program logic.

Well, while being an interesting part of Win32 API I've never seen before, fibers ultimately don't do anything towards the original article topic. My phrase on the process managing its own threads was probably not worded well.

I've seen Sun papers on their research of async CPUs and it looks mighty interesting. Essentially, they view the CPU as a farm of really tiny specialized units, which are not synchronized to a single clock signal. As I understood, a processing thread (either a separate process, or whatever) can lock, say, the ALU for 5ns to do a division, while two other threads do memory loads (each being, say, 2ns). Anywho, it's supposed to be different from Pentium's way of parallellizing instructions in that there is no central clock pulse that everything marches to. Interesting stuff...

Now, if the OS supports hundreds of thousands of threads and is run on a big soup of such asynchronous CPU tidbits, there is possibility for micro-threading. At that stage, yes, the interpreter would do very well by being able to very quickly allocate and tear down these tiny parallel tasks automatically.

I agree. Fibers are an interesting, and potentially useful architectural feature of Win32, but they do not address the OP topic directly.

There are two types of parellelism that need to be addressed.

vector parallelism.

This kind of parallelism is the kind that is already fairly well handle by dedicated vector processors (as the big iron guys call them) and by GPUs and DSPs. The kind where a single, identical, often CISC, opcode is performed on a large number of datapoints in parallel.

This kind is quite easily dealt with in hardware. The depth of the parellelisation is basically controlled by how much silicon you are prepared to dedicate to the operation.

A generalisation of this would be to allow a whole sequence of opcodes (an entire subroutine or block of code) to applied to a large number of datapoints simultaneuosly. Again, the datapoints would be loaded into "special memory", hardwired to operate each opcode on all of those registers in parallel.

The criteria for what constituted a suitable block of code (ie. reentrant code without side-effects or dependancies beyond it's parameters, and probably with a single result from each call), would be a software (compiler or interpreter) decision. It would also require good optimisers (whether compiler or interpreter) to make best use of such parellelism.

Sync point parellelism.

This is where different sections--usually sequential--of the source code can be run in parallel until a point is reached where they share a common dependancy. Although analogous to the pipelining that many modern processors do, to get best benefit, it needs to be done at a macro (source code) or function point) level rather than the micro (a few sequential opcodes) level as done is with pipelining.

I think (within the limitations of my sparse knowledge of silicon), that pipelining has gone just about as it can go. The economics of branch-point prediction and the costs of flushing the pipeline when the BPP goes wrong, severely limit the effectiveness of pipelining beyond a certain, rather low, limit.

In order for best use of syncpoint parellelism to made, OSs will need to radically alter the scheduling schemers they use. The current round-robin within priority groups, with starvation priority promotion, and processes (or threads as currently implemented) as atomic entities will have to be replaced by a much finer grained mechanism. And compilers and interpreters will have to get a lot cleverer to make good use of it.

In effect, the processor pool will be driven by a 'macro-processor' overseen, queue of 'units of code' that need processing. Each individual process will constitute a stream of VHL opcodes. The macro-processor will enqueue these, interleaving the VHL opcodes from different processes according to priorities etc. The pool of processors will pull the next available unit of code off of the central queue and execute it (without regard to what process it belongs to), and then go back and grab the next available. Reentrancy will be paramount. As will a capabilities based security mechanism.

An interesting, and site-topical, sideeffect is that interpreted code will probably carry a much smaller penalty relative to compiled code, as the compilers will need to produce streams or groups of small, self-contained units of code.

Once code is compiled/interpreted into these self-contained units, it then becomes possible to transmit these units to external processors or pools of cooperating machines to achieve massively parallel operations across peer groups of lan/net connected machines.

It's a hard concept to describe in words, and my attempts at an ASCII art diagram left much to be desired. There are probably no web references I can give either as it is very much a prediction of where I personally think things will go, rather than a recounting of any individual piece I have read.

A sort of mental mish-mash, reading between the lines of everything I have read.