I've had a few pieces of mail asking me to clarify my scheme for
compiling a functional language to execute in parallel on a
shared-memory machine. I'll just give a quick overview.
As I said previously, if you're happy with combinator graphs and
reduction architectures, then fine. I personally feel that the
hardware technology isn't quite there yet, and, more importantly, I'm
not sure how much use reduction execution is for other languages or
for an entire operating system; I think it's too specialised. My
compilation scheme requires a fairly conventional control-flow
architecture with threads and shared memory, plus a simple semaphore
operation on memory locations.
The compiler is in two passes (at least :-)). The first pass has a
simple rule which says that all function applications should be forked
off to a new process, and the parent should wait for the result. In
principle, this is fairly simple-minded and fairly useless at this
stage. For example:

fun f x = (f1 x, f2 x, f3 x)

A call to `f' creates a new process context. That process forks
a child for `f1', waits for it to finish, forks a child for `f2', and so
on. Finally it builds a tuple record and returns it to its caller.
The second pass is where things get interesting. It is worth noting
that the second pass is *nothing more* that a peephole optimiser on
the intermediate code. We are attempting to get a result back to the
parent while still doing useful work; this is where the parallelism
starts to come from.
Firstly, we can return the record to the parent *while it is
empty*. We create the record, return it to the parent, and then deal
with `f1', `f2', `f3'. The semaphore operations in the memory mean
that the parent will block if it tries to access a record field; but,
it can do other things first. If `f1', `f2' and `f3' return records,
then `f' might get them back with empty fields; then, `f' can return
to its parent while the `fi's are still going, and so on. The net
result that empty records get claimed from the heap, and structures
are filled in in parallel by a number of processes.
Next, there are the usual tail-recursion optimisations:

fun f x = g x

Execution of `f' can pass silently to `g'.
Finally, we attempt to bound the amount of concurrency. Creating a
new process for each function is very expensive, so we use tail
recursion optimisation techniques to knock a lot of these out. The
above example is an obvious one: use `f's process for `g', rather than
create a new one. In fact, the optimiser can go one better than this:

fun f x = cons(..., g x)

This is "structurally tail recursive"; `f's last action is to activate
`g', followed by some structure assignment. So, I have an optimisation
which makes this iterative. `f' builds the cons cell, passes it back
to the parent, and then jumps to `g'.
This has some interesting implications. Consider:

fun f 0 = nil
| f n = cons(n, f(n-1))

fun g(cons(x, y)) = cons(x+1, g y)

Then, `f 100' will start a process to generate a list of 100 numbers.
`g(f 100)' will start a process to generate 100 numbers, and another
process to consume the first list in parallel and generate a second
list.
This generalises to `fun f x = (a, b, c, d, e, f, g x)' and so on.
That's about it. My thesis also dabbled rather unsuccessfully with
some polymorphic typechecking issues, and I integrated a concurrent
Prolog compiler into the functional language as well.
If you want a copy, I think it costs money (5 quid or something).
Either write to:
Dorothy McKie,
Department of Computer Science,
James Clerk Maxwell Building,
The King's Buildings,
University of Edinburgh EH9 3JZ
SCOTLAND

Or e-mail to: dam@uk.ac.ed.ecsvax, asking for ordering details. Sorry,
but I don't know much about administrative matters like this... I'll
pass on the enquiries I've had already.