Netmulticore works!

Ocaml and multicore programming
- by Gerd Stolpmann,
2011-04-12

Netmulticore is my attempt at solving the multicore puzzle for
Ocaml. It has reached now a development stage so that I can run test
programs, and I see real speedups. Although not everything is perfect
yet, the API has stabilized a bit, and it is close to being ready for
broader testing. Expect a test release in the next days - I hope to
finish it before the Ocaml Meeting at Friday, so you guys have
something to talk about. (Unfortunately, I cannot visit.)

The approach of Netmulticore is unusual, but AFAIK the only one that
works without modifying the Ocaml runtime and/or compiler.
Instead of using kernel threads and implicitly sharing memory,
Netmulticore forks full-fledged processes as thread replacements, and
allocates explicit shared memory that is accessible by all worker
processes. By doing this allocation directly at program startup, it is
ensured that all processes see the shared memory block at the same
address (which would not be the case if the processes mapped the block
individually). The shared block is now treated in a very special way
- shared memory has to cope with some difficulties that normally do
not exist. All in all, this creates a setup where each process has its
normal Ocaml heap (which is not shared with other processes) plus
access to the shared block. This is not a bad situation provided we
can make the access to the shared block as convenient as possible, and
this is what Netmulticore is really about. If we can achieve a nice
programming API for dealing with shared data, the multicore issue is
solved, and probably even in a more scalable way than in other
runtimes where all threads have to share a single heap, and constantly
step on each others' feet.

Self-contained shared heaps

So what we want to have is that we can store normal Ocaml values
into the shared block - including all regular data representations
such as tuples, variants, records. Also, we want to allow
modifications of values - immutable shared memory is not worth the
fun. Netmulticore solves both issues - it provides direct read access
to Ocaml values stored in the shared block, and it allows
modifications (but the programmer has to follow a special API).

The initially allocated shared block is broken down in smaller
units called shared heaps. The heaps do not have a fixed size
but can grow and shrink (just like the regular Ocaml heap). The heaps
are now the containers for the Ocaml values: When creating a heap, one
can copy an initial Ocaml value into it, and by following the special
rules for mutation it is possible to put more values into it (or
remove values). The shared heaps are structured similarly to the
normal Ocaml heap, and contain the value blocks densely packed one
after the other. Free areas are managed with a free list. When the
shared heap fills up, a specially implemented garbage collector tries
to find unreachable values and reclaims the space (using the
mark-and-sweep design). Shared heaps have a lock which synchronizes
accesses to it, and, unfortunately, limits the degree of
concurrency. Especially, only one process can write to a heap at a
time. The programmer can, if necessary, work around this limitation by
using several heaps - each heap has its own lock, so tricky
application designs can avoid lock contention.

This sounds like a nice idea, but there are some traps. In the next
two paragraphs I'll try to give an impression of the difficulties. The
essence is, after all, that managing shared heaps requires a
disciplined programmer, or the memory gets corrupted. This is the
downside of the Netmulticore approach - it is quite easy to crash the
program by not following one of the programming rules. (This is,
however, also true for "normal" multi-threaded programming.)

The first difficulty is that this design requires that the shared
heaps are self-contained. This means that no pointer must ever exist
that points from a shared heap to the normal process-local Ocaml heap,
or from a shared heap to a different shared heap. The first kind of
pointer would cause invalid memory accesses if a second process
dereferenced such a pointer. The second kind of pointer confuses the
garbage collector. What is still allowed, of course, are pointers from
process-local memory to the shared heaps. (The garbage collector built
into the Ocaml runtime fortunately does not follow such pointers.)
However, one should be careful: the garbage collector cleaning the
shared heap from time to time will not see that such "external"
pointers exist, and will not keep the referenced data alive. It is
left to the programmer to do something about it.

The second difficulty is how to actually do mutation in shared
heaps. If you have every written a memory manager, you'll probably
know the problem: Each allocation can cause a GC run, and this can
invalidate what you've just put into the heap but is not yet
considered as reachable by the GC. The solution is to keep a set of
further roots, i.e. pointers the GC must also consider although they
are not in the memory region the GC manages. I omit here the details -
the point is that mutation requires a special procedure so that such
additional roots can be managed. This is a bit like declaring
arguments of wrapper functions with the CAMLparam macros. A similar
convention exists for Netmulticore, only that it has to be done on the
Ocaml level.

Higher-level data structures

Fortunately, the programmer does not need to remember all this
low-level stuff most of the time, because there are a number of
ready-to-use data containers. These are already developed on top of
the raw shared heap structure, and are a lot safer to use:

Netmcore_array: Keeps data in an array, and provides synchronization
for accessing array elements

Netmcore_matrix: a two-dimensional array

Netmcore_buffer: A shared string buffer where one can add strings
at the end and remove data from the beginning

Netmcore_queue: A shared queue very much like the Queue module of
the standard library, but again with additional synchronization

Netmcore_hashtbl: A shared hash table very much like the Hashtbl module of
the standard library, but again with additional synchronization

Netmcore_ref: A single shared variable (like a shared "ref" variable)

In some sense, these modules are "ports" of the corresponding data
structures that are provided by the standard library. During porting I
had especially to change the way the data is mutated so the mentioned
programming rules are followed. (This is the main reason why these
ports exist - you cannot put a normal Hashtbl into shared memory,
because the normal mutation breaks the rules for shared heaps.
There is no such problem when read-only data structures are copied
to shared heaps.)

Synchronization primitives

The mentioned data types have - to some degree - built-in protection
against uncontrolled parallel access. Sometimes, however, it is useful
to have additional ways of managing synchronization:

Netmcore_sem: Semaphores

Netmcore_mutex: Mutexes (including normal and recursive ones)

Netmcore_condition: Condition variables

The condition variables have a bit unpleasent API, because it is the
task of the caller to allocate a special block of memory for each
process that can be suspended. In system-level implementations of
condition variables this block can be hidden from the user (it can be
put into the thread control block). As we don't have access to
something like a process-local but nevertheless shared place the only
solution I'm seeing right now is to delegate this obligation to the
caller. But anyway, the important message is that Netmulticore provides
condition variables, and that it is thus easy to signal (or "broadcast")
suspended processes.

Message passing

Netmulticore also integrates nicely with Camlbox, the message passing
API that exists since Ocamlnet-3. Camlboxes allow it to send Ocaml
values from a number of sender processes to a single receiver process.
The implementation of Camlboxes also uses shared memory (and not
sockets), and is very fast.

The code, please!

What I've described in this article already works so far. Netmulticore
is distributed as part of Ocamlnet, and is right now only available
in the svn repository:

The skeptical reader is very encouraged to look into the examples,
just to see how nice the code using Netmulticore looks. It is the first
time you can use shared memory from Ocaml without having to deal with
data marshalling issues (because we don't use this technique). Actually,
the code looks very much like multi-threaded programming, only that
here and there different primitives need to be used.

What next?

It is very important to create sample programs that use a library like
Netmulticore, because problems only show up in practice, and are hard
to predict. There are already three non-trivial examples, and I've
plans to write a few more. Expect also blog articles about this,
and how the performance is (and the numbers I got so far are promising).

The Netmulticore implementation has reached some "beta quality", at
least. We need a few improvements here and there, but generally the
code exists and works.

It is generally a good idea to watch out how to make Netmulticore
programming safer. As pointed out before, it is right now easy to
crash the program (with a segfault) when missing one of the special
programming rules. The OCaml compiler could perhaps help here and
prevent some mistakes. What I've especially in mind here is a typing
annotation whether a value is in a shared heap. This annotation would
normally be invisible to the user (like the polarity annotation), but
jump in at compile time when a reference to a non-shared value is
stored into a shared value. However, this annotation would probably
require a modification of the compiler.

If anybody is interested in testing Netmulticore, please write me.
Optimizing programs for multicore is tricky, and getting more
experience here would allow us to make a good step forward.