Multicore OCaml
===============
Stephen Dolan, 22nd January '14
After much scholarly discussion (i.e. heated arguments), Leo White and
I have settled on the following as a more-or-less sane design for a
multicore version of OCaml. I suspect many people have strong opinions
about this sort of thing; feel free to direct them our way.
Most of this document is about garbage collection, which accounts for
almost all of the work needed to build a multicore OCaml. GC in OCaml
is interesting because of pervasive immutability: many objects are
immutable, allowing us to play various GC games like copying objects
without caring which copy future accesses touch. On the other hand,
OCaml programs are expected to sustain a truly obscene rate of
allocation for immutable objects.
Operations on immutable objects are very fast in OCaml: allocation is
by bumping a pointer, initialising writes (the only ones) are done
with no barriers, and reads require no barriers. Most of the following
design is based around keeping these operations as fast as they are at
the moment, with some compromises for mutable objects.
Xavier Leroy and Damian Doligez wrote a paper about efficient
multicore GC for OCaml [1] (well, for Concurrent Caml Light), based
around having many thread-private heaps and a single shared heap, and
maintaining the invariant that there are no pointers from the shared
to the private heaps.
This means that whenever an attempt is made to write a pointer to a
private object into the shared heap, the private object and all
objects reachable from it are promoted to the shared heap en
masse. The problem with this is that this eagerly promotes many
objects that were never really shared: just because an object is
pointed to by a shared object does not mean another thread is actually
going to attempt to access it.
We'd like to go for something similar but lazier along the lines of
the multicore Haskell work [2], where objects are promoted to the
shared heap whenever another thread actually tries to access
them. This means our sharing operation is slower, since it requires
synchronisation of two different threads, but we need to do a lot less
of it. It also avoids some pathological cases since storing a large
temporary datastructure in a shared reference won't generally require
it to be copied in its entirety.
[1]: Damien Doligez and Xavier Leroy. "A concurrent, generational
garbage collector for a multithreaded implementation of ML."
POPL 1993.
[2]: Simon Marlow, Simon Peyton Jones, and Satnam Singh. "Runtime
support for multicore Haskell." ICFP 2009.
Threads
-------
The first problem is getting multiple threads of OCaml to play nicely
in the same memory space. The main issue here is that the runtime is
full of global mutable state. Some of it is actual runtime-wide state
(e.g. the location of the end of the minor heap), and some of it is
gratuitous (e.g. the regex engine stores state in globals for called
functions to read, instead of passing it as in arguments).
There's a project ongoing at OCamlPro to get rid of global state by
packing it all into a context structure that gets passed around. This
is quite an invasive patch (essentially adding a new argument to
nearly every function, and changing every access to global state), but
allows you to do things like having two independent OCamls in the same
process (a caravan of camls?).
I don't particularly care about having two independent OCamls in one
process, so I've gone for a simpler approach: marking global state as
thread-local. This is a vastly smaller patch, and lets multiple
threads use OCaml independently.
This is implemented. It amounts to some grunt work annotating
variables, and some scripts to verify that the .text section of the
runtime's object files is empty (so we can be certain we've
elimintated all global state). There are a couple of patches from this
work to eliminate some of the more gratuitous uses of state which
might be worth sending upstream indepenently of any other multicore
work.
Message-passing
---------------
The next part is a means for the threads to interrupt each other, to
synchronise GC operations. Each thread has a multiple-writer,
single-reader message queue, which it frequently polls for new
messages. The polling mechanism re-uses the existing mechanism to
check for unix signals, and is implemented for the bytecode
interpreter but not yet for native code.
In native code, I intend to have threads sending messages reset the
minor heap pointer so that the receiving thread is interrupted at next
allocation (again, this is the same mechanism currently used for unix
signals).
It's important that a thread respond to a message in a bounded period
of time, and that messages are only processed at GC safepoints. This
might require adding new safepoints during function calls or loops,
although OCaml programs generally allocate so frequently that we might
get away without this.
During long-running C calls, a thread will not be able to handle
messages. In this case, we will notify the other threads of this
by releasing a relatively heavyweight lock, which shouldn't add much
to the runtime of a slow call.
Shared and private heaps
------------------------
The shared heap is a large memory area available to all threads, with
a non-moving concurrent GC. Immutable ones will be allocated in the
private heaps (see below) and promoted to the shared heap if they
survive long enough or are accessed by multiple threads. For mutable
objects, there are a couple of options, discussed in a section below.
The GC will be a mostly-concurrent mark-and-sweep. Each thread
maintains its own grey stack and periodically does some marking
work. Marking is by changing the GC bits in the object header, which
is done without any synchronisation between threads. This means there
is a race condition whereby multiple threads will attempt to mark the
same object, but since marking is idempotent this won't cause any
problems.
When a thread finishes marking, it will trigger a stop-the-world phase
by interrupting every other thread via their message queues. During
the stop-the-world phase, each thread will scan their roots and mark
any unmarked objects they find. Generally, this does not involve much
marking: the stop-the-world phase exists so that all threads can
verify that marking has completed successfully, so should be short.
After the stop-the-world phase completes, any unmarked objects are
garbage and available to be swept.
This part of the memory system is fairly pedestrian and in no way
novel. I quite like VCGC [3] as a design for the mark-and-sweep
section, since it avoids having explicit phase transitions between
"marking" and "sweeping" which are a traditional source of bugs, and I
intend to use an allocator design based on size-segmented thread-local
pages along the lines of Streamflow [4], since this seems to have very
good multicore behaviour and decent fragmentation performance.
Initially, it won't support compaction. Stop-the-world compaction can
be implemented without too much pain later on, but I consider
incremental compaction to be outside the scope of this project (and
besides, the OCaml community seems to be OK without it at the
moment).
Each thread also has a private heap, which is used to allocate
immutable objects, and possibly mutable objects depending on how we
choose to handle those (see below). Private heaps can be collected
without synchronisation, and hopefully without modifying the existing
minor collection code (much).
One part of minor collection is updating intergenerational pointers
(that is, pointers which used to point to private objects which have
been moved). In the current single-threaded GC this can happen at any
point and be correct, but in a multhreaded GC we need to ensure this
happens after all of the objects are promoted and initialised, to
prevent any other threads seeing a half-formed shared object.
Objects that survive long enough are be promoted to the shared
heap. Thus, dead cycles of shared and non-shared objects will be
collected eventually, when they become cycles of purely shared
objects.
[3]: Lorenz Huelsbergen and Phil Winterbottom. "Very concurrent
mark-&-sweep garbage collection without fine-grain
synchronization." ISMM 1998.
[4]: Scott Schneider, Christos D. Antonopoulos, and Dimitrios
S. Nikolopoulos. "Scalable, locality-conscious multithreaded memory
allocation." ISMM 2006.
Reading and writing
-------------------
The following invariants are maintained:
* No thread ever follows a pointer to another thread's private heap
* Immutable fields of shared objects point to shared objects
So, the private heaps may point to themselves and to the shared heap,
the values on a thread's stack may point to that thread's private heap
and to the shared heap, and shared heap pointers may point anywhere.
The second invariant is easily maintained, since objects are promoted
all at once: an immutable object only points to older objects, and
an immutable object becomes shared only after all of objects older
than it have become shared.
This means that reading from and matching on immutable objects will
require no extra overhead in order to maintain the first invariant,
since an immutable object that one thread can access can never point
to another thread's private heap.
Mutable fields are more complex, and require both a write barrier and
a read barrier. The write barrier (caml_modify) is just a standard
write barrier maintaining the tricolor invariant for the shared GC,
and maintaining a per-thread remembered set of pointers from the
shared to the private heap.
The read barrier examines the value being read, and decides whether it
can be used directly as follows:
* If it is an integer, it may be used directly
* If it is a pointer to the shared heap, it may be used directly
* Otherwise, it is a pointer to a private heap:
* If this thread's, it may be used directly
* Otherwise, it must be requested from the thread owning the
private heap
This seems quite complex, but has a surprisingly efficient
implementation (described at the end of this document).
The last case, of a pointer to another thread's private heap, is the
only one that requires any processing. In this case, which we refer to
as a "read fault", a message is sent to the the other thread
requesting the object. Upon receipt of this message, the other thread
copies the requested object to the shared heap, including its
transitive closure, and replies with the address of the object in the
shared heap. The requesting thread blocks until this reply arrives.
This is expensive, but should be rare. It happens only when multiple
threads are contending for the same mutable reference, which involves
expensive inter-core cache line transfers in any case.
Code written in a message-passing style may hit this case quite
frequently if immutable, newly-created messages are being passed to
code running on a different thread. However, this can be solved at
user-level in the implementation of the message-passing library: if
messages are being passed to a queue which has recently been accessed
by a different thread, they can be pre-emptively promoted to the shared
heap so the fault is avoided.
Read faults
-----------
When a thread tries to access an object in the private heap of another
thread, a read fault occurs. In this case, the thread performing the
access sends a message to the thread that owns the private heap in
question, asking it to perform the read instead.
Upon receipt of such a request, a thread performs the read, copies the
resulting object to the shared heap, and replies with the new shared
copy.
There is a race condition where a field can be modified to point to
another thread's private heap while the read fault message is
in flight. This means that the thread servicing the fault will itself
fault, causing another message to be sent. This actually causes no
problems: the faulting is guaranteed to terminate, since the race
condition requires that another thread overwrite the field during the
fault processing. In the worst case, every thread on the system gets
dragged into servicing the fault, and there are no threads left to
overwrite the field and cause the race.
Mutable objects
---------------
We have a number of options for dealing with mutable objects. The
differences between the options are slight changes to allocation
behaviour and barrier code, so it's not difficult to implement several
of these options and compare them for performance.
The issue is that read faults cause copies of private objects to be
made in the shared heap. For objects with only immutable fields, this
is semantically fine, and programs can use the copies
interchangeably. For mutable fields, however, it is important that
only the shared copy is used, otherwise programs might see stale
copies of already-modified objects. We can deal with this in several
ways:
* Allocate all mutable objects directly in the shared heap, so this
problem doesn't arise.
* Have an extra check in the read barrier for mutable fields to see
if there's a more recent shared copy of this object.
* Scan the private heap after every read fault to update pointers to
mutable objects.
The first option would mean losing generational collection for mutable
objects, although a non-moving generational scheme [5] for the shared
heap would fit reasonably well. The third option may turn out to be
the most efficient, if the fault rate is low enough.
[5]: Katsuhiro Ueno, Atsushi Ohori, and Toshiaki Otomo. "An efficient
non-moving garbage collector for functional languages." ICFP 2011.
Read barrier implementation
---------------------------
To make code that heavily uses mutable objects run acceptably fast, it
is important that the read barrier have low overhead in the
uncontended (common) case. The condition we are checking for is quite
complex, but happily can be implemented with a single branch, no
memory references, and some bit-twiddling.
We arrange the memory space of the OCaml process so that all of the
minor heaps are arranged within a single, large, power-of-two aligned
memory region, and each individual heap is a power-of-two aligned
region within that. This can be done quite portably on Posix platforms
using mmap(), or on Windows using VirtualAlloc().
Using hypothetical 16-bit addresses, we assume for the sake of an
example that the private heaps are all allocated within 0x4200 -
0x42ff, with the individual thread's private heaps at:
* thread 1: 0x4220 - 0x422f
* thread 2: 0x4250 - 0x425f
* thread 3: 0x42a0 - 0x42af
* thread 4: 0x42b0 - 0x42bf
It's not necessary that all of the memory in the region 0x4200 -
0x42ff be allocated or even mapped, as long as no part of the shared
heap lies within that region. Again, this can be done with some mildly
fiddly but not especially difficult mmap() trickery.
So, if we have a pointer p, then it points to a private heap if and
only if its high bits are 0x42, and the bits in the middle indicate
which private heap. So, we can check whether it lies in the current
thread's private heap by comparing its bits with a pointer to
somewhere in the current thread's private heap.
We have such a pointer available: the current minor heap pointer is
stored in a register in native code. We XOR this value with the
pointer p, resulting a value whose bits we write as 0xPQRS. Our
various conditions for the read barrier can be read off from these
bits: if the low bit of S is 1, then it's an int. Otherwise, if PQ is
nonzero, then it's a pointer to the shared heap. Finally, if PQ is
zero and R is zero, then it's a pointer to the current thread's
private heap, while if PQ is zero and R is nonzero it's a pointer to
another thread's private heap.
Checking whether PQ or the low bit of S are nonzero gets us the first
two conditions. Otherwise, if PQ is zero we need to check whether R is
zero. Happily, we can combine these conditions with more
bitgrinding. We subtract 0x10 from 0xPQRS. If R is zero, this carries
upwards, and causes PQ to become nonzero (it will never cause PQ to
become zero, since we ensure that the first page of memory is not used
for the shared heap). So, all of our conditions can be checked
simultaneously by checking whether (0xPQRS - 0x10) has any of its high
bits or its lowest bit set. There's an instruction to test bitmasks on
most architectures (x86 calls this "test", PowerPC calls it "and."),
bringing us to a total of three instructions: xor, sub and test.
In pseudo-amd64 code, assuming the value to be tested is in %rax and
the minor heap pointer is in %r15, we get:
xor %r15, %rax
sub $0x10, %rax
test $0xff01, %rax