1 Introdution

All of GHC's concurrency primitives are written in C code and is baked in as a part of the RTS. This precludes extensibility as well as making it difficult to maintain. Ideally, the concurrency libraries will be implemented completely in Haskell code, over a small subset of primitive operations provided by the RTS. This will provide a Haskell programmer the ability to build custom schedulers and concurrency libraries. For an earlier attempt at this problem, please look at Peng Li's paper [1].

2 Substrate primitives

Substrate primitives are the primitives exposed by the RTS, on top of which user-level concurreny libraries are built.

3 Concurrency libraries

In order to support the construction of extensible user-level schedulers in GHC, special care has to be taken about blocking concurrency actions. When there is no default scheduler, the user-level scheduler must be made aware of the blocking action, and more interestingly, the blocking action of the user-level scheduler.

3.1 Motivation

The interaction between user-level schedulers and blocking actions is motivated through actions on MVars.The semantics of takeMVar is to block the calling thread if the MVar is empty, and eventually unblock and return the value when the MVar becomes full. Internally, when a thread blocks on an MVar, it switches
to the next runnable thread. This assumes that the takeMVar has knowledge about the scheduler. In particular, the current implementation of takeMVar knows how to perform the following:

Block action: blocking the current thread on a condition, and switching to another runnable thread.

Unblock action: placing the unblocked thread back into the scheduler data structure.

3.2 Proposal

The new, scheduler agnostic version of takeMVar (say takeMVarPrim), will have the type:

takeMVarPrim :: PTM ()-> PTM ()-> MVarPrim a ->IO a

where the first and second arguments are the block and unblock actions. If the blocking and unblocking actions are known, takeMVar with its usual type can be obtained simply by partial application:

Since the MVar implementation is independent of the schedulers, even threads from different schedulers can perform operations on the same MVar. The knowledge of schedulers is completely embedded in the block and unblock actions. Block and unblock actions can be obtained from up-call handlers.

4 Interaction with RTS

In the current GHC implementation, runtime manages the threading system entirely. By moving the threading system to the user-level, several subtle interactions between the threads and the RTS have to be handled differently. This section goes into details of such interactions, lists the issues and potential solutions.

4.1 Up-call handlers

In order to support interaction between the scheduler and RTS, every Haskell thread must have the following up-call handlers:

switchToNext :: SwitchStatus ->IO()
unblockThread :: SCont ->IO()

switchToNext

implements code necessary to switch to the next thread from the calling thread's scheduler, and suspends the calling thread with the given status.

explicitly takes an SCont as an argument. This might seem strange at first. But this signature allows helper threads created by the RTS to inherit a scheduler. This mechanism will be explained later.

The up-call handlers are stored in the StgTSO thread structure so that the RTS may find it. They are traced during a GC as a part of tracing the thread. It is the responsibility of schedulers to install the up-call handlers during thread creating. Currently, up-call handlers are installed using the following primitives exposed by the substrate:

where the given SCont is the target thread. Ideally, this needs to be a part of newSCont primitive.

4.2 Interaction with GC

In the vanilla GHC implementation, each capability maintains a list of runnable Haskell threads. Each generation in the GC also maintains a list of threads belonging to that generation. At the end of generational collection, threads that survive are promoted to the next generation. Whenever a new thread is created, it is added to generation0's thread list. During a GC, threads are classified into three categories:

Runnable threads: Threads that are on the runnable queues. These are considered to be GC roots.

Reachable threads: Threads that are reachable from runnable threads. These threads might be blocked on MVars, STM actions, etc., complete or killed.

At the end of a GC, all unreachable threads that are blocked are prepared with

BlockedIndefinitely

exception and added to their capability's run queue. Note that complete and killed reachable threads survive a collection along with runnable threads, since asynchronous exceptions can still be invoked on them.

In the lightweight concurrency implementation, each capability has just a single runnable thread. Each generation still maintains a list of threads belonging to that generation. During a GC, threads are classified into reachable and unreachable. RTS knows whether a thread is blocked or complete since this is made explicit in the switch primitive.

4.2.1 Problem 1 - Indefinitely blocked threads

In the LWC implementation, since there is no notion of a runnable queue of threads for a capability, how do we raise

BlockedIndefinitely

exception?

We need to distinguish between blocked on an unreachable concurrent data structure and an unreachable scheduler. The programmer makes this distinction explicit through the thread status argument as a part of context switch primitives.

4.2.1.1 Blocked on an unreachable concurrent data structure

If the MVar is unreachable, the scheduler might still be reachable, and some runnable thread is potentially waiting pull work off this scheduler. Thread blocked on an unreachable MVar will be blocked with thread status

BlockedOnConcDS

. In this case, we can prepare the blocked thread for raising the asynchronous exception as we do in the vanilla implementation. Subsequently, RTS need only to evaluate the blocked thread's unblock action, which will enqueue the blocked thread on its scheduler. But on which thread do we execute the unblock action? In the LWC implementation, each capability has only ever one thread in its run queue.

The solution proposed here is similar to finalizer invocations. We create an array of IO () actions with the following structure:

exception. unblock_currentThread and switchToNext_currentThread correspond to the

unblockThread

and

switchToNext

upcalls of the (only) thread currently on this capability. Next, we create a helper thread with the following closure applied to the array constructed previously.

rtsSchedulerBatchIO :: Array (IO())->IO()

When given an array of IO () actions,

rtsSchedulerBatchIO

performs each IO action it one-by-one. The net effect of executing the new thread is to add the resurrected threads to their corresponding schedulers and waking up the original thread that was running on this capability.

The newly created thread inherits the scheduler of the thread that was running on the scheduler. This is done by copying the upcall handlers. This is necessary since the newly created helper thread might also get blocked due to PTM actions, blackholes, etc,.

4.2.1.2 Blocked on a unreachable scheduler

This case is a bit tricky. If a thread is blocked on an unreachable scheduler, we need to find a scheduler for this thread to execute. But which scheduler? RTS does not know about any other user-level schedulers.

We might fall back to the vanilla GHC's solution here, which is to prepare the blocked thread for asynchronous exception and add it to the current capability's queue of threads blocked on scheduler. At the end of GC, RTS first raises BlockedIndefinitelyOnScheduler exception on all the threads blocked on scheduler, and finally switches to the actual computation (current thread). This solution is not ideal since we do not eliminate schedulers completely from RTS.

4.2.2 Problem 2 - Detecting deadlock

In the vanilla implementation, whenever RTS finds there are no runnable threads, a GC is forced, that might potentially release the deadlock. This will happen since any indefinitely blocked threads will be woken up with asynchronous exceptions. In the LWC implementation, how would the runtime distinguish between a scheduler that might actively be spinning, looking for more work and a thread execution? There might be multiple schedulers running on a capability, and no one scheduler might know that all schedulers on the capability are waiting for work. It might not be a good idea to trigger a GC whenever a scheduler runs out of work.

Proposal 1

Every capability keeps a count of SConts spawned as schedulers, and empty schedulers. When these counts become equal, a GC is triggered. If no threads are unblocked by this GC, then we are really deadlocked. There is a possibility of false positives with this scheme since a scheduler might be slow to let the RTS know that it has in fact found work. How do we deal with such momentary false positives?

Proposal 2

Treat the first Haskell thread (proto_thread) created on any capability as a special thread whose only job is to create threads to execute work. It enqueues itself on the scheduler it creates. But none of the threads on the scheduler will switch to this proto_thread, unless there is nothing else to switch to. The proto_thread, when resumed, will force a GC. However, this solution assumes there is a single scheduler data structure at the lowest level per capability.

A capability might really not be deadlocked, since work might be generated by other cores. For example, MVar synchronization might release threads that will be enqueued to schedulers on this capability. Should performing GC on a deadlock be a global property of all capabilities?

4.3 Bound threads

Bound threads [2] are bound to operating system threads (tasks), and only the task to which the haskell thread is bound to can run it. In the vanilla implementation, bound threads are created using

forkOS

primitive, which creates a new task to run this bound thread.

A bound thread and its bound task have the invariant that if the bound thread is running, it is running on its bound task, and when the bound thread is blocked, its bound is suspended. When a bound thread is resumed, RTS checks if the current task is its bound task. If not, current task is suspended, and the current capability is passed to the bound task, which resumes the bound thread.
Thus, bound threads are handled transparently from the programmer's point of view, and the programmer never sees the tasks.

LWC implementation

We would like to have the same interface in the LWC implementation. Bound threads can be created with the new substrate primitive:

newBoundSCont ::IO()->IO SCont

which has the same type signature as

newSCont

. But unlike

newSCont

, a new OS thread (a task), is created in a suspended state and bound to the new thread. During a user-level context switch, the target thread is assigned as the current capability's thread and the control is returned to RTS scheduler loop. The RTS scheduler loop, as it does for the vanilla implementation, takes care of passing capabilities between tasks, creating worker tasks, etc., if the control switches to or from a bound thread.

4.4 Safe foreign calls

A safe foreign calls does not impede the execution of other Haskell threads on the same scheduler, if the foreign call blocks, unlike unsafe foreign calls. A safe foreign call is typically more expensive than its unsafe counterpart since it potentially involves switching between Haskell threads. At the very least, a safe foreign call involves releasing and re-acquiring capability.

4.4.1 Anatomy of a safe foreign call

Every capability, among other things, has a list of tasks (returning_tasks) that have completed their safe foreign call. The following are the steps involved in invoking a safe foreign call:

Before the foreign call, release the current capability to another worker task.

Perform the foreign call.

Add the current task to returning_tasks list.

Reacquire the capability.

Resume execution of Haskell thread.

The first action performed by the worker task that acquired the capability is to check if returning_tasks is not empty. If so, the worker yields the capability to the first task in the returning_task list (fast path). Otherwise, the worker proceeds to run the next task from the run queue (slow path). Thus, in the fast path, the haskell thread never switches.

4.4.2 Problem

In the LWC implementation, the worker does not have the reference to the scheduler to pick the next task from. And for the same reason, when the task returns from the foreign call, it needs to know what to do with the Haskell thread, whether to switch to it (fast path) or add it to the scheduler data structure, to which it does not have a reference to. Even if the RTS had a reference to the scheduler data structure, it must be implemented in such a way that it is operable by both C and Haskell code.

5 Status

5.1 Done

5.2 Todo

Implement substrate primitives for scheduler actions. This will obviate the need for explicitly passing scheduler actions as arguments to concurrency primitives. The type signature are given below. These are obtained from the upcall handlers installed for the current thread. For