When using the systhread library with at least one computation intensive thread, the scheduler has very bad behavior. Preemption happens only rarely. This is, for example, demonstrated by the following program when run in native mode :

In the output, there are roughly as many "FAST" lines as "SLOW" lines, meaning that preemption only happens when printing to standard output. I would expect to see much more "FAST" lines than "SLOW" lines (i.e., the slow computation is preempted by the fast one even if the slow one does not interact).

Note that both computation do allocate memory through the GC, so the signals used for preemption can indeed be handled. This can be seen by adding the following lines at the beginning of the previous code, which gets the expected behavior:

I.e., we periodically force preemption by suspending the currently executing thread for a very short time.

My analysis of the problem is that there are two flaws in the implementation of the systhread library:

1- The "tick thread" regularly sends a signal so that yield() is called regularly. But on linux, the only thing yield() does is releasing and acquiring back the master lock. Since scheld_yield() apparently doesn't do want we want here, one solution could be to use usleep(1) instead. My experiments seem to indicate that the loss in performance is not measurable.
2- Actually, 1- would be fine if the implementation of the masterlock would garantee fairness. But it doesn't. Another soution would therfore to use a fair implementation for the master lock (such as a ticket lock). This should not make the implementation much more complicated, given that the master lock is already implemented "by hand".

Jane Street did some experiments with using [nanosleep] (similar to your point 1) and as far as I remember, the conclusion was that we would be keen for the runtime to provide an option to use that, rather than the current [yield].

The fairness issue you mention is especially visible under Linux, because the Linux scheduler favors throughput over fairness, and because sched_yield() was made unusable in Linux 2.6. Windows is more fair as far as I can remember.

I'm open to using nanosleep(1 ns) instead of sched_yield() under Linux, provided someone conducts serious experiments to measure the overhead, if any.

But it's unclear to me why we would want fairness in thread scheduling and be willing to invest time and possibly degrade performance to get it. What is wrong with the behavior reported in this MPR?

There are a few different schemes that we considered and experimented with. The two that I have numbers for are:

1. dropping the `#ifndef` in `yield` to re-instate the `sched_yield` call; and
2. using `nanosleep(1ns)` instead of `sched_yield`.

In addition to these modifications, we modified the Async scheduler to call `Thread.yield` in its main loop. You can see the location of this change here, though that location is currently occupied by a nanosleep that can be enabled and disabled through a configuration flag (disabled by default):

Briefly, we have an Async-based client and server that support two commands: `get-updates` and `ping`. A `get-updates` request will cause the server to continuously stream bytes to the client until the client closes the connection. A `ping` request on the other hand will cause the server to send a short response and close the connection.

What we're testing, then, is the latency of pings to a server that is servicing a `get-updates` request. We'd expect a fair system to result in low ping times, on the order tens of milliseconds on average.

You can see the results in the gist linked above (units in seconds, and include process startup time for the client, so remove 50ms to approximate the real latency). Unfortunately, the tabular data is not rendered properly in Mantis.

As you can see, the baseline had a good median, bad average and tail, and a concordant high variance. `sched_yield` had worse medians but that won you a better average tail.

And as is clear from the table, `nanosleep` is just an across-the-board improvement. It's a clear winner in terms of thread fairness, though it doesn't really speak to the performance impact on the long-running operation. In addition it certainly has the downside that it makes the current thread unrunnable. How severe a penalty you take by making the current thread unrunnable really dependes on how much work the other threads in your system have to do before releasing the lock again. Ideally, we would just stay runnable but just go to the back of the run queue, which is exactly the behavior that was changed in `sched_yield` so long ago.

It might make sense to somehow expose the number of threads waiting on the lock. That way, decisions on how affect a yield can be left up to the application. In Async we approximate this using the number of that are in use from our thread pool. This quantity conflates threads that may be making progress, threads that are blocked in a system call, and threads that are blocked waiting to acquire the OCaml lock. In the absence of change to `Thread.yield`, it'd be nice to tease out just that latter component so that use cases like ours don't have to over-approximate the condition where yielding is a productive action.

Thanks for these experiments, that indeed confirm what I observed informally.

I think Xavier was rather interested in measurement of overhead in terms of computation time. If a worker thread (i.e., a thread that uses CPU time) gets interrupted 20 times per second and (at least) a context switch occurs each time, this has a cost. If this cost is indeed negligible (which I guess it should be, but I never did serious experiments), then I do not see a reason not doing this change.

I tried to do what Xavier suggested, by executing in parallel N computational threads. However, the fact that several threads are actually executing at the same time (i.e., preempt each other) has an important effect on the allocation pattern, and hence on the performance of the GC. It gets quite difficult to get relevant numbers: depending on the computation, the version with preemption gets better or worse performances by as much as >5%.

Instead, I used a different setup: I measure the throughput of one computation thread running at the same time as an I/O thread. The computation thread is designed to both allocate and compute things. It does 1000 runs of a computation that lasts about 0.1s and then prints statistics about the timings.

The I/O thread essentially echoes on stdout numbers that it reads from stdin. I did experiments where I either give zero input to the program or a small amount of input data. The presence of a small amount of input does not seem to have any impact on performance. Giving it a large amount input would be obviously unfair, since the non-preemptive version would not handle I/Os while the preemptive version would.

The difference between the preemptive and the non-preemptive version is the presence of the following lines of code in the preemptive version of the code:
let preempt signal = Thread.delay 1e-6
let () =
Sys.set_signal Sys.sigvtalrm (Sys.Signal_handle preempt)
This effectively overrides Thread's sigvtalrm handler with mine. My handler does exactly what the new handler of Thread would do.

I attach my source file if you want more details about the actual code. all the experiments have been done with 4.06.0. I can't see any reason why trunk would make any difference here.

Moreover, running the executable several times seems to indicate that the results are fairly reproducible: We cannot see any significant performance change. Even though the preemptive version seems a bit slower, it is by about 0.5%.