Here is the annotated slidedeck from my recent talk on the internals of Multicore OCaml GC. The aim is to provide an overview of the Multicore OCaml GC design with a few deep dives into some of the interesting parts of the GC design. The post has also been discussed elsewhere at Hacker News and reddit (r/programming, r/ocaml). Happy to answer questions!

Thanks! I read the document earlier in the day and I think it is a truly excellent document -- it ramps up slowly enough in term of sophistication, so that people that already know a bit about GC stuff but don't routinely work with them can refresh their memories on the fly, and it goes into far more details than what I have previously seen (except an in-person talk by Stephen a couple years ago, but many things seem to have changed since then.).

One thing that I did not realize about your design before is the very subtle interaction between the fiber switching machinery and the GC. This aspect is fairly complex and easy to not think of when reading a description of the GC in "normal" functioning (without fiber switches), or when reading descriptions of the user-side of fiber scheduling (ML programming with effects, where it is natural to forget about GC interaction).

Thanks @gasche! My intention was indeed to make the multicore GC design more approachable to non-experts so that we may have informed conversation around the design. Happy that you found the fiber-GC interactions interesting. This is something that is not widely known (GC books don't talk about concurrency) since only a few languages have them. I've seen interesting discussions in golang around GC and goroutines. For example, [0,1]. Some of these are useful choices to consider for multicore OCaml.

How does fiber switching and its complexity compare to the costs incurred when doing concurrency the monadic way? Generally speaking, it seems like Lwt/Async are building up large data structures analogous to these stacks, and that they use the default GC scheme, whereas by embedding and specializing the handling of these data structures in the runtime, we should get better performance. Does that sound right?

Lwt/Async unroll the execution stack into heap data structures, not to dissimilar to CPS translating the code. In particular, what are essentially stack frames in fibers are allocated as intermediate closures in monadic libraries. We do see better performance in this case. See performance results (a bit dated) in http://kcsrk.info/slides/multicore_fb16.pdf.