Today I made a typo in the shell: instead of m trc (m being my alias for less and trc the name of some trace file), I accidentally typed mt rc. To my surprise, the system didn’t respond with zsh: command not found: mt. It responded with this message:

Say, what? I’ve been using Unix-like systems for REDACTED years, but I don’t remember ever having used an mt command. And so I checked the fine manual:

MT(1) GNU CPIO MT(1)
NAME
mt - control magnetic tape drive operation
SYNOPSIS
mt [-V] [-f device] [--file=device] [--rsh-command=command] [--version]
operation [count]
DESCRIPTION
This manual page documents the GNU version of mt. mt performs the
given operation, which must be one of the tape operations listed below,
on a tape drive.
The default tape device to operate on is taken from the file
/usr/include/sys/mtio.h when mt is compiled. It can be overridden by
giving a device file name in the environment variable TAPE or by a com‐
mand line option (see below), which also overrides the environment
variable.
...

Now I admit that I am not too keen to upgrade my Ubuntu, but it’s not that ancient, either. It turns out that an Ubuntu distribution released in 2018 actually installs a program to control magnetic tape drive operation! I know, I know, the program is likely useful for other things as well, and tar is technically also short for “tape archiver”, but it still feels a bit silly to carry this cruft into the 21st century.

And people say that Windows is arcane in its support for DOS graphics and x86 addressing modes.

Several months ago a StackOverflow user asked an interesting question:

Imagine we have an iterator, say iter(range(1, 1000)). And we have two functions, each accepting an iterator as the only parameter, say sum() and max(). In SQL world we would call them aggregate functions. Is there any way to obtain results of both without buffering the iterator output?

The question went unnoticed for some time, perhaps because it didn’t specify the customary python-asyncio. I found it challenging, so I revived it with an answer that provides solutions using the various async tools available in Python¹. It soon rose to blog-post-length and might be interesting to my readers, so I’m reproducing it here.

¹ Technically not all solutions because it doesn’t cover curio and trio, but they are in this respect close enough to asyncio that the asyncio example can be easily adapted to them.

Two functions, one iterator

Let’s consider how to apply two aggregate functions to the same iterator, which we can only exhaust once. The initial attempt (which hardcodes sum and max for brevity, but is trivially generalizable to an arbitrary number of aggregate functions) might look like this:

This implementation has the downside that it stores all the generated elements in memory at once, despite both functions being perfectly capable of stream processing. The question anticipates this cop-out and explicitly requests the result to be produced without buffering the iterator output. Is it possible to do this?

Serial execution: itertools.tee

It certainly seems possible. After all, Python iterators are external, so every iterator is already capable of suspending itself. How hard can it be to provide an adapter that splits an iterator into two new iterators that provide the same content? Indeed, this is exactly the description of itertools.tee, which appears perfectly suited to parallel iteration:

The above produces the correct result, but doesn’t work the way we’d like it to. The trouble is that we’re not iterating in parallel. Aggregate functions like sum and max never suspend – each insists on consuming all of the iterator content before producing the result. So sum will exhaust it1 before max has had a chance to run at all. Exhausting elements of it1 while leaving it2 alone will cause those elements to be accumulated inside an internal FIFO shared between the two iterators. That’s unavoidable here – since max(it2) must see the same elements, tee has no choice but to accumulate them. (For more interesting details on tee, refer to this post.)

In other words, there is no difference between this implementation and the first one, except that the first one at least makes the buffering explicit. To eliminate buffering, sum and max must run in parallel, not one after the other.

Threads: concurrent.futures

Let’s see what happens if we run the aggregate functions in separate threads, still using tee to duplicate the original iterator:

Now sum and max actually run in parallel (as much as the GIL permits), threads being managed by the excellent concurrent.futures module. It has a fatal flaw, however: for tee not to buffer the data, sum and max must process their items at exactly the same rate. If one is even a little bit faster than the other, they will drift apart, and tee will buffer all intermediate elements. Since there is no way to predict how fast each will run, the amount of buffering is both unpredictable and has the nasty worst case of buffering everything.

To ensure that no buffering occurs, tee must be replaced with a custom generator that buffers nothing and blocks until all the consumers have observed the previous value before proceeding to the next one. As before, each consumer runs in its own thread, but now the calling thread is busy running a producer, a loop that actually iterates over the source iterator and signals that a new value is available. Here is an implementation:

This is quite some amount of code for something so simple conceptually, but it is necessary for correct operation.

produce() loops over the outside iterator and sends the items to the consumers, one value at a time. It uses a barrier, a convenient synchronization primitive added in Python 3.2, to wait until all consumers are done with the old value before overwriting it with the new one in next_val. Once the new value is actually ready, a condition is broadcast. consume() is a generator that transmits the produced values as they arrive, until detecting STOP. The code can be generalized run any number of aggregate functions in parallel by creating consumers in a loop, and adjusting their number when creating the barrier.

The downside of this implementation is that it requires creation of threads (possibly alleviated by making the thread pool global) and a lot of very careful synchronization at each iteration pass. This synchronization destroys performance – this version is almost 2000 times slower than the single-threaded tee, and 475 times slower than the simple but non-deterministic threaded version.

Still, as long as threads are used, there is no avoiding synchronization in some form. To completely eliminate synchronization, we must abandon threads and switch to cooperative multi-tasking. The question is is it possible to suspend execution of ordinary synchronous functions like sum and max in order to switch between them?

Fibers: greenlet

It turns out that the greenlet third-party extension module enables exactly that. Greenlets are an implementation of fibers, lightweight micro-threads that switch between each other explicitly. This is sort of like Python generators, which use yield to suspend, except greenlets offer a much more flexible suspension mechanism, allowing one to choose who to suspend to.

This makes it fairly easy to port the threaded version of max_and_sum to greenlets:

The logic is the same, but with less code. As before, produce produces values retrieved from the source iterator, but its send doesn’t bother with synchronization, as it doesn’t need to when everything is single-threaded. Instead, it explicitly switches to every consumer in turn to do its thing, with the consumer dutifully switching right back. After going through all consumers, the producer is ready for the next iteration pass.

Results are retrieved using an intermediate single-element list because greenlet doesn’t provide access to the return value of the target function (and neither does threading.Thread, which is why we opted for concurrent.futures above).

There are downsides to using greenlets, though. First, they don’t come with the standard library, you need to install the greenlet extension. Then, greenlet is inherently non-portable because the stack-switching code is not supported by the OS and the compiler and can be considered somewhat of a hack (although an extremely clever one). A Python targeting WebAssembly or JVM or GraalVM would be very unlikely to support greenlet. This is not a pressing issue, but it’s definitely something to keep in mind for the long haul.

Coroutines: asyncio

As of Python 3.5, Python provides native coroutines. Unlike greenlets, and similar to generators, coroutines are distinct from regular functions and must be defined using async def. Coroutines can’t be easily executed from synchronous code, they must instead be processed by a scheduler which drives them to completion. The scheduler is also known as an event loop because its other job is to receive IO events and pass them to appropriate callbacks and coroutines. In the standard library, this is the role of the asyncio module.

Before implementing an asyncio-based max_and_sum, we must first resolve a hurdle. Unlike greenlet, asyncio is only able to suspend execution of coroutines, not of arbitrary functions. So we need to replace sum and max with coroutines that do essentially the same thing. This is as simple as implementing them in the obvious way, only replacing for with async for, enabling the async iterator to suspend the coroutine while waiting for the next value to arrive:

One could reasonably ask if providing a new pair of aggregate functions is cheating; after all, the previous solutions were careful to use existing sum and max built-ins. The answer will depend on the exact interpretation of the question, but I would argue that the new functions are allowed because they are in no way specific to the task at hand. They do the exact same thing the built-ins do, but consuming async iterators. I suspect that the only reason such functions don’t already exist somewhere in the standard library is due to coroutines and async iterators being a relatively new feature.

With that out of the way, we can proceed to write max_and_sum as a coroutine:

Although this version is based on switching between coroutines inside a single thread, just like the one using greenlet, it looks different. asyncio doesn’t provide explicit switching of coroutines, it bases task switching on the await suspension/resumption primitive. The target of await can be another coroutine, but also an abstract “future”, a value placeholder which will be filled in later by some other coroutine. Once the awaited value becomes available, the event loop automatically resumes execution of the coroutine, with the await expression evaluating to the provided value. So instead of produce switching to consumers, it suspends itself by awaiting a future that will arrive once all the consumers have observed the produced value.

consume() is an asynchronous generator, which is like an ordinary generator, except it creates an async iterator, which our aggregate coroutines are already prepared to accept by using async for. An async iterator’s equivalent of __next__ is called __anext__ and is a coroutine, allowing the coroutine that exhausts the async iterator to suspend while waiting for the new value to arrive. When a running async generator suspends on an await, that is observed by async for as a suspension of the implicit __anext__ invocation. consume() does exactly that when it waits for the values provided by produce and, as they become available, transmits them to aggregate coroutines like asum and amax. Waiting is realized using the next_val future, which carries the next element from it. Awaiting that future inside consume() suspends the async generator, and with it the aggregate coroutine.

The advantage of this approach compared to greenlets’ explicit switching is that it makes it much easier to combine coroutines that don’t know of each other into the same event loop. For example, one could have two instances of max_and_sum running in parallel (in the same thread), or run a more complex aggregate function that invoked further async code to do calculations.

The following convenience function shows how to run the above from non-asyncio code:

Performance

Measuring and comparing performance of these approaches to parallel execution can be misleading because sum and max do almost no processing, which over-stresses the overhead of parallelization. Treat these as you would treat any microbenchmarks, with a large grain of salt. Having said that, let’s look at the numbers anyway!

Measurements were produced using Python 3.6 The functions were run only once and given range(10000), their time measured by subtracting time.time() before and after the execution. Here are the results:

max_and_sum_buffer and max_and_sum_tee: 0.66 ms – almost exact same time for both, with the tee version being a bit faster.

max_and_sum_threads_simple: 2.7 ms. This timing means very little because of non-deterministic buffering, so this might be measuring the time to start two threads and the synchronization internally performed by Python.

max_and_sum_threads: 1.29 seconds, by far the slowest option, ~2000 times slower than the fastest one. This horrible result is likely caused by a combination of the multiple synchronizations performed at each step of the iteration and their interaction with the GIL.

max_and_sum_greenlet: 25.5 ms, slow compared to the initial version, but much faster than the threaded version. With a sufficiently complex aggregate function, one can imagine using this version in production.

max_and_sum_asyncio: 351 ms, almost 14 times slower than the greenlet version. This is a disappointing result because asyncio coroutines are more lightweight than greenlets, and switching between them should be much faster than switching between fibers. It is likely that the overhead of running the coroutine scheduler and the event loop (which in this case is overkill given that the code does no IO) is destroying the performance on this micro-benchmark.

max_and_sum_asyncio using uvloop: 125 ms. This is more than twice the speed of regular asyncio, but still almost 5x slower than greenlet.

Running the examples under PyPy doesn’t bring significant speedup, in fact most of the examples run slightly slower, even after running them several times to ensure JIT warmup. The asyncio function requires a rewrite not to use async generators (since PyPy as of this writing implements Python 3.5), and executes in somewhat under 100ms. This is comparable to CPython+uvloop performance, i.e. better, but not dramatic compared to greenlet.

async/await and the event loop

Python 3.5 officially introduced asynchronous functions, also known as coroutines, and the await keyword. Based on the yield from extension to generators available since Python 3.3, await allows a seemingly blocking operation to instead suspend the execution of the async function, and be resumed later. The mix of synchronous appearance with asynchronous execution makes it a perfect replacement for callback-based programming, the infamous callback hell. Async/await was added to Python for the asyncio ecosystem, but it soon inspired asyncio alternatives such as curio and trio, and got adopted by old players in the async field. And none of this is unique to Python – async and await were previously present in C# and have since made their way into JavaScript, Scala, and even Rust.

A common theme with async/await coroutines is that their use tends to be centered around an event loop. This is because coroutines, much like generators, and in contrast to ordinary functions, cannot be started and left to complete without assistance. Before it produces a result, a coroutine can suspend itself and expect its caller to resume it later, once some condition is met. Multiple coroutines can run concurrently, using await to turn what would normally be a blocking operation into a cooperative context switch. All this requires coroutines to be driven by a dispatcher with which they coordinate their suspensions. Reasons for suspension will vary, but most of them will ultimately boil down to waiting for data, a resource, or a timeout. Since multiplexing IO and timeout events is the job description of polling event loops, a coroutine scheduler is typically integrated with an event loop.

asyncio can of course do much more than run a single coroutine to completion. It can submit a coroutine to the event loop without waiting for it to finish, allowing independent coroutines to run concurrently. It provides combinators such as gather and wait which make it convenient for coroutines to start other coroutines in parallel, using await to get their results when available. It supports queuing ordinary functions to execute alongside coroutines, as well as running chunks of synchronous code in a thread or process pool and awaiting it as if it were a coroutine. Finally, once submitted to the event loop, coroutines can be manipulated like futures, making it easy to hook them to classic async code that relies on callbacks. Taken together, all these provide a very powerful toolbox for real-world asynchronous programming — as long as the code is run inside the asyncio event loop.

But what happens if we can’t run the asyncio event loop? Python callbacks sometimes run embedded inside the event loop of a game, in an application using a foreign GUI toolkit, or in a web server on an appliance. Is there a way to use coroutine in such environments, or are we condemned to revert to callback hell?

One option is to implement a subset of the event loop interface on whatever event loop we are running under. This is in principle explicitly allowed by asyncio, whose event loop is specified by PEP 3156. Doing so in practice is still a large undertaking best left to dedicated projects. For example, gbulb, a project that implements the asyncio event loop on top of GLib’s, contains a decent amount of bridge code, and still warns of the various impedance mismatches between GLib and asyncio. Some differences are fundamental, such as GLib allowing recursive looping while asyncio doesn’t, and some are subtle, such as incompatible approaches to multi-threaded execution or platform-specific differences. Implementing the asyncio event loop just to add some coroutines to an existing application is a non-starter.

It turns out that there are ways of executing coroutines that do not require having an event loop.

Coroutines with continuations

Coroutine-like constructs have been around for much longer than Python-style generators. The Scheme language famously pioneered continuations, a concept more powerful and general than coroutines, but also harder to understand and harder yet to efficiently implement. Scheme didn’t have a concept of coroutines, but it allowed any function to capture a continuation, which would allow it to be later resumed at that exact point. The powerful call/cc primitive could be used to implement all kinds of control flow abstractions, not limited to coroutines. In case of coroutines, both suspension and resumption can be expressed in terms of requesting a continuation and at the right time, and applying it later to switch to the previous execution point. Scheme continuations could be called any number of times, which posed a burden on their efficient implementation. This is why full-fledged continuations, despite their undeniable power, are almost universally shunned by mainstream language implementors.

The success of Python generators showed that more limited forms of suspendable functions can be efficiently implemented and still be well-accepted by programmers. Where a Scheme continuation is almost as heavy as a thread because it needs to store a snapshot of the entire call stack, a running Python generator only needs to store its local variables, and those of its subgenerators if any.

The rise of callback-based programming that led to the invention of async/await sparked a renewed interest in coroutines, including those with explicit continuations. In May 2017 Kotlin 1.1 introduced experimental support for just such coroutines. Kotlin coroutines deftly avoid the downsides of full-fledged Scheme-style continuations by instituting the same limitations as those of Python generators: there is a clear boundary between suspending and non-suspending code, and each continuation can only be resumed once.

Kotlin divides functions into suspending functions, statically marked with a suspend modifier, and regular non-suspending ones. To call a suspending function from normal code, one must invoke the createCoroutine primitive or its high-level cousins launch and async. Once inside a suspending function, one can freely call other suspending functions. Unlike in Python, await is not explicit, calling another suspending function automatically awaits it. Most importantly, a suspending function may choose to suspend itself by calling suspendCoroutine, a function without a direct Python equivalent. suspendCoroutine suspends execution, like Python’s yield, but before doing so, generates a continuation object and passes it to the coroutine. The coroutine will be resumed when someone calls the resume method on the continuation object.

This example uses suspendCoroutine to resume the same coroutine in a different thread, after a 1-second delay:

resume_later trivially uses the venerable Timer to spawn a new thread that will call cont.resume(None) a second later. The real magic happens in suspendCoroutine, which:

creates a continuation and immediately (before suspending!) invokes resume_later with the continuation;

then returns an awaitable object that suspends the coroutine when awaited.

But why is all this a big deal? How is it useful?

Notice how everything is done without relying on the asyncio event loop. In fact, not only doesn’t the code rely on a particular event loop, it doesn’t presuppose the existence of an event loop at all! Using explicit suspend/resume, a coroutine can suspend itself and its callers, previously arranging its resumption. And it can achieve this relying only on the resources at its disposal, without needing anything like a full-fledged PEP 3156 event loop. For example, a version of asyncio.sleep that works within the GTK event loop might look like this:

Usefulness of explicit suspend is not limited to sleeping, it could be used to await any callback, such as update ticks from the GTK frame clock, or arbitrary widget signals. Widget setup code reacting to multiple events such as realize and map must currently be dispersed acrossed several callbacks. With suspend they could be easily expressed as a single coroutine that awaits each signal it’s interested in.

Use cases extend beyond GUI toolkits. Explicit suspend has the potential to bring benefits of coroutines to any callback-based environment, no matter how minimalistic.

Now that we’re hooked, let’s implement suspendCoroutine.

Continuations in Python

How await works

A Python coroutine, also called an async function, is a thin wrapper around a generator, where yield signals a suspension and return signals coroutine completion. The @coroutine decorator can be used to promote an ordinary generator into an awaitable coroutine:

@types.coroutine
def inner():
yield 1
yield 2
return "foo"

Other async functions can now use await inner() to delegate their execution to inner(). But what will await return? Let’s try to drive the coroutine to completion using its send method:

As far as outer is concerned, the individual values yielded by inner do not exist, they are directly transferred all the way up to the caller of send. On the other hand, the foo value returned by inner is never seen by send, it is consumed by the await in outer. In its StopIteration instance send() only receives the bar value returned by outer. await can be thought of as syntactic sugar for:

await is implemented with the same code as yield from, which delegates execution to a sub-generator, which is (as authors of asyncio noticed) conceptually the same thing as awaiting a coroutine. The original asyncio was in fact entirely based on generators and yield from.

Suspensions in asyncio

The event loop drives a coroutines by doing something similar to what await does: invoking send() until it signals it is finished by raising StopIteration. The difference is that the event loop switches to a difference coroutine after each send, and switches to waiting for IO or timeouts when there are no runnable coroutines.

Any coroutine that suspends must be resumed at a later point by the event loop. A naked “suspend me” primitive is not provided by asyncio — every asyncio suspension is paired with the code that arranges for timely resumption. To see how this works, consider a minimal implementation of asyncio.sleep():

At this level, things start to look similar to the Kotlin version. Suspension itself is clearly represented by yield. Prior to suspension, the coroutine arranges for its resumption with loop.call_later. The continuation is apparently represented by the “current task”, the object that owns the coroutine inside the event loop, conveniently provided by asyncio. The task object inherits from asyncio.Future, which means it is resumed with the set_result() method, analogous to cont.resume(). Since the task is tracked by the event loop, it will automatically notice that it has finished and will resume a coroutine that awaits simple_sleep.

Explicit continuation

Now let’s see if we can write a version of simple_sleep that doesn’t depend on asyncio. First, we’ll need a simple driver to invoke the coroutine from synchronous code:

def start(coro):
try:
coro.send(None)
# at this point the coroutine is suspended
except StopIteration:
pass # the coroutine chose not to suspend

This is very different from an event loop. For one, there is no loop! The name start reflects that we don’t really drive the coroutine, we simply set it in motion, and leave it to do the rest.

Calling start on asyncio-style coroutines such as inner and outer won’t produce a useful result – it will simply pick up the first yielded value and exit immediately, discarding the value. This is exactly what we want — our coroutines know how to resume themselves, so there is no need for start to continue executing after the coroutine has been given a chance to run. Also, if we ever hope to invoke start from a callback-based system, both start and the later continuations must be non-blocking, i.e. each of them must only execute the coroutine code until the next suspension point.

Next, we need to change simple_sleep to use threading.Timer to arrange for its continuation, renaming it to thread_sleep:

Here we have a little problem. Exactly what callback do we pass to timer.Thread? Creating a Future like asyncio does wouldn’t help because a freshly created future would not be observed by anyone, so its set_result would have no effect. However, start already has exactly the continuation we’re looking for — it’s coro.send(). But we don’t and cannot access to the running coroutine object stored somewhere on the stack. If only we could only somehow tell start to telegram us the continuation, we’d be set.

One way to do so would be for start to leave the current continuation in a global variable, like asyncio does with the current task. But there is an even neater approach, inspired by Curio, and that is to request the continuation from our non-coroutine caller! Something like this:

Now we have two suspensions, one to request the continuation, and the other to actually suspend. We’ll also change start to handle the first one:

def _resume(coro, value):
# the pattern of catching StopIteration is frequent enough that a
# convenience function comes useful
try:
coro.send(value)
return True
except StopIteration:
return False
def start(coro):
if not _resume(coro, None):
return # the coroutine chose not to suspend at all
# we are at first suspension, the coroutine requested the continuation
cont = lambda value: _resume(coro, value)
_resume(coro, cont)
# we are at second suspension, the coroutine is responsible
# to resume itself by invoking its continuation

It works! start exited immediately, as expected, but the coroutine arranged for itself to continue. As strange as it looks to use _resume to send _resume to the function, it actually works. The await construct drives thread_sleep for as long as it suspends, passing each suspensions to its caller until it reaches a non-coroutine caller such as start().

Explicit suspension

Of course, we don’t want to write coroutines like thread_sleep using bare generators – we want to have a suspension primitive that is itself awaitable, the equivalent of Kotlin’s suspendCoroutine. Looking at thread_sleep, it is straightforward to generalize it:

This works exactly the same as thread_sleep, but leaving it up to the caller to decide what to do with the continuation, by passing it as to a function of the caller’s choosing. (This is why Scheme calls its primitive “call with current continuation”.) thread_sleep can now be an actual async function that awaits suspend:

start(greet()) works just like before. The original greet example translated from Kotlin to Python would now also work, just with suspendCoroutine changed to suspend. In fact, other than in the implementation of suspend, we never again have to use raw generators to implement suspension.

As a final change, the suspend API that accepts a function requires defining a temporary function for even very simple uses. This is not very ergonomic in Python, where suspension would be better expressed as an async context manager.

corocc

The code presented above is available as the corocc module on github. It includes tests and several usage examples and is published under the MIT license. Although it is more complete than the snippets shown above, it should at this stage be considered experimental.

The goal of corocc is to bring the power of explicit continuations to Python. This should serve to allow coroutines to be used in a wider range of applications than previously possible.

Relation to greenlets

Invoking explicit continuations sounds in principle similar to the switching of greenlets, “green thread” objects provided by the greenlet library. Beyond the superficial similarity, the two do not share the same design. corocc is based on async functions and the await keyword, themselves based on Python generators and yield from. Greenlets are based on a full-featured green-thread design without distinction between blocking and suspending functions, so that any function may decide to switch context. Also, greenlets are organized into a tree hierarchy with exception propagation upward. These feature come at a cost of increased weight and also some limitations.

The two differ in behavior with threading. This is an area where continuation-based coroutines can lift the limitations of coroutines that were never inherent to coroutines themselves, but to the event loop design. There is nothing preventing a corocc coroutine from starting execution in one thread and continuing in another, and so on for every continuation. greenlet on the other hand explicitly documents that “It is not possible to mix or switch between greenlets belonging to different threads.” (Of course, switching a coroutine between threads is not possible in asyncio either.)

Cancellation

Canceling coroutines and futures from the outside is a standard event loop feature. In the explicit continuation paradigm, canceling a coroutine might prevent its continuation, possibly canceling its timeout if it was arranged through corocc.

Execution context

asyncio comes with functions gather and wait that combine coroutines running in parallel into a single awaitable object. Both of those are available in corocc, implemented on top of the on-done callback provided by the option of start to connect to a Future. However, fully implementing them requires a generic way to schedule a timeout task, the corocc equivalent of asyncio.sleep().

A generic corocc.sleep() would definitely require some knowledge of the execution context, perhaps obtained from a context provided to corocc.start(). Kotlin’s launch supports a context argument with similar semantics. The context would provide primitives like call_soon and call_later that could be easily adapted to new environments.

corocc.sleep() would use corocc.suspending() to retrieve the start context using the same protocol that currently retrieves the continuation. Then it would suspend itself and use context.call_later() to arrange for the continuation to be invoked later, regardless of the current event loop. The difference between this kind of execution a PEP 3156 event loop is that this mechanism is explicitly designed to hook into existing event loop systems, not to run the show.

Mixing with… other coroutines?

If corocc coroutines can be integrated into classic event loops, can we fit them into async event loops? Can we await a corocc coroutine from a Curio coroutine and the other way around? This doesn’t look very useful at first because async event loops are natively able to drive coroutines – but it might allow the same coroutine code to work under async and regular event loops!

It is not clear that this can be easily achieved, given the larger liberties provided to explicitly continued code. But with some ground rules and with support in the framework, it might be possible to do it anyway. And when the reward is access to a large body of established and well-tested code, it is well worth a try.

asyncio is the Python library for asynchronous programming and coroutine-based concurrency bundled with the language since version 3.4. As asynchronous programming and coroutines are currently hotly discussed in context of different programming languages (e.g. Kotlin, Rust, JavaScript, only to name a few), I decided to take a look at asyncio. Soon I found it fascinating and before I knew it I started spending time on the asyncio StackOverflow tag.

A StackOverflow participant asked me to recommend resources for learning about asyncio. While the official documentation provides a good reference manual, it doesn’t work as a tutorial – despite the authors’ best intentions, it is hard to follow unless one already understands the underpinnings. Fortunately there are quite a lot of introductory articles about asyncio. As with learning any novel concept, it takes some time to wrap one’s head around it, and one often needs to look at the same thing from many different angles before reaching that a-ha! moment that makes it all worth it. Here is a list of resources that I personally found enlightening.

Basic

Exploring Python 3’s Asyncio by Example – a more detailed introductory text. It shows spawning coroutines in parallel and using aiohttp for real-world tasks. It is written in the older yield from syntax, but don’t be put off by that, just imagine await in place of yield from, as their semantics in asyncio is identical. (Also, yield from is still supported, so the code works in modern asyncio just fine.)

Advanced

Python Concurrency From the Ground Up – a talk by the inimitable Dave Beazley. During the talk, Dave implements a full-featured event loop with support for coroutine dispatch, timeouts, and IO polling, typing and testing all of it in front of live audience. A friend likened this talk to the programming equivalent of jazz improvisation combined with stand-up comedy. Although the actual asyncio event loop implementation differs from the one presented there, this talk is a fantastic aid for understanding the dispatch of Python coroutines, and how it relates to the classic “event loop” concept. If you’re curious about coroutine-native event loops as presented in the talks, look up Dave’s curio and Nathaniel J. Smith’s trio libraries.

In the previous installment we showed that while atomic types provided by the Rust standard library can be used for lock-free access to shared values, memory reclamation must be ensured manually because Rust’s normal scoping rules do not cleanly map to lock-free concurrency.

Crossbeam

The problem of memory reclamation in lock-free data structures is not unique to Rust, it is shared by other languages without garbage collection, most notably C++. Different solutions have been proposed, sporting exotic names such as quiescent state based reclamation, epoch-based reclamation, and hazard pointers. See Tom Hart’s thesis for an extensive description of the memory reclamation strategies and analysis of their benefits and drawbacks.

In Rust the currently favored strategy is the epoch-based memory reclamation, a clever scheme that keeps track of objects marked for destruction in three thread-local bins. Each bin corresponds to an “epoch”, somewhat similar to a GC generation. When a thread becomes active, i.e. it is expected to start executing lock-free code, it locally stores the current epoch number (0-2) and uses it until deactivated. During this period, objects slated for destruction will be registered in the corresponding bin. A thread performing GC will first check whether all currently active threads belong to the current epoch. If that is the case, it means that there are no threads remaining from the previous epoch, and the epoch number is atomically bumped (incremented modulo 3). The thread that succeeds in incrementing the epoch proceeds to destroy objects from the bin of two epochs ago. For example, the thread that increments epoch from 1 to 2 can at that point safely destroy objects in bin 0. Objects in bin 1 cannot yet be destroyed because the epoch was just switched from 1 to 2, and there can still be active threads from epoch 1. But no new epoch 1 threads are being activated, and as soon as existing ones deactivate, all active threads will have been from epoch 2. At this point it will be safe to bump the epoch to 0 and drop objects from bin 1.

The nice thing about epoch-based memory reclamation is that it is a good fit for libraries, since it can be fully embedded inside the code that, say, implements a lock-free queue, without the rest of the application having to know anything about it. Rust’s implementation of epoch-based memory reclamation is provided by the Crossbeam crate. Aaron Turon’s original blog post is an excellent read on the topic, describing both Crossbeam and epoch-based memory reclamation in some detail using the classic Treiber’s stack as an example.

This version is very similar to the version from last article based on the imaginary AtomicCell, except it adapts to the requirements of Crossbeam. Let’s first cover the basics:

source and value are Atomic, Crossbeam’s equivalent of AtomicCell. Option is not needed because Crossbeam always allows options by representing None values as null.

Before calling Atomic::swap and Atomic::load, the thread needs to be “pinned”, i.e. marked as active within the current epoch. The guard returned by epoch::pin serves as proof that the thread has been pinned, and automatically marks it as inactive when destroyed. A reference to this guard can be sent to helper methods such as try_transform.

Crossbeam’s Atomic::swap accepts Owned, an object similar to Box that guarantees that the value we’re storing is heap-allocated and owned by the caller (who just transferred that ownership to swap). This is similar to AtomicCell::swap from last post, except Crossbeam’s design allows reusing a previously extracted box.

Methods working with Atomic accept an Ordering argument, with the same meaning as in Rust’s atomic types. The initial test of source requests the least strict Relaxed ordering, which is safe because the source, if non-null, will be re-fetched using a stricter ordering once again in try_transform.

The key feature introduced by Crossbeam lies in the mostly-automatic memory management implemented on top of epoch reclamation. Atomic::swap returns a Shared guard which encapsulates the pointer obtained from AtomicPtr::swap and provides safe access to the underlying object, concurrently observable by other threads. The lifetime bound on the returned Shared ensures that it cannot outlive the guard returned by epoch::pin(), preventing the object from being collected while reachable through Shared. Once we are done with the object, we must manually mark it for collection. This is an unsafe operation and something Crossbeam cannot attempt automatically because it cannnot prove that the retrieved pointer is not still used elsewhere in the data model, for example in a linked list chaining to the pointer. We know no such reference exists, so it’s safe to deallocate the object. Atomic::load is used exactly the same way, only without the final deallocation.

try_transform extracts the source value published by set_source by calling std::ptr::read, a function that moves the object from an arbitrary location and returns it by value. After the call to std::ptr::read, the memory where the object resided is treated as uninitialized, and it is left to Crossbeam to free it at a later epoch switch. std::ptr::read is marked unsafe because Rust cannot trace the pointer to prove that we own the object on that location. But since we pass it the location freshly swapped out that set_source won’t ever read, we know calling std::ptr::read is safe. An unsafe block hiding unsafe implementation inside a completely safe public API forms the essence of unsafe Rust. A safe function is not only one that uses no unsafe code, but also one that can be called with any kind of argument without incurring undefined behavior.

This version of LazyTransform satisfies the requirements of the exercise. It is not only lock-free, but also wait-free because it avoids compare-and-swap retry loops. The size of the LazyTransform object equals the size of two pointers and one bool, and the pointers only ever allocate the amount memory needed to store S and T respectively. Given the requirements, that is as memory-efficient as it gets.

Coco

Having written and tested the above code, I had expected it to be the final version of the code. However, running some additional test code had a strange effect on my memory monitor – the program was leaking memory, and in large quantities! I had both expected and in prior runs observed the memory to fluctuate due to epoch-based memory reclamation, but this was different. What I observed here was memory consumption monotonically growing for as long as the program was running. Also, the leak could only be reproduced when using a value type that allocates, such as a String. It looked like Crossbeam was simply not dropping the unreachable objects.

Carefully looking at the code, it is obviously inconsistent in its memory management of shared values. set_source simply forgets about the previous value, presumably expecting guard.unlinked(prev) to dispose of it. But try_transform() uses std::ptr::read() to move source data out of the Crossbeam-managed Owned container, and also calling guard.unlinked afterwards. They cannot both be correct: either guard.unlinked doesn’t drop the underlying object and guard.unlinked(prev) in set_source leaks memory, or it does drop and guard.unlinked(source) in try_transform results in a double free because the underlying source_data was moved to transform_fn and dropped there.

I posted a StackOverflow question and, again to my surprise, it turned out that not running destructors was a known limitation of the current Crossbeam. The description of Crossbeam does state that “the epoch reclamation scheme does not run destructors [emphasis in the original], but merely deallocates memory.” This means that Crossbeam’s guard.unlink(prev_value) deletes the dynamically allocated storage for T internally created by Atomic<T>, but doesn’t drop the underlying T instance. That works for the lock-free collections supported by current Crossbeam, which automatically remove items “observed” by the collection user (no peeking is allowed) and take ownership of the object inside, similar to our AtomicCell::swap. Support for such semantics fits the needs of a queue or stack, but not e.g. a lock-free map, or even of a simple container such as LazyTransform.

Maintainers of Crossbeam are aware of the issue and are working on a new version which will include many improvements, such as the support for full dropping of objects and an improved and tunable garbage collection. A preview of the new Crossbeam design is already available in the form of the Concurrent collections (Coco) crate, whose epoch-based reclamation implements the object dropping we need, and also optimizes epoch::pin.

Switching to Coco finally resolves the memory leak and leads to the following LazyTransform implementation:

Compared to Crossbeam, the differences are minor, and mostly to Coco’s advantage.

Where appropriate, defer_drop is used to drop the object in addition to the memory that it occupied. This eliminates the leak. The inconsistency regarding ptr::read is no longer present – when ptr::read is used to move the object out of the Coco-managed memory, defer_free is used in place of defer_drop.

epoch::pin no longer returns a guard, it now accepts a closure that will be run with the thread pinned to the current epoch (“active”). This makes no practical difference in our example, but might reduce readability of Crossbeam code that embedded flow control constructs such as return or break inside a pinned block.

Finally, accessing the value through a shared reference now requires an unsafe block. This is unfortunate, as pinning was explicitly designed to guarantee safety of such access. The problem was that such access was really safe only when memory orderings were correctly specified. As this was impossible to enforce statically, unsafe was introduced to eliminate a serious soundness issue in current Crossbeam.

Performance

After taking the trouble to write the code, it makes sense to measure it and see how much of a performance benefit Rust brings to the table. The Java version is admittedly much shorter (although not necessarily easier to devise) because it can rely on a volatile variable to achieve atomic access to an object. Likewise, memory reclamation is a non-issue because it is transparently handled by the GC. But surely this comes at a cost? Even with the advantage GC brings to lock-free code, Rust is a statically typed ahead-of-time compiled language specifically targeted for systems and high-performance programming.

The benchmark simulates a busy single producer thread that occasionally publishes a randomly generated value, and then remains busy spending CPU for several microseconds. At the same time, 8 consumer threads are continuously reading the transformed and (most of the time) cached value, trivially inspecting it in order to prevent a very clever compiler from optimizing away the whole loop. The whole benchmark is run three times, allowing the JVM to warm up the JIT, and also to make it easier to spot anomalies between runs.

For the Rust version, build it with cargo build --release and run target/release/bench.

For the Java version, cd to src/java, byte-compile it with javac *.java and run it with java Benchmark.

Results

On my 2012 desktop workstation with 3GHz Xeon W3550, the Java benchmark reports an average of 7.3 ns per getTransformed invocation. The Rust benchmark reports 128 ns in get_transformed, a whopping 17 times slower execution. These timings are in stark contrast with the original Crossbeam article which documents the lock-free queue implemented in Crossbeam as not only competitive with, but consistently faster than java.util.concurrent.ConcurrentLinkedQueue. What could explain such a performance difference in this case?

Let’s consider the most common “happy case” for get_transformed, when it simply returns the cached value. The Java version performs the following:

an atomic load of source with the sequentially consistent ordering (the docs define get as having “the memory effects of reading a volatile variable”, which is sequentially consistent in Java.)

if non-null, as it will be in the happy case, an atomic load of transformed.

So we have two atomic loads, a check against null, and looping overhead. The Rust version also performs two loads, a relaxed load of self.value and a sequentially consistent load of self.value. However, behind the scenes it additionally does the following:

Pin the epoch

Check the garbage bins for dead objects

Clone the cached String value, which allocates

In the get_transformed caller, destroy the cloned String, again using the allocator

For a start, using a String value that Rust clones and Java only returns by pointer would appear to favor Java. Since a typical payload object is expected to be a complex object, it would surely be more efficient to make Payload an Arc<String>. “Cloning” the payload will only increment a reference count and string allocations will be eliminated. However, making this change not only fails to pay off, it makes the code even slower, with an average get_transformed invocation now taking 290 ns!

Breakdown

To make sense of this measurement, I decided to strip down down get_transformed to its very basics, breaking its contract where necessary, just to see which part takes what time. Here are the findings, now measuring only the “happy case” obtained with PRODUCE_ITERS reduced to 1. Repeating the benchmark showed some variation in numbers, but not significant enough to change their overall meaning. Keep in mind that absolute figures are obtained on my old desktop; a modern computer would be significantly faster.

single unsafe relaxed load of u64 value: 3 ns

epoch::pin() + u64 payload: 26 ns

like the above, but payload that allocates, Box<u64>: 74 ns

Arc<u64> payload: 230 ns

String (3 chars): 95 ns

String (128 chars) -> 105 ns

String (1024 chars) -> 136 ns

Arc<String> (any string size) -> 231 ns

String (2048 chars) -> 280 ns

Pinning the epoch costs around 23 ns on my machine, 26 ns measurement minus the 3 ns load and some loop overhead. This is consistent with the documentation cautioning of 10-15 ns pin time on a modern computer. This is likely the only work done, as no allocation is needed, and the thread-local garbage bins are empty. The u64 payload we’re cloning is Copy, so its clone() just loads the primitive value. No garbage is generated in the “happy case” because neither the source nor the cached value are written to, only read.

One surprising finding is that atomic reference counting is expensive, especially so when there is high contention over access to the object. It is no wonder that Rust opted to implement a separate single-threaded reference-counted type – using atomics adds a large overhead to Arc::clone compared to Rc::clone. (Also confirmed by separately benchmarking that comparison only.) Compared to the cost of cloning an Arc, string allocation and copying are fantastically optimized. It takes strings of almost 2 kilobytes for String::clone to match the cost of contended Arc::clone. Most surprisingly, it turns out that a heap allocation is actually cheaper than increasing and decreasing an atomic reference count. Allocation time can be obtained by subtracting the Box<u64> timing from the u64 one, which pegs allocation at under 50ns, in an 8-thread scenario. jemalloc’s segmented locking seems very successful in this scenario.

It would appear that this kind of micro-benchmark favors GC-backed languages, although it’s hard to tell by how much. It would be interesting to extend it to include some sort of processing and test whether the cumulative cost of garbage collection elsewhere in the program tips the scale.

Conclusion

Implementing the exercise was an interesting learning experience in both Rust and lock-free programming. The material presented here of course only scratches the surface of the topic. Jeff Preshing’s articles provide a much more in-depth treatment and further references.

Within Rust, Crossbeam and its successor Coco provide a convenient way to implement custom lock-free algorithms. This convenience does come at a cost – the mere cost of pinning the epoch would make Rust’s get_transformed fast path 3-4 times slower than equivalent Java. The real challenge comes when sharing objects among threads. Timings show that lock-free Rust requires very careful allocation design, as the cost of memory management can easily dwarf the lock-free operations that were supposed to be chosen for their efficiency. Specifically, Arc is not a panacea and can even add significant overhead to performance-sensitive designs. If access to a reasonably small object is needed, it may actually be more efficient to clone the object than to expose it through Arc. If some portion of a large object needs to be accessed, it may be more efficient to temporarily expose the reference to the object to a closure provided by the caller, which can then pick up the information it needs. Whatever solution is chosen, there does not appear to be a silver bullet that would fit all kinds of objects.

Both Crossbeam and Coco require unsafe in a couple of key places, so they are not as elegant as the statically checked lock-based design offered by the standard library. The libraries themselves are not to blame here – it is a hard problem and might require additional research and possibly even support from the language to resolve satisfactorily. This area is still under active research, especially in the wider C++ community, and it will be interesting to follow how it will develop.

The previous article showed a mutex-based implementation of the LazyTransform value container. However, the exercise was explicit about implementing a “lock-free” container. What does that mean, exactly?

Limitations of locks

The interaction between LazyTransform methods was carefully designed to prevent deadlock, but in a busy system one could expect significant contention for the source mutex between get_transformed and set_source. This is not optimal; we would like set_source to just store the new data without any waiting at all, and get_transformed to keep returning the old cached value until a new one is observed.

Consider also the contention for the value read-write lock. After get_transformed performs the transformation, it must acquire the value lock in write mode, which means it must wait until all the readers release it. What is even worse is that once the value lock is finally acquired in write mode, other readers must wait for the writer to release it. Of course, the writer only holds the lock for as long as it takes to store new_value, which is an extremely efficient operation, but all readers are still blocked, and if the writing thread gets suspended by the OS scheduler, all readers will remain blocked until the writer thread resumes. The readers, on the other hand, hold the lock for as long as it takes to clone the value, which is not bounded in any way. What is common to all these cases is that progress of the entire system hinges on the behavior or scheduling of one thread, and this is what lock-free algorithms are designed to avoid.

In a lock-free system, the program will make progress as long as any of its threads are allowed to run. Being lock-free is not just about refraining from using the OS lock primitives, but about carefully designing the system so it cannot lock up, be it by deadlock or livelock or unfortunate thread scheduling patterns. As Jeff Preshing puts it, a lock-free application won’t lock up even with hypothetical thread scheduling decisions being made by your worst enemy.

Another reason to want to avoid mutexes is their cost on object size. On x86_64 Linux, a pthread_mutex_t, internally used by Rust’s Mutex, takes 40 bytes of memory, and a RwLock takes 56. To make Mutex movable (a requirement for all Rust types), the actual OS lock object is dynamically allocated. The smallest possible Rust mutex, a Mutex<()>, takes 16 bytes of memory for the object itself (8 for the pointer to the dynamically allocated OS mutex and 8 more for the poison flag). With each LazyTransform object requiring ~200 bytes of memory and three dynamic allocations just to store the locks, we would not want to have millions of such objects lying around. If we don’t really need the waiting functionality offered by a full-featured Mutex and RwLock, we would do well to eliminate them and significantly reduce the footprint of LazyTransform.

AtomicBool

For a start, let’s eliminate the transform_lock mutex, an easy target as it holds no data and is only ever acquired with try_lock. As used, this lock was already “lock-free” in the above sense, because owning it would never cause another thread to stall.

We could replace it with a bool, but then try_lock would be doomed to a race condition in the window between checking the previous lock state and changing it to “locked”. Not to mention that Rust would see this as a data race and the compiler would reject access to a bool without protecting it with a mutex. Fortunately, a mutex is not the only option for avoiding data races. Modern processors support a number of specialized instructions to read and update values atomically, including one that fetches a machine value and atomically set the new one. In Rust, these instructions are exposed as methods on types in the std::sync::atomic module. The stable channel provides AtomicBool, AtomicUsize, AtomicIsize, and AtomicPtr<T>. For transform_lock, we can use AtomicBool to store a flag indicating whether the lock is currently taken. The resulting try_lock would look like this:

swap atomically stores true into the lock and returns the previous value – the operation Java AtomicBoolean calls getAndSet. If the previous value was false, it means it was us that changed the value from false to true and thus successfully acquired the lock. Conversely, if the previous value was true, it means that the lock had been already taken by someone else, and our store of true was a no-op. The Ordering argument tells the compiler which memory order is requested for the operation. The memory order corresponds to guarantees that place constraints on how the compiler and the CPU can reorder data reads and writes, both ordinary and atomic ones, around one particular atomic read/write; see the nomicon for a detailed explanation. For transform_lock we use the acquire and release ordering, so named because of its usefulness in implementing locks. A store with the “release” ordering guarantees that all writes executed prior to the atomic store will be observable by any thread that observes the newly stored atomic value. It is used when releasing a lock by setting the “locked” flag to false with the “release” ordering, which signals to other threads that the lock is now available for the taking, but also ensures that the objects that have been updated while the lock was held are available for reading. Conversely, any code that observes the false value using the “acquire” ordering can be sure that its subsequent load of other locations will also observe the results of all writes executed before the “release” store.

Note that the AtomicBool::swap method operates on the object by &self shared reference. This is another example of the conceptual magic like the one employed by mutexes and described in the previous post. The difference is that with AtomicBool no waiting takes place, so we are not trusted to ever get a mutable reference to the underlying bool. Instead, we tell AtomicBool how to change the object, and it executes the change through hardware mechanisms that are guaranteed to be exclusive, which allows it to declare the mutating methods to take &self.

Lock abstraction

One disadvantage of directly using an AtomicBool compared to a mutex is that we don’t get a fancy guard that automatically unlocks the lock when going out of scope. Instead, we must remember to unlock it ourselves, which can be difficult to guarantee during panic. A robust implementation would wrap the AtomicBool into a lock object that returns a guard that implements Drop:

In accordance with Rust conventions, the method that attempts the lock is named try_lock. Our version returns an Option that tells the caller whether taking the lock succeeded and, if so, contains the guard object to automatically unlock it on exit. Rust’s Mutex::try_lock has a more sophisticated return value that allows for other kinds of errors, but the logic of using it is essentially the same.

Using this type, transform_lock gets declared as LightLock and initialized as LightLock::new(). get_transformed automatically becomes panic-safe and also cleaner and more maintainable because it no longer needs to remember to release the lock before every exit point:

Getting rid of one mutex wasn’t that hard. Can the same logic apply to other locks? transform_lock was distinctive in that it never guarded concrete data, it was only used to serialize access to the transformation of source into value. This ensured that source and value locks, which are actually exclusive in nature, were held as briefly as possible. Once the lock was obtained, the actual protection of source and value from access by different threads was left to their respective locks. To safely access stored values without locks, we will need some way to access the values atomically like we did with the bool stored by the AtomicBool.

Atomic arbitrary values

Ideally we would have at our disposal a generic atomic type, let’s call it AtomicCell<X>, that provides a swap method to atomically modify the shared value and retrieve the old, and load to only read the existing one. With source and value fields declared as AtomicCell<Option<T>> and AtomicCell<Option<S>> respectively, LazyTransform could look like this:

get_transformed is now split into two methods: get_transformed, which now only returns the cached value, and a non-public try_transform, which actually attempts the transformation. set_source atomically swaps in the new source, and drops the old one (by virtue of ignoring the return value of swap). try_transform does the exact same thing with the source, and also with value when it obtains a new one. When the old value needs to be read, load() is called to access a reference to the underlying value, which is cloned.

The problem with the above code is that AtomicCell doesn’t exist. Rust’s standard std::sync::atomic module is limited to atomic manipulation of pointer-sized values which correspond to the operands of the CPU’s atomic instructions. Thus the most promising building block for implementing an AtomicCell is the AtomicPtr type, which provides atomic access to a pointer. An AtomicCell<T> would be represented by an AtomicPtr<T> that references T stored dynamically. The dynamic allocation is unfortunate, but unavoidable if we ever want to support T larger than a pointer. Using a pointer, AtomicCell::swap could look like this:

AtomicCell::new simply creates a cell whose ptr points to a valid heap-allocated object, which is AtomicCell‘s invariant. The interesting part happens in swap.

In swap the new object, owned by the local variable received from the caller, is moved to the heap and swapped into AtomicPtr as a raw pointer. The old object is retrieved from the same call to AtomicPtr::swap, also as a raw pointer, immediately converted back to a Box whose contents are moved from it, and returned by value. The ownership of each object is clear at every point. Values stored in the AtomicPtr are (conceptually) owned by the AtomicPtr. When writing a value to the pointer, the ownership over the new value is relinquished to the AtomicPtr using Box::into_raw, and at the same time the ownership of the old value is reasserted with Box::from_raw.

This last part is also the first time we’ve had to use unsafe in the implementation. Box::into_raw is not unsafe because it only gives back a raw pointer, trusting the caller to arrange for the object to be eventually destroyed. Even if the caller fails to uphold their end of the bargain, the worst thing that can happen is a leak, already possible with std::mem::forget. Box::from_raw is unsafe because it reinterprets a raw pointer as a box without any proof that the raw pointer is unreachable from elsewhere or even valid. We, however, know both things to be true. The pointer must be valid because we are only ever passing it pointers created by Box::into_raw, as only such pointers are stored in the AtomicPtr. We also know that the object is not reachable from another thread because we have just obtained it from swap, meaning it has already been disowned by the previous owner. Thus our swap is sound and the unsafe block is a mere implementation detail. Taking ownership is not only safe, but necessary if we want to prevent the old value from leaking.

swap was easy. Now what about AtomicCell::load, which we also need to implement lock-free reading? Based on AtomicPtr::load, a load could be imagined like this:

This code is written in analogy to swap, only without modifying the existing. Unfortunately it is also unsound, which means that it breaks Rust’s ownership rules and causes a crash. The problem is that loadtakes ownership of the object behind the atomic pointer without also changing the pointer to point to something else. This leads to double free as soon as load is called again to observe the same pointer, or in any subsequent call to swap, which will attempt to take ownership of a freed value. And there is no easy way to fix it, either. For example, suppose we defined load() to return Option<&T>, thus avoiding taking ownership of the observed object. This would prevent double free, but it wouldn’t work on the type level because there is no way to express the lifetime of the returned reference. While multiple loads would no longer be an issue, a swap() could still invalidate the value while a reference is live. Immediately cloning the observed object and returning a new value is also unsound because the reference can be invalidated by a swap before T::clone finishes. The atomic_cell crate resolves the issue using a lock around clone(). Locking is performed with a spinlock, whose lock looks like LightLock::try_lock above, except it retries taking the lock in a busy loop until it succeeds. Despite its efficiency when the lock is uncontended, a spinlock is still a lock and it is not acceptable in the middle of a lock-free algorithm.

Ideally load() would return a guard that provided a reference to the underlying value, much like the guard returned by Mutex::lock. The existence of the guard will prevent writers like swap() from destroying the object. Instead, swap will store the new raw pointer into AtomicPtr and mark the old one for deletion, deferring its actual destruction until all the load guards have been dropped. It is far from obvious how to implement this as efficiently as possible, and of course without heavy synchronization, lest we lose the very benefits lock-free code was supposed to provide.

Lock-free access to arbitrary values with correct memory management is the topic of the next article.

As a learning exercise I set out to implement a simple lock-free algorithm in Rust. It was inspired by a problem posed at job interviews at a company where a friend works. The problem is simple enough that it can be tackled by a beginner, but tricky enough to require some thought to get right – and Rust presents several new challenges compared to the original Java.

This series of articles presents the evolution of a simple Rust lock-free container starting from single-threaded, progressing to a multi-threaded variant with locks, and finally settling on a lock-free implementation, discussing the trade-offs at each step. A basic understanding of Rust and of multi-threading programming is assumed, but the articles might be useful to adventurous beginners at both. Do note, however, that the author is not an expert at lock-free programming, so there might be errors – if you find some, please do leave a comment.

The Exercise

Implement a class providing access to a cached value. In more details:

Write a LazyTransform class that stores a single value, allowing it to be updated as needed. The value is potentially expensive to compute, so the setter method, set_source receives a “source” that will be used to compute the final value using a transformation function received in the LazyTransform constructor. Transformation must not be attempted until requested by get_transformed. Once generated, the value is cached and returned by further invocations of get_transformed, until invalidated by a new call to set_source.

A single-threaded version can be summarized with the following Python:

The class must support being called from multiple threads, with the following semantics:

set_source and get_transformed can and will be called on the same LazyTransformer instance in parallel;

Once set_source completes, future invocations of get_transformed must eventually start returning the new value.

Read-heavy usage pattern is expected, so get_transformed must not block regardless of how many times set_source or get_transformed are called in other threads. The one exception is when a new source is detected – it is allowed for get_transformed to block until the transformation finishes before returning the transformed value (and caching it for future calls).

The code must be lock-free: neither set_source nor get_transformed should get stuck waiting on each other, even if they are called in quick succession or in parallel by many threads, or both.

Rust API tradeoffs

Before proceeding to parallelization, let’s review how the above interface would map to Rust’s type system. Ideally we’d want to place as few restrictions as possible on the type of values used for the source and the final objects; either could be as simple as a single u32 or a huge heap-allocated object. We know, for example, that both the source and the value type must be Send, because they need to be accessed from threads different from those that create them.

Another necessary restriction is that the final value type must be Clone. Why? Consider how the concept of “returning of cached value”, the return self.value line in the above Python, maps to Rust. In Python the semantics are clear because all of its objects are heap-allocated, and you always get the same instance shared. This is also the specified by the original Java exercise, which returns an Object. But a correct Rust implementation needs to deal with an actual value stored in the cache, and has three options for returning it:

move the object out of the container, typically by making the value field an Option and returning self.value.take();

return a reference to the object, return &self.value;

clone the object and return the cloned value to the caller.

The first option obviously doesn’t work because it would prevent get_transformed to return the cached value more than once. The second option looks feasible until one considers that the returned reference cannot be allowed to outlive the stored value. Since the stored value can be invalidated by a call to set_source, which can happen literally at any time, it is clear that allowing a reference to be returned would be unsound. Indeed, all such attempts are promptly rejected by the borrow checker.

Although cloning at first appears like it would be inefficient for arbitrary objects, it actually provides the greatest flexibility for the user. Light values, such as numeric IDs which are Copy (and hence also Clone), or small strings which are cheap to clone, can be placed in the cache as-is. Heavy values, on the other hand, can be dynamically allocated and accessed as Arc<ActualData>, ensuring that their clone only increments a reference count, providing the semantics one would expect from equivalent Python or Java. If needed, one can even combine the two and store a tuple of a light object and a heavy one.

So, LazyTransform needs to be generic on the value type (T) and the source type (S). But let’s not forget the transformation function received by the constructor. Fixing its type to fn(S) -> T would limit it to stateless functions, and we would like the user to be able to provide an arbitrary closure for transformation. One option would be to accept a generic function object in the constructor and box it in a Box<Fn(S) -> T>, but that would impose a dynamic allocation on each LazyTransform instance, as well as an indirection when invoking the function. If the transformation function is known at compile time and carries no state, it should incur neither storage nor run-time indirection overhead. This is easily achieved by adding a third type parameter, that of the transformation function type.

As a demonstration of the API, here is a single-threaded implementation of the container:

In spirit this is exactly the same thing as the original Python, except sprinkled with a healthy dose of static typing.

Compile-time thread safety

What happens if we try to share an instance of LazyTransform among threads? Rust will prevent that at compile time — invoking a &mut method from multiple threads would require creating multiple &mut references to the same object, which is prevented by the borrow checker. For example, the following doesn’t compile:

lt gets moved into the closure executed by the new thread, but then it is no longer available for use by the main thread. Sending it by reference wouldn’t work because there can exist only one &mut reference to an object, so we wouldn’t be allowed to send the same reference to multiple threads. Allocating LazyTransform dynamically and using Arc to share it among threads wouldn’t help either because Arc only provides shared access to the data it owns.

In Rust, supporting parallel access to a container requires not only a change in implementation, but also in method signatures. This is an intentional design decision – while Python or Java single-threaded code will happily execute when called from multiple threads, providing incorrect results, the Rust version will refuse to compile when the thread-unsafe LazyTransform object is accessed from two threads.

The compiler uses simple rules to decide whether it is safe to share an object between threads:

Methods invoked from more than one thread must accept &self rather than &mut self. This rule is enforced by the borrow checker for single-threaded code as well.

The object must not contain values of types specifically blacklisted for multi-threaded access even through shared references. In Rust terms, its type must “be Sync”, meaning it implements the Sync marker trait, which most objects do. Examples of non-Sync types are Rc or Cell, and both have thread-safe equivalents.

At a glance, the first rule seems to rule out LazyTransform as a multi-threaded type. Both its public methods clearly modify the object, with set_source even doing that in a way that is observable from the outside. Changing the signatures to accept &self instead of &mut self fails to compile because both methods modify the data behind the &self shared reference, which is prohibited. Accessing an object inside &self will also result in further shared references that are read-only.

To modify data, we must find a way to obtain an exclusive mutable reference from the shared reference to self. This is not allowed for ordinary objects because the compiler would have no way to ensure that writes are exclusive, i.e. that while a thread holds a mutable reference to a value, no other thread can read it or write to it. However, if we could statically convince Rust that the reference’s ownership of the data will be exclusive, it would be within the rules to allow the conversion. This is where mutexes come in.

Mutexes

Rust’s Mutex type provides read-write access to the value it protects, using the appropriate operating system primitive to ensure that this can be done by only one thread at a time. Here is an implementation of LazyTransform updated to use a mutex:

Both methods now operate on &self, relying on the mutex to obtain write access to the data in self.state. As far as method signatures are concerned, this is the final version of the API – all future versions will only differ in implementation.

The storage is now split into transform_fn, which is itself immutable and can be invoked from a shared reference, and state, the mutable part of the object’s state moved to a separate struct and enclosed in a mutex. As can be seen here, Rust’s Mutex is a container that holds and owns the data it protects. While that coupling looks strange at first, it enables the mutex to safely grant read-write access to the data it owns.

Calling Mutex::lock() waits until an exclusive OS lock is acquired, then returns a “guard” object, that both LazyTransform methods store in a local variable called state. The mutex will not be unlocked until the guard goes out of scope. Therefore the existence of a live guard represents a proof that the mutex is locked and therefore provides read-write access to the underlying data.

In Rust’s twist on mutex semantics, the very meaning of the act of locking a mutex is obtaining temporary exclusive write access to its data through a temporary guard object. Despite self being a shared reference, a successful self.state.lock() grants access to &mut LazyState that may last for as long as the mutex is locked (guard exists) and no more. This is the crux of the way Rust prevents data races through static analysis.

Other than the curious mutex design, there is nothing really interesting about the code itself. Once the mutex is locked, both functions do exactly the same thing that their single-threaded counterparts did. While this code is thread-safe in the sense Rust promises, i.e. free from data races, it is still very far from being efficient when invoked in parallel, even ignoring the stringent lock-free requirements. In particular, get_transformed is extremely inefficient in a read-heavy scenario because each call blocks all other calls even when set_source isn’t called at all. When a transformation is in progress, all the other readers are blocked until it is finished.

Fine-grained locking

To minimize the amount of time spent waiting, we can take advantage of the following facts:

The methods are operating on two distinct pieces of data, source and value. set_source, for example, doesn’t access value at all. The two fields can be protected with different locks.

get_transformed has two distinct modes of operation: a fast one when it only returns the cached value, and the slow one when it detects that the source has changed and it needs to calculate the new value. The vast majority of calls to get_transformed can be expected to belong to the “fast” scenario.

Here is an implementation that uses finer-grained locking to ensure that readers don’t wait for either writers or other readers:

In this implementation there is no longer a “state” structure protected by a coarse mutex, we are back to individual fields. The source field is protected by its own mutex and the value field is protected by a separate RwLock, which is like a mutex, except it allows read access by multiple concurrent readers that don’t block each other. Finally, a new transform_lock field doesn’t protect any particular piece of data, it serves as something resembling a conventional mutex.

set_source locks the source mutex and replaces the source with the new value. It assigns to *locked_source because locked_source is just the variable holding the guard, and assigning Option<S> to it would be a type error. Since the guard provides automatic access to &mut Option<S>, *locked_source at the left-hand side of the assignment serves to both coerce the guard to &mut Option<S> (returned by guard’s implementation of DerefMut::deref_mut) and at the same time to dereference it, so that the value behind the reference is replaced with the new one.

get_transformed is more sophisticated. It first ensures that only a single call attempts to interact with the writer at one time. This is for two reasons: first, to avoid set_source being “attacked” by a potentially large number of readers in a read-heavy scenario. Second, we want to prevent more than one transformation happening in parallel, which would require the result of one expensive transformation to be thrown away. The synchronization is implemented using try_lock, which immediately returns if the lock could not be obtained. In case of failure to lock, get_transformed gives up and returns the cached value, which meets its requirements. If it acquires transform_lock, it proceeds to check whether a new source is available, again with a try_lock and a fallback to returning the cached value. This ensures that get_transformed gets out of the way of set_source as much as possible. If it acquires the source lock, it uses Option::take() to grab the new value, leaving None in its place. If the captured source is not None, meaning a new source was published since the last check, get_transformed performs the transformation, caches its result, and returns a copy.

get_transformed uses a RwLock to ensure that readers don’t wait for each other, but that the update is exclusive. RwLock nicely maps to Rust’s ownership system by RwLock::read returning a guard that provides shared reference to the underlying data, and RwLock::write returning a guard that provides a mutable reference.

This implementation is about as good as it can get with the use of locks. The problem statement, however, requires get_transformed and set_source not to block each other regardless of how often they are invoked. The above implementation will attempt an exclusive lock of source just to check if a new source has appeared. When this lock succeeds, set_source will be blocked for the duration of the lock. In a read-heavy scenario, get_transformed will be called often and by many different threads, and it is easy to imagine it hogging the lock enough to slow down set_source, which must wait to acquire the lock (it cannot use try_lock) in order to do its job.

Have a critical cron’d automated task? You’d like to be notified if something fails? With the ubiquity of smartphones, you can notice an error right away and take action.Wow, someone wrote to me! It’s from someone named “mdadm”. Must be spam again!

Computers sending emails for various purposes is nothing new. I have a couple of critical cron jobs on my home computer; syncing the family photos to my remote server, backing up the said remote server to my local computer, etc. These are all tasks that are defined in the daily crontab, and without a proper or any alerting system, something can go wrong and you can really find yourself in a bind if it turns out the backup procedure died months ago because the ssh key changed or something. You can either check the backup or automated task every single day to make sure nothing went wrong, or you can setup a robust alerting system that will send you an email if something goes wrong. This is not the only use case, Mdadm can also send you an email if a disk drops from a RAID array etc.

Setting up a Gmail relay system with Postfix

Installing and managing an email service is difficult, and you have to contend with all sorts of issues, is your server blacklisted, do you have the appropriate SPF records, is your IP reverse resolvable to the domain name etc, etc. Most of these requirements are difficult or impossible with a simple home computer behind a router without an FQDN. With the relay, you’ll be able to send an email without having to worry if it’ll end up in spam, or not delivered at all as it will be sent from a real Gmail account. Luckily, it’s extremely simple to set it up:

If everything went fine, you should be getting the email promptly from your new gmail account. I haven’t tried with other email providers, but it should all be pretty much the same.

Usage example

Now that you have a working relay, it’s time to put it to good use. Here is a simple template script with two key functions that can be sourced through Bash so you can use it within other scripts without having to copy & paste it around.

flock(1) is used to make sure there is only one instance of the script running, and we’re checking the exit status of the commands. If you don’t need instance locking, you can simply forego the lock_failure function. The actual work is contained in another subshell which is terminated if any of the commands in the chain fail and sends an email advising you to check $LOG.

Conclusion

A lot of Linux services like Mdadm or S.M.A.R.T. have a feature to send emails if something goes wrong. For example, I set it up to send me an email if a drive fails inside my RAID 1 array, I just had to enter my email address in a variable called MAILADDR in the mdadm.conf file. A couple of days later, I received an email at 7 AM; ooooh someone emailed me. I had a rude awakening, it was Mr. Mdadm saying that I have a degraded array. It turned out to be the SATA cabling that was at fault, but still. This could have gone unnoticed for who knows how long and if the other disk from the RAID 1 failed later on, I could have had serious data loss. If you want to keep your data long term you can’t take any chances, you need to know if your RAID has blown up and not rely on yourself to check it out periodically, you won’t, you can’t, that’s why we automate.

Be careful when you write these programs. If your script is buggy and starts sending a lot of emails at once for no good reason, Gmail will block your ass faster than you can say “Linux rules!” If you’re blocked by Gmail, you might miss an important email from your computer.

In a comment on my answer to a StackOverflow question about callbacks in Rust, the commenter asked why it is necessary to specify 'static lifetime when boxing closures. The code in the answer looks similar to this:

It seems redundant to specify the lifetime of a boxed object that we already own. In other places when we create a Box<T>, we don’t need to add 'static to T’s trait bounds. But without the 'static bound the code fails to compile, complaining that “the parameter type CB may not live long enough.” This is a strange error – normally the borrow checker complains of an object not living long enough, but here it specifically refers to the type.

Let’s say Processor::set_callback compiled without the lifetime trait on Fn(). In that case the following usage would be legal as well:

When analyzing set_callback, Rust notices that the returned box could easily outlive the data referenced by the CB closure and requires a harder lifetime bound, even helpfully suggesting 'static as a safe choice. If we add 'static to the bound of CB, set_callback compiles, but crash_rust predictably doesn’t. In case the desire was not to actually crash Rust, it is easy to fix the closure simply by adding move in front of it closure, as is is again helpfully suggested by the compiler. Moving s into the closure makes the closure own it, and it will not be destroyed for as long as the closure is kept alive.

This also explains the error message – it is not c that may not live long enough, it is the references captured by the arbitrary CB closure type. The 'static bound ensures the closure is only allowed to refer to static data which by definition outlives everything. The downside is that it becomes impossible for the closure to refer to any non-static data, even one that outlives Processor. Fixing the closure by moving all captured values inside it is not always possible, sometimes we want the closure to capture by reference because we also need the value elsewhere. For example, we would like the following to compile: