I have always been interested in concurrent programming through the use of
threads. This inclination did not originate from a need to exploit the
parallelism at the hardware level for performance reasons, but rather due to the
fact that I believe it establishes a model of reasoning about your code that
provides a more convenient and natural way for encapsulating concurrency in the
overall program flow. I cannot express the relief I felt when I first ported an
asynchronous piece of networking code to
PThreads. But my joy did not last
long – context switching, memory, cache miss costs and more importantly the
physical constraints that bounds the number of threads were waiting for me at
the end of the tunnel. Lucky for me, it did not take much time to figure out the
notion of Light-weight
processes, which later
introduced me to Protothreads. That being said, I
should have admit that I fell back to my single-threaded asynchronous event loop
many times rather than a Protothreads flavored PThreads spaghetti. I suppose
every programmer had once dreamed about having light-weight threads that maps to
a multitude of physical CPU cores via some sort of mechanism in the background,
many similar efforts in the literature (Plan9
rfork, Solaris
LWP, NetBSD
lwp_create) backs this
claim, I suppose.

Threads, Coroutines, and Fibers

The heart of the problem lies in the question of mapping light-weight
threads to physical processor cores () given that the operating
system has no prior knowledge on what is going on at the application level. As
an inevitable consequence of this problem, developers come up with the idea of
providing application level constructs to hint the underlying process scheduling
mechanism about the concurrency operating at the application level. Among many
other attempts, coroutines and
fibers are the two
well-known examples emanated from this research. Quoting from
Wikipedia,

Fibers describe essentially the same concept as coroutines. The distinction,
if there is any, is that coroutines are a language-level construct, a form of
control flow, while fibers are a systems-level construct, viewed as threads
that happen not to run concurrently. Priority is contentious; fibers may be
viewed as an implementation of coroutines, or as a substrate on which to
implement coroutines.

If I would try to restate the given description from a more naive perspective
targeting the Java Virtual Machine, coroutines are language-level constructs to
form a way of cooperative execution of multiple tasks, whereas fibers are
light-weight threads scheduled preemptively by the virtual machine. There is a
set of coroutine
implementations
available for JVM, but in this post I will try to stick with two particular
fiber implementations: Akka from
Typesafe and
Quasar from Parallel
Universe.

Akka and Quasar

Last year I had a plenty amount of time to play around with
Akka during the The Principles of Reactive
Programming course. Claiming that
Akka is just a toolkit to realize fibers on the JVM would be a fairly incomplete
definition – it is totally much more than that. Designed for resiliency from
top to ground, a gigantic toolbox of common messaging pattern shortcuts,
adaptive load balancing, routing, partitioning, and configuration-driven
remoting, etc. just to name a few. Additionally, one should note that it is
battle-tested with real-world work loads by means of both as a standalone
application and a compound to the Typesafe
Platform.

On the other hand, Quasar is a
relatively new player in this league. That being said, it is implemented as a
compact library with – in my opinion – a relatively sufficient set of
features. Though it has a lot to go to catch Akka in terms of out of the box
features. In addition, while Akka just provides
actor abstractions at the
programming language level, Quasar ships both actors and fibers. (Actually,
Quasar actors are built on top of fibers.) Quoting from Quasar core developer
Ron Pressler’s
post:

Quasar and Akka provide completely different abstractions. Quasar provides
fibers while Akka provides an asynchronous-programming framework based on
callbacks. Both libraries then implement actors using their respective
abstractions. […] fibers are lightweight threads, which
means that their most important property – even before being light-weight –
is being threads. The thread abstraction is a series of sequential
computations with the ability to block – on IO or synchronization. Callbacks,
on the other hand, are a completely different abstraction. Akka’s callbacks
(and therefore actors), are not allowed to block, and therefore do not provide
a thread abstraction; hence they’re not fibers.

Quasar’s main design goal was the desire to make the simple, familiar,
threading abstraction (i.e., synchronous, blocking code) cheap, as kernel
threads are expensive, and blocking a kernel thread carries a lot of overhead.
Quasar is used to run servlets that serve hundreds of thousands of concurrent
requests – instead of mere thousands – all without changing your existing
Java code. We want developers get the full performance benefits of
asynchronous code – which Akka also offers – while keeping their APIs and
synchronous programming model – which Akka certainly doesn’t. Quasar makes
threading cheap, while Akka abandons the thread abstraction altogether. In
that respect, Quasar is a lot more similar to Erlang and Go (which, of course,
provided the inspiration to Quasar), which also use the thread abstraction but
provide a lightweight thread implementation.

Akka’s design goals are different, and we think Quasar is far simpler to use
than Akka, requires less mental overhead and far-less re-engineering, while at
the same time being more feature-rich.

The Benchmark

In order to have an idea on how two implementations compare to each other in
terms of performance, I put together a small benchmark suite
(fiber-test) that employs both Akka (2.3.5)
and Quasar (0.6.0) for the well-known
thread-ring problem. (
threads are spawned and connected as a ring structure. Through this ring a
message – an integer comprising 4 bytes – is circulated involving
message passings.) In addition, I included a version of the benchmark that uses
Java Threads to establish a base line.

In fiber-test, you will observe that
benchmark.sh
employs taskset to bond the JVM to a given
set of CPUs on the system. However, for small number (1-2) of CPUs, Quasar
freezes randomly and Java Threads become totally unresponsive. Hence, I get rid
off of the taskset for the figures presented here.

Tests are deployed on a Ubuntu GNU/Linux 14.04 (LTS) system running on a 6
physical core Intel(R) Xeon(R) E5645 2.40GHz processor, where 64-bit Java
HotSpot VM (build 1.8.0_20-ea-b05) is used.
JMH is set to run 5+5 warmup
and regular iterations using 3 JVM forks. There are 503 threads in each
benchmark and 1e7 message passings. Results are in milliseconds per benchmark.

Fiber Impl.

Min.

Avg.

Max.

Stddev.

Confidence Interval

Quasar Fibers

695.613

29x

721.457

758.231

20.796

99.9%

699.226

743.689

Akka Actors

856.807

23x

911.295

963.403

34.345

99.9%

874.578

948.012

Quasar Actors

1553.224

13x

1606.718

1660.329

35.756

99.9%

1568.492

1644.943

Java Threads

15730.028

1x

20709.233

33084.117

4338.423

99.9%

16071.196

25347.270

Results point out that even in the worst case (which corresponds to Quasar
Actors in the figures) fiber implementation on the average performs 13 times
better compared to native Java threads. In addition, Akka Actors and Quasar
Fibers improve this level up to 23x and 29x, respectively. All that being said
and done, I could not reason about the performance difference between Quasar
Fibers and Actors, given that actors are implemented using fibers in Quasar. In
order to further investigate the problem, I profiled the benchmarks using
-agentlib:hprof=cpu=samples,interval=20,depth=3. (See
HPROF
documentation for further information on the parameters.)

As anticipated, Java Threads were bitten by park/unpark methods in
sun.misc.Unsafe. Whereas, Akka Actors and Quasar Fibers perform some sort of
interleaving between routines to avoid these calls and that totally rocks.
Unfortunately, it turns out that the channel mechanisms (e.g.,
co.paralleluniverse.strands.channels.SingleConsumerQueueChannel) in Quasar
Actors somehow trigger the Quasar core to employ park calls in between the
message receivers and that predominantly consumes the majority of the fiber
runtime.

The Conclusion

Observing that fiber implementations perform superior compared to standard Java
Threads is nothing new. However, seeing Quasar Fibers showing up impressive
performance is not something I was expecting, considering that the project is in
its very early stages. That being said, it appears that channels in Quasar
Actors have a certain room for improvement and might borrow some ideas from its
Akka counterpart. In the overall, both fiber implementations show stable and
noteworthy performance figures and stand as a perfect standard Java Thread
replacement for a considerable amount of use cases.