WARNING: This text is deprecated and refers to an old version of ØMQ. It remains here for historical interest. DO NOT USE THIS TO LEARN ØMQ.

HISTORICAL WHITEPAPER

Introduction

This document describes design decisions made for ØMQ lightweight messaging kernel (version 0.1) and rationale behind the decisions. Afterwards it foreshadows the areas of further development.

Scope of work

Version 0.1 of ØMQ lightweight messaging kernel is not intended to be a full-blown messaging software. Rather than providing extensive featureset we've focused on proving that the goals of the project - in terms of throughput and latency - are achievable.

Given this goal we've divided the code into "code irrelevant to performance" and "code relevant to performance" and cut the former off completely. The cut off code encompasses all the functionality that is not used during message passing proper. For example, connection initiation, authentication and queue management fall into "irrelevant to performance" category. On the other hand, all the functionality involved in actual message passing is retained. Specifically, passing messages via underlying transport and thread synchronization are the areas that - in our experience - account for 99% of the performance, so these two were elaborated on very carefully.

However, there are two areas not covered in version 0.1 that should in fact fall into "relevant to performance" category. In following paragraphs we discuss these two in detail:

First of them is message routing. Complex routing like one based on regular expressions or full-blown content-based routing can take significant amount of time and even though we are able to do the routing in quite an efficient manner using inverted bitmap technique it can lower the message throughput to several tens of thousands per second.

However, we expect most of the routing be much simpler than that, ideally requiring no more than single memory access thus having no measurable impact on the performance.

The second issue is persistence. Once you start persisting messages your throughput will drop in a very significant manner. Persistence requires access to slow resources like databases or hard-disks and therefore cannot be made much more faster. Once the persistence is involved, performance of messaging software actually boils down to the performance of the persistence medium making optimisation of the messaging software almost irrelevant.

The other way to handle persistence is to process it in parallel with the message transfer. That way of doing things lowers the overall reliability of the system, however, it allows to persist messages with zero impact on messaging performance.

So, the persistence can have either devastating or no effect on the messaging system. In either case there is little point in including the functionality into version 0.1.

Is it good for anything?

Given the restricted functionality of version 0.1 of ØMQ lightweight messaging kernel, is there anything aside of running the test suite it can be used for?

Actually, yes.

You can think of the product as of some kind of better TCP/IP that allows you to connect two applications in the same manner as TCP does, however, there are following differences:

The transport is message-based rather than stream-based, meaning that you are transporting well-delimited chunks of data rather than raw stream of bytes.

The message throughput is several times higher for ØMQ than it would be for straightforward passing of size-prefixed chunks of data over TCP/IP.

The messages are queued rather than blocking execution once the TCP buffers are full. ØMQ allows you to continue with processing even if messages cannot be sent immediately.

ØMQ scales better on multicore systems. By having worker thread to deal with actual transfer of the data, the transfer can be accomplished on a dedicated CPU core. As a consequence, your message producing/processing application running on a different core isn't slowed down by the actual sending/receiving of the message data.

Version 0.1 is designed with features to appear in the future releases in mind. This code is the foundation for future functionality like message routing, one-to-many transfer, scaling to unlimited number of CPU cores etc.

Supported platforms

Although ØMQ lightweight messaging kernel is intended to be a multiplatform application, version 0.1 can be compiled only on POSIX-compliant systems and it is performance-tuned only for Linux.

What's the point? Using some kind of portability library (like APR) would require no extra effort and it would guarantee portability of the product, free of charge.

Well, there's a good reason not to use a portability layer. The reason is that portability library - in principle - has to provide unified system API to the client. Unifying the API means that the lowest common denominator of functionality available on all the supported systems is wrapped and exposed to the client in a standardized way. However, the lowest common denominator functionality is in no way the best in the terms of performance. What yields a nice performance on Linux may perform poorly on Windows. Fancy features allowing to boost the performance on Solaris may not be available on Linux and therefore won't make their way into the portability library. Some features, although standardized - say asynchronous I/O or unnamed semaphores - may not be implemented on some systems and therefore either can't be implemented in a portable manner or require inefficient workarounds.

The solution we've chosen for ØMQ is to separate the functionality into three areas:

functionality that is irrelevant to performance

functionality that is relevant to performance

functionality that is extremely relevant to performance

The first area is almost everything not directly involved in actual message passing: management functionality, connection initiation, startup & shutdown code etc. Functionality in this area can be implemented using a standard portability library. The performance drawbacks are irrelevant here and having just one codebase for all the platforms makes development less resource consuming and the code much more manageable.

The second area consists of all the code that has to be executed when message is sent from point A to point B. We cannot rely on portability library here for the reasons explained above. We have to have separate codebase for each operating system (or at least for each operating system family). However, we should try to keep this platorm-dependent code as short as possible. The intent is to keep is around 500 lines of code per system.

The third area are the pieces of code that are stressed the most and that create worst bottlenecks in the system. These have to be optimized at micro-architecture level and therefore should have different implementations for different processors or processor families (say i386 vs. x86_64). We want to keep this code restricted to individual lines or at most tens of lines.

As for ØMQ version 0.1, the code we present is heavily focused on performance and therefore falls almost entirely into latter two categories. Most of the code is optimized for a specific operating system (Linux), however, there are few spots optimized for a specific micro-architecture (32-bit and 64-bit Intel CPUs). The consequence is that there is almost no "irrelevant to performance" category code that would benefit from using a portability library. Therefore, to keep the build and installation processes simple, we've discarded the portability layer and decided to run only on Linux.

Note: Actually, we've been able to compile and run the code on Mac OS X, but we haven't done extensive testing and tuning on these platforms yet.

Design

Wire format

The connection implemented by version 0.1 of ØMQ kernel is what we call "backend connection". To get full discussion of what does the term mean, see messaging enabled network whitepaper. The backend connection is refered to as type "B" connection in the whitepaper.

Given that backend connection is clearly separated from standard front-end AMQP connection (type "A" connection) we are free to devise a new wire format for it, one that is more favorable for high-performance messaging than AMQP is.

It should be noted that during AMQP development following problems were deliberately considered out-of-scope:

Transfer of very small messages (market data). AMQP is designed for messages several hundred bytes long.

Message batching. AMQP has no support for batching several messages into a single unit. Although this is not obvious with AMQP-over-TCP, it becomes clear with AMQP-over-SCTP.

First of all we've asked a simple question: What would the maximal message throughput for different message header sizes be? In other words, how does the amount of additional data that has to be passed with every message affect the number of messages that can be physically passed via say 1Gb Ethernet?

The following diagram shows maximal possible message throughputs on 1Gb network for different message sizes and header sizes of 1, 2, 4 and 8 bytes. Additionally, it shows maximal throughputs for AMQP messages (both with minimal header and average-sized header).

Message sizes up to 10 bytes are not shown on the diagram as these sizes are not very common in real world, however, the differences in throughput for different header sizes are extremely high (1000% and so) for such messages. Even for sensibly small messages (10-50 bytes), the differences are considerable - they can range up to millions messages a second.

To get best possible throughput we've opted for 1 byte long header, i.e. single byte length followed by the message body.

To be able to pass messages larger than 255 bytes, there is an escape mechanism built into ØMQ kernel. Message size of 255 (0xFF) means that following 8 bytes should be interpreted as body size (in network byte order) followed by message body itself.

The layouts for messages smaller than 255 bytes and bigger than 255 bytes:

Our next question was: Given 1 vs. 9 byte header length for short and long messages, what will the decrease of the throughput be for 254 vs. 255 byte long messages?

The decrease in throughput at 255 bytes doesn't look harmful in any way. To be precise, it is below 5% which we've set as maximal tolerable deviation - it's 3.41% to be precise - and thus negligible in our opinion.

Threading model

Inter-thread synchronization is one of the largest bottlenecks in the messaging systems. To mitigate the impact on the synchronization, we've choose to use as little threads as possible.

Unfortunately, using just a single thread and thus avoiding synchronization altogether is not possible. ØMQ should process incoming messages even when client application is busy processing business tasks. If it were not so, incoming messages would be delayed on the other side of the connection until the client application would cede control to ØMQ, thus damaging latency in a severe manner.

Therefore, the minimal number of threads to use is two. We will refer to one of them as 'client thread', i.e. the thread owned by client application and used to invoke ØMQ API. The other will be called 'worker thread' - the thread owned by ØMQ kernel and used to process all the messaging hard work.

Obviously, product with two threads cannot scale to systems with more than two CPU cores. In the future releases we would therefore like to extend ØMQ kernel to use arbitrary number of client threads and worker threads. However, the thread synchronization should be kept minimal even then - each message should pass thread boundary at most once on the the publishing side and at most once on the receiving side.

Inter-thread synchronization

The overall scheme of inter-thread synchronization can be seen on the following picture:

The idea is that messages are passed through a pipe in the standard write and read manner. However, be aware that the pipe has to be thread-safe as there are two threads accessing it concurrently. Once there is a problem - when reader has no more messages to read from the pipe, it notifies the sender using passive synchronization and goes asleep. Passive synchronization means that the other thread is not notified directly using some kind of async signal, rather it will be notified once it tries to write the next message to the pipe. When this happens, writer is aware that reader is already asleep or at least going asleep at the moment. It knows that there is new message available, so it wakes the reader up using active synchronization, i.e. actively sending wake-up event to the other thread.

Synchronization implemented

Although the above scheme looks simple, implementing it in efficient and correct way is tricky.

To make the synchronization as fast as possible we've opted for ultra-optimized lock-free and wait-free pipe. Term lock-free refers to the fact that a thread cannot lock up - every step it takes brings progress to the system. This means that no synchronization primitives such as mutexes or semaphores can be used, as a lock-holding thread can prevent global progress if it is switched out. Wait-free refers to the fact that a thread can complete any operation in a finite number of steps, regardless of the actions of other threads.

Specifically, we've been able to implement both read and write operations on the pipe using single atomic compare-and-swap-pointers operation. This operation can be implemented using single assembler instruction on most modern micro-architectures.

Moreover, we've integrated the passive synchronization (letting the writer know that reader is asleep) with the pipe itself, so there's no need for any additional processing there - trying to read data from pipe unsuccessfully means that the writer will be notified of the fact on the next attempt to write.

Note that pipe implementation (ypipe class) is optimized for the case when there is single writer thread and single reader thread. Using multiple writers or multiple readers will result in race conditions and unspecified behavior.

As for active synchronization, it is implemented in two different ways. One is used to send notification from client thread to worker thread. Given that worker thread in our implementation 'sleeps' using POSIX poll function, the active synchronization is accomplished using local socket pair and sending single byte through it to interrupt the polling.

Active synchronization from worker thread to client thread is implemented using semaphore class. However, as Linux semaphore itself is quite a heavy-weight object, the actual implementation of semaphore class uses mutex to simulate the semaphore behaviour.

Boosting the synchronization

Although the lock-free and wait-free synchronization that we are using is extremely fast, it is still not fast enough to get throughput of several millions messages a second.

The principle of further boost of performance is to avoid the synchronization altogether in most cases. Imagine that synchronization is done only for each 100th message. Given that actual enqueueing and dequeueing of messages is almost for free, the time needed to pass huge amount of messages through the pipe will drop to 1%.

To get this kind of behaviour we've modified the pipe to be able to and read messages in batches rather than one by one. Each batch is read using single atomic operation. Note that batching of writes is not implemented yet.

Batching

To get high throughput figures the messages have to be batched to avoid large number of expensive operations (either synchronization or passing OSI stack up and down over and over again).

There are two basic ways to do batching. Sender side batching means that sender batches the messages and passes them on only if a certain limit or timeout is reached. Receiver side batching means that receiver reads all the available messages available in a single batch irrespective of their number (so the batch can contain 1 message as well as 1000 messages).

In the current implementation we've opted for receiver side batching because that way we can keep latency as low as possible - messages are read instantly when receiver is free to process them. Sender side batching, on the other hand, delays the messages irrespective of whether reader is ready process them or not.

Batching implemented

In version 0.1 implementation there are three points where the batching is done.

When sending messages, worker thread - when ready to write to socket - reads all the messages enqueued by client thread and pushes them into socket using a single sendmsg system call.

Or receiver side, data are read from the socket in batches rather than message by message.

The chunks read from socket are enqueued and when client thread requests for message, all the enqueued chunks are transferred to it in a single operation.

Slowing down the receiver

When batching is done on receiver side there is one more problem we have to handle. If receiver is faster than publisher, suddenly no batching happens. Publisher writes a message and receiver immediately reads it without waiting for subsequent messages to form a batch. However, it can be even worse than that. Receiver may try to get a message even before publisher is able to publish it. In that case, receiver goes asleep and has to be woken up once the message is available. Given that going asleep and waking up can be rather lengthy operations (several microseconds) and that go-asleep/wake-up sequence can take place after every single message, the whole thing can hurt throughput in very serious manner. It can decrease throughput by factor of ten or even more.

To prevent this situation, we should deliberately slow down the receiver, specifically in the case where the next phase of the processing has no immediate need for new messages.

However, in version 0.1 of ØMQ kernel the functionality of slowing down the receiver is implemented in a very rough manner. Firstly, out of three batching points, it is present only on the receiving side of the socket. Secondly, it is implemented using simple POSIX nanosleep call meaning that slowing down blocks whole worker thread thus causing even messages being passed in the opposite direction to be blocked. Thirdly, nanosleep implementation on Linux is in no way perfect. Instead of waiting for specific amount of nanoseconds, it rounds the interval up to the next half-millisecond (according to our tests), thus creating intolerable latency bottleneck.

This functionality is to be inspected closely in the next releases of ØMQ kernel.

X-mode

Imagine a single message to be passed via ØMQ. There are no preceding messages, nor subsequent messages. No batching can be done. However, the message still has to pass all the stages of batched message transfer: It has to be written to the sender pipe (atomic compare-and-swap operation), the worker thread has to be woken up (interrupting poll using single-byte transfer via a socket pair), message has to be read from the pipe by the worker thread (yet another compare-and-swap), it has to be written to the socket (sendmsg), read from the socket (recv) on the receiver side, passed to the receiver pipe (one more compare and swap), client thread has to be woken up (using samaphore object) and finally, message has to be retrieved from the receiver pipe (compare-and-swap).

Passing the message through the socket cannot be avoided, however, the rest of the processing is actually superfluous and causes unnecessary decrease in latency. Atomic compare-and-swaps as well as waking up the client thread using semaphore object (implemented as mutex) are quite fast (each operation ranging from tens to hundreds of nanoseconds), however, waking up the worker thread using socketpair is really slow. It takes approximately 6-8 microseconds. Alternative interruption of poll using POSIX signals seems to yield no considerable improvement (except that signal latency is a bit more deterministic).

Given that best TCP latency we've seen (using Intel's I/OAT technology) is around 15 microseconds and that InfiniBand vendors claim even lower latencies, 8-10 microseconds of overhead observed in our test environment form a considerable portion of end-to-end latency.

The idea of what we call extreme latency mode (or x-mode for short) is that when there are no messages passing through the system, working threads on both sides of the connection can cede the control of the socket to the client thread. That way client thread on sender side can write to the socket directly and client thread on receiver side can read the data from the socket directly thus avoiding all the slow inter-thread synchronisation. Once the client thread cannot write/read more data to/from the socket, it cedes the control back to the worker thread, i.e. switches back to standard mode (also called batch mode).

Ceding the ownership of the socket back from client thread to worker thread can be a time consuming operation (up to several microseconds), however, the important thing is that it is done only when there are no data to process, therefore having no negative effect on the overall performance.

With x-mode fully implemented on both sides of the connection, we expect the latency overhead ØMQ adds to the underlying transport (TCP) to drop below 100 nanoseconds.

As for version 0.1 of ØMQ lightweight messaging kernel, ceding of socket to the client thread is implemented only on the sender side. Additionally, when x-mode is on, consumer-slowing-down algorithm is turned off on the receiver side. Following table summarizes the functionality for both x-mode and batch-mode:

batch mode

x-mode

sender

never cedes the socket to the client thread

cedes the socket to the client thread when there is no traffic

receiver

never cedes the socket to the client thread; slows down the worker thread to improve the batching ratio

never cedes the socket to the client thread; worker thread is never slowed down to improve batch ratio

Would you like to join the project?

If you want to join ØMQ project, you have to do some paperwork first. Read our Copyright Assignment and License, sign it - or ask your company to sign it - and send it to us. Once this is done, there are many areas where you can contribute. See below for some ideas of what should be done.

Micro-architecture-level stuff

What's missing is implementation of atomic 'compare and swap pointer' for different micro-architectures. In version 0.1, there are implementations for i386 and x86_64 platforms and GCC compiler (AT&T inline assembler syntax). If you are using different platform or a different compiler, the operation will be simulated using mutex, thus being at least twice as slow.

What's needed is implementation for different micro-architectures (however, some operating systems may provide API to get the functionality irrespective of underlying micro-architecture) and different compilers.

Also, optimization on micro-architecture level is an issue. We've modified the atomic_ptr (class that provides atomic 'compare and swap pointer' functionality) so that it always uses a separate cache line thus avoiding unnecessary cache coherency updates. However, we haven't seen any measurable performance boost afterwards. This area is still open to optimization.

Lock-free and wait-free algorithms

Considerable part of ØMQ's performance is due to efficient lock-free and wait-free data structures. However, we would still like to make them even more efficient. The one missing feature we are aware of is the possibility to write several items into a lock-free and wait-free queue (ypipe) using single atomic operation.

Further kernel optimisation

We would obviously like to make ØMQ even faster. There's lot of experimenting to be done in this area.

Some possibilities are:

Move message parsing from client's thread to the worker thread. That way client thread would be freed from excesive work and can focus on business processing.

Elaborate on consumer-side throttling algorithm. The simple algorithm used for consumer-side throttling is extremely rough now. It's a simple fixed-time nanosleep (10us). Take into account that nanosleep is not implemented properly on Linux, rounding the time up to next half-a-millisecond.

Experiment with producer-side throttling to get better message batching ratio.

Zero-copy

Current version of ØMQ lightweight messaging kernel copies the message data once (to the buffer of the consumer application). Although we haven't seen any measurable performance decrease for message sizes up to 1024 bytes, copying the message may be prohibitive for really large messages.

We should find the best way to do zero-copy. Small messages should be copied to larger buffers to allow effective batching, whereas longer messages should be processed using zero-copy techniques.

Towards a full-blown scheduler

Managing messages in version 0.1 is easy. All the messages sent by the client are passed to the worker thread, then written to the socket etc.

However, if we want ØMQ to scale on multicore sytems or send messages to different destinations we have to make things much more complex. We need several worker threads to make use of multiple CPU cores. We need several sockets to send messages to different hosts. We need several client threads to allow client application to scale on multiple cores. And we need all this mixed together in any possible ratio: Arbitrary number of client threads speaking to arbitrary number of worker threads which in their turn manage arbitrary number of sockets, all this without affecting throughput or latency in a negative way.

Take into account that the internal state of each single element of the design is quite complex by itself. Client thread may be doing some work irrelevant to messaging. It may send or receive message at the moment. Alternatively, it may sleep waiting for a message to arrive from any of the several worker threads. Worker thread is processing input from several client threads and passing it to any of multiple sockets, or it is reading messages from socket(s) and passing them to the client thread(s). Several actions may happen at the same time. Single worker thread can at the same moment be requested to get messages from client threads A, B and C and pass them to the sockets D and E, where socket E is actually non-responsive because its TCP buffer is full. At the same time there may be messages available to read in sockets F, G and H… and there may be a message half-read from socket I, however, further reading is not possible because the other party had - for some reason (have it failed?) - ceased to send data. Still at the same moment, client threads J and K are waiting for messages and the worker thread should wake them up as the messages are already available. Moreover, all this have to be done in a fair and non-blocking fashion to get deterministic response times for any operation.

It's clear that ØMQ will need some kind of message scheduler in the same way as operating system needs a process scheduler. It is not our intention to provide a full-blown scheduler with the next version. However, we would like to move towards it by providing an ability to pass several parallel streams of data between two endpoints.

By making the streams mutually independent, we hope that the system will scale in linear manner on multicore boxes. Thus, having seen throughput of 2.8 million messages a second using two cores on each end of the connection, we expect to see figures above 10 million messages a second on big multicore boxes.

Our goal is to saturate 1Gb network with small FIX/FAST messages using quadcore boxes. For medium-sized messages we would like to saturate 10Gb network (whether Ethernet or InfiniBand).

Miscellaneous new functionality

Some of the functionality we would like to see in the future releases:

Efficient routing mechanisms. AMQP-style matching algorithms on one hand and new extra-efficient routing algorithms on the other. We call the latter wire-speed routing and it is our intent to cooperate with hardware vendors to make the algorithms implementable in silicon efficiently .

Integration with AMQP. This task boils down to virtualisation of user interface and allowing it to be controlled either locally via API or over network via AMQP.

Reliable multicast as underlying transport. Reliable multicast is still an exotic technology. We would like to explore the implementations available on different platforms, enhance them to suit our needs (if needed) and integrate them into ØMQ framework.

Testing

Any number of tests on different platforms, micro-architectures, with different transport mechanism, on different networking hardware etc. would be helpful, especially the tests showing bottlenecks in the system.