Introduction

The concept of "Messaging Enabled Network" has evolved from an attempt to integrate AMQP with high-performance messaging use cases, such as those encountered in stock trading business. What follows is an analysis of the bottlenecks in high-performance environments and a discussion how to avoid them. The resulting network topology is then consoled with AMQP model.

Bottlenecks

The first bottleneck is network bandwidth. There are scenarios that require network bandwidth exceeding available bandwidth by whole orders of magnitude. For example, passing 70 megabit of data a second over a one-megabit network.

The second bottleneck is CPU power. Especially with small messages, the broker may be unable to process the messages fast enough to use all available bandwidth. For example, it can pass at most 100 kilobit a second on one-megabit network.

The third bottleneck is latency. Particularly with the market data, latency is paramount. What clients are paying for is an upper bound on message delivery time. Stock quotes become useless very rapidly.

Let's look at these in more detail:

For the first issue, we see that we are passing large amount of redundant data. If there are 100 clients subscribed for the same data feed, each message is passed hundred times over the wire thus increasing bandwidth usage by factor of 100.

For the second issue, the problem is that all CPU-intensive computation is going on at the single node in the network, namely at the broker. Although we may have enough computational power when all the nodes in the network are taken into account, there's currently no way to utilise it.

For the third issue, latency is introduced by number of intermediary nodes between producer and consumer and of course by the time needed to process the message at each intermediary node. The number of nodes on the path is affected by both administrative concerns (security, etc.) and the messaging architecture used. Processing a message involves a number of kernel/user space transitions per message, context switching, possible accesses to persistent storage and overall performance of the implementation. It is also necessary to keep in mind intermediary network devices like routers as well as AMQP brokers on the message path.

It often happens that you encounter throughput vs. latency tradeoff. Think of message batching. When sending messages in batches, you get much better throughput because you don't have to traverse the stack for every message, because the number of on-wire network packets is highly reduced etc. However, latency for the first message in the batch is much worse than it would be otherwise as it has to wait for subsequent messages in the batch to arrive.

When facing this dilemma we should make the behaviour configurable so that user can choose whether to prefer latency over throughput or the other way round. However, if configurable behaviour is not an option, we should opt for better latency. The rationale is that throughput is scalable (you can buy more bandwidth, distribute the load between several servers, etc.) whereas latency isn't (buying more hardware won't improve your latency in any way, on the contrary, it would add latency to the system).

Bandwidth

To decrease bandwidth requirements we need to do two things:

If there is no consumer for the message, the message should not even be passed to the network. Although it looks obvious, note that standard messaging architecture does pass messages over the network (from producer to broker) just to be dropped at the broker if there is nobody subscribed.

No message should be passed over the wire twice. So even if there are ten consumers for the same message, the message should be passed over the network at most once.

The first rule applies to LAN and WAN in the same way. The broker closest to the producer (or even producer itself) should know whether there are any subscriptions for the message and if not so, it should drop it immediately.

The second rule has different implications for LAN and WAN.

On LAN, current architecture works in following manner:

Obvious choice to decrease bandwidth usage would be multicast:

Still, the message is passed twice over the LAN – firstly it is unicast from the producer to the broker, secondly it is multicast from the broker to the consumers. By passing the message directly from the producer to the consumer, we would lower network bandwidth usage (and latency) by half:

On WAN the goal of passing message over the wire exactly once cannot clearly be achieved. There are several LANs on the path from the producer to the consumer and message has to be passed at least once on each LAN.

A typical AMQP WAN architecture looks like this, where arrows show message flow:

Note that two of the three brokers are optional. We can do the same with a single broker. The extra brokers are introduced either for security reasons (so that client does not have to open connection to the different LAN) or for network architecture reasons (if the broker in the middle is needed to distribute messages to two different LANs instead of the single one).

To improve bandwidth usage, we have to ensure that messages are duplicated as late as possible, ideally just before sending them to consumers:

Here we have postponed the message duplication to as late as possible and thus – say – cut four passes over the wire in the LAN in the middle of the picture to just two. However, each message is still passed at least twice over each LAN. Combining the broker and router into a single box would solve this issue:

Lastly, the two messages still passed over the rightmost LAN can be cut down to single one using multicast as explained previously:

CPU Usage

There are two ways to improve CPU usage:

Move work to the edges of the network.

Optimise the message-processing stack in each node.

The mainframe era is over and network end-points (clients) are more and more capable. We can therefore plan to move some of the broker's work out to the clients. Processing AMQP commands is not really CPU intensive, commands form only a small fraction of all the work done – possibly below 1%. We therefore focus on the processing of messages and move that work to the clients.

The broker's CPU usage is effectively zero. The producers do routing, but the CPU load is evenly distributed between individual producers. The consumers have to do queueing (no shared queues are allowed), but the load is distributed among individual consumers.

Optimising the stack, our second strategy, involves some non-trivial issues. We can see several ways to do this:

We can move functionality to hardware. For example pre-computing and tagging messages on the producer (based on the routing data), thus allowing routers to route them at wire speed even on high-performance networks like 10 megabit ones.

We can move the lower part of the stack to the OS kernel, thus minimising kernel/user mode transitions. (For example, dropping messages on the consumer may not even require any user mode support. Actually, dropping messages may be moved even lower into network interface cards thus having no impact on CPU usage altogether.)

We can move to single-threaded architecture when implementing AMQP. Single threaded processing is dramatically faster when compared to multi-threaded processing, because it involves no context switching and synchronisation/locking. To take advantage of multi-core boxes, we should run one single-threaded instance of AMQP implementation on each processor core. Individual instances are tightly bound to the particular core, thus running with almost no context switches (for more information see load distribution whitepaper).

We can provide integration of AMQP with higher level business protocols like FIX. Some of the data from FIX message can be passed to AMQP layer, so that FIX applications can take advantage of the underlying high-performance stack.

Latency

Some of the latency-related work was already introduced in the 'Bandwidth' section. By cutting down number of messages passed on the wire we've in many cases minimised number of network hops thus improving latency considerably.

The ideas presented in the 'CPU Usage' section would have a positive effect on latency as well. The less processing power we need to spend on each message, the more messages we are able to process per time unit and thus the lower the latency.

In some scenarios, where latency is paramount, it can be improved by relaxing reliability and ordering constraints. Using UDP for message transport would decrease latency as it exhibits no head-of-line blocking. However, it would introduce unreliability and unordered delivery as a side effect.

Latency can also be improved by moving producer and consumer close together. This is lately known as 'proximity' or 'colocation' meaning that producer and consumer are placed close to each another in terms of physical distance and/or network distance. Imagine an algorithmic trading engine located in London trading at NYSE. When favourable price appears on NYSE, it must be transported to London, where the trading engine decides to post an order. Order must be transported once again across the ocean, making the latency really high:

The 'proximity' solution means that the box hosting the algorithmic trading engine is placed close to the NYSE, say, in the neighbouring building. That way the latency can be radically reduced:

As can be seen, messages are passed only locally within New York, thus getting latency improvement of 10x-100x. However, trading engine is still administered from London.

Taking this idea to its limit, we can place producer and consumer to the same physical box or even into the same process. In these cases, message transfer can be done in extremely efficient manner using shared memory or even process-local memory (passing a straight pointer). We call this concept 'ultra-proximity':

Now the latency drops down to few nanoseconds (or microseconds in case we assume that thread synchronisation is involved) and the latency improvement of up to 1,000,000x.

Comment: Obviously, nobody would really want to host market data publishing engine, order execution engine and algorithmic trading engine in a same process. However, concept of ultra-proximity may prove useful in grid computation and other areas.

Architectural requirements

To deal with the above issues we need some kind of distributed AMQP solution with as much support from hardware as possible. Basically it means breaking the standard AMQP broker architecture into separate pieces and distributing them over the network in the manner that minimises bandwidth, latency and CPU usage. Each component can be implemented as different process or device. We call this distributed architecture “Messaging Enabled Network” (MEN). MEN has following features:

Supports a range of transport mechanisms for messages including UDP and multicast.

Allows local messages to be passed locally (i.e. if producer and consumer reside in the same process, the messages should be passed as pointers to process-local memory).

Decomposing the broker

Currently the broker architecture – at least when passing messages is considered – looks like this:

Clients are connected to the broker via standard AMQP connections (A, which we call the "front-end" connection). Messages received on front-end connections are forwarded to exchanges where they are routed and stripped of their envelope (e.g. Basic.Publish command frame). Then they are passed to the appropriate queue. Transfer from exchange to queue is done in-process, by passing simple pointers to messages (B, which we call the "back-end" connection). Messages from queue are passed back to AMQP state machine and delivered to subscribed consumers via a front-end connection.

Note that there are important differences between the front-end and back-end connections. The front-end connection is a standard AMQP connection that spans the network. It is used to carry commands as well as messages. Messages are carried with their envelopes.

The back-end connection is not an AMQP connection, but a local data flow. The exact character of this connection is out of scope of the AMQP specification. It carries messages only. Messages are carried without their envelopes.

To make a distributed broker, we make the back-end connection happen over a network. This connection carries only messages (no commands and no message envelopes) over a varied set of transport mechanisms.

By making the back-end connection a network connection. we can separate routing and queueing functionality:

There are significant differences between the needs for the front-end and back-end connections. The front-end connection is a stateful link between two parties, with these important properties:

The connection is bidirectional but not symmetric. The dialogue consists of request-response commands, and asynchronous requests with no responses.

Commands are delivered in the order they were sent in. If it was not so, a single command delivered out-of-order would possibly cause severe semantic misbehaviour and lead to client application malfunction, dead-locking or even crashing.

Commands are delivered reliably, meaning that no command is dropped silently. It is either delivered to the other party, or the connection is torn down. Missing commands would have the same fateful consequences as out-of-order commands.

For the back-end connection, we can profitably relax these requirements:

We don't want the message transport to be connection-based. For example, in a multicast scenario, we want the sender to not even be aware of receivers joining and leaving the multicast group.

We don't need the message transport to be bidirectional.

We don't need the to assume a one-to-one basis. IP multicast and PGM are obvious cases of one-to-many transport scenarios.

We don't need to enforce in-order delivery of messages. In case of UDP we can take advantage of immediate delivery feature. The same applies to the 'unordered' flag in SCTP.

We don't need reliable delivery. If the information transported in messages is highly transient, it's better to sacrifice messages instead of loading the network with retransmissions and retransmission requests.

Examples

Standard broker

This is the diagram we've seen before. The back-end connection B is a transfer via local memory.

Routing distributed to client

In this scenario, routing is done on the client. This lets the client route messages to different brokers:

Note that A is used to describe both standard AMQP communication (on broker side) and client API (on client side). We assume the two map each to another in a 1:1 relationship and we don't make any distinction between them.

Queueing distributed to client

In this scenario, the client acts as the storage for messages. It can get messages from several brokers:

Brokerless communication

By combining these two examples we get a routing client that speaks directly to a queuing client, bypassing the broker:

Although brokerless communication bypasses the formal AMQP architecture, it is the proper architecture for high-volume scenarios. Later we'll see how the brokerless design can be incorporated into an AMQP infrastructure.

Multicast

Multicast scenarios are important for LAN data distribution for the stock trading business. Note that multicast conforms to the relationship between queue and exchange as described by the AMQP specification, i.e. message is copied to each queue that has appropriate binding to the exchange:

To improve latency and bandwidth usage, multicast can be combined with brokerless communication:

Standalone router

This scenario handles the case where AMQP routing functionality is deployed on a network router with little persistent storage/memory resources. Messages are just passed through at the wire speed without actually being stored:

Standalone storage

This scenario is interesting the there are storage servers on the network. Each box can service a set of queues without a need to do routing:

Local eventing

This scenario uses an AMQP client not to connect to any broker, but to do internal messaging (eventing) for the application. It may be used for example when implementing 'ultra-proximity' concept:

Conclusion

We believe that messaging architecture described above is an efficient and robust basis for building messaging software. We believe that the distributed nature of the architecture allows for placing parts of the system into separate software components and/or separate hardware devices, for moving the components to different geographical locations as well as for writing or manufacturing multiple implementations of each component, each with its unique and useful features. We believe that this kind of flexibility will make it easy for everyone - from big hardware manufacturers to individual software developers - to participate on ØMQ project.