Inter-thread communications in Java at the speed of light

The story begins with a simple idea: create a developer friendly, simple and lightweight inter-thread communication framework without using any locks, synchronizers, semaphores, waits, notifies; and no queues, messages, events or any other concurrency specific words or tools.

Just get POJOs communicating behind plain old Java interfaces.

It could be something similar to Akka typed actors, but that might be overkill as the new framework has to be ultra-lightweight, and optimized for inter-thread communication on a single multi-core computer.

The Dispatcher Thread receives messages from Actor Threads and sends them into appropriate SPSC queue.

Actor threads, using data from a received message, invoke a corresponding method of the actor instances. By using proxies of other actors, actor Instances send messages to the MPSC queue and then the messages go to the target Actor Thread.

The speed they play was around 500,000 ping/pongs a second; so far so good. However, when compared with the execution speed using just a single thread, it suddenly looks not so good. The code running in a single thread can perform more than 2 billion (2,681,850,373) operations per second!

The difference is more than 5,000 times. This disappointed me. It produces single threaded code that is more effective than multi-thread code in many cases.

I started looking for reasons for the slowness of my ping-pong players. After some investigation and testing I found that the blocking queues that I used to pass messages between actors were affecting performance.

Figure 2: SPSC queue with single producer and single consumer

So I launched a quest for one of the fastest queue implementations in Java as a replacement. I found a great blog by Nitsan Wakart. He has several posts describing some implementations of Single Producer/Single Consumer (SPSC) Lock-Free Queues. The posts were inspired by Martin Thompson’s presentation of Lock-Free Algorithms for Ultimate Performance.

Lock-Free queues provide better performance in comparison to queues based on lock primitives. In the case of lock based queues when one thread gets a lock, other threads will be blocked until the lock is free. In the case of lock free algorithms a producer thread can produce messages without blocking by other producer threads, and consumers will not be blocked by other consumers while reading from the queue.

The performance results of SPSC queues described in Martin Thompson’s presentation and in Nitsan’s blog were incredible - more than 100M ops/sec. It’s more than 10 times faster the JDK’s Concurrent Queue implementations (which performance on Intel Core i7 with 4 cores has been around 8M ops/sec).

With great anticipation I replaced the linked blocking queues connected to each actor with lock-free SPSC queue implementations. Sadly, the performance tests didn’t produce a significant improvement in throughput. It did not take long to realize that the bottleneck was not a SPSC queue but a Multi Producer/ Single Consumer (MPSC) one.

Using SPSC queues in a role of MPSC queue is not a straightforward task; multiple producers can overwrite each other’s values by doing a put operation. SPSC queues just do not have code controlling put operations by multiple producers. Therefore even the fastest SPSC queues would not fix my problem.

For the Multiple Producers/Single Consumer I decided to leverage LMAX Disruptor – a High Performance Inter-thread Messaging Library based on a ring buffer.

Figure 3: LMAX Disruptor with single producer and single consumer

By using Disruptor it’s easy to achieve very low-latency, high-throughput inter-thread message communication. It also provides use cases for different combination of producers and consumers. Several threads can read from the ring buffer without blocking each other:

Figure 4: LMAX Disruptor with single producer and two consumers

A scenario when multiple producers write into the ring buffer with multiple consumers getting messages from it.

The Disruptor was more than twice as fast as the LinkedBlockingQueue for the 3 Producers/1 Consumer case. However this was still a long way from my expectations of producing a 10 times improvement in performance results.

I was frustrated by this order of things and my mind was searching for a solution. As fate had it, I had recently modified my commute to use a subway instead of the old carpool. Suddenly a reverie came over me and my mind started mapping stations to producers and consumers. At one station we have both producers (in the form of a wagon with people exitng from it) and consumers (the same wagon with people who enter it.)

I created a Railway class and used AtomicLong to track the train as it passed from station to station. For a simple scenario I started with a single-train railway.

public class RailWay {
private final Train train = new Train();
// the stationNo tracks the train and defines which station has the received train
private final AtomicInteger stationIndex = new AtomicInteger();
// Multiple threads access this method and wait for the train on the specific station.
public Train waitTrainOnStation(final int stationNo) {
while (stationIndex.get() % stationCount != stationNo) {
Thread.yield(); // this is necessary to keep a high throughput of message passing.
//But it eats CPU cycles while waiting for a train
}
// the busy loop returns only when the station number will match
// stationIndex.get() % stationCount condition
return train;
}
// this method moves this train to the next station by incrementing the train station index…
public void sendTrain() {
stationIndex.getAndIncrement();
}
}

For testing purposes I used the same conditions used in Disruptor performance tests and tests for SPSC queues - tests transfer long values between threads. I created the following Train class, which contains a long array:

By running the test with different train capacity, the results surprised me:

Capacity

Throughput: ops/sec

Latency: ns

1

5,190,883

192.6

2

10,282,820

194.5

32

104,878,614

305.1

256

344,614,640

742. 9

2048

608,112,493

3,367.8

32768

767,028,751

42,720.7

The throughput of transferring messages between two threads reached 767,028,751 ops/sec with train capacity in 32,768 longs. It’s several times faster than SPSC queues in Nitsan’s blog.

Continuing the railway train of thought, I considered what would happen if we would have two trains? I felt it should improve throughput and reduce latency at the same time. Every station will have its own train. While one train will be loading goods at the first station, the second train will unload goods at the second station and vice versa.

Figure 7: Railway with single producer and single consumer uses two trains

Here are the results for the throughput:

Capacity

Throughput: ops/sec

Latency: ns

1

7,492,684

133.5

2

14,754,786

135.5

32

174,227,656

183.7

256

613,555,475

417.2

2048

940,144,900

2,178.4

32768

797,806,764

41,072.6

The results were amazing; it was more than 1.4 times faster than the test results for a single train. For the train capacity of one the latency was reduced from 192.6 nanoseconds to 133.5 nanoseconds; clearly a promising sign.

Therefore my experiments were not over. The latency of transferring messages between threads for the train capacity of 2048 was - 2,178.4 nanoseconds, which was too much. I was considering how to reduce that and created a case with many trains:

Figure 8: Railway with single producer and single consumer uses many trains

I also reduced the train capacity to one long value and started playing with the train count. Below are test results:

Train Count

Throughput: ops/sec

Latency: ns

2

10,917,951

91.6

32

31,233,310

32.0

256

42,791,962

23.4

1024

53,220,057

18.8

32768

71,812,166

13.9

With 32,768 trains the latency of sending a long value between threads was reduced to 13.9 nanoseconds. Playing with the train count and train capacity the throughput and latency can be tuned up to optimal balance when the latency is not so high and the throughput is not so low.

These numbers are great for single producer and single consumer (SPSC); but how could we make that work for several producers and consumers? The answer was simple- add more stations!

Figure 9: Railway with single producer and two consumers

Every thread waits for the next train, then loads/unloads items, and sends the train to the next station. The producer thread puts items to the train while consumers get items from it. Trains constantly move by the circle from one station to another.

In order to test the Single Producer/Multiple Consumer (SPMC) case I created the Railway test with 8 stations. One station belongs to a single producer while other 7 stations belong to consumers. The results are:

For the train count = 256 and train capacity = 32:

ops/sec = 116,604,397
latency nanos = 274.4

For the train count = 32 and train capacity = 256:

ops/sec = 432,055,469
latency nanos = 592.5

As you can see even with eight working threads the test shows pretty good results - 432,055,469 ops/sec with 32 trains and a capacity of 256 longs. During the test all CPU cores were loaded to 100%.

Figure 10: CPU utilization during the Railway test with 8 stations

While playing with the Railway algorithm I almost forgot about my goal; to improve the performance of the Multiple Producers/Single Consumer case.

Figure 11: Railway with three producers and single consumer

I created a new test with 3 producers and 1 consumer. Each train traces the circle from station to station while each producer loads only 1/3 the capacity of each train. By every train the consumer gets all three items received from three producers. The performance test shows the following average results:

ops/sec = 162,597,109
trains/sec = 54,199,036
latency ns = 18.5

That’s pretty good. Producers and the consumer work at a speed of more than 160M ops/sec.

As you can see the Railway approach provides average throughput 162,597,109 ops/sec, whereas the best result using the Disruptor for the same case was only 128,205,128 ops/sec. In the case of LinkedBlockingQueue the best result was just 4,546,281 ops/sec.

The Railway algorithm introduces an easy way for event batching that increases throughput significantly. It can be easily configurable to achieve desired results for throughput/latency by playing with train capacity or train count.

Also the Railway could be used for really complex cases by mixing producers and consumers when the same thread could be used to consume messages, process them and to return the results back to the ring:

It had the following average results: the throughput more than one and a half billion (1,569,884,271) operations per second and the latency equals 1.3 microseconds. As you can see the results for the test are on the same order of magnitude with the results for the single threaded test described at the beginning of the article which was 2,681,850,373 operations per second.

Here I'll leave you to draw your own conclusions.

In a future article I hope to demonstrate how to back the Railway algorithm with Queue and BlockingQueue interfaces for different combinations of producers and consumers. Stay tuned.

About the Author

Aliaksei Papou is the Lead Software Engineer and Architect at Specific Group, a software development company located in Vienna, Austria. Aliaksei has more than 10 years experience in the development of small and large scale enterprise applications. He has a strong belief: writing concurrent code shouldn't be so hard.

this is really a cool design, somehow it seems like a disruptor with multiple ring buffer(train). for disruptor threads will block or limited by the slowest thread(no matter producer or consumer) and this design will using different trains to push the results and delay the blocking.

just one question, in terms of disruptor implementation, do you think in your ultra high throughput case, the trainNoLong should be volatile to force retrieve new value? in case the thread could not get the latest value.

and something more, in my opinion, disruptor is more suitable for "Fork" like operation, which means either the producer is much faster than consumer, so we need to use multiple consumer to serve the messages, or the consume operation could be parallel just like the assembly line example of disruptor.

train is on the other side, which is more suitable for "Join" like operation, producers could be paralleled without coordinate with each other.

something more.this design is more like selector-channel pattern in nio, each consumer's station is a dedicated selector(while loop), and train is channel, the sendTrain is like flip on the channel to coordinate producer and consumer

I'm currently looking into some optimizations in Hazelcast; the most important one is the scheduling of operations for a given partition. Perhaps this might be interesting to reduce the overhead of queuing operations.

Per Martin Thompson's recommendation, you will find yourself getting an even larger performance boost by avoiding division (modulus operator). Since division is not fully pipelined in the processor, it prevents the CPU from performing up to 6-tasks in parallel per clock cycle (vice one: division).

I recently have evaluated the relative works. So, I'd like to share some my opinions here:

1. The thing in the article is not new. In fact, your Train is just a wrapper for long array. You can exchange your "Train" object with Disruptor or j.u.c Queue. The results will show no significant difference, or may be better. Because the bottleneck for this techique is not exchanging but buffering.

Although the author gives this techique a new name - "Railway algorithm", it is essentially similiar to an old techique call "Multiple Buffering"(en.wikipedia.org/wiki/Multiple_buffering), IMHO. One coomment hit the edge by mentioning the nio buffer:) It is a very common techique for high throughput.

However, every technique has its application scope. The batch buffering improve nicely the throughput at the price of heavily worsening the latency. And worsely, before the batch send out, your consumer may starve for waiting but nothing. So, it is truely possible to see a degradation in real world system after applying this kind of technique.

2. latency != 1/throughputthis is a trivial fact: some cpu L1 cache may has ~1TB/s throughput, then is it easy to reason out this cpu's L1 latency? 1 train transfers from A to B in 1 seconds. Then, you may say this transport has 1s latency. If 1000 parallel trains can transfer from A to B in 1 seconds, can we say this transport has become 1ms latency?

"latency" is a well established concept, do not abuse it. So, here, I "leave you to draw your own conclusions" about the "many trains"' and MPSC latencies:)

3. for your SPSC, there is a fundational mistake: no fence to guarantee the visibility. "Thread.yield(); // this is necessary to keep a high throughput of message passing. "And this words is misleading in fact. Sometimes, sleeping(backoffing) to more can get similiar or better result for throughput.

You may feel good in that your test may get passed. It is just a trick for the specific "Thread.yield()" impl in hotspot. "Thread.yield()" seemly offers a fence here. But it can't. The JMM in JLS has explicitly disabled this semantics.

That is, you must offer the fence for your "ultra high throughput Single Producer/Single Consumer". Which type of or where the fence should be placed is any other topic. But Disruptor gives a good example.

4. for your MPSC, as some one hinted, it is fair to compare with mulitple SPSC Disruptors rather than a single MPSC Disruptor. Note that your train suffer from fasle sharing problem and indirection with AtomicInteger. Again note, AtomicInteger is easy to run over itself.

5. finally, SPSC may not be a good point for massive ITC.

Because the space complexity goes in O(N^2), N is the number of threads. It is not rare to see a server with 1k or 2k hardware threads. And "many core" is the final destination of CPU from current sight.

Thanks for the feedback. It's a good addition to the article. You've clarified some things I haven't covered by it and provided a different view for major points. Thank you.

The goal of the article was to provide a non intrusive introduction into the world of high throughput and low latency for inexperienced developers in this topic. I provided it in a way of my current understanding of the topic. As it is. I think it's a good start to get people interested in this topic and draw their attention to the subject.

I'm glad to share my findings with others and I'm open to discuss different aspects.

I've got a number of feedbacks from people having concerns about using AtomicInteger.lazySet instead of using AtomicInteger.incrementAndGet.

Also have to say I do not agree with assumptions I've got from previous comments:

From Lee Yunpeng:

just one question, in terms of disruptor implementation, do you think in your ultra high throughput case, the trainNoLong should be volatile to force retrieve new value? in case the thread could not get the latest value.

And from Jin Mingjian:

3. for your SPSC, there is a fundational mistake: no fence to guarantee the visibility. ...That is, you must offer the fence for your "ultra high throughput Single Producer/Single Consumer".

All Railway examples adheres Single Writer Principle by using a single thread per a station and a single train at a time at the station.

According to Martin's blog post about Single Writer Principle the "ultra high throughput Single Producer/Single Consumer" example uses hardware level memory barriers in the case of x86/x64 architectures:

A quote from the post:

If you have a system that can honour this single writer principle then each execution context can spend all its time and resources processing the logic for its purpose, and not be wasting cycles and resource on dealing with the contention problem. You can also scale up without limitation until the hardware is saturated. There is also a really nice benefit in that when working on architectures, such as x86/x64, where at a hardware level they have a memory model, whereby load/store memory operations have preserved order, thus memory barriers are not required if you adhere strictly to the single writer principle. On x86/x64 "loads can be re-ordered with older stores" according to the memory model so memory barriers are required when multiple threads mutate the same data across cores. The single writer principle avoids this issue because it never has to deal with writing the latest version of a data item that may have been written by another thread and currently in the store buffer of another core.

In order to dispel doubts of Lee, Jin and other readers about absence of explicit memory bariers in the "ultra high throughput Single Producer/Single Consumer" example I modified it to use AtomicInteger with AtomicInteger.lazySet implementation to be sure it works not only on x86/x64.

I have had a little time to digest the code examples and produce a couple of like-for-like benchmarks. Firstly the the MPSC example is quite interesting as it highlights a potential optimisation that had not be been taken advantage of in the Disruptor. If you know and fix the number of publishers at construction time, it is possible to build a structure that has far less contention. The existing multi-producer case supports an large, arbitrary and dynamic number of producers, which is often quite common (think web server). I got into this in more detail on my blog goo.gl/s7F8s1. The headline is that I can get around 390 Mops/sec from the Disruptor vs. 290 Mops/sec from the supplied railway example.

One concern I have about the MPSC implementation is that there is a serialisation between the producers that could cause stalling problems in a real-world application. Each of the producers has a station number and the train must move from station to station in a sequential pattern, as each producer moves the train from it own stationNumber to stationNumber + 1. As an example if you have three producers 0, 1, and 2. If producers 0 and 2 have data to send, but station 1 does not then the train will move from station 0 to station 1, but not progress to station 2. This can be mitigated by simply passing the train onto the next station, but what happens if that thread happens to be blocked (e.g. waiting on a socket for data) or has crashed due to some exception at the time that the train arrives?

I've also looked at the ultra-fast case and there I'm less convinced that the test is measuring anything useful. The test allocates trains with a capacity of 2048 longs. So the test will iterate through the 2K array filling it with values, push to the next station and repeat. So most of the time will be spent filling the array and 1 "op" in the test is considered to the write of a single long to that array. It is possible to do the same thing using the Disruptor by allocating specifying a ring buffer of long[]. I've implemented a performance test that more closely matches this behaviour which is available here goo.gl/AdYACv. I get 1,040 Mops/sec with the Disruptor compared to 629 Mops/sec for the Train. I wasn't able to get the claimed 1.5 Bops/sec, but that is probably to do with differences in hardware. While I can get the Disruptor faster using this pattern, I don't believe this test tells me anything about the performance of the concurrent structure/algorithm.

The ultra-fast case also violates the single writer principal. If you look at the implementation of Train (goo.gl/9FrvyW), but the write method (addGoods) and the read method (getGoods) mutate the same variable (index). Therefore both the producer thread and consumer thread will be updating the same location in memory.

Interesting article. Some of your findings we discovered ourselves in creating Reactor's Dispatchers. We'd love for any contributions you might like to send our way to get our throughput even higher. We're always looking for ways to improve our throughput and make Reactor the fastest reactive fastdata framework around! :)

Hi Aliaksei,First of all, I'm glad you enjoy my blog and find the post useful reference material.I have the following reservations about your article above (I have focused on the SPSC code, and did not find time for the rest):

1. Broken terminology: If you want to compare your implementation with queues and refer to queues throughout your article, why not use an understandable vocabulary that is common to the industry? why make up your own terms to define the same things? IMHO it does not help your readers or your cause.

2. Broken implementation I: As others have kindly pointed out, you failed to apply any memory barriers in your SPSC example. Your reaction is that you added them to "dispel doubts" not for correctness sake. You are wrong. Your implementation is broken without the memory barrier. If the counters are not read as volatile and written using lazySet the compiler is free to hoist them out of loops and reorder the writes such that your events are made visible before they are fully written. Your quote from Martin: "thus memory barriers are not required if you adhere strictly to the single writer principle." - is a common miss-interpretation of the different levels of memory barriers employed. The CPU may not re-order loads/stores, but the compiler is still free to do so if you do not employ lazySet.

3. Broken implementation II: IMO, The Train class is also broken in the way it manages it's internal state (reading the same element twice will decrease the goods count twice, the semantics are inconsistent).

4. Broken implementation III: As Mike points out it also breaks the Single Writer principle, but as it is the 'event' rather than the 'queue' that may not be an issue.

5. Broken measurement I: I'm not sure you haven't used one of the existing benchmarks from the sources you quote, it makes your comparison with other results produced on different hardware with different benchmarks misleading.

6. Broken measurement II:Your benchmarks measure both latency and throughput incorrectly. As Jin gently points out Latency != Time/Throughput. You are measuring the write throughput to your 'event' + 'queue' and your 'event' preparation+ 'queue' offer avg cost. There's no attempt at calculating the arrival time of the event or the completion of all events arrival.

7. Broken measurement III: You are also mixing the warmup results with your later results (your benchmarks don't have separate warmup iterations measurement).

As Mike points out you could have implemented the same functionality, with better performance, by sending your Train event on the Disruptor or any other correct concurrent queue implementation. I apologize if the above sounds harsh, but I feel your article is giving a bad example.Regards,Nitsan

There are a number of issues with this implementation and approach as have been pointed out in the comments. Rather than going straight to publication on a major site I'd recommend getting feedback in a suitable forum to validate and refine an approach. There are a number of excellent discussion groups for this. Two I would recommend are:

Both of these forums have a number of members who specialise in concurrent and lock-free algorithms. If you wrote a personal blog and then asked for feedback in these forums I'm sure you would learn a lot and not mislead the general developer population.

More than the final result and conclusion (which is extremely useful), I like the way you have approached the whole problem. Thanks for sharing the whole story! Chimes with people like who writes programs to earn a living.

I think if you make the event object in disruptor based on an array, you will likely find that the throughput increases accordingly. While you wait for that train to fill, the consumer will probably be stalled, which increases your latency.