A typical push-based API, similar to JMS. An important note is that EventConsumer is blocking, meaning it won't deliver new Event until the previous one was consumed by EventConsumer. This is just an assumption I made that does not drastically change the requirements. This is also how message listeners work in JMS. The naive implementation simply attaches a listener that takes around 10 milliseconds to complete:

It compiles and runs but in order to figure out that the requirements aren't met we must plug in few metrics. The most important metric is the latency of message consumption, measured as a time between message creation and start of processing. We'll use Dropwizard Metrics for that:

After 30 seconds our application processes events on average with 15 second delay. Not entirely real-time. Obviously the lack of concurrency whatsoever is the reason. Our ClientProjection event consumer takes around 10 ms to complete so it can handle up to 100 events per second, whereas we need an order of magnitude more. We must scale ClientProjection somehow. And we haven't even touched other requirements!

Naive thread pool

The most obvious solution is to invoke EventConsumer from multiple threads. The easiest way to do this is by taking advantage of ExecutorService:

We use a decorator pattern here. The original ClientProjection, implementing EventConsumer was correct. However we wrap it with another implementation of EventConsumer that adds concurrency. This will allows us to compose complex behaviors without changing ClientProjection itself. Such design promotes:

loose coupling: various EventConsumer don't know about each other and can be combined freely

single responsibility: each does one job and delegates to the next component

open/closed principle: we can change the behavior of the system without modifying existing implementations.

Yet we still see growing delay on a much smaller scale, after 30 seconds the latency reached 364 milliseconds. It keeps growing so the problem is systematic. We... need... more... metrics. Notice that NaivePool (you'll see soon why it's naive) has exactly 10 threads at its disposal. This should be just about enough to handle thousand events, each taking 10 ms to process. In reality we need a little bit of extra processing power to avoid issues after garbage collection or during small load spikes. To prove that thread pool is actually our bottleneck it's best to monitor its internal queue. This requires a little bit of work:

The idea here is to create ThreadPoolExecutor manually in order to provide custom LinkedBlockingQueue instance. We can later use that queue to monitor its length (see: ExecutorService - 10 tips and tricks). Gauge will periodically invoke queue::size and report it to wherever you need it. Metrics confirm that thread pool size was indeed a problem:

The ever-growing size of the queue holding pending tasks hurts the latency. Increasing thread pool size from 10 to 20 finally reports decent results and no stalls. However we still didn't address duplicates and protecting from concurrent modification of events for the same clientId.

Obscure locking

Let's start from avoiding concurrent processing of events for the same clientId. If two events come very quickly one after another, both related to the same clientId, NaivePool will pick both of them and start processing them concurrently. First we'll at least discover such situation by having a Lock for each clientId:

This is definitely going in the wrong direction. The amount of complexity is overwhelming but running this code at least reveals there is an issue. The event processing pipeline looks as follows, with one decorator wrapping another:

Once in a while the error message will pop-up, telling us that some other thread is already processing event for the same clientId. For each clientId we associate a Lock that we examine in order to figure out if another thread is not processing that client at the moment. As ugly as it gets we are actually quite close to a brutal solution. Rather than failing when Lock cannot be obtained because another thread is already processing some event, let's wait a little bit, hoping the Lock will get released:

The idea is very similar. But instead of failing tryLock() waits up to 1 second hoping the Lock for given client will be released. If two events come in very quick succession, one will obtain a Lock and proceed whereas the other will block waiting for unlock() to happen.

Not only this code is really convoluted, but probably also broken in many subtle ways. For example what if two events for the same clientId came almost exactly at the same time, but obviously one was first? Both events will ask for Lock at the same time and we have no guarantee which event will obtain a non-fair Lock first, possibly consuming events out of order. There must be a better way...

Dedicated threads

Let's take a step back and a very deep breath. How do you ensure things aren't happening concurrently? Well, just use one thread! As a matter of fact that's what we did in the very beginning but the throughput was unsatisfactory. But we don't care about concurrency for different clientIds, we just have to make sure events with the same clientId are always processed by the same thread!

Maybe creating a map from clientId to Thread comes to your mind? Well, this would be overly simplistic. We would create thousands of threads, each idle most of the time as per the requirements (only few events per second for given clientId). A good compromise is a fixed-size pool of threads, each thread responsible for a well-known subset of clientIds. This way two different clientIds may end up on the same thread but the same clientId will always be handled by the same thread. If two events for the same clientId appear, they will both be routed to the same thread, thus avoiding concurrent processing. The implementation is embarrassingly simple:

This simple algorithm will always use the same single-thread ExecutorService for the same clientId. Different IDs may end up in the same pool, for example when pool size is 20, clients 7, 27, 47, etc. will use the same thread. But this is OK, as long as one clientId always uses the same thread. At this point no locking is necessary and sequential invocation is guaranteed because events for the same client are always executed by the same thread. Side note: one thread per clientId would not scale, but one actor per clientId (e.g. in Akka) is a great idea that simplifies a lot.

By the way to be extra safe I plugged in metrics for average queue size in each and every thread pool which made the implementation longer:

Deduplication and idempotency

In distributed environment it's quite common to receive duplicated events when your producer has at least once guarantees. The reasons behind such behavior are beyond the scope of this article but we must learn how to live with that issue. One way is to attach globally unique identifier (UUID) to every message and make sure on the consumer side that messages with the same identifier aren't processed twice. Each Event has such UUID. The most straightforward solution under our requirements is to simply store all seen UUIDs and verify on arrival that received UUID was never seen before. Using ConcurrentHashMap<UUID, UUID> (there is no ConcurrentHashSet in JDK) as-is will lead to memory leak as we will keep accumulating more and more IDs over time. That's why we only look for duplicates in the last 10 seconds. You can technically have ConcurrentHashMap<UUID, Instant> that maps from UUID to timestamp when it was encountered. By using a background thread we can then remove elements older than 10 seconds. But if you are a happy Guava user, Cache<UUID, UUID> with declarative eviction policy will do the trick:

It took us a lot of work to come up with relatively simple and well structured (I hope you agree) solution. In the end the best way to tackle concurrency issues is to... avoid concurrency and run code that is subject to race conditions in one thread. This is also the idea behind Akka actors (single message processed per actor) and RxJava (one message processed by Subscriber). In the next installment we will see declarative solution in RxJava.

Source code is not on GitHub, but you can reconstruct it from this article, I include everything. The source of events can be found in part 2: http://www.nurkiewicz.com/2016/10/small-scale-stream-processing-kata-part_13.html

Great article covering important problem when working with data feeds.I think to make it closer to real world example we cannot assume that distribution of events for clients is even. One clientA may emit much more events than clientB. Using your final solution in the worst case when we select clients to be handled by one thread with % operator (on uuid of client) we may end up with all clients that are active being served by just one thread.If that happens we are back to naive sequential processing described in first part, which is inefficient.I think nice solution would be to assign one actor for a client (you mentioned it in "dedicated threads" chapter), and after 10s of inactivity remove it.10 000 actor running at once i think wouldn't be a big overhead for the system (although i haven't measure it).It would be very interesting to see if it would work in the same time (or comparable) to final solution with dedicated threads which assumes even distribution.