Search This Blog

Small scale stream processing kata. Part 2: RxJava 1.x/2.x

In part 1: thread pools we designed and implemented relatively simple system for processing events in real time. Make sure you read previous part as it contains some classes that we'll reuse. Just in case here are the requirements:

A system delivers around one thousand events per second. Each Event has at least two attributes:

clientId - we expect up to few events per second for one client

UUID - globally unique

Consuming one event takes about 10 milliseconds. Design a consumer of such stream that:

allows processing events in real time

events related to one client should be processed sequentially and in order, i.e. you can not parallelize events for the same clientId

if duplicated UUID appeared within 10 seconds, drop it. Assume duplicates will not appear after 10 seconds

What we came up so far was a combination of thread pools and shared cache. This time we will implement the solution using RxJava. First of all I never revealed how EventStream is implemented, only giving the API:

interface EventStream {
void consume(EventConsumer consumer);
}

In fact for manual testing I built a simple RxJava stream that behaves like the system from the requirements:

Understanding how this simulator works is not essential, but quite interesting. First we generate steady stream of Long values (0, 1, 2...) every millisecond (thousand events per second) using interval() operator. Then we delay each event by random amount of time between 0 and 1_000 microseconds with delay() operator. This way events will appears in less predictable moments in time, a bit more realistic situation. Finally we map (using, ekhem, map() operator) each Long value to a random Event with clientId somewhere between 1_000 and 1_100 (inclusive-exclusive).

The last bit is interesting. We would like to simulate occasional duplicates. In order to do so we map every event (using flatMap()) to itself (in 99% of the cases). However in 1% of the cases we return this event twice, where the second occurrence happens between 10 milliseconds and 5 seconds later. In practice the duplicated instance of the event will appear after hundreds of other events, which makes the stream behave really realistically.

There are two ways to interact with the EventStream - callback based via consume() and stream based via observe(). We can take advantage of Observable<Event> to quickly build processing pipeline very similar in functionality to part 1 but much simpler.

Missing backpressure

The first naive approach to take advantage of RxJava falls short very quickly:

(ClientProjection, ProjectionMetrics et. al. come from part 1). We get MissingBackpressureException almost instantaneously and that was expected. Remember how our first solution was lagging by handling events with more and more latency? RxJava tries to avoid that, as well as avoiding overflow of queues. MissingBackpressureException is thrown because consumer (ClientProjection) is incapable of handling events in real time. This is fail-fast behavior. The quickest solution is to move consumption to a separate thread pool, just like before, but using RxJava's facilities:

By consuming events using flatMap() in a separate Scheduler.io() each consumption is invoked asynchronously. This time events are processed near real-time, but there is a bigger problem. I decorated ClientProjection with FailOnConcurrentModification for a reason. Events are consumed independently from each other so it may happen that two events for the same clientId are processed concurrently. Not good. Luckily in RxJava solving this problem is much easier than with plain threads:

A little bit has changed. First of all we group events by clientId. This splits single Observable stream into stream of streams. Each substream named byClient represents all events related to the same clientId. Now if we map over this substream we can be sure that events related to the same clientId are never processed concurrently. The outer stream is lazy so we must subscribe to it. Rather than subscribing to every event separately we collect events every second and count them. This way we receive a single event of type Integer every second representing the number of events consumed per second.

Impure, non-idiomatic, error-prone, unsafe solution of deduplication using global state

Now we must drop duplicate UUIDs. The simplest, yet very foolish way of discarding duplicates is by taking advantage of global state. We can simply filter out duplicates by looking them up in cache available outside of filter() operator:

Accessing global, especially mutable state from inside of operators is very dangerous and undermines the sole purposes of RxJava - simplifying concurrency. Obviously we use thread-safe Cache from Guava, but in many cases it's easy to miss places where shared global mutable state is accessed from multiple threads. If you find yourself mutating some variable outside of the operator chain, be very careful.

Custom distinct() operator in RxJava 1.x

RxJava 1.x has a distinct() operator that presumably does the job:

es.observe()
.distinct(Event::getUuid)
.groupBy(Event::getClientId)

Unfortunately distinct() stores all keys (UUIDs) internally in ever-growing HashSet. But we only care about duplicates in last 10 seconds! By copy-pasting the implementation of DistinctOperator I created DistinctEvent operator that takes advantage of Guava's cache to only store last 10 seconds worth of UUID's. I intentionally hard-coded Event in this operator rather than making it more generic to keep code easier to understand:

This solution is much shorter than previous one based on thread pools and decorators. The only awkward part is custom operator that avoid memory leak when storing too many historic UUIDs. Luckily RxJava 2 to the rescue!

RxJava 2.x and more powerful built-in distinct()

I was actually this close from submitting a PR to RxJava with more powerful implementation of distinct() operator. But before I checked 2.x branch and there it was: distinct() that allows providing custom Collection as opposed to hard-coded HashSet. Believe it or not, dependency inversion is not only about Spring framework or Java EE. When a library allows you to provide custom implementation of its internal data structure, this is also DI. First I create a helper method that can build Set<UUID> backed by Map<UUID, Boolean> backed by Cache<UUID, Boolean>. We sure like delegation!

Labels

Comments

Awesome post as usual Tomasz.I have a use-case pretty similar to your cache that would be really interesting to follow up with.I'm trying to implement a retry budget for an async IO computation, meaning that if calls fail, only X nr of retries per second are allowed globally in order to not put more stress on the backend.

My naive solution would be to have a global AtomicInteger that gets reset each second by an interval observable. In the IO observable I would then check and decrease that counter during the retryWhen, meaning that only the first X errors during each second would cause a retry.

Even before I read your article my gut feeling told me that having global state referenced by two observables is a bit no-no, and I guess it should be implemented using some zip with a windowed observable + take.Do you have any pointers on how to proceed?

I'm actually considering using the same pattern as you with a custom operator hiding the budget counter state.

Long story short, we've actually been using Hystrix extensively but for a couple of reasons would like to retry non-blocking IO clients inside the Hystrix execution instead of as an operator on the Hystrix observable.

As our client uses RX natively it would be very clean to deal with "plain rx", but if everything else fails I'll probably look into simply using circuit breakers to acheive similar results.