Monday, November 23, 2015

Last week I created an RxMovers scene showing the toList() operator. This week I will show how to take an Observable emitting List<T> objects, turn each List<T> into an Observable<T>, and flatten all these Observables into a single Observable<T> using a flatMap().

Say you have three RxMovers emptying boxes from a house and loading them into truck, just like our previous examples. But this time Mover 1, the source Observable, is pushing a dolly of items out of the house. He is an Observable<List<Box>> emitting List<Box> objects.

Mover 2 will act as a flatMap() taking each dolly of boxes and passing each box individually to Mover 3, the Subscriber.

Mover 1 pushes two dollies of boxes before he calls onCompleted(). On each emission, Mover 2 receives the dolly and passes the boxes individually. Effectively, both dollies were consolidated into a single stream of boxes which were received by Mover 3.

Of course, each List<Box> has to be converted into an Observable<Box>, and you accomplish that with the Observable.from() static factory method. This also means that flatMap() can map each emitted item (of any type) to any Observable and not just ones backed by a List.
The flatMap() is powerful because it transforms an emission into another set of emissions, and these multiple sets of emissions get consolidated into a single stream. flatMap() is arguably one of the most critical operators in RxJava because it contributes so much utility. It is difficult to show every application of flatMap() here, but definitely spend some time playing with. The concept of taking each emitted item from an Observable and mapping it to another Observable is a powerful concept.

Wednesday, November 18, 2015

If you read my earlier post Understanding Observables, you already know that I like to think of Observable operations as stick figures passing boxes. The analogy made sense to me and it seemed to help others too. As I get more efficient at cheap animation, I'll see if I can cover other operators with these "RxMovers". I even have some ideas to visualize subscribeOn() and observeOn() with these guys.

The toList() operation reveals quite a bit about Observable behavior and how onNext() and onCompleted() calls can be manipulated. In my previous article that covered map(), the onNext() calls happen in sequence down the entire chain and get passed all the way to the final Subscriber.

The toList() operator intercepts onNext() calls. Rather than handing the items further down the chain, it catches them instead and adds them to a List (specifically an ArrayList). Only when the source Observable calls onCompleted() will it emit the entire List up the chain, then it will call onCompleted() up the chain.

The toList() in Action

The scene starts by Mover 1 taking boxes out of storage and pushing them to Mover 2, who represents the toList() operation. Instead of passing the boxes immediately to Mover 3, he instead collects them onto a dolly and does not pass anything to Mover 3. When Mover 1 has passed all items from the storage unit (which is three items), he calls onCompleted() on Mover 2 and says they are done.

However, Mover 2 does not pass the onCompleted() call to Mover 3 yet. This is his cue to push the dolly of items to Mover 3. He calls Mover 3's onNext() method and passes the entire List<Box> of items to him. Then Mover 2 calls onCompleted() which signals Mover 3 to close the truck.

The importance of onCompleted()

As you can probably guess, the onCompleted() call is very important with an operation like toList(). Logically, you cannot collect infinite emissions from an infinite Observable into a List, because a List is a finite collection of items. Therefore, if an onCompleted() call is never passed to a toList() operation, you will get some very undesired behaviors. These behaviors can range from infinite blocking because toList() infinitely keeps waiting for items, or it gives up and does not emit anything at all.

A few other operators are dependent on the onCompleted() call, such as last() and count(). So always be mindful and ensure the sequences are finite. Let me know if you have any questions.

Friday, November 6, 2015

After doing RxJava for several months, I have found the key to making it extra useful is not just knowing the operators, but knowing how to combine them together. The built-in operators can be cleverly composed together with other operators to create new operations, like flatMap() and subscribeOn() to achieve parallelization.

When I started mixing object-oriented and reactive programming together, I came across some interesting problems and design decisions. If I have a class that contained one or more Observable properties, I was not quite sure how to filter on these properties.

For instance, let's say you have a type called Transaction. It has four properties that you typically might expect on a bank transaction object.

Suppose we have an Observable streaming these Transaction objects from some IO source or a reactive JDBC library. If wanted to filter these emissions to just today's date, that is easy enough using the filter() operator. It only passes emissions that meet the boolean condition specified by a lambda.

Easy enough, right? But now what if we wanted to filter on the Observable property getBalance()? This property will dynamically push a BigDecimal value reflecting the bank balance after that transaction took place. Say we want to filter the stream of transactions to where the balance is less than $100. The problem is we can't use the filter operator thanks to the monad. Try it. You will never be able to compile because you will always return an Observable, not a boolean which the filter() operator expects.

The problem is we are passing the filter() operator an Observable<Boolean>, not a boolean which it needs. We could call toBlocking() and extract that boolean value but that will break the monad (which is bad).

So how do we filter on this Observable property? I learned through experimentation the flatMap() has a lot of tricks up its sleeves, and we can use it filter on an Observable property on a stream of objects.

Study the code above very closely and notice everything that happened inside that flatMap(). For each Transaction, we called the getBalance() which emits the balance. But we immediately filtered where the balance is less than 100. Any balance emissions that pass this condition are then mapped back to the original Transaction item they came from using .map(b -> t). This Transaction is emitted out of the flatMap(), but Transactions that failed to meet the criteria will come out as empty Observables.

This is how you can filter on Observable properties. Of course, the above example assumes the getBalance() method returns a single emission followed by an onCompleted() call. You might encounter scenarios where you need to filter on a property that emits multiple values, and this can be very useful if that is your intention. You can emit multiple conditions from a property, and those that pass will emit the item through while the conditions that fail do not.

Thursday, November 5, 2015

RxJava is often misunderstood when it comes to the asynchronous/multithreaded aspects of it. The coding of multithreaded operations is simple, but understanding the abstraction is another thing.

A common question about RxJava is how to achieve parallelization, or emitting multiple items concurrently from an Observable. Of course, this definition breaks the Observable Contract which states that onNext() must be called sequentially and never concurrently by more than one thread at a time. Now you may be thinking "how the heck am I supposed to achieve parallelization then, which by definition is multiple items getting processed at a time!?" Widespread misunderstanding of how parallel actually works in RxJava has even led to the discontinuation of the parallel operator.

You can achieve parallelization in RxJava without breaking the Observable contract, but it requires a little understanding of Schedulers and how operators deal with multiple asynchronous sources.

Say you have this simple Observable that emits a range of Integers 1 to 10 upon subscription.

And for simplicity's sake let's just make intenseCalculation() sleep for a random interval before returning the integer back, simulating an intensive calculation. We will also make it print the current thread the computation is occurring on.

We would want to parallelize and process multiple integers at a time. A common mistake people make is think "Oh, I just use subscribeOn() and have the source Observable emit items on multiple computation threads just like an ExecutorService.

However, if you run this code you will get this output, showing that each integer only emitted on one computation thread, not several as you might have expected. As a matter of fact, this is as sequential as a single-threaded operation.

Calculating 1 on RxComputationThreadPool-1
Subscriber received 1 on RxComputationThreadPool-1
Calculating 2 on RxComputationThreadPool-1
Subscriber received 2 on RxComputationThreadPool-1
Calculating 3 on RxComputationThreadPool-1
Subscriber received 3 on RxComputationThreadPool-1
Calculating 4 on RxComputationThreadPool-1
Subscriber received 4 on RxComputationThreadPool-1
Calculating 5 on RxComputationThreadPool-1
Subscriber received 5 on RxComputationThreadPool-1
Calculating 6 on RxComputationThreadPool-1
Subscriber received 6 on RxComputationThreadPool-1
Calculating 7 on RxComputationThreadPool-1
Subscriber received 7 on RxComputationThreadPool-1
Calculating 8 on RxComputationThreadPool-1
Subscriber received 8 on RxComputationThreadPool-1
Calculating 9 on RxComputationThreadPool-1
Subscriber received 9 on RxComputationThreadPool-1alculating 10 on RxComputationThreadPool-1
Subscriber received 10 on RxComputationThreadPool-1

Well that was not helpful! We did not achieve any effective parallelization at all. We just directed the emissions to happen on another thread named RxComputationThreadPool-1.

So how do we make calculations happen on more than one computation thread? And do it without breaking the Observable contract? The secret is to catch each Integer in a flatMap(), create an Observable off it, do a subscribeOn() to the computation scheduler, and then perform the process all within the flatMap().

Now we are getting somewhere and this looks pretty parallel now. We are getting multiple emissions happening on different computational threads.

Calculating 1 on RxComputationThreadPool-3
Calculating 4 on RxComputationThreadPool-2
Calculating 3 on RxComputationThreadPool-1
Calculating 2 on RxComputationThreadPool-4
Subscriber received 3 on RxComputationThreadPool-1
Calculating 7 on RxComputationThreadPool-1
Subscriber received 4 on RxComputationThreadPool-2
Calculating 8 on RxComputationThreadPool-2
Subscriber received 2 on RxComputationThreadPool-4
Calculating 6 on RxComputationThreadPool-4
Subscriber received 8 on RxComputationThreadPool-2
Subscriber received 6 on RxComputationThreadPool-4
Calculating 10 on RxComputationThreadPool-4
Subscriber received 7 on RxComputationThreadPool-1
Subscriber received 10 on RxComputationThreadPool-4
Subscriber received 1 on RxComputationThreadPool-3
Calculating 5 on RxComputationThreadPool-3
Subscriber received 5 on RxComputationThreadPool-3
Calculating 9 on RxComputationThreadPool-3
Subscriber received 9 on RxComputationThreadPool-3

But how is this not breaking the Observable contract you ask? Remember that you cannot have concurrent onNext() calls on the same Observable. We have created an independent Observable off each integer value and scheduled them on separate computational threads, making their concurrency legitimate.

Now you may also be asking "Well... why is the Subscriber receiving emissions from multiple threads then? That sounds an awful lot like concurrent onNext() calls are happening and that breaks the contract."

Actually, there are no concurrent onNext() calls happening. The flatMap() has to merge emissions from multiple Observables happening on multiple threads, but it cannot allow concurrent onNext() calls to happen down the chain including the Subscriber. It will not block and synchronize either because that would undermine the benefits of RxJava. Instead of blocking, it will re-use the thread currently emitting something out of the flatMap(). If other threads want to emit items while another thread is emitting out the flatMap(), the other threads will leave their emissions for the occupying thread to take ownership of.

Here is proof as the above example makes this not so obvious. Let's collect the emissions into a toList() before they go to the Subscriber.

Notice that the calculations were all happening on different threads, and the flatMap() was pushing items out on one of these threads at a given time. But the last thread to do an emission was RxComputationThreadPool-3. This happened to be the thread pushing items out of flatMap() when it needed to emit the final value 9, so it called onCompleted() which pushed out the list all the way to the Subscriber.

Let me know if you have any questions. Here is the full working example.

About Fixing Thread Pools
Please note as David mentioned below that this parallelization strategy does not work with all Schedulers. Some Schedulers other than computation() could flood the system with too many threads and therefore cause poor performance. The computation scheduler limits the number of concurrent processes that can happen inside the flatMap()based on the number of CPU's your machine has. If you wanted to use other schedulers like newThread() and io() which do not limit the number of threads, you can pass a second int argument to the flatMap()limiting the number of concurrent processes.

You can also create a Scheduler off an ExecutorService giving you more fine-tuned control if needed. Actually you can get significantly better performance doing this. You can read about it in my other article on Maximizing Parallelization.

Saturday, October 31, 2015

When I first started learning RxJava, I found it very difficult to wrap my head around even its most basic concepts. The idea of a Subscriber's onNext(), onCompleted(), and onError() calls made sense in isolation, but how did they work in a practical reactive process? I thought the Subscriber was the final operation in the chain that consumed the emissions, so when I called Observable.create(), I thought I was calling on the Subscriber at the end of the chain, not the operators between them.

Of course this is not the case, but to a reactive newbie this is not so obvious. I developed a relatable analogy that helped me understand and explain what happens in a reactive chain of Observable operations. This is more or less how I think of Observable operations.

The onNext()

The Observableemits items. I like to think of emissions as "handing" items to an Observer by calling it's onNext() method, which accepts the item as an argument.

But Observableoperators (like map(), flatMap(), etc) create Observables that are both an Observable and Observer.

Here is a relatable example. Take a moving crew (portrayed by my artistically crafted stick figures) passing boxes. A chain of Observable operations is much like a mover handing a box up a chain. Each time the box is pushed to the next person, the onNext() is called on the next guy to pass it.

You could say that the mover on the left is the "source" Observable, since he is the originator creating the emissions. He calls the middle guy's onNext() method to pass the box to him. The second guy in turn immediately calls the third guy's onNext() method, and so on. The final onNext() call will be on the final Subscriber. But the point is the second and third mover are observing the mover on their left. A mover reacts to a box getting pushed to him by turning around and pushing it to the next mover.

Here is a more thorough example showing these three movers each doing something more specific and can be represented in code.

Mover 1 on the far left is the source Observable. He creates the emissions by picking items out of the house. He then calls onNext() on Mover 2, who does a map() operation. When his onNext() is called he takes that Item and puts it in a Box. Then he calls onNext() on Mover 3. He is the final Subscriberwho loads the box on the truck.

If we were to express these three movers as an Observable chain of operations, this is what it might look like.

However, if there is a finite number of items in the house we need to implement onCompleted().

The onCompleted()

Observables can be infinite, which means they will never call onCompleted() and are expected to forever emit items. But in this example, it is unlikely there are infinite items in the house. When Mover 1 has removed all items from the house, he needs to tell Mover 2 they are done by calling his onCompleted() method. He will in turn tell Mover 3 and call his onCompleted() method. Mover 3 will then close the truck and shutdown. Each mover will clean up and leave in this process as well.

We need to have mover1 call onCompleted() since we created that source Observablefrom scratch, and that way it will communicate up the chain its completion event. The map() operation on mover2 created an Observablethat already implements a call to onCompleted(). It just needs that notification from mover1 which we implemented. Since mover3 is the final Subscriber, we overload another lambda specifying to close the truck when onCompleted() is called.

The onError()

Errors can obviously happen in any program, but the reactive approach handles them a bit differently. The exception is emitted up the chain so the Subscriber can decide how to handle it. Of course there are operators to intercept the exception, and perform policies like resume next or retry a specified number of times. But generally you should have the Subscriber handle exceptions in some way.

In this case, the map() operation failed. Mover 2 accidentally broke the item while trying to put it in a box. He calls onError() on Mover 3 who happens to be the final Subscriber, and Mover 3 comes down with a broom to handle the error.

Coding-wise, we have to handle the source Observable and catch any errors to pass up the chain. The error in this example does not occur at the source however. It occurs in the middle of the chain in the map() operation, which already implements passing the error up the chain. We also overload the Subscriber with another lambda to handle the error.

Saturday, January 31, 2015

Multithreading can be hard. When you have several threads modifying and contending for shared objects, variables, and resources... many things can go wrong if you are not deliberate and careful.

In Java, that is why it is critical to utilize synchronizers like CountDownLatch, Semaphore, and Cyclic Barrier (in addition to other multithreading tools like volatile and final keywords). These all ensure thread safety or prevent them them from proceeding until certain conditions are met and it is safe to proceed.

An analogy that I find really helpful is comparing this to the Kentucky Derby. The race horses (the "threads") walk up to the gate (the "synchronizer"), but the gate is closed so they cannot proceed any further (in a state of "await()"). Once a certain condition is cleared, the gate opens ("notifyAll()") and the horses are free to charge forward.

The CountDownLatch is probably the synchronizer I use the most. Its policy is simple and effective. It starts at a specified number on construction, and decrements on each call to "countDown()". When it reaches zero, the CountDownLatch lets any other threads waiting on it pass through.

If you have three tasks each running on a separate thread, but you don't want the main thread to proceed until these three tasks are done, create a CountDownLatch and let each task countDown() it when they are done. Have the main thread await() on the CountDownLatch, and when all three tasks are done the main thread will proceed.

The BufferedLatch

However, what if we don't know what the count will be in advance? We do not have a number for the CountDownLatch to start at, but we will only know later in the process. I ran into this issue a lot dealing with buffered streams from database queries. As I looped through each database record, I wanted to kick it off into a task and submit it into a fixed threadpool. The problem is once all the records are looped through, how do I pause until all the runnables are complete that are processing those records?

Looking at the conceptual code above, we are looping through each record in the query as each record comes in. But we do not know how many records there will be until rs.next() returns false and we finish looping. This is problematic if we are going to convert each record into an object (in this case "FinanceDay"), and then pass that object to a Runnable that runs on an entirely separate thread. If we don't find a synchronization solution, we have an enormous risk of creating a race condition where this method will finish prematurely. Upon completion, the FinanceDay objects may still be getting processed and are not ready for the application to use.

The BufferedLatch solves this problem. Its purpose is very similar to the CountDownLatch, except it is for cases where the countdown number is unknown, and will not be known until later mid-process.

There are three methods that control the state of the synchronizer (incrementRecordCount(), incrementProcessedRecordCount(), and setIterationComplete()) as well a method for waiting threads (await())

incrementRecordCount() is called every time a record is iterated.incrementProcessedRecordCount() is called every time a runnable of that record is completed.setIterationComplete() is only called once by the looping thread to flag that the iteration is complete.

Any threads waiting for the runnables to complete will need to call await(). In my uses, this always has been the thread that does the iteration and calls setIterationComplete(), and then it calls await() and sits until all the runnables are done.

For our example above, this is how BufferedLatch would be implemented.

The way this works now is a BufferedLatch is created before any recordset iteration starts. After a record is iterated, converted to a FinanceDay object, and passed off as a Runnable to the executor, incrementRecordCount() is called.

When a Runnable (the lambda passed to the executor submit() method) finishes processing the FinanceDay, it calls incrementProcessedRecordCount();

After the entire ResultSet is looped through, the setIterationComplete() is called to flag that no more records are coming in. The query has been iterated through completely. If any more incrementRecordCount() is called, a RuntimeException will be thrown because iterating records should not happen after setIterationComplete() is called.

Finally, the original thread will come to the latch's await() method and will pause until the runnables complete by calling incrementProcessedRecordCount() enough times to match the recordCount. After that, every record has been iterated and processed, and the original thread is now free to shutdown the executor and move on.

Conclusions

My only regret is this latch does add some boilerplate to the client code by having four different methods that need to be called, where CountDownLatch typically only has two (countDown() and await()). If anybody has suggestions I am very willing to hear them. But I have not found a latch that accomplishes anything like this, perhaps because this problem is somewhat niche.

One disclaimer: like any multithreading decision... first evaluate if it is even worth multithreading the task in question. Test and ensure there will be performance gains over a single-threaded approach. If your database query is quick, it might be worth importing all the data first before doing anything with it. But if you have worked with painfully slow data connections like me, or are issuing an intensive query, you may want to utilize that idle CPU time and use the solution above.

Lazy initialization is deferring calculation of a value until it is needed, and then caching it for all uses thereafter. It often is used to hold off an expensive calculation and then saving the result for future uses.
A typical lazy initialization would look something like this...

The "balance" variable is initially null. When the getBalance() method is called, it first checks if it is null and calculates and assigns the "balance" value before returning it. After the initial call, it will not have to calculate it again as the "if (balance == null)" condition will return false.

Lazy initialization is not optimal for most cases. You should always strive to initialize values on construction. Even better, you should make them final and immutable if possible.

But there are certainly cases lazy initialization is applicable, especially where properties are either a) expensive to build or b) depend on the object already being constructed. I have found lazy initialization to be greatly needed in business decision process applications that go through a lot of decision calculations. This is not only expensive but also requires the business objects to be constructed and fully aware of themselves before they can engage with any algorithms. This is a good use of Lazy Initialization.

However, lazy initialization gets even more complicated when you multithread your application (which is inevitable for any reasonably complex application). To prevent race conditions or contention between threads, you have to synchronize on a lock and protect the "balance" value, and you have to check the null condition twice.

If you have five properties that use lazy initialization in a multithreaded environment, your class getter methods can get messy very quickly like the one above. This pattern is very repetitive, and therefore is a good candidate to be encapsulated into a utility class. With a supplier lambda expression, it becomes even easier.

Here is LazyProperty, a solution to fulfill that need using all the means described.

LazyProperty takes care of this entire pattern and is generic so it works with any type. It also uses a lambda supplier to provide instructions on how to construct the value. Invoking LazyProperty is simple. Call LazyProperty.forSupplier() and pass a supplier lambda to the static method. It will return a new LazyProperty which will create and cache the value once called. It will also manage all thread synchronization so it is threadsafe. Use the get() method to extract the value out of LazyProperty.

Just make sure you do not create a stack overflow or null pointer exception by making the supplier call itself by calling the get() method of the LazyProperty it is populating. To my understanding, you can use references of "this" in the supplier safely as the lambda is holding a soft reference to the class which may be in partially constructed state, but if implemented properly the get() will not be called until the entire class is constructed.

Optionally, you can implement a reset() method to clear the cached value so it can be recalculated again.

The reset() method is also synchronized in similar fashion to the get() by only allowing one thread to check and change the value. Some die hard Immutable-ists like me could argue this introduces undesirable mutability and may present opportunity for bad mutable designs, but I cannot see the harm since the supplier is final. Assuming the supplier is designed correctly, the supplier should always return the same value every time, or the most up-to-date value. Abuse could come from excessive calls to reset() which may hit performance. It could also encourage bad designs by allowing reuse of objects rather than creating new ones once properties change. But if used intelligently, it is nice to have reset() available for certain solutions.