Search This Blog

Improving Performance with Parallelisation and Database Streaming

My current workplace is in the process of shifting services from another ESB implementation to MuleSoft's platform and has already re-written a number of services in a "like-for-like" process - that is, the MuleSoft implementation is as close to a copy/paste of the old implementation as possible, without bug fixes or improvements. While this might sound odd to some, it's part of the bigger plan - first a seamless transition to Mule then enhancements, fixes and optimisations.

As part of this transition, one such service - a search service - has already been migrated. This service provides functionality to support a web application, allowing its users to search for information in real time.

Recently I was asked to have a look at the performance of the search. Even though a like-for-like implementation had been provided, in some cases the search was timing out after 30 seconds - and this is not ideal for a synchronous service which has a user waiting for the results.

It's worth noting a few things here. First, the data model - there are a number of tables in a relational database (Oracle in this case) - there's one core table and a number of tables containing additional information, using 1:N relationships. Next, the maximum number of search results - limited to 2,500 records.

Here's how the flow looked initially:

Breaking it down, there's a number of steps:

Perform a query on the main table to identify the candidate matches

For each match:

Read related rows from each of 3 other tables

Transform and merge the results from each into a single record

Merge all the result records into one payload

The following is a sample log output from this flow, when invoked with a search that returns all possible 2,500 records:

Performance Improvement Test #1 - Scatter/Gather

The obvious issue here is that we're doing 3 additional queries for every match candidate returned - that's 7,501 total queries for the maximum result count! There's a simple change here that might have a noticeable impact on performance - that is, running the 3 queries in parallel for each candidate match - and we can do this easily using Mule's Scatter-Gather processor:

While this did result in an improvement - around 6 seconds, so 28 seconds in all - it was not enough to bring the overall processing time to a suitable level.

Performance Improvement Test #2 - Query Optimisation

With the easy choice out of the question, it's time to face the obvious - 7,501 queries is just too many. If we were using a "NoSQL" document database such as MongoDB, we could store the information in question in a much more suitable format, however like many legacy applications this is not the case - we have the option of either performing the 7,501 queries, or re-writing the query using a cartesian join and returning a (potentially) huge number of rows - or so you'd think.

Luckily Oracle has the capability of performing queries - including nested subqueries - and returning the results as XML in a CLOB (newer versions can also produce JSON, but we're not there yet). This means it was possible to rewrite the query such that the entire 7,501 could be replaced with just one. The actual details of that query have no place in this blog though, however it did lead to a simplified flow which now looked like the following:

As you'd expect, this produced significant gains - we're down to around 16 seconds now. Surely we can do better though, right?

Performance Improvement Test #3 - Database Streaming

After adding some logging it was noted that the bulk of the time in the previous test was taken in the database call, probably as expected, but did it need to take so long? Indeed, running the same query directly in Oracle took nowhere near the same time - so what's happening? Well, as streaming was disabled (the default), Mule was running the query, retrieving all the results, and transforming them into a Map - this means the flow's thread spent a lot of time paused, waiting for the results to appear.

Enabling streaming on the database component (as simple as ticking the box on the configuration window) allows Mule to retrieve a chunk at a time and process that chunk before going back for more - potentially reducing the amount of wait time. In our case, we also defined the fetch (chunk) size as 200 rows, which proved to be optimal.

The one consideration, however, is that the query needs to be changed - instead of returning a single XML document with all matches inside, it now needs to return an XML document per match so that we can iterate through the results. Here's the new flow:

We've achieved a lot so far, but we're still processing the results one by one. While the transformations being done themselves are not complicated, each one does need to convert a CLOB to a string - and that involves reading from a stream which is coming over the network. The final test involves seeing what we can achieve if we do this in parallel - using a multi-threaded approach.

The key to this test is Mule's Request-Reply processor and VM queues. When put together, these can allow you to embed a block of multi-threaded, asynchronous processing within a synchronous flow - but there are some tricks...

Consider the following flow:

Here we have the same query and the same database streaming, however we're pushing the results to a VMQ for processing, then waiting for the responses.

First, we're streaming the results from the database and pushing each to the VMQ individually - so we're processing each result in a separate thread. Note the For Each scope around the outbound VM connector - the payload after a streaming database operation is a result set, not a collection, so needs to be iterated.

Next, notice the Async scope around the processing in the queued flow - this allows us to configure how many threads can run concurrently to process the items.

Finally, notice the unusual usage of the Request-Reply scope. Usually this scope allows you to call a processor with a one-way exchange pattern and treat it like a request-reply pattern, however in this case we have a one-way send, but the flow at the other end is actually sending a reply (by writing to the response queue).

Why is this unusual? Well, the queued flow includes an asynchronous block - this is executed in a separate thread so it's result would normally be lost - that is, the Request-Reply would usually not see it's result. Also, we send to the request queue inside a For Each, however we only have a single read in the Request-Reply processor.

So how does it work? Well, the trick is only visible in the configuration XML:

There's a number of points here:

As we're pushing the items to the VMQ one by one from a result set, we need to manually set the correlation group size. If we were not streaming the database results here we'd have a List and could replace the For Each with a Collection Splitter, in which case Mule could set this value - however seeing we're iterating over the results Mule does not know how many records there are. In this case I had to modify the query so that each row includes the total row count so that it's available here.Note that if this is not set, Mule would not wait for all results to be returned - as soon as one was received it would proceed.

The default behaviour of the Request-Reply scope causes the payload at the end of the flow to be returned in the Reply processor - in this case that's not what we want. Instead, we want to wait for the results from the Async block, which we're manually sending.Removing the MULE_REPLYTO header is a quick "hack" which stops this default behaviour.

The Reply processor includes a resequencer and aggregator - this ensures the results are returned as a collection, in the same order as the requests were sent. If you're not bothered with order, you can leave out the resequencer.

The Request-Reply processor is wrapped in a Choice. In testing we found that using this approach causes the flow to block indefinitely if there were no database records returned - that is, the For Each never actually sends anything however the VMQ read still waits for results.

Conclusion

As we've seen, the process has reduced from around 35 seconds to around 7 seconds by optimising the database access and parallelisation of the results processing using multiple threads and VMQs, so there's a lot to think about when structuring your flows.

Keep in mind also that if this flow was deployed into a Mule cluster, the usage of VMQs would allow the spreading of the load not only to other threads, but also to other nodes in the cluster.

Excellent! One thing that stands-out here is: using the best of each tool/product.1. Databases (especially Oracle) is so powerful, that it allows query optimization before sending it to the client. That is very crucial here.2. Client side: parallelism and multi-threadingThank you for sharing this.

About Me

Brad is a software developer with 20 years experience in Java and 3+ years working with the MuleSoft suite of tools. He holds a number of MuleSoft certifications, including MCD Integration Professional.