Tuning an Akka application

The first section presents the application (actually, a simple microservice) that will be used along this post to illustrate the tuning process.

The second section describes the load tests and tools that will be used to measure the application performance.

Next section presents different test scenarios and the results obtained for each of them.

The last section provides some final considerations when it comes to extrapolate the results to a Production environment.

Application description

Briefly, the application under study is a microservice that receives requests through a REST endpoint and in turn calls to a third party SOAP endpoint; then the SOAP response is enriched with data extracted from a Redis database and the final response is sent back to the client.

Components

The application is based on Akka 2.4.11 and Scala 2.11.8 and has these components:

REST endpoint exposed through Akka HTTP

Apache Camel’s CXF component to connect to a third party SOAP endpoint

Thread pools

Now, what is the representation of all these components at runtime?

The components described in the previous section are nice high level abstractions to let the developers do their job easily. However, when it comes to tuning the application, it is necessary to move to a lower level of abstraction in order to analyse the different threads that run the application.

The following table shows the thread pools found after taking a thread dump of the application:

Name

Description

Default parameters

akka.actor.default-dispatcher

Used by Akka to run the actors

Uses default dispatcher with default fork-join-executor with values:

parallelism-factor = 3.0
parallelism-min = 8
parallelism-max = 64

rediscala.rediscala-client-worker-dispatcher

Used by Redis driver to run Redis requests/replies

Uses default dispatcher as above

ForkJoinPool-2-worker

Used by scala.concurrent.ExecutionContext.Implicits.global to run all tasks submitted to it

Uses default ExecutionContext with Scala ForkJoinPool with values:

parallelism-factor = number of available processors
parallelism-min = number of available processors
parallelism-max = number of available processors

The above values can be modified with these System properties:
'scala.concurrent.context.numThreads'
'scala.concurrent.context.minThreads'
'scala.concurrent.context.maxThreads'

ForkJoinPool.commonPool-worker

Used by java.util.concurrent.ForkJoinPool.common to run all tasks submitted to it

Uses Java ForkJoinPool with values:

parallelism-factor = number of available processors - 1

The above value can be modified with this System property:
'java.util.concurrent.ForkJoinPool.common.parallelism'

The above values can be modified with these properties of the bean 'AutomaticWorkQueueImpl':
'lowWaterMark'
'highWaterMark'

StatsD-pool

Used by StatsD client to send metrics to StatsD server

logback

Used by Logback to write logs

Configuration

The application is configured to use a router of actors whose number can be changed in between tests. This will allow us to explore the behaviour of the application when combining different values of threads and routees.

The default dispatcher delegates the calls to the third party to scala.concurrent.ExecutionContext.Implicits.global inside a blocking block. The role of the blocking block is to ensure that parallelism level is maintained despite the blocking operation.

scala.concurrent.ExecutionContext.Implicits.global is also used to process all others Future operations in the application, including Redis responses. Admittedly, having too many operations sharing the same limited number of threads of scala.concurrent.ExecutionContext.Implicits.global may lead to starvation. That is why the call to the third party inside a blocking block is so critical.

On the other hand, Redis driver uses its own threadpool so there is no risk of blocking threads shared with other operations.

Note: the dispatcher used by the actors plays a role similar to the event loop in languages like Node.js. Therefore, it is the paramount importance that the dispatcher threads never block as that would block the entire Akka machinery. All blocking calls must be delegated to some other thread pool.

When the application runs on my laptop, which has 8 processors, the number of threads is determined by the default configuration specified on the above table:

default dispatcher: 24 threads

scala.concurrent.ExecutionContext.Implicits.global: 8 threads

java.util.concurrent.ForkJoinPool.common: 7 threads

Load Test description

The process of tuning the application relies on monitoring under a heavy workload. To generate traffic on the application, we will use a Jmeter script with different variables:

target.concurrency: number of concurrent clients calling the microservice

ramup.time: time (in seconds) to hit the concurrency target

ramup.steps: number of steps to reach the concurrency target; it represents the users arrival rate

target.time: span of time (in seconds) during which the test runs after reaching the concurrency target; therefore, the total duration of the test is ramup.time + target.time

The ramp-up time is a transitory period, the shorter it is compared to the target time the more accurate the results will be.

Redis runs on a Docker container on localhost and its latency is of the order of a few milliseconds (what makes the third party’s latency the dominant factor when it comes to considering blocking operations)

Tools

This section describes the different tools used to run and monitor the tests.

Jmeter

First of all, we need a Jmeter script to generate traffic. Here is the properties file used with the script

# IP to connect to the service
host.ip=localhost
# Port to connect to the service
host.port=8080
# Number of concurrent clients calling the service
target.concurrency=xx
# Time (in seconds) to hit the concurrency target
ramup.time=xx
# Number of steps to reach the concurrency target
ramup.steps=xx
# Span of time (in seconds) during which the test runs after reaching the concurrency target
# Therefore, the total duration of the test is ramup.time + target.time
target.time=xx

And the command to run the script:

jmeter -n -t jmeterScript.jmx -p jmeterProperties.properties

Thread analyser

Based on the script jstackSeries, here is a thread sampler that takes thread dumps at intervals during the time the application runs and presents a summary of the evolution of the number of each type of threads. This thread sampler will help us get information about the behaviour of the threads during the Load Test runs.

Thread dump analyser

In order to examine the content of the thread dumps in detail, a tool like https://github.com/irockel/tda comes in handy. It is a java application that can be run with a JAR file:

java -jar <tda home>/tda.jar

Redis connections script

In order to have the entire picture, it is also necessary to monitor the number of connections to Redis. Here is the script used to count the number of connections (as it was mentioned, we are running Redis inside a Docker container)

The command “jstackSeries.sh 2297 _20-20-3 4 30″ yields this result. As expected, the number of threads increase as new concurrent clients are added during the test.

Remarkably, the thread pool ForkJoinPool-2-worker (corresponding to scala.concurrent.ExecutionContext.Implicits.global) has exceeded its maximum number of 8. As discussed above, this is down to the use of the statement blocking to enclose the blocking call to the SOAP endpoint.

20 clients, 1 routees, parallelism-factor=3

Results are similar to the previous ones. Therefore, it turns out that one actor can handle the same amount of traffic as 20. It makes sense as the actor does not perform any blocking operation and therefore is lightning fast.

A further consequence is that, as only 1 thread at a time can run inside an actor, a single thread in the default dispatcher should be enough.

Moreover, given that there is only 1 routee, there is just 1 Redis connection. Again, this does not seem to penalise the performance (Redis operations take just a few milliseconds which, obviously, is a negligible amount compared to the dominant latency of the SOAP endpoint)

40 clients, 1 routee, 1 thread

So far the results have remained stable, meaning that the application have enough capacity to handle up t0 20 concurrent clients no matters its configuration.

However, when changing to 40 concurrent clients, the application hits the limit of 25 threads imposed by the camel connector thread pool. Therefore, the throughput is capped at 25 requests/sec and as a consequence, the response time goes up as there is not enough capacity to handle 40 clients/sec.

We will stop here as the few examples discussed in this post give a good idea of the different factors to take into account when dealing with an Akka aplication (and in general any application running on the JVM)

Final thoughts

Despite 1 actor with 1 thread has proven to be enough to handle the scenarios proposed in the post, it would be better to take advantage of all the 8 processors. Admittedly, it does not make any difference for the examples considered but it will for much higher volumes.

When running perf tests, it is very important to remember that on Prod the number of processors may be different and therefore the size of the different thread pools. This is specially true when deploying on the cloud as DevOps will tend to choose the smallest available VMs in order to cut down costs. As a consequence, the number of processors on Prod is likely to be smaller than on your laptop!!