Search This Blog

It is not a secret that Java servlet containers aren't particularly suited for handling large amount of concurrent users. Commonly established thread-per-request model effectively limits the number of concurrent connections to the number of concurrently running threads JVM can handle. And because every new thread introduces significant increase of memory footprint and CPU utilization (context switches), handling more than 100-200 concurrent connections seems like a ridiculous idea in Java. At least it was in pre-Servlet 3.0 era.

In this article we will write scalable and robust file download server with throttled speed limit. Second version, leveraging Servlet 3.0 asynchronous processing feature, will be able to handle ten times bigger load using even less threads. No additional hardware required, just few wise design decisions.

Token bucket algorithm

Building a file download servers we have to consciously manage are our resources, especially network bandwidth. We don't want a single client to consume the whole traffic, we might even want to throttle the download limit dynamically at runtime, based on user, time of the day, etc. - and of course everything happens during heavy load. Developers love reinventing the wheel, unfortunately all our requirements are already addressed by absurdly simple token bucket algorithm.

The explanation in Wikipedia is pretty good, but since we'll adjust the algorithm a bit for our needs, here's even simpler description. First there was a bucket. In this bucket there were uniform tokens. Each token is worth 20 kiB (I will be using real values from our application) of raw data. Every time a client ask for a file, the server tries to take one token from the bucket. If it succeeds, he sends 20 kiB to the client. Repeat last two sentences. What if the server fails to obtain the token because the bucket is already empty? He waits.

So where are the tokens coming from? Background process fill the bucket from time to time. Now it becomes clear. If this background process adds 100 new tokens every 100 ms (10 times per second), each worth 20 kiB, the server is capable of sending 20 MiB/s (100 times 20 kiB times 10) max, shared amongst all the clients. Of course if the bucket is full (1000 tokens), new tokens are ignored. This works amazingly well – if bucket is empty, clients are waiting for next bucket filling cycle; and by controlling the bucket capacity we can limit total bandwidth.

Enough of talking, our simplistic implementation of token bucket starts with an interface (whole source code is available on GitHub in global-bucket branch):

takeBlocking() methods are waiting synchronously for the token to become available, while tryTake() are taking token only if it is available, returning true immediately if taken, false otherwise. Fortunately the term bucket is just an abstraction: because tokens are indistinguishable, all we need to implement bucket is an integer counter. But because the bucket is inherently multi-threaded and some waiting is involved, we need more sophisticated mechanism. Semaphore seems to be almost ideal:

Semaphore fits exactly to our requirements. bucketSize represents current amount of tokens in the bucket. bucketCapacity on the other hand limits the bucket maximum size. It is volatile because it can be modified via JMX (visibility):

As you can see I am using Spring and its support for JMX. Spring framework isn't absolutely necessary in this application, but it brings some nice features. For instance implementing a background process that periodically fills the bucket looks like this:

For the record: (1) tiny XML snippet is required to make @Scheduled annotation working and (2) – this code contains major multi-threading bug that we can ignore for the purposes of this article. It is suppose to fill the bucket up to the maximum value – will it always work?

Having token bucket abstraction and very basic implementation we can develop the actual servlet returning files. I am always returning the same fixed file with size of almost 200 kiB):

HttpRequestHandlerServlet was used here. As simple as can be: read 20 kiB of file, take the token from the bucket (waiting if unavailable), send chunk to the client, repeat until the end of file.

Believe it or not, this actually works! No matter how many (or how few) clients are concurrently accessing this servlet, total outgoing network bandwidth never exceeds 20 MiB! The algorithm works and I hope you get some basic feeling how to use it. But let's face it – global limit is way too inflexible and kind of lame – single client can actually consume your whole bandwidth.

So what if we had a separate bucket for each client? Instead of one semaphore – a map? Each client has a separate independent bandwidth limit, so there is no risk of starvation. But there is even more:

some clients might be more privileged, having bigger or no limit at all,

some might be black listed, resulting in connection rejection or very low throughput

we might try to correlate concurrent requests coming from the same client and use the same bucket for all of them to avoid cheating by opening several connections. We might also reject subsequent connections

and much more...

Our bucket interface grows allowing the implementation to take advantage of the new possibilities (see branch per-request-synch):

The implementation is very similar (full class here) except that the single semaphore was replaced by map. I am not using request object itself as a map key for various reasons but a unique request number that I am assigning manually when receiving new connection. Calling completed() is very important, otherwise the map would grow continuously leading to memory leak. All in all, the token bucket implementation haven't changed a lot, also the download servlet is almost the same (except passing request to token bucket). We are now ready for some stress testing!

Throughput testing

For the testing purposes we will use JMeter with this wonderful set of plugins. During the 20-minute testing session we warm up our server firing up one new thread (concurrent connection) every 6 seconds to reach 100 threads after 10 minutes. For the next ten minutes we will keep 100 concurrent connections to see how stable the server works:

Important note: I artificially lowered the number of HTTP worker threads to 10 in Tomcat (7.0.10 tested). This is a far from real configuration, but I wanted to emphasize some phenomena that occur with high load compared to server capabilities. With default pool size I would need several client machines running distributed JMeter session to generate enough traffic. If you have a server farm or couple of servers in the cloud (as opposed to my 3-year-old laptop), I would be delighted to see the results in more realistic environment.

Remembering how many HTTP worker threads are available in Tomcat, response times over time are far from satisfactory:

Please note the plateau at the beginning of the test: after about a minute (hint: when the number of concurrent connections exceeds 10) response times are skyrocketing to stabilize at around 10 seconds after 10 minutes (number of concurrent connections reaches one hundred). Once again: the same behavior would occur with 100 worker threads and 1000 concurrent connections – it's just a matter of scale. The response latencies graph (time between sending request and receiving first lines of response) clears any doubts:

Below magical 10 threads our application responds almost instantly. This is really important for clients as receiving only headers (especially Content-Type and Content-Length) allows them to more accurately inform the user what is going on. So what is the reason of Tomcat waiting with the response? No magic here really. We have only 10 threads and each connection requires one thread, so Tomcat (and any other pre-Servlet 3.0 container) handles 10 clients while the remaining 90 are... queued. The moment one of the 10 lucky ones is done, one connection from the queue is taken. This explains average 9 second latency whilst the servlet needs only 1 second to serve the request (200 kiB with 20 kiB/s limit). If you are still not convinced, Tomcat provides nice JMX indicators showing how many threads are occupied and how many requests are queued:

With traditional servlets there is nothing we can do. Throughput is horrible but increasing the total number of threads is not an option (think: from 100 to 1000). But you don't actually need a profiler to discover that threads aren't the true bottleneck here. Look carefully at DownloadServletHandler, where do you think most of the time is spent? Reading a file? Sending data back to the client? No, the servlet waits... And then waits even more. Non-productively hanging on semaphore – thankfully CPU is not harmed, but what if it was implemented using busy waiting? Luckily Tomcat 7 finally supports...

Servlet 3.0 asynchronous processing

We are this close to increase our server capacity by an order of magnitude, but some non-trivial changes are required (see master branch). First, download servlet needs to be marked as asynchronous (OK, this is still trivial):

The main change occurs in download handler. Instead of sending the whole file in a loop with lots of waiting (takeBlocking()) involved, we are splitting the loop into separate iterations, each wrapped inside Callable. Now we will utilize a small thread pool that will be shared by all awaiting connections. Each task in the pool is very simple: instead of waiting for a token, it asks for it in a non-blocking fashion (tryTake()). If the token is available, piece of the file is sent to the client (sendChunkWorthOneToken()). If the token is not available (bucket is empty), nothing happens. No matter whether the token was available or not, the task resubmits itself to the queue for further processing (this is essentially very fancy, multi-threaded loop). Because there is only one pool, the task lands at the end of the queue allowing other connections to be served.

I am leaving the details of Servlet 3.0 API, there are plenty of less sophisticated examples throughout the Internet. Just remember to call startAsync() and work with returned AsyncContext instead of plain request and response.

BTW creating a thread pool using Spring is childishly easy (and we get nice thread names as opposed to Executors and ExecutorService):

<task:executor id="downloadWorkersPool" pool-size="1"/>

That's right, one thread is enough to serve one hundred concurrent clients. See for yourself (the amount of HTTP worker threads is still 10 and yes, the scale is in milliseconds):

As you can see, response times when one hundred clients are downloading a file concurrently are only about 5% higher compared to the system with almost no load. Also response latencies aren't particularly harmed by increasing load. I can't push the server even further due to my limited hardware resources, but I have reasons to believe that this simple application would handle even twice as more connection: both HTTP threads and download worker thread weren't fully utilized during the whole test. This also means that we have increased our server capacity 10 times without even using all the threads!

Hope you enjoyed this article. Of course not every use case can be scaled so easily, but next time you'll notice your servlet is mainly waiting – don't waste HTTP threads and consider servlet 3.0 asynchronous processing. And test, measure and compare! The complete application source codes are available (look at different branches), including JMeter test plan.

Areas of improvement

There are still several places that require attention and improvement. If you want to, don't hesitate, fork, modify and test:

While profiling I discovered that in more than 80% of executions DownloadChunkTask does not acquire a token and only reschedules itself. This is an awful waste of CPU time that can be fixed quite easily (how?)

Consider opening a file and sending content length in a worker thread rather than in an HTTP thread (before starting asynchronous context)

How can one implement global limit on top of bandwidth limits per request? You have at least couple of choice: either limit the size of download workers pool queue and reject executions or wrap PerRequestTokenBucket with reimplemented GlobalTokenBucket (decorator pattern)

TokenBucket.tryTake() method does clearly violate Command-query separation principle. Could you suggest how it should look like to follow it? Why it is so hard?

I am aware that my test constantly reads the same small file, so the I/O performance impact is minimal. But in real life scenario some caching layer would have certainly be applied on top of disk storage. So the difference is not that big (now the application uses very small amount of memory, lots of place for cache)

Lessons learned

Loopback interface is not infinitely fast. In fact on my machine localhost was incapable of processing more than 80 MiB/s.

I don't use plain request object as a key in bucketSizeByRequestNo. First of all, there are no guarantees on equals() and hashCode() for this interface. And more importantly – read the next point...

With servlets 3.0 when processing the request you have to call completed() explicitly to flush and close the connection. Obviously after calling this method request and response objects are useless. What wasn't obvious (and I learned that the hard why) is that Tomcat reuses request objects (pooling) and some of their contents for subsequent connections. This means that the following code is incorrect and dangerous, possibly resulting in accessing/corrupting other requests' attributes or even session (?!?)

ctx.complete();
ctx.getRequest().getAttribute("SOME_KEY");

EDIT: Watch out for asynchronous executions timeouts. After creating an asynchronous invocation using startAsync() on Tomcat by default you have 30 seconds. After that Tomcat will kill your connection and getRequest() will suddenly start returning null (not very helpful). Luckily timeout can be customized and you can listen for timeout events easily (see this commit).

Thanks for pointing out this interesting article. The problem with 100-200 threads is that each thread comes with some memory footprint, not necessarily needed when all the thread does is waiting. Also note that this article mentions Servlet 3.0 as well.

Hi Tomasz,I notice in the DownloadChunkTask that if tokenBucket.tryTake(ctx.getRequest()) fails you instantly place the task back in to the queue. Would that result in a fair amount of thrashing as the task is instantly re-run only to placed back in the queue, this repeats many many times before tokenBucket.tryTake(ctx.getRequest()) succeeds. You could probably decrease CPU usage by switching downloadWorkersPool to a Executors.newScheduledThreadPool(1) and adding a little delay (downloadWorkersPool.schedule(this, 100, TimeUnit.MILLISECONDS)) before its re-processed...

I am so glad you are mentioning about that flaw. Actually if you read Areas of improvement the very first point talks about it. I haven't thought about your solution. Mine was to create a separate queue of pending chunks that we were unable to process due to empty bucket. This queue is polled only when new tokens are added to the bucket so the number of failed tryTake() is reduced as well.

Hey there, Thank you for this post, it is a great resource.I experimented it and observed that using tokens corresponding to a given buffer size is not perfectly accurate as a smaller size of bytes can be read from the stream than the buffer value (like 950 instead of 1024) which result in an actual rate a little below the expected value.Inspired by your work, I have implemented a rate limitation with overall limitation / per client and per request here. Thanks again!

Tomasz, you can probably get even better performance if you also switch to Tomcat's NIO connector which releases a request processing thread immediately after the request has been processed and not after a keepalive timeout (or connection drop) has occurred.

while the solution may improve scalability it comes at a cost of stability.

Because there is no way of knowing how fast client is actually able to read the data there is a significant risk that data gets buffered in memory instead of being sent over wire (which may lead to increase memory usage by up to bandwidth limit (bytes)/sec). Adding .flush() after every write would fix the issue with buffering, but it would also make the solution obsolete.

The real way to fix the issue would be to allow on to return InputStream to the container. (A solution which would obviously require changes to servlet spec)

Scenario 1: - Server is limited to send 100MiB/sec - There is 1client capable of downloading 99MiB/sec - There are 100clients capable of downloading 1KiB/sec/client

Thank you for your thorough and insightful comment. In scenario 2 I believe the server would block on write() (is flush() making any difference?) so Dr. Evil would successfully run DoS attack. But I don't think the system would be unstable. I hope OS observes the output buffer size and won't let one process fill up hundreds of MiB. I have no idea if the buffer is per socket, per process or per whole network subsystems.

I plan to rewrite this application to use netty and RateLimiter from Guava. Will see how well it performs and I will specifically test (mis)behaviour you describe.

Referring to third point in "Lessons Learned" which is "Tomcat reuses request objects (pooling) and some of their contents for subsequent connections. "

One solution which can be tried is to get org.apache.catalina.connector.Request object. This class has recycle() method to release the current object references and re-initialize the instance variables.

But, the question arises how can I get this object, as request object we have is ServletRequest type.

With the same number of threads per machine (400) and same number of concurrent connections (420), using wrk (C http injector), I had 15% better throughput with blocking vs. async and 15% lower cpu consumption, and probably better memory allocation, and the same latency (because i used the same timeout, 50ms for both blocking wait and for the async timeout).

Used Jetty 8 with NIO enabled. So in my test, blocking worked 13% faster than async using 8 cores instead of 10.

I might also be running longer tests (this one took just 6 minutes and seems that both systems were queing http requests).

This is somewhat counter intuitive to me, but when you think that wait/notify, sleep and the kin are not busy waits and do not keep cores, and that linux does good scheduling with waiting threads, and the fact that async servlets do one more dispatch that go through part of the container machinery, it may make sense that at least until a few thousands threads, for not very long backend processings (or those where a timeout with a default answer are acceptable), blocking is better than async.