Web Service Concurrency Shoot-out

I work mainly on web services, mainly in Ruby. Our team is pretty happy with it. We have no intention of rewriting existing Ruby code into other languages.

However, we're running into cases where Ruby is not the right tool for the job. That has led us to search for a general purpose "fast" language that we can use as an alternative.

This post focuses on a particular aspect of "fast": support for threads and concurrent execution. Ruby and Python are sometimes criticized for their global interpreter locks that prevent them from using more than one CPU at a time. But how does this matter in practice for web services, and how much better is the competition?

Before we discuss the details, let's look more closely at our use case.

Web Services

As a web service, our number one metric for throughput is requests serviced per second. To optimize throughput, we can either serve individual requests more quickly (ie. reduce latency), or serve more requests concurrently.

Reducing latency is clearly the better option. Lower request time usually translates to happier users and decreased server costs. But this approach tends to hit diminishing returns pretty quickly.

Reducing latency in a typical web service is difficult because most of the request time is not dependent on the server. The server will do some processing, translate the request into a database query, forward it along, wait for a result, then re-encode the result and pass it back to the user. Most of the request time is spent in the "waiting for database result" step. If waiting for the database makes up 90% of the request latency, then optimizing the non-database, web-server-dependent parts will at most result in a 10% latency speedup (by Amdahl's Law).

So we turn our attention to the second optio: concurrency. Here we see a big opportunity to make an impact. Since a request that's blocked waiting for a response from a remote service requires no CPU time, we could in theory support hundreds of them at once. Maybe even tens of thousands of requests are not outside the realm of possibility.

This is a exciting possibility. If we could process thousands of requests in parallel (compared to the 40ish concurrent requests per CPU that our Rails servers are currently configured to support) we could serve the same amount of traffic with much fewer server instances.

But it turns out that serving ten thousand connections simultaneously is a pretty hard problem. In fact, they even have a name for it: the c10k problem.

Let's look at some common technologies to see how they've approached this problem and how well it scales.

MRI Ruby

Ruby 1.8

In Ruby 1.8, you could call Thread.new as often as you like, and top would still (correctly) show no new OS threads. Threads in Ruby 1.8 were more like "green threads" or "fibers". The OS had no idea how many of these Ruby 1.8 threads your program is running at a time.

This model has some drawbacks. Normally, when a thread makes a blocking call like sleep(2) or read(2), the OS will put that thread to sleep so it can give CPU time to some other thread that isn't sleeping. This ensures the CPU is always doing work, so long as there is work to be done. But the OS can't block and resume threads it doesn't know about.

The Ruby 1.8 VM uses some pretty clever tricks to work around this limitation. In that version of Ruby, Kernel#sleep did not call sleep(2). Instead, it coordinated with the VM's scheduler to yield the CPU to another Ruby thread. Same thing with File#read. The Ruby 1.8 VM effectively included logic equivalent to an OS process scheduler, including a form of preemption.

Unfortunately, Ruby 1.8 was less good at doing this sort of thing than a typical OS would be. Certain gems, including an early version of the MySQL gem, would still make OS calls that would block every thread in the process. In fact, writing gems the "wrong" way would be significantly easier than writing them the right way. And even without poorly written gems, the VM's threading logic was not especially good or fast. (See this blog post from Joe Damato for all the gory details.)

So in Ruby 1.9 they did things differently.

Ruby 1.9

In Ruby 1.9 Thread.new created real OS threads. They VM still did not allow threads to run in parallel; a Ruby process was still limited to using at most one CPU at a time, thanks to the interpreter lock. But at least the OS could see the Ruby threads as threads.

For most use cases, this was a big improvement. The Ruby VM no longer had to implement its own process scheduling and preemption logic. They could use the OS to schedule threads, and the OS generally did this faster and better than Ruby 1.8 used to. Gems could be written that would call out to C libraries and make blocking OS calls without blocking other threads.

But there are drawbacks to this approach, too. The next section will demonstrate one of them.

Micro-benchmark: Ruby 1.8 vs. Ruby 2.x

Here is a tiny benchmark meant to test some form of scalability. The idea is to create thousands of threads and have them all go to sleep.

I compiled a non-optimized copy of Ruby 1.8.7 and compared it with the Ruby 2.3 that I obtained through my package manager. Here's how long this benchmark took for each of them.

At 10,000 threads:
Ruby 1.8: 24 seconds
Ruby 2.3: 18 seconds

At 100,000 threads:
Ruby 1.8: 3 minutes
Ruby 2.3: crashed

The lesson I would take from this benchmark is that OS threads come with a price. They're really good at what they do up until a point. Beyond that point, the scheduling and memory overheads start to get really expensive.

This matters because a web service is not far removed from this benchmark. From a scheduling point of view, a thread that's sleeping looks a lot like a thread that's waiting for a response from a database. Our ideal highly-concurrent web service would be able to support many thousands of these requests simultaneously.

Go

Go has made a name for itself in part because of its unique support for concurrency. Its goroutines are notable for their low memory usage and minimal scheduling overhead.

Goroutines are a kind of userspace thread. Go spawns OS threads in proportion to the number CPUs available (or the GOMAXPROCS setting). The Go runtime schedules goroutines to run on these threads. The OS knows about the threads, but not the goroutines.

It's also pretty smart about scheduling. Like Ruby 1.8, it implements a time.Sleep not as a sleep(2) system call, but as an indication to the thread scheduler to not schedule this thread for a while. That means sleeping is really fast.

I rewrote the 100,000 sleeping thread benchmark from the Ruby section in go:

Go crushed it. This program took less than 12 seconds to run. The number of OS threads remained constant throughout. Compared to Ruby, which took 3 minutes or crashed, this is quite an improvement.

But it turns out this is a case of gaming the benchmark. Or at least a demonstration that the benchmark is poorly written. While the go runtime does have a cheap implementation of sleep, other kinds of blocking calls are not so efficient. Here's another demonstration:

In this case, Go actually creates 100 OS threads, in proportion to the 100 blocked goroutines. Unlike time.Sleep, the runtime can't avoid the syscall implicit in bio.Readline() quite so easily. Crank up NumSleepers to 100,000 and this program starts to panic and crash just as much as the Ruby 2.3 benchmark did in the previous section.

Unfortunately for us, the blocking I/O microbenchmark is a better simulation of the web server than the sleep test. Our web server's requests are not spending most of their time waiting for a wall-clock. They're performing read(2) and write(2) calls on sockets. Go, in its current implementation, would be forced to spawn OS threads to handle this I/O. If it spawns too many threads, it crashes.

What we really need to scale this web service is event-driven I/O, like what Ruby 1.8 had. But we want it to be faster and more consistently implemented than that. Which brings us to the next candidate on our list.

Node.js

Enter Node.js. It uses only one OS thread, but on that thread there are multiple contexts of execution. When one context is blocked because it's waiting on a timer or blocking I/O, the runtime just context-switches to the next one.

If you read through the c10k link from earlier, you already know that a typical good solution to this problem is to use non-blocking I/O with event loop based around a select(2)-like interface, probably wrapped in a library like libevent or libuv for convenience.

This looks synchronous, but each of those callbacks contains its own context.
And yet this program spawns a constant number of OS threads no matter how many concurrent I/O operations are in progress. It's fast, too: when I pipe /bin/yes into that program it outputs 100,000 "Done" lines and terminates in under three seconds.

So it seems like the JavaScript community got something right. Although Node.js can't make use of multiple cores as easily as some other languages, and the non-preemptive scheduling might be problematic in CPU-bound workloads, for an I/O heavy server this model is pretty nice.
Conclusions and Caveats

So Node.js is the winner, right?

Well, that depends on context. I've made a few assumptions along the way that led to this conclusion. Your context may differ.

First, I assume that your web server isn't doing a lot of CPU-intensive work, or if it does, that this work could easily be farmed out to a separate process. If you insist on doing lots of computation in the same process that handles web requests, you might want to choose something else. A server like that is likely to hit CPU bottlenecks long before it hits 10,000 concurrent requests anyway.

Second, I assume that cooperative multitasking is good enough for use case. If your service supports even one kind of CPU-intensive request, then the Node.js cooperative multitasking model should scare you. In Node.js a single bad request could hog the CPU and the event loop could affect the latency of all other requests in the system. Preemptive multi-tasking models, like OS threads or goroutines, would preempt such a request before it disrupts the rest of the system.

Finally, I've implicitly rejected lots of other viable solutions because I assume that ecosystem support is important. The c10k solutions found in other languages tend to involve non-idomatic tricks that make the vast majority of their language's package ecosystem unavailable. Compare, for example, this IMAP for EventMachine and the Net::IMAP implementation that ships as part of Ruby's standard library. Which of these would you expect has the larger community? But if you're attached to a language other than JavaScript and don't mind working with a smaller ecosystem, then you can take your pick of non-JavaScript event-driven frameworks. Ruby's EventMachine, Python's Twisted, and Java's Netty all seem to be perfectly fine choices, though I admit to having little first-hand experience with them.

That said, if your use case requires a strong open-source ecosystem support, has a workload heavily dependent on I/O, and you want to minimize the number of CPU cores needed to solve the problem, then yes, Node.js is a good choice.