Linux Kernel Tuning for C500k

Like the idea of working on large scale problems? We’re always looking for talented engineers, and would love to chat with you – check it out!

Note: Concurrency, as defined in this article, is the same as it is for The C10k problem: concurrent clients (or sockets).

At Urban Airship we recently published a blog post about scaling beyond 500,000 concurrent socket connections. Hitting these numbers was not a trivial exercise so we’re going to share what we’ve come across during our testing. This guide is specific to Linux and has some information related to Amazon EC2, but it is not EC2-centric. These principles should apply to just about any Linux platform.

For our usage, squeezing out as many possible socket connections per server is valuable. Instead of running 100 servers with 10,000 connections each, we’d rather run 2 servers with 500,000 connections apiece. To do this we made the socket servers pretty much just socket servers. Any communication between the client and server is passed through a queue and processed by a worker. Having less for the socket server to do means less code, cpu-usage, and ram-usage.

To get to these numbers we must consider the Linux kernel itself. A number of configurations needed tweaking. But first, an anecdote.

The Kernel, OOM, LOWMEM, and You

We first tested our code on a local Linux box that had Ubuntu 64-bit with 6GB of RAM, connecting with several Ubuntu VMs per client using bridged network adapters so we could ramp up our connections. We’d fire up the server and run our clients locally to see just how many connections we could hit. We noticed that we could hit 512,000 with our Java server not even breaking a sweat.

The next step was to test on EC2. We first wanted to see what sort of numbers we could get on “Small” instances, which are 1.7GB 32-bit VMs. We also had to fire up a number of other EC2 instances to act as clients.

We watched the numbers go up and up without a hitch until, seemingly randomly, the Java server fell over. It didn’t print any exceptions or die gracefully—it was killed.

We tried the same process again to see if we could replicate the behavior. Killed again.

Grepping through syslog, we found this line:

Out of Memory: Killed process 2178 java

The OOM-killer killed the Java process. Having watched the free RAM closely, this was odd because we had at least 500MB free at the time of the kill.

The next time we ran it we watched the contents of /proc/meminfo. What we noticed was a steady decline of the field “LowFree”, the amount of LOWMEM that is available. LOWMEM is the kernel-addressable RAM space used for kernel data. Data like socket buffers.

As we increased the number of sockets each socket’s buffers increased the amount of LOWMEM used. Once LOWMEM was full the kernel (instead of simply panicking) found the user process responsible for the usage and promptly killed it so it could continue to function.

On a standard EC2 Small, the configuration is such that the LOWMEM is around 717MB and the rest is “given” to the user. The kernel is smart about reallocating LOWMEM for the user, but not the other way around. The assumption is that the kernel will use very little ram, or at least a predictable finite amount, and the user should be allowed to go crazy. What we needed with our socket server was just the opposite. We needed the kernel to use all the ram it needed—our Java server rarely uses above a few hundred MB.

On a 32-bit system the kernel-addressable RAM space is 4GB. Making sure the proper space reserved for the kernel is important. But on 64-bit (x86-64) Linux the kernel-addressable space is 64TB (terabytes). At the current state of computing this is effectively limitless, and as such you will not even see LowMem in /proc/meminfo because it is all LOWMEM.

So we created some EC2 Large instances (each of which is 64-bit with 7.5GB of RAM) and ran our tests again, this time without any surprises. The sockets were added happily and the kernel took all the RAM it needed.

Long story short, you can only scale to so many sockets on a 32-bit platform.

Kernel Options

Several parameters exist to allow for tuning and tweaking of socket-related parameters. In /etc/sysctl.conf there are a few options we’ve modified.

First is fs.file-max, the maximum file descriptor limit. The default is quite low so this should be adjusted. Be careful if you’re not ready to go super high.

Second, we have the socket buffer parameters net.ipv4.tcp_rmem and net.ipv4.tcp_wmem. These are the buffers for reads and writes respectively. Each requires three integer inputs: min, default, and max. These each correspond to the number of bytes that may be buffered for a socket. Set these low with a tolerant max to reduce the amount of ram used for each socket.

Meaning that the kernel allows for 999,999 open file descriptors and each socket buffer has a minimum and default 4096-byte buffer, with a sensible max of 16MB.

We also modified /etc/security/limits.conf to allow for 999,999 open file descriptors for all users.

#
* - nofile 999999

You may want to look at the manpage for more information.
Testing

When testing, we were able to get about 64,000 connections per client by increasing the number of ephemeral ports allowed on both the client and the server.

echo "1024 65535" > /proc/sys/net/ipv4/ip_local_port_range

This effectively allows every ephemeral port above 1024 be used instead of the default, which is a much lower (and typically more sane) default.

The 64k Connection Myth

It’s a common misconception that you can only accept 64,000 connections per IP address and the only way around it is to add more IPs. This is absolutely false.

The misconception begins with the premise that there are only so many ephemeral ports per IP. The truth is that the limit is based on the IP pair, or said another way, the client and server IPs together. A single client IP can connect to a server IP 64,000 times and so can another client IP.

Were this myth true it would be a significant and easy-to-exploit DDoS vector.

Scaling for Everyone

When we set out to establish half a million connections on a single server we were diving deep into water that wasn’t well documented. Sure, we know that C10k is relatively trivial, but how about an order of magnitude (and then some) above that?

Fortunately we’ve been able to achieve success without too many serious problems. Hopefully our solutions can help save time for those out there looking to solve the same problems.