Now that we have the C10K concurrent connection problem licked, how do we level up and support 10 million concurrent connections? Impossible you say. Nope, systems right now are delivering 10 million concurrent connections using techniques that are as radical as they may be unfamiliar.

Robert has a brilliant way of framing the problem that I’ve never heard of before. He starts with a little bit of history, relating how Unix wasn’t originally designed to be a general server OS, it was designed to be a control system for a telephone network. It was the telephone network that actually transported the data so there was a clean separation between the control plane and the data plane. The problem is we now use Unix servers as part of the data plane, which we shouldn’t do at all. If we were designing a kernel for handling one application per server we would design it very differently than for a multi-user kernel.

Which is why he says the key is to understand:

The kernel isn’t the solution. The kernel is the problem.

Which means:

Don’t let the kernel do all the heavy lifting. Take packet handling, memory management, and processor scheduling out of the kernel and put it into the application, where it can be done efficiently. Let Linux handle the control plane and let the the application handle the data plane.

The result will be a system that can handle 10 million concurrent connections with 200 clock cycles for packet handling and 1400 hundred clock cycles for application logic. As a main memory access costs 300 clock cycles it’s key to design in way that minimizes code and cache misses.

With a data plane oriented system you can process 10 million packets per second. With a control plane oriented system you only get 1 million packets per second.

If this seems extreme keep in mind the old saying: scalability is specialization. To do something great you can’t outsource performance to the OS. You have to do it yourself.

Now, let’s learn how Robert creates a system capable of handling 10 million concurrent connections...

C10K Problem - So Last Decade

A decade ago engineers tackled the C10K scalability problems that prevented servers from handling more than 10,000 concurrent connections. This problem was solved by fixing OS kernels and moving away from threaded servers like Apache to event-driven servers like Nginx and Node. This process has taken a decade as people have been moving away from Apache to scalable servers. In the last few years we’ve seen faster adoption of scalable servers.

The Apache Problem

The Apache problem is the more connections the worse the performance.

Key insight: performance and scalability or orthogonal concepts. They don’t mean the same thing. When people talk about scale they often are talking about performance, but there’s a difference between scale and performance. As we’ll see with Apache.

With short term connections that last a few seconds, say a quick transaction, if you are executing a 1000 TPS then you’ll only have about a 1000 concurrent connections to the server.

Change the length of the transactions to 10 seconds, now at 1000 TPS you’ll have 10K connections open. Apache’s performance drops off a cliff though which opens you to DoS attacks. Just do a lot of downloads and Apache falls over.

If you are handling 5,000 connections per second and you want to handle 10K, what do you do? Let’s say you upgrade hardware and double it the processor speed. What happens? You get double the performance but you don’t get double the scale. The scale may only go to 6K connections per second. Same thing happens if you keep on doubling. 16x the performance is great but you still haven’t got to 10K connections. Performance is not the same as scalability.

The problem was Apache would fork a CGI process and then kill it. This didn’t scale.

Why? Servers could not handle 10K concurrent connections because of O(n^2) algorithms used in the kernel.

Two basic problems in the kernel:

Connection = thread/process. As a packet came in it would walk down all 10K processes in the kernel to figure out which thread should handle the packet

Connections = select/poll (single thread). Same scalability problem. Each packet had to walk a list of sockets.

Solution: fix the kernel to make lookups in constant time

Threads now constant time context switch regardless of number of threads.

Came with a new scalable epoll()/IOCompletionPort constant time socket lookup.

Thread scheduling still didn’t scale so servers scaled using epoll with sockets which led to the asynchronous programming model embodied in Node and Nginx. This shifted software to a different performance graph. Even with a slower server when you add more connections the performance doesn’t drop off a cliff. At 10K connections a laptop is even faster than a 16 core server.

The C10M Problem - The Next Decade

In the very near future servers will need to handle millions of concurrent connections. With IPV6 the number of potential connections from each server is in the millions so we need to go to the next level of scalability.

Often people who see Internet scale problems are appliances rather than servers because they are selling hardware + software. You buy the device and insert it into your datacenter. These devices may contain an Intel motherboard or Network processors and specialized chips for encryption, packet inspection, etc.

X86 prices on Newegg as of Feb 2013 - $5K for 40gpbs, 32-cores, 256gigs RAM. The servers can do more than 10K connections. If they can’t it’s because you’ve made bad choices with software. It’s not the underlying hardware that’s the issues. This hardware can easily scale to 10 million concurrent connections.

What the 10M Concurrent Connection Challenge means:

10 million concurrent connections

1 million connections/second - sustained rate at about 10 seconds a connections

10 gigabits/second connection - fast connections to the Internet.

10 million packets/second - expect current servers to handle 50K packets per second, this is going to a higher level. Servers used to be able to handle 100K interrupts per second and every packet caused interrupts.

10 coherent CPU cores - software should scale to larger numbers of cores. Typically software only scales easily to four cores. Servers can scale to many more cores so software needs to be rewritten to support larger core machines.

We’ve Learned Unix Not Network Programming

A generation of programmers has learned network programming by reading Unix Networking Programming by W. Richard Stevens. The problem is the book is about Unix, not just network programming. It tells you to let Unix do all the heavy lifting and you just write a small little server on top of Unix. But the kernel doesn’t scale. The solution is to move outside the kernel and do all the heavy lifting yourself.

An example of the impact of this is to consider Apache’s thread per connection model. What this means is the thread scheduler determines which read() to call next depending on which data arrives. You are using the thread scheduling system as the packet scheduling system. (I really like this, never thought of it that way before).

What Nginx says it don’t use thread scheduling as the packet scheduler. Do the packet scheduling yourself. Use select to find the socket, we know it has data so we can read immediately and it won’t block, and then process the data.

Lesson: Let Unix handle the network stack, but you handle everything from that point on.

How do you write software that scales?

How do change your software to make it scale? A lot of or rules of thumb are false about how much hardware can handle. We need to know what the performance capabilities actually are.

To go to the next level the problems we need to solve are:

packet scalability

multi-core scalability

memory scalability

Packet Scaling - Write Your Own Custom Driver to Bypass the Stack

The problem with packets is they go through the Unix kernel. The network stack is complicated and slow. The path of packets to your application needs to be more direct. Don’t let the OS handle the packets.

The way to do this is to write your own driver. All the driver does is send the packet to your application instead of through the stack. You can find drivers: PF_RING, Netmap, Intel DPDK (data plane development kit). The Intel is closed source, but there’s a lot of support around it.

How fast? Intel has a benchmark where the process 80 million packets per second (200 clock cycles per packet) on a fairly lightweight server. This is through user mode too. The packet makes its way up through to user mode and then down again to go out. Linux doesn’t do more than a million packets per second when getting UDP packets up to user mode and out again. Performance is 80-1 of a customer driver to a Linux.

For the 10 million packets per second goal if 200 clock cycles are used in getting the packet that leaves 1400 clocks cycles to implement functionally like a DNS/IDS.

With PF_RING you get raw packets so you have to do your TCP stack. People are doing user mode stacks. For Intel there is an available TCP stack that offers really scalable performance.

Multi-Core Scalability

Multi-core scalability is not the same thing as multi-threading scalability. We’re all familiar with the idea processors are not getting faster, but we are getting more of them.

Most code doesn’t scale past 4 cores. As we add more cores it’s not just that performance levels off, we can get slower and slower as we add more cores. That’s because software is written badly. We want software as we add more cores to scale nearly linearly. Want to get faster as we add more cores.

Multi-threading coding is not multi-core coding

Multi-threading:

More than one thread per CPU core

Locks to coordinate threads (done via system calls)

Each thread a different task

Multi-core:

One thread per CPU core

When two threads/cores access the same data they can’t stop and wait for each other

All threads part of the same task

Our problem is how to spread an application across many cores.

Locks in Unix are implemented in the kernel. What happens at 4 cores using locks is that most software starts waiting for other threads to give up a lock. So the kernel starts eating up more performance than you gain from having more CPUs.

What we need is an architecture that is more like a freeway than an intersection controlled by a stop light. We want no waiting where everyone continues at their own pace with as little overhead as possible.

Solutions:

Keep data structures per core. Then on aggregation read all the counters.

Atomics. Instructions supported by the CPU that can called from C. Guaranteed to be atomic, never conflict. Expensive, so don’t want to use for everything.

Lock-free data structures. Accessible by threads that never stop and wait for each other. Don’t do it yourself. It’s very complex to work across different architectures.

Threading model. Pipelined vs worker thread model. It’s not just synchronization that’s the problem, but how your threads are architected.

Processor affinity. Tell the OS to use the first two cores. Then set where your threads run on which cores. You can also do the same thing with interrupts. So you own these CPUs and Linux doesn’t.

Memory Scalability

The problem is if you have 20gigs of RAM and let’s say you use 2k per connection, then if you only have 20meg L3 cache, none of that data will be in cache. It costs 300 clock cycles to go out to main memory, at which time the CPU isn’t doing anything.

Think about this with our 1400 clock cycle budge per packet. Remember 200 clocks/pkt overhead. We only have 4 cache misses per packet and that's a problem.

Co-locate Data

Don’t scribble data all over memory via pointers. Each time you follow a pointer it will be a cache miss: [hash pointer] -> [Task Control Block] -> [Socket] -> [App]. That’s four cache misses.

Keep all the data together in one chunk of memory: [TCB | Socket | App]. Prereserve memory by preallocating all the blocks. This reduces cache misses from 4 to 1.

Paging

The paging table for 32gigs require 64MB of paging tables which doesn’t fit in cache. So you have two caches misses, one for the paging table and one for what it points to. This is detail we can’t ignore for scalable software.

Solutions: compress data; use cache efficient structures instead of binary search tree that has a lot of memory accesses

NUMA architectures double the main memory access time. Memory may not be on a local socket but is on another socket.

Memory pools

Preallocate all memory all at once on startup.

Allocate on a per object, per thread, and per socket basis.

Hyper-threading

Network processors can run up to 4 threads per processor, Intel only has 2.

This masks the latency, for example, from memory accesses because when one thread waits the other goes at full speed.

Hugepages

Reduces page table size. Reserve memory from the start and then your application manages the memory.

Summary

NIC

Problem: going through the kernel doesn’t work well.

Solution: take the adapter away from the OS by using your own driver and manage them yourself

CPU

Problem: if you use traditional kernel methods to coordinate your application it doesn’t work well.

Solution: Give Linux the first two CPUs and you application manages the remaining CPUs. No interrupts will happen on those CPUs that you don’t allow.

Memory

Problem: Takes special care to make work well.

Solution: At system startup allocate most of the memory in hugepages that you manage.

The control plane is left to Linux, for the data plane, nothing. The data plane runs in application code. It never interacts with the kernel. There’s no thread scheduling, no system calls, no interrupts, nothing.

Yet, what you have is code running on Linux that you can debug normally, it’s not some sort of weird hardware system that you need custom engineer for. You get the performance of custom hardware that you would expect for your data plane, but with your familiar programming and development environment.

Reader Comments (30)

This breakdown of bottlenecks is impressive, it is refreshing to read something this correct.

Now I want to play devils' advocate (mostly because I thoroughly agree w/ this guy). The solutions proposed sound like customized hardware specific solutions, sound like a move back to the old days, when you could not just put some fairly random hardware together, slap linux on top and go ... that will be the biggest backlash to this, people fear appliance/vendor/driver lock-in, and the fear is a rational one.

What are the plans to make these very correct architectural practices available to the layman. Some sort of API is needed, so individual hardware-stacks can code to it and this API must not be a heavy-weight abstraction. This is a tough challenge.

Best of luck to the C10M movement, it is brilliant, and I would be a happy programmer if I can slap together a system that does C10M sometime in the next few years

Great article! I've always found scale fascinating, in general. Postgres 9.2 added support (real support) for up to 64 cores, which really tickled my fancy. It seems the industry switches back and forth between "just throw more cheap servers at it... they're cheap" and "let's see how high we can stack this, because it'll be tough to split it up". I prefer the latter, but a combination thereof, as well as the sort of optimizations you speak of (not simply delegating something like massive-scale networking to the kernel) tends to move us onward... toward the robocalypse :)

The universal scalability law (USL) is exhibited around 30:00 mins into the video presentation, despite his statement that scalability and performance are unrelated. Note the performance maximum induced by the onset of multicore coherency delays. Quantifying a similar effect for memcache was presented at Velocity 2010.

This is very interesting. The title may be better as "The Linux Kernel is the Problem", as this is different for other kernels. Just as an example, last time I checked, Linux took 29 stack frames to go from syscalls to start_xmit(). The illumos/Solaris kernel takes 16 for the similar path (syscall to mac_tx()). FreeBSD took 10 (syscall to ether_output()). These can vary; check your kernel version and workload. I've included them as an example of stack variance. This should also make the Linux stack more expensive -- but I'd need to analyze that (cycle based) to quantify.

Memory access is indeed the enemy, and you talk about saving cache misses, but a lot of work has been done in Linux (and other kernels) for both CPU affinity and memory locality. Is it not working, or not working well enough? Is there a bug that can be fixed? It would be great to analyze this and root cause the issue. On Linux, run perf and look at the kernel stacks for cache misses. Better, quantify kernel stacks in TCP/IP for memory access stall cycles -- and see if they add up and explain the problem.

Right now I'm working on a kernel network performance issue (I do kernel engineering). I think I've found a kernel bug that can improve (reduce) network stack latency by about 2x for the benchmarks I'm running, found by doing root cause analysis of the issue.

Such wins won't change overall potential win of bypassing the stack altogether. But it would be best to do this having understood, and root caused, what the kernel's limits were first.

Bypassing the kernel also means you may need to reinvent some of your toolset for perf analysis (I use a lot of custom tools that work between syscalls and the device, for analysing TCP latency and dropped packets, beyond what network sniffing can do).

Unix wasn't initially used for controlling telephone networks. Mostly a timesharing system for editing documents (ASCII text with markup). This is from the BSTJ paper version of the CACM paper.The CACM paper is earlier and slightly different. http://cm.bell-labs.com/cm/cs/who/dmr/cacm.html.

Since PDP-11 Unix became operational in February, 1971, over 600 installations have been put into service. Most of them are engaged in applications such as computer science education, the preparation and formatting of documents and other textual material, the collection and processing of trouble data from various switching machines within the Bell System, and recording and checking telephone service orders. Our own installation is used mainly for research in operating systems, languages, computer networks, and other topics in computer science, and also for document preparation.

Uhhhh, I hate to break it to you guys, but this problem, and its solutions, has long been known to the exokernel community. While I admit that Linux and BSD "aren't Unix" in the trademark sense of the term, they're still both predicated on an OS architecture that is no less than 60 years old (most of the stuff we take for granted came, in some form or another, from Multics). It's time for a major update. :)

One problem is choke-points (whether shared data structures or mutexes, parallelizable or not) that exist in user- and kernel- space. The exo-kernel and other approaches simply choose different ways of doing the same thing by shifting the burden around (IP stack packetting functions). Ultimately, the hardware (real or virtual) presents the most obvious, finite bottleneck on the solution.

At the ultra high end of other approaches that include silicon, http://tabula.com network-centric apps compiled w/ app-specific OS frameworks coded in a functional style. This is promising not only for niche problems, but looks like the deepest way of solving many of the traditional bottlenecks of temporal/spatial dataflow problems. For 99% of solutions, it's probably wiser to start with vertically-scaled LMAX embedded systems approaches first http://martinfowler.com/articles/lmax.html after having exhausted commercial gear.

'Locks in Unix are implemented in the kernel' claim is not absolutely right for systems of today. We have light-weight user-land locks as in http://en.wikipedia.org/wiki/Futex. Although I agree with the potential speed gains of an application doing the network packet handling, Memory management (including Paging) and CPU scheduling are hardly in the domain of an application. Those kind of things should be left to kernel developers who have a better understanding of the underlying hardware...

Interesting how often the architectural pattern of "separation of control and data" shows up. I gave a talk just the other day to a room full of executives in which I pointed out how this pattern occurred (fractally as it turns out) in the product we were developing. You can see it in the kinds of large scale mass storage systems I worked at while at NCAR (control over IP links, but data over high speed I/O channels using specialized hardware), over the kinds of large telecommunications systems I worked on while at Bell Labs (again, "signaling" over IP links, but all "bearer" traffic over specialized switching fabrics using completely different transport mechanisms), in VOIP systems (control using SIP and handled by the software, but RTP bridged as much as possible directly between endpoints with no handling by the software stack), and even in embedded systems (bearer over TDM busses directly from the A/D chips, but control via control messages over SPI serial busses). This pattern has been around at least since the 1960s, and maybe in other, non-digital-control contexts, even earlier. It's really a division of specialized labor idea, which may go back thousands of years.

But, I agree that Linux kernel should still offer substantial improvements as far as a high number of sockets is concerned. It's good to see that the new versions of Linux kernel (starting with version 3.7) come with some important improvements in terms of socket-related memory footprint. However. more optimization is necessary and possible. As mentioned in my post, the memory footprint for 12 million concurrent connections is about 36 GB. This could be certainly enhanced by the Linux kernel.

This is great! I have implemented a Java Web server using nio and can achieve 10K+ connection on a 8GB ram, 5 years old desktop computer - but 10M is insane. On linux the number of open sockets appears to be limited by the ulimit - and not sure if 10M is even possible without kernel tweaks.. I guess for your approach this isn't relevant.

Great post. Intel is hiring a Developer Evangelist for DPDK and Network Developing. If you know someone interested in this space please pass onhttp://jobs.intel.com/job/Santa-Clara-Networking-Developer-Evangelist-Job-CA-95050/77213300/

What kind of dumb, ambiguous way of expressing a number is this? Are you deliberately trying to confuse peple? You should be using either numerals or words *consisently* throughout the article, let alone the same figure.