GoGaRuCo '09 – Magic scaling sprinkles – Nick Kalen

With respect to the title, magic sprinkles are the fundamentals of computer science, irrespective of languages. Nick is disappointed when people point to simply a technology to solve their scaling concerns, like erlang.

Scaling comes down to 3 main issues, which are really compute science issues:

Distribution

Balance

Locality

First, nick builds a very simple echo network service, called the JokeServer. You connect to it and it echos the input. Then he does a simple load test on it, using a custom TCP benchmarking system based on apache benchmark On first try, it shows 8450 req/sec, but admittedly it does almost nothing.

If you have a simple single threaded worker that completes 2 jobs/sec, the throughput is 2 req/sec and the latency is 0.5 sec/req. Things get more interesting when we add more threads, where the latency stays approximately the same, but the throughput goes up. Latency is an efficiency question, and throughput is a scalability question.

He then modifies the JokeServer, adding the following code:

10000.times { Time.now }
sleep rand

These contrieved inefficiencies are representative of two kinds of work that application servers usually perform. The first uses a lot of system calls and object creation, and the second blocks on some I/O for a specific time.

When he reruns the benchmark, the throughput drops to around 1 req/sec. If thousands of users need to use this service at once, Nick asks the basic question: “How many of these can we run per machine, and how many per core?”

Distribution

To answer this question, he adds Statisaurus to collect statistics from the critical section of server code. It outputs timestamps, transaction ID, and 3 measurements of ellapsed time (wall clock, system time, user time). First, he points out that the wall clock != (system time + user time). The excess is refered to as “stall time”, and it can be caused by waiting on I/O or by context switching. Context switching is very expensive at the CPU level, so the goal here is to find the optimal number of processes per core.

Suppose a worker takes 0.5 sec of CPU time and 0.5 sec of “stall time”. What is the optimal scheduling for 2 workers on 1 core? It’s obviously to run process1’s CPU and process2’s stall, then vica versa. That sort of optimal scheduling is what we’re hoping to achieve in the real world by controlling the # of processes we run. In general, you want to take the wall clock time and divide it by the CPU time, which will yield the number of processes to run per core. In Nick’s example JokeServer, he shows about 6 processes per core. Since he has 2 cores, he’s going to run 10 processes total (some room for error).

What distribution strategy should we use to divide client requests amount our processes? We can try out a simple TCP proxy to divide requests between our many workers. The proxy introduces a point of failure, but is completely transparent to the server and the client. Another model is the DNS model, where the client talks to a “nameserver”, which gives the client the address of an available worker. The client can then talk directly to the server, removing the extra proxy latency. In a third model, the client uses a distributed hash table (like memcached) to determine which server to communicate with directly. The advantages are obvious, but the disadvantages can be logistical nightmares.

In his demo, Nick is going to build a simple proxy. By adding a proxy, it’s another part of giant moving system. To keep things from becoming impossible to debug, Nick suggests that you use logging with transaction ids. The proxy will generate a transaction GUID, then pass it to the backend, where it uses this ID in its logs as well (great for debugging and correlating requests).

Balancing

First, he demonstrates the “random” strategy. With a concurrency level of 10, we can now get a throughput of 4 req/sec. It’s better than non-concurrent, but random is clearly inefficient.

Next, he tries round-robin, where we load balance sequentially across each worker in order. This sound really good, but it assumes that each job takes exactly the same time. With a concurrency level of 10 req/sec, we get 7 req/sec.

Last, he points out that the jobs are not of the same duration, so round robin is not a good strategy. Instead, we try a “least busy” strategy, where the proxy forwards the request to the worker with the least # of currently open connections. With concurrency of 10 req/sec, the throughput jumps to about 8 req/sec.

Locality

By introducing memoization into the JokeServer, we never do the same work twice. This is where caching comes into play, and can reduce the response time tremendously. Nick then adds a primitive cache to the joke server, that can only store 2 cached values (to simulate resource starvation). Since there are 10 total values, we expect roughly 20%-30% cache hit ratio. When tested, that value is achieved more or less. We’d prefer to get 99-100%, of course. By associating a certain class of request with a certain servers, we can achieve that goal. This takes advantage of locality (such as writing consecutive hard disk blocks is faster than random block).

To try this out, Nick uses a sticky proxy strategy. Similar requests are funneled to the same server consistently. Our throughput jumps to several hundred req/sec as our cache hit ratio gets close to 100%.