As with some of his previous talks, this one did an excellent job of revealing a bit more about how Google builds systems and what they’ve learned in doing so.

Working at the scale they do (thousands and thousands of servers) is especially interesting because you have to really step back and change your thinking about the available building blocks and how they can best fit together to get the job done.

And to do so it helps to have a good high level mental model of the pieces and their performance. In Google’s case, this means having 40-80 servers (4-8 cores, 16GB RAM, 2TB disk) in a rack all connected via a gigabit ethernet switch.

Multiple (30+) racks then uplink to a another switch to form a cluster. And services are designed to run across multiple clusters in different data centers around the world.

Design for Failure

Jeff discussed their experience with a typical cluster in its first year and presented some sobering downtime and failure statistics.

Relaly, all of this argues for building distributed systems that are as independent as possible and can tolerate failures, even if means giving users partial functionality (because it’s better than no functionality).

But figuring out how t design and build scalable distributed systems isn’t easy either. Your first attempt or two may not be quite right.

There’s a good chance you’ll underestimate the cost of some operation or another.

Numbers Everyone Should Know

With that in mind, Jeff presented a chart of “Numbers Everyone Should Know” that helps put various operations into perspective.

The idea is to facilitate back of the envelope estimates of system performance. He argues that a critical skill is the ability to estimate the performance of a system without having to actually build it. And that’s where those numbers come into play.

To make talking about these easier, let’s pick a few familiar points of reference. If you spend any time playing with ping or looking at specs for hard disks, you’re probably used to thinking in milliseconds rather than nanoseconds.

A millisecond is 1,000,000 nanoseconds. So the “round trip within same datacenter” timing above is 0.5ms, which sounds about right to me.

The exact numbers here aren’t as important as the differences in magnitude as you move up and down the list.

Let’s walk through an example: deciding between a distributed in-memory key/value store (like Redis) or a disk-based system like Berkeley DB is something you can quantify.

If we assume that the data set is sufficiently larger than the available RAM on your server, then most Berkeley DB accesses will have to hit disk and probably result in a few seeks (we’ll ignore the details of hash table vs. B-Tree for now).

Let’s call that 3 seeks per request, which is 10,000,000ns (or 10ms) each for a total of 30ms.

However, if you’re using a distributed in-memory key/value store, you need to fetch data from another machine in your cluster.

Let’s call that a few 500,000ns (or 0.5ms) round-trips in the same datacenter (assuming persistent network connections and a simple request/response protocol) and the same 3 “seeks” as before, but this time they’re 100ns main memory references on the remote machine.

Now we’re looking at a total more like 2-3ms.

Using this estimation technique, you can see that the distributed in-memory key/value store is easily 10 times faster than using a disk-based solution.

Of course, there’s a critical piece of information missing from these numbers: cost. And that cost comes in two flavors.

The first is the cost associated with buying enough machines to keep all the data in memory and the network gear that allows them all to talk to each other.

The second “cost” is the complexity associated with having 10 servers instead of 1 server. As we saw earlier, when the number of servers grows, the odds of failures rise as well.

So you need to decide how you needs map onto the spectrum of performance and cost/complexity options.

If it’s a requirement to serve N requests per second, then you can scope out what that will take given various amounts of data.

Conclusion

It’s fun to design big complicated systems, but it’s no fun at all to support them when they’re not performing or are too fragile because they break all the time.

By knowing some basic rules of thumb in advance, you can design with the right expectations in mind and spend a lot less time scratching your head (or pulling your hair out).

Are there are performance or failure estimation tips you regularly use? What are they?