Thinking at Cloud Scale

18 January, 2010 8:55 pm

Scalable systems are not just bigger versions of their smaller brethren. If you take a ten-node system and just multiply it by ten, you’ll probably get a system that performs poorly. If you multiply it by a hundred, you’ll probably get a system that doesn’t work at all. Scalable systems are fundamentally and pervasively different, because they have to be. I’d been meaning to write about some aspects of this for a while, but my recent post about MaxiScale brought a couple of particular points to mind. Here are two Things People Don’t Get about building scalable systems.

Everything has to be distributed.

Change has to be handled online.

To the first point, almost everyone knows by now that any “master server” can become a limit on system scalability. What’s less obvious is that universal replication is just as bad if not worse. For an extremely read-dominated workload, spreading reads across many nodes and not worrying about a few writes here and there might work. For most systems, though, the overhead of replicating those more-than-a-few writes will kill you. What you have to do instead is spread data around, which means that those who want it have to be able to find each piece separately instead of just assuming they all exist in one convenient place.

The second point is a bit less obvious. As node count increases, so does the likelihood that users will represent conflicting schedules and priorities that preclude bringing down the whole system for a complete sanity check and tune-up. This is the down side of James Hamilton’s non-correlated peaks. Despite the “no good time for everyone” feature of large systems, though, change will continue to occur. Nodes will be added and removed, and possibly upgraded. Lists will get long, space will fragment, and cruft will generally accumulate. Latent errors will appear, and they’ll remain latent until they’re fixed online or until they cause a catastrophic failure. Individual nodes will reboot, and some will argue that they should be rebooted even if they seem fine, but “planned downtime” for the system as a whole will be no more than a fond memory.

As it turns out, these two rules combine in a particularly nasty way. If everything has to be distributed and then found, and changes to the system are inevitable, then you would certainly hope that your method of finding things can handle those inevitable changes. Unfortunately, this is not always the case. For example, consider one of the many systems based on Dynamo-style consistent hashing with N replicas. Now add N nodes, such that they’re all adjacent in the space between file hash X and server hash Y. Many systems support a update-or-insert operation, but if such an operation is attempted at this point it will create a new datum on the N new nodes, separate from and inconsistent with regard to the existing datum at Y and its successors. This is just bad in a bunch of ways – inconsistent data on the new nodes, stale data on the old ones, perhaps even an unexpected and hard-to-resolve conflict between the two if one of the new nodes then fails. This might seem to be an unlikely scenario, but the third key lesson of scalable systems is this:

Given enough nodes and enough time, even rare scenarios become inevitable.

In other words, you can never sweep the icky bits under the rug. You have to anticipate them, deal with them, and test the way that you deal with them. I can guarantee that certain well regarded data stores, implemented by generally competent people, mishandle the case I’ve just outlined. I’ve watched their developers hotly deny that such cases can even occur, proving only that they hadn’t thought about how to handle them. It’s not that they’re bad people, or stupid people, but they clearly weren’t thinking at cloud scale.

Trying to reason about systems that have no single authoritative frame of reference is hard. It’s like a store where every customer tries to use a different currency, with the exchange rates changing every minute. Building systems that can never go down for more than a moment, or perhaps never at all, is hard too. It’s no wonder people have trouble with the combination. Nonetheless, that’s what people who make “cloud scale” or “web scale” or “internet scale” products have to get used to. A ten-node cloud is just a puff of warm moist air, and anyone can produce one of those.

Yes, people like Lynch and Lamport, Brewer and Vogels, have covered much of this territory before. Unfortunately, many – including some who have explicitly tried to copy systems like Dynamo – still need reminding from time to time because they keep getting the details wrong.