Notes on Scalability

We all hope that we’re going to succeed beyond our wildest expectations. Startups long for multi-billion dollar IPOs or scaling to hundreds, or even thousands, of servers. Every hosting provider is touting how their new cloud offering will help us scale up to unheard of heights. I’ve built things up and torn them down a few times over my career

Build it to Break

Everything you make is going to break, plan for it.

Whenever possible, design the individual layers of an application to operate independently and redundantly. Start with two of everything – web servers, application servers, even database servers. Once you realize that everything can and will fail, you’ll be a lot happier with your environment, especially when something goes wrong. Well designed applications are built to fail. Architects accept that failure is inevitable. It’s not something that we want to consider, but it’s something that we have to consider.

Distributed architecture patterns help move workloads out across many autonomous servers. Load balancers and web farms help us manage failure at the application server level. In the database world, we can manage failure with clustering, mirroring, and read-only replicas. Everything computer doesn’t have to be duplicated, but we have to be aware of what can fail and how we respond.

Everything is a Feature

As Jeff Atwood has famously said, performance is a feature. The main thrust of Jeff’s article is that making an application fast is a decision that you make during development. Along the same lines, it’s a conscious decision to make an application fault tolerant.

Every decision that has a trade off. Viewing the entire application as a series of trade offs leads to a better understanding about how the application will function in the real world. The difference between being able to scale up and being able to scale out can often come down to understanding key decisions that were made early on.

Scale Out, Not Up

This isn’t as axiomatic as it sounds. Consider this: cloud computing like Azure and AWS is at its most flexible when we can dynamically add servers in response to demand. To effectively scale out means that we need also to be able to scale back in.

Adding additional capacity is usually in the application tier; just add more servers. What happens when we need to scale the database? The current trend is to buy a faster server with faster disks and more memory. This process keeps repeating itself. Hopefully your demand for new servers will continue at a pace that is less than or equal to the pace of innovation. There are other problems with scaling up. As performance increases, hardware gets more expensive for smaller and smaller gains. The difference in cost between the fastest CPU and second fastest CPU is much larger than the performance gained – scaling up often comes at a tremendous cost.

The flip side to scaling up is scaling out. In a scale out environment, extra commodity servers are added to handle additional capacity. One of the easiest ways to manage scaling out the database is to use read-only replica servers to provide scale out reads. Writes are handled on a master server because scaling out writes can get painful. But what if you need to scale out writes? Thankfully, there are many techniques available to horizontally scaling the database layer – features can be broken into distinct data silos, metadata is replicated between all servers while line of business data is sharded, or automated techniques like SQL Azure’s federations can be used.

The most important thing to keep in mind is that it’s just as important to be able to contract as it is to expand. As a business grows it’s easiest to keep purchasing additional servers in response to load. Purchasing more hardware is faster and usually cheaper than tuning code. Once the application reaches a maturity level, it’s important to tune the application to run on fewer resources. Less hardware equates to less maintenance. Less hardware means less cost. Nobody wants to face the other possibility, too – the business may shrink. A user base may erode. A business’s ability to respond to changing costs can be the difference between a successful medium size business and a failed large business.

Buy More Storage

In addition to scaling out your servers, scale out your storage. If you have the opportunity to buy a few huge disks or a large number of small, fast disks give serious thought to buying the small, fast disks. A large number of small, fast drives is going to be able to rapidly respond to I/O requests. More disks working in concert means that less data will need to be read off of each disk.

The trick here is that modern databases are capable of spreading a workload across multiple database files and multiple disks. If multiple files/disks/spindles/logical drives are involved in a query, then it’s possible to read data from disk even faster than if only one very large disk were involved. The principle of scaling out vs. scaling up applies even at the level of scaling your storage – more disks are typically going to be faster than large disks.

You’re Going to Do It Wrong

No matter how smart or experienced your team is, be prepared to make mistakes. There are very few hard and fast implementation guidelines about scaling the business. Be prepared to rapidly iterate through multiple ideas before finding the right mix of techniques and technologies that work well. You may get it right on the first try. It may take a number of attempts to get it right. But, in every case, be prepared to revisit ideas.

Related

When I’m not working with databases, you’ll find me at a food truck in Portland, Oregon, or at conferences such as DevLink, Stir Trek, and OSCON. My sessions have been highly rated and I pride myself on their quality.

I couldn’t agree more about throwing a visit to Newegg or Fry’s at the problem – RAM and SSDs can solve many problems. I would hope that people will try to scale their database servers up long before they go down the route to scale them out. That can be a painful path, especially if you need ACID across the whole dataset.

That’s my first thought exactly when reading this post. Most people will never actually have to scale out the database. This is even more true in the era of enterprise class SSDs like Fusion-io.
Think about scaling out, prepare plan for it, but don’t lose your sleep over it. I have the infrastructure where single powerful database (with mentioned Fusion-io SSD) doesn’t cut it – but you will not find yourself in that place overnight (unless you are working at place like Facebook or Google, of course 🙂 ).

Great article, really enjoyed reading this. Being prepared to rewrite core functionality and make it better tends to be a stumbling block. There is an anxiety about breaking something that kinda works because it kinda works.

You hit the nail right on the head when you added “by removing unused/unnecessary instances”. Sometimes you’ll be scaling with physical servers and sometimes with virtual servers. Either way, you need to be ready to reduce your physical infrastructure when load/demand decreases. Nobody wants to pay for servers they aren’t using.

Excellent article that really hit home. I’m involved in a project right now helping redesign one of our core products for scalability and high availability. I was pleasantly surprised to see many of the ideas I had been thinking about or pushing for you covered.