Thoughts and musings on scalability, messaging, Windows Azure, and programming in general

Scaling is hard. It's the mantra of lots of startups, it's a sales point for software vendors, and it's true. It wasn't always. Up to a few years ago, scalability meant your application could handle a few extra people on a sunday night. Nowadays, your app can go viral and have a million new users overnight. The meaning of scalability has changed, and so has the impact of a badly scalable application. There are horror stories about startups that couldn't handle the sudden load and went bust when their application died. There is nothing worse for your reputation than when your website goes down the moment a lot of people want to use it. Suddenly, being able to build something properly scalable is a marketable skill, seen by some as more important than building for correctness or robustness.

Scaling is hard for a reason. Not enough people have done it yet. Sure, there's the Facebooks, Microsofts, Amazons, and Twitters of the world, but ask an average programmer (or architect, for that matter), and they'll look at you with a puzzled expression. It's hard because we're at the start of it all, and the knowledge we have hasn't spread far enough yet. This post hopes to remedy at least a bit of that. I intend to give a few tips and things to think about, in the hopes that you will be able to apply them when needed. They're not technology-specific, or even paradigm-specific, because the rules of scalability are the same no matter what you write. So without further ado...

1. Bottlenecks are everywhere

And they're never where you expect them to be. If your userbase doubles, what part of the application will be first to give? Most will say "the load on the CPU", "not enough memory", "my database might crash", etc. In some cases, that's true. More likely though, is that you'll run into other limits of your system. Maybe it'll be hard drive speed, leading to write and read faults. Maybe it's the network speed. If your application is split over multiple machines, are you sure all those machines can handle the number of requests that might come in? How many requests can your webserver handle per second, and how many can it queue? Is your cache big enough, and is the invalidation system for it robust enough?

It doesn't matter how many boxes you tick, at some point something will give. The clue here is, know which one is most likely to break before you scale. Do load tests, and try to simulate proper user behavior. A load test where you just hit a blank page on the web server is useless if your database cracks. Test thoroughly, and test often. Every major addition to the software, hardware or infrastructure should prompt a new battery of load tests. Call this "scalability hygiene".

Once you know what your primary bottleneck is, fix it and move on. Then fix your secondary and your tertiary, and so on until you're satisfied the application will stay up and running. Don't do this until you're sure about where the problems lie. Remember premature optimization is the root of all evil.

2. Robustness is just as important

The more users you have, the more load you have, and the more machines you need, the more likely it is that something will go wrong. Once you hit a certain scale, it's a guarantee that there will always be some percentage of your infrastructure that is in a failure mode. It's your job to handle these cases gracefully. Google has mastered this, their core functionality keeps working no matter what goes on in their data centers. Some things might run slower, specific (tiny) parts might not work, but the bulk of Google's services almost never goes down.

Endeavor to do the same. Here's a thought experiment: take a UML diagram of your application's architecture on a whiteboard. Take a big red pen, and cross out one of the blocks. Now go through it all, what happens to the other blocks? How much functionality can you still provide without it? What steps do you need to take to ensure you can provide that functionality? Are there single points of failure, blocks that will kill the entire application if they fail? Can they be avoided? Then do the same for 2 blocks, and 3.

Consider using a Chaos Monkey. If none exist for your platform or infrastructure, write a simple one. Don't ever turn it off. It's a great tool for learning where potential disaster are situated, and it's unparalleled for learning about robustness.

3. Load balancers are stupid

I don't mean they are a bad idea. A load balancer is the first part of your application any user will hit, and it is absolutely vital. Love your load balancer, tune it and take good care of it. What I mean is that they are not very intelligent. Because of the amount of data they're supposed to handle, they can be a blindingly stupid part of your infrastructure. The more intelligence you expect from it, the slower it will be, and the more likely that it will end in tears. It would also mean your application is harder to migrate to a different one, since you would need to be sure that the new LB can handle the exact same logic yours does now.

Keep it as dumb as possible. All a load balancer has to do is pick a machine to send a request to, anything else is a risk. This is doubly important if you're using one of the main cloud hosts. All of them have built-in load balancers, that can handle staggering amounts of requests, but that are dumb as nails. Forget doing sticky sessions, forget advanced routing tables or content inspection, they are fire and forget. Build your application around that, by making sure every server on the LB can handle every request.

3. Define a scaling unit

If you carry nothing else from this, remember this heading. A scaling unit is a vertical slice of your application on an infrastructure level. It contains everything, from your front-end server, to your database instance, to your caching and queuing servers. It's the smallest unit of your application that can provide all the functionality, with maximum usage of your resources. In simple applications, this will be a web server and a database server (or instance, for cloud hosts). In advanced applications, the smallest slice might be 15 servers and 5 switches. In Windows Azure Storage, a scaling unit ('stamp') is 10-20 racks of servers with 20PB of storage space. Ideally, under the maximum load the unit can handle, everything in it will be at capacity, with your current primary bottleneck just about holding on. Keep the unit as small as possible, because the smaller it is, the finer you can tune your scale to your needs.

Once you have defined your scaling unit, perform load testing on it, the same testing as above. Make sure you know exactly when your scaling unit is at capacity, when it can handle more and when it will crack, and what its behavior is in the intervals. Most importantly, know exactly how much users your scaling unit can handle. Congratulations, you've defined a minimum scaling step. You now have a template that you can duplicate to scale.

Once you have this down, scaling becomes increasingly easy. If you get close to the tipping point of your first scaling unit, duplicate it. You now have 2 separate instances of your application running side by side, either behind a common load balancer or a more advanced traffic manager. Your application can now handle twice the number of users. Still not enough? Add another. If you've done your testing and building properly, every duplicate of the scaling unit you add increases your maximum userbase by the exact amount your testing showed. It won't matter if your have just one instance of your scaling unit, or a million, it'll be perfectly predictable.

The difficult part in this is making sure data gets to where it belongs. This depends a lot on your application structure. Some share the data among their scaling units through database syncing, others split the units over locations and let a traffic manager send users to the right place. Still others segregate with subdomains. There is no perfect solution, but they all fit an important criterium: they depend as little as possible on the load on the units. What good is having an application that scales perfectly and predictably, but mysteriously give the wrong data after a certain number of scaling units? Make sure that the way you manage your units doesn't ever become a bottleneck.

4. Autoscaling

Don't bother. Not at first. It's very tempting to make scaling an automatic step that you don't have to worry about, but unless you've chosen manually when to scale up and down for a while, you won't know when to do it. Scaling too late can be a problem, but scaling the wrong way or scaling too early can end up costing a lot more money. Start doing it manually, keeping an eye on your metrics, your userbase and your logging, and learn when the best moments are and what the criteria are for needing to scale up. Learn the behavior of your users, and try to divine rules that you can apply so you can anticipate when you will need to do a big or small scaling action.

5. Don't forget to scale down

Last but not least. Not many applications grow in only one direction. Most have peaks where they serve a large number of requests and quiet moments where all is calm. Some grow, only to decline again after a long enough time. Often, your application might need to scale because users are using it differently than you thought. Remember to scale down when the load goes down. If your users' focus has changed, go back and redefine your scaling unit. Maybe you need less of server X and more of server Y in your unit, to handle the most representative load. It might save you a lot of money. And after all, money is the reason you don't just buy enough hardware to handle everybody on the planet.

And many more

I could keep going, but I've taken up enough of your time. There's a million more tricks to do and things to think about, but keep these things in mind and hopefully you'll be able to build an application from scratch, and be confident it will be ready for whatever the future throws at it.

"Don't do it yourself"

The primary way worker efficiency can be improved is an old principle: "Don't do it yourself". If you want a particular worker to produce tables twice as fast, stop making him saw his own wood, but supply wood pieces in the correct sizes. The worker can now specialize in composing tables from those precut parts, while another can specialize in turning raw materials into correctly sized wood pieces. The next step after that is buying the wood from someone else.

In the 19th century, interchangeable parts became the name of the game. It was no longer common for a worker or even a factory to produce everything they needed themselves from raw materials. They could go out and purchase the parts from any other manufacturer. The godfather of this evolution was Henry Maudslay. Maudslay is credited with inventing the first bench micrometer, a device capable of measuring anything his factories produced to an accuracy of 3 micrometers, or 0.003 millimetres. It gave us compatible thread and diameters on screws, nuts and bolts, and quickly became one of the biggest motors for standardization. The end result was that factories didn't need to produce custom parts for their products any more. They could buy parts with standard measurements or have things produced for them by providing the exact measurements. They could trust that whatever they bought, if it conformed to those measurements, it would work in their own products.

Why an 19th century inventor matters to software

Software hasn't reached this point yet, but we are on the right path. Open standards have always been important, but in recent years more and more software companies have jumped ship from their proprietary protocols and formats to open industry-wide standards. This goes an all levels, from how you send data through a copper wire to queueing protocols. As a direct consequence of that, loose coupling -that holy grail of robust software- is made easier. Service Oriented Architecture, CBSE, Microservices, Actor model, Communicating Sequential Processes, it doesn't matter what you call it. The principe is always the same. Build an application out of independent parts that communicate, and it will be more robust, easier to scale, and easier to maintain and improve. Working more with open standards are now making this easier and allow us to build components on entirely different languages and paradigms, but still compose them into functional software.

The next step is to make these components so standardized that they can be switched out at will. If I can switch out components of my application at will, I can also build software that way. In theory, I won't ever need to know how code is written, or how the protocols function. I could take a bunch of components, build my flow from those, and be done. Software would go from a craft to a commodity. This goes for distributing it to clients as well. Everybody in corporate IT dreams of an install procedure that consist of exactly one step: clicking 'done'. If protocols are standardized enough, that's all that will ever be required.

Standards, standards standards

All this means having a default way of communicating, so commands and queries can be issued without worrying about how they will get to destination. That's the easy part. The hard part is how you communicate your intent to a component. There are no standard commands to send to a service that stores something, or that sends an invoice. We have it for e-mail in SMTP, why not for accounting software? What is the standard protocol to address a social network?

What we are missing is Mr. Maudslay. There is no software equivalent to the micrometer. There is no way to define a standard, and then run a battery of checks to see if it will behave correctly in all situations. Unit testing and TDD help us on this path, but they aren't far-reaching enough. The goal to shoot for should be being able to write any component, and validate that comprehensively against the intended standard protocol, without writing any tests yourself. It also requires a standard protocol for providing standard protocols (Yo dawg...), so someone in need of a custom component can provide the exact specifications it should meet, in the same way a factory provides a set of measurements to its suppliers.I don't know if it is even possible to build such a micrometer. Theoretically it must be, but it requires wide adoption of a common system. This kind of wide spread has happened in the past, but it is never easy. My magic eight-ball is clear: "Ask again later."

"Manufacturing software"

In the physical world, a product goes through 2 major steps in its life cycle: design and manufacturing. In most industries, this is a very 'over-the-wall' process, production only starts after the design phase is completely finished, and the necessary machinery has been set up and workers trained to manufacture the new item. Software does not have this structure, since there is no manufacturing step, is often argued. While this is true for most in-house developed software, it does seem to be possible to identify some post-design steps in software sold either as product or as service. These are steps that are not part of the design part, but need to happen again and again for every sale, and are therefore the closest we have to 'manufacturing'. Those include deployment to production servers, configuring the software, adapting other systems to properly integrate with the new piece of code, etc.

Why are these steps necessary? In what way is software so different from hardware that there is significant post-sale work to be done before it is operational? My watch did not need to be installed by a professional. I did not need to specify what colour I wanted the strap to be. It's perfectly normal for physical items to be something you buy without giving it input. The most important factor for this is simply choice. I did not need to configure my watch, because there are thousands of models available, from the very cheap to the extremely expensive. Some are very basic, others are intricately designed timepieces that work in the most extreme circumstances. All of those would fulfill my basic requirement. I never needed to configure exactly what I wanted, because I could shop around until I found the perfect watch. In the process, I saw options I never knew existed.

The current state of software is very different. Software requirements are, on average, not more complicated than the requirements we have for physical items. The difference is in choice. If I'm lucky, and what I'm looking for is a popular requirement, I could have a few dozen software systems I could buy. All of them will fill my need, but most will be heavy and expensive, and will do far more than I ever wanted. And most of them will need some form of configuration to match my IT and my corporate structure. Why are there not thousands of applications that just do contact management? Why can I not simply choose the one that matches what I need, and have that be the end of it?

The 1800's vs now

To figure out how we can change that market and what the future might hold, let's take a look at the physical industrial revolution. The world changed a lot in those years between the middles of the 18th and 19th century. We developed machine manufacturing, made better use of steam power and coal, and increased worker productivity by enormous amounts. We were able to refine basic elements, like iron, better and on a larger scale, and transportation systems became faster, more efficient and most of all capable of handling a far greater load. All these developments together led to a slow revolution, a time when factories started working together and becoming each other's clients on a massive scale. After those roughly 100 years, craftsmen had been replaced by factories that assemble products from components that they bought from other factories, that made them from basic materials bought from still others, etc.

The great developments of the day did not happen in isolation, of course. They played off each other, but it was only because they were all present and making great strides that the industrial revolution went so fast and was so successful. Each of them also seems to have an equivalent in the software world.

Let's start off with the basics. The first modern steam power plant was created around 1712. By 1783, James Watt had managed to improve its power output and reliability, and managed to turn the power into a rotary motion. Power to drive machinery became cheap and easily available. Coal became ubiquitous, allowing greater energy output for less work. In later years, electricity came on the scene, which was a lot more flexible and crucially, pay-per-use. For years, we've seen the same developments in software. Computers have become more powerful and cheaper. They've also become more easily available, throughout the world, to the point where a large percentage of the population now has a computer in their pockets. We see the pay-per-use factor return in every cloud offering, and computing power on any magnitude has become available and open to everybody through cloud providers.

The increased use of coal caused major breakthroughs in metallurgy. Higher temperatures and cleaner techniques for burning improved the quality of available copper, iron and steel. To a software engineer, code is iron. A line of code is the basic element of our entire industry. Stronger computers and more advanced ways of using their power has allowed that basic element to improve in quality as well. We run garbage collectors, abstraction layers and emulation systems, all with the purpose of making our lines of code as clean and as high-quality as possible. Programming languages become more advanced by the day. New programming paradigms and models tend to be less efficient power-wise, but better for cleanliness and extensibility. We've gone from writing assembly code to python, from COBOL to .Net.

Transportation of data has steadily become faster. 10 gigabit Ethernet is no longer an exception, 4G networks are being rolled out across the globe, and even structures a simple as a message queue are now able to handle enormous quantities of data. Meanwhile, basic connectivity has become far more reliable, too. Our faster networks enable new business models, even ones that would have been completely alien to us before. Software-As-A-Service lives by this. Nobody on dial-up would ever have considered uninstalling their local word processor for something like Google Docs. Nowadays, it's not just common, for a lot of software it's the preferred model. Even local e-mail clients are in a rapid decline.

Now what?

So far, these are all developments we've got down pat. The basic groundwork for an industrial revolution has been laid, but we're not there yet. How do we increase worker efficiency by a thousand-fold? How do we automate the installation and configuration procedures on a similar level as an automated factory? Are we far from those goals, or are we perhaps closer than we think? I hope I'll be able to shed some light in part 3, stick around.

What is the general verb to use when a developer does his job? Does he 'create' software? 'build'? 'engineer'? In recent years, the word 'craft' has become more fashionable. We are all 'software craftsmen', and the craftsmanship movement has gained enormous momentum. It is definitely an apt description. We focus on quality, creativity and mastery of the skill of programming. Like craftsmen in the past, every line is bespoke, every function something toiled over.

The desk my computer is on was not 'crafted'. Neither was the chair I sit in, or the computer itself. Not even complex machinery, like my car, was crafted. It was manufactured and assembled. Somewhere along the line from idea to my driveway came engineers and designers, but they would be hard-pressed to describe themselves as craftsmen. None of them worried about how to build the screws that hold it together, and yet it will work for years in changing conditions and constant use, with minor maintenance. Why is that so hard for software?

History, when seen through a vague enough lens, is fractal in nature. Parts of it repeat the whole, on shorter time scales. Until about 350 years ago, assembly lines and manufacturing processes hadn't been invented yet. The world was filled with craftsmen. Artisans, experts at their job, knew everything there is to know about their subject. They crafted products of amazing quality and beauty, some that still last to this day. What they couldn't do, was handle complexity. They were unable to produce complex machinery reliably. Does that ring a bell?

All of that changed with the Industrial Revolution. Suddenly, all kinds of products flooded the market. They were cheap and reliable, and the longer time went on, the more intricate and well-designed they were. Economics of scale started playing, and it was suddenly possible to afford conveniences nobody could before. The complexity of available products shot up, but economics of scale and progressing knowledge meant they could be cheap and reliable on a level no craftsman could even imagine. Workers became able to build those complicated products, even without much schooling or in-depth knowledge of the underlying physics. The craftsmen of before morphed into engineers and designers. They did not build anything beyond prototypes themselves, they laid out plans and defined techniques. While the current generation of products are being produced and work, they invent the next. Workshops with a master and his apprentices ceased to exist, replaced by Research and Development, and production divisions.

Software is on the verge of the industrial revolution. Several of the important steps have already been taken, others are being worked on. In part 2, I will try to examine if a few of the most important factors in the Industrial Revolution are at play in the software industry, and what the missing factors might look like.